Gen-Verse/ReasonFlux-PRM-1.5B
Gen-Verse/ReasonFlux-PRM-1.5B is a 1.5 billion parameter trajectory-aware process reward model (PRM) designed to evaluate reasoning traces. It incorporates both step-level and trajectory-level supervision for fine-grained reward assignment aligned with structured chain-of-thought data. This model supports both offline and online reward supervision, making it suitable for data selection, reinforcement learning training, and reward-guided test-time scaling. Its lightweight architecture and efficient inference capabilities are optimized for resource-constrained applications and edge deployment.
No reviews yet. Be the first to review!