RLHFlow/Llama3.1-8B-PRM-Deepseek-Data

RLHFlow/Llama3.1-8B-PRM-Deepseek-Data is an 8 billion parameter process-supervised reward model, fine-tuned from Meta's Llama-3.1-8B-Instruct. Developed by RLHFlow, this model is specifically trained on the Deepseek-PRM-Data dataset with a 32768 token context length to excel at evaluating and improving mathematical reasoning. It demonstrates strong performance in mathematical problem-solving benchmarks like GSM8K and MATH, particularly when used for process-supervised reward modeling.

Warm
Public
8B
FP8
32768
Hugging Face