RLHFlow/Llama3.1-8B-PRM-Deepseek-Data

RLHFlow/Llama3.1-8B-PRM-Deepseek-Data is an 8 billion parameter process-supervised reward model, fine-tuned from Meta's Llama-3.1-8B-Instruct. Developed by RLHFlow, this model is specifically trained on the Deepseek-PRM-Data dataset with a 32768 token context length to excel at evaluating and improving mathematical reasoning. It demonstrates strong performance in mathematical problem-solving benchmarks like GSM8K and MATH, particularly when used for process-supervised reward modeling.

Warm

Public

Model Size: 8B

Quant: FP8

Ctx length: 32768

Hugging Face