internlm/OREAL-7B-SFT
The internlm/OREAL-7B-SFT model is a 7.6 billion parameter instruction-tuned language model developed by InternLM, based on the Qwen2.5-7B architecture. This model is the initial supervised fine-tuning (SFT) policy for the OREAL (Outcome REwArd-based reinforcement Learning) framework, specifically designed for mathematical reasoning tasks. It serves as a foundational model for the OREAL series, which achieves high accuracy on benchmarks like MATH-500, making it suitable for complex mathematical problem-solving.