LichengLiu03/Qwen2.5-3B-UFO
LichengLiu03/Qwen2.5-3B-UFO is a 3.1 billion parameter language model based on Qwen2.5-3B-Instruct, fine-tuned with Proximal Policy Optimization (PPO) on the MetaMathQA dataset. It utilizes the Unary Feedback as Observation (UFO) framework to significantly improve multi-turn mathematical reasoning by learning from minimal "Try Again" feedback. This model excels at revising its reasoning across multiple attempts, making it particularly effective for complex math, logic, and reasoning tasks requiring iterative problem-solving.