CriteriaPO/qwen2.5-3b-dpo-coarse

CriteriaPO/qwen2.5-3b-dpo-coarse is a 3.1 billion parameter language model fine-tuned from CriteriaPO/qwen2.5-3b-sft-10. This model utilizes Direct Preference Optimization (DPO) for training, enhancing its ability to align with human preferences. It is designed for general text generation tasks, building upon the Qwen2.5 architecture with a 32768 token context length.

Warm
Public
3.1B
BF16
32768
Hugging Face