nvidia/Llama-3.1-8B-Instruct-FP8

The nvidia/Llama-3.1-8B-Instruct-FP8 model is an 8 billion parameter instruction-tuned language model, quantized to FP8 precision by NVIDIA using TensorRT Model Optimizer. This model is derived from Meta's Llama 3.1 8B Instruct and is optimized for efficient inference on NVIDIA hardware, offering approximately 1.3x speedup on H100 GPUs. It maintains strong performance across benchmarks like MMLU and GSM8K while significantly reducing memory footprint. This model is suitable for commercial and non-commercial use in applications requiring fast, resource-efficient text generation.

Warm
Public
8B
FP8
32768
License: llama3.1
Hugging Face