opendatalab/MinerU-HTML

opendatalab/MinerU-HTML is an advanced HTML main content extraction tool developed by OpenDatalab. This model leverages Large Language Models (LLMs) for intelligent content identification and uses state machine-guided generation to produce structured JSON output. It provides a complete pipeline for extracting primary content from HTML pages, featuring a fallback mechanism and comprehensive evaluation capabilities. MinerU-HTML is optimized for accurate and structured main content extraction from web pages.

Warm
Public
0.8B
BF16
40960
License: apache-2.0
Hugging Face

No reviews yet. Be the first to review!