Structured PDF Parser¶
Comprehensive guide to using the StructuredPDFParser
.
Overview¶
The StructuredPDFParser
is the foundational parser in Doctra, designed for general-purpose PDF document processing. It combines layout detection, OCR, and optional VLM integration to extract all content from PDF documents.
Key Features¶
- Layout Detection: PaddleOCR-based document structure analysis
- OCR Processing: Text extraction from all document elements
- Visual Element Extraction: Automatic cropping of figures, charts, and tables
- VLM Integration: Optional structured data extraction
- Multiple Output Formats: Markdown, HTML, Excel, JSON
Basic Usage¶
from doctra import StructuredPDFParser
# Initialize parser with defaults
parser = StructuredPDFParser()
# Parse document
parser.parse("document.pdf")
Configuration¶
See API Reference for detailed parameter documentation.
Output Structure¶
When to Use¶
Use StructuredPDFParser
for:
- General PDF processing
- Good quality documents
- When image restoration is not needed
- Extracting all content types
See Also¶
- Enhanced Parser - With image restoration
- Chart & Table Extractor - Focused extraction
- API Reference - Complete API documentation