Welcome to Doctra¶
Overview¶
Doctra is a powerful Python library for parsing, extracting, and analyzing document content from PDFs. It combines state-of-the-art layout detection, OCR, image restoration, and Vision Language Models (VLM) to provide comprehensive document processing capabilities.
Key Features¶
Comprehensive PDF Parsing¶
- Layout Detection: Advanced document layout analysis using PaddleOCR
- OCR Processing: High-quality text extraction with Tesseract
- Visual Elements: Automatic extraction of figures, charts, and tables
- Multiple Parsers: Choose the right parser for your use case
Image Restoration¶
- 6 Restoration Tasks: Dewarping, deshadowing, appearance enhancement, deblurring, binarization, and end-to-end restoration
- DocRes Integration: State-of-the-art document image restoration
- GPU Acceleration: Automatic CUDA detection for faster processing
- Enhanced Quality: Improves document quality for better OCR results
VLM Integration¶
- Structured Data Extraction: Convert charts and tables to structured formats
- Multiple Providers: OpenAI, Gemini, Anthropic, and OpenRouter support
- Automatic Conversion: Transform visual elements into usable data
- Flexible Configuration: Easy API key management and model selection
Rich Output Formats¶
- Markdown: Human-readable documents with embedded images
- Excel: Structured data in spreadsheet format
- JSON: Programmatically accessible data
- HTML: Interactive web-ready documents
- Images: High-quality cropped visual elements
User-Friendly Interfaces¶
- Web UI: Gradio-based interface with drag & drop
- Command Line: Powerful CLI for automation
- Python API: Full programmatic access
- Real-time Progress: Track processing status
Quick Start¶
Installation¶
Basic Usage¶
from doctra import StructuredPDFParser
# Initialize parser
parser = StructuredPDFParser()
# Parse a document
parser.parse("document.pdf")
System Dependencies
Doctra requires Poppler for PDF processing. See the Installation Guide for detailed setup instructions.
Core Components¶
Parsers¶
Parser | Description | Best For |
---|---|---|
StructuredPDFParser | Complete document processing | General purpose parsing |
EnhancedPDFParser | Parsing with image restoration | Scanned or low-quality documents |
ChartTablePDFParser | Focused extraction | Only charts and tables needed |
Engines¶
Engine | Description | Use Case |
---|---|---|
DocResEngine | Image restoration | Standalone image enhancement |
Layout Detection | Document analysis | Identify document structure |
OCR Engine | Text extraction | Extract text from images |
VLM Service | AI processing | Convert visuals to structured data |
Use Cases¶
- Financial Reports: Extract tables, charts, and text from financial documents
- Research Papers: Parse academic papers with figures and tables
- Document Archival: Convert scanned documents to searchable formats
- Data Extraction: Extract structured data from visual elements
- Document Enhancement: Restore and improve low-quality documents
Getting Help¶
- Documentation: You're reading it! Explore the sidebar for detailed guides
- GitHub Issues: Report bugs or request features
- PyPI: View package details
What's Next?¶
-
Quick Start
Get up and running with Doctra in minutes
-
User Guide
Learn about parsers, engines, and advanced features
-
API Reference
Detailed API documentation for all components
-
Examples
Real-world examples and integration patterns
Acknowledgments¶
Doctra builds upon several excellent open-source projects:
- PaddleOCR - Advanced document layout detection and OCR capabilities
- DocRes - State-of-the-art document image restoration model
- Outlines - Structured output generation for LLMs
We thank the developers and contributors of these projects for their valuable work.
License¶
Doctra is released under the MIT License. See the LICENSE file for details.