Quick Start¶
This guide will get you started with Doctra in just a few minutes.
Your First Document Parse¶
Let's parse a PDF document and extract its content:
from doctra import StructuredPDFParser
# Initialize the parser
parser = StructuredPDFParser()
# Parse a document
parser.parse("document.pdf")
That's it! Doctra will:
- Detect the document layout
- Extract text using OCR
- Save images of figures, charts, and tables
- Generate a Markdown file with all content
Understanding the Output¶
After parsing, you'll find the following structure:
outputs/
└── document/
├── full_parse/
│ ├── result.md # Markdown with all content
│ ├── result.html # HTML version
│ └── images/ # Extracted visual elements
│ ├── figures/ # Document figures
│ ├── charts/ # Charts and graphs
│ └── tables/ # Table images
Basic Examples¶
Parse with Custom Output Directory¶
from doctra import StructuredPDFParser
parser = StructuredPDFParser()
parser.parse("document.pdf", output_base_dir="my_outputs")
Parse Scanned Documents¶
For scanned or low-quality documents, use the enhanced parser:
from doctra import EnhancedPDFParser
parser = EnhancedPDFParser(
use_image_restoration=True,
restoration_task="appearance" # Improve overall appearance
)
parser.parse("scanned_document.pdf")
Extract Only Charts and Tables¶
If you only need charts and tables:
from doctra import ChartTablePDFParser
parser = ChartTablePDFParser(
extract_charts=True,
extract_tables=True
)
parser.parse("data_report.pdf")
Using Vision Language Models¶
To convert charts and tables to structured data, add VLM support:
from doctra import StructuredPDFParser
parser = StructuredPDFParser(
use_vlm=True,
vlm_provider="openai",
vlm_api_key="your-api-key-here"
)
parser.parse("document.pdf")
This will generate:
tables.xlsx
- Excel file with extracted table datatables.html
- HTML tables for web viewingvlm_items.json
- JSON with structured data
VLM Providers
Doctra supports multiple VLM providers:
"openai"
- GPT-4 Vision and GPT-4o"gemini"
- Google's Gemini models"anthropic"
- Claude with vision"openrouter"
- Access multiple models
Document Restoration¶
Enhance document quality before parsing:
from doctra import DocResEngine
# Initialize restoration engine
docres = DocResEngine(device="cuda") # Use GPU for speed
# Restore a single image
restored_img, metadata = docres.restore_image(
image="blurry_doc.jpg",
task="deblurring"
)
# Or enhance an entire PDF
docres.restore_pdf(
pdf_path="low_quality.pdf",
output_path="enhanced.pdf",
task="appearance"
)
Available restoration tasks:
Task | Description |
---|---|
appearance |
General appearance enhancement |
dewarping |
Correct perspective distortion |
deshadowing |
Remove shadows |
deblurring |
Reduce blur |
binarization |
Convert to black and white |
end2end |
Complete restoration pipeline |
Using the Web UI¶
Launch the graphical interface for easy document processing:
Or from the command line:
Then open your browser to the displayed URL (typically http://127.0.0.1:7860
).
Command Line Interface¶
Doctra provides a powerful CLI:
# Parse a document
doctra parse document.pdf
# Enhanced parsing
doctra enhance document.pdf --restoration-task appearance
# Extract charts and tables
doctra extract both document.pdf --use-vlm
# Visualize layout
doctra visualize document.pdf
See the CLI Reference for all available commands.
Layout Visualization¶
Visualize how Doctra detects document elements:
from doctra import StructuredPDFParser
parser = StructuredPDFParser()
# Display layout detection results
parser.display_pages_with_boxes(
pdf_path="document.pdf",
num_pages=3, # First 3 pages
save_path="layout_viz.png"
)
This creates a visual representation showing:
- Detected text regions (blue boxes)
- Tables (red boxes)
- Charts (green boxes)
- Figures (orange boxes)
- Confidence scores for each element
Configuration Options¶
Parser Configuration¶
parser = StructuredPDFParser(
# Layout Detection
layout_model_name="PP-DocLayout_plus-L", # Model choice
dpi=200, # Image resolution
min_score=0.5, # Confidence threshold
# OCR Settings
ocr_lang="eng", # Language code
ocr_psm=6, # Page segmentation mode
# Output
box_separator="\n" # Separator between elements
)
Enhanced Parser Configuration¶
from doctra import EnhancedPDFParser
parser = EnhancedPDFParser(
# Image Restoration
use_image_restoration=True,
restoration_task="dewarping",
restoration_device="cuda", # or "cpu"
restoration_dpi=300,
# All StructuredPDFParser options also available
use_vlm=True,
vlm_provider="openai",
vlm_api_key="your-key"
)
Common Patterns¶
Batch Processing¶
import os
from doctra import StructuredPDFParser
parser = StructuredPDFParser()
# Process all PDFs in a directory
pdf_dir = "documents"
for filename in os.listdir(pdf_dir):
if filename.endswith(".pdf"):
pdf_path = os.path.join(pdf_dir, filename)
print(f"Processing {filename}...")
parser.parse(pdf_path)
Error Handling¶
from doctra import StructuredPDFParser
parser = StructuredPDFParser()
try:
parser.parse("document.pdf")
except FileNotFoundError:
print("Document not found!")
except Exception as e:
print(f"Error parsing document: {e}")
Progress Tracking¶
from doctra import StructuredPDFParser
parser = StructuredPDFParser()
# Progress bars are shown automatically
parser.parse("large_document.pdf")
Next Steps¶
Now that you've learned the basics:
- Dive Deeper: Read the User Guide for detailed explanations
- Explore Parsers: Learn about each parser's capabilities
- Advanced Examples: Check out Advanced Examples
- API Reference: Browse the API Documentation
Getting Help¶
- Read the full documentation
- Check GitHub issues
- Ask questions in discussions
Common Issues¶
"Poppler not found" Error¶
Install Poppler (see Installation).
Low OCR Accuracy¶
Try the enhanced parser with image restoration:
from doctra import EnhancedPDFParser
parser = EnhancedPDFParser(
use_image_restoration=True,
restoration_task="appearance"
)
Slow Processing¶
Use GPU acceleration:
Or reduce DPI: