Core Concepts¶
Understanding Doctra's core concepts will help you use the library effectively.
Document Processing Pipeline¶
Doctra follows a multi-stage pipeline for document processing:
graph LR
A[PDF Input] --> B[Layout Detection]
B --> C[Element Classification]
C --> D[OCR Processing]
D --> E[VLM Processing]
E --> F[Output Generation]
B -.Optional.-> G[Image Restoration]
G --> C
Pipeline Stages¶
- Layout Detection: Analyzes document structure using PaddleOCR
- Element Classification: Identifies text, tables, charts, and figures
- OCR Processing: Extracts text from identified regions
- VLM Processing (Optional): Converts visual elements to structured data
- Output Generation: Creates Markdown, Excel, HTML, and JSON files
Architecture Overview¶
Doctra is organized into several key components:
Parsers¶
Parsers are the main entry point for document processing. They orchestrate the entire pipeline.
- StructuredPDFParser
- The base parser for general PDF processing. Handles layout detection, OCR, and output generation.
- EnhancedPDFParser
- Extends StructuredPDFParser with image restoration capabilities for low-quality documents.
- ChartTablePDFParser
- Specialized parser focused on extracting only charts and tables.
Engines¶
Engines provide specific processing capabilities:
- Layout Detection
- PaddleOCR-based layout analysis to identify document structure.
- OCR Engine
- Tesseract-based text extraction from images.
- DocRes Engine
- Image restoration for document enhancement.
- VLM Service
- Vision Language Model integration for structured data extraction.
Exporters¶
Exporters handle output generation in various formats:
- MarkdownWriter: Creates human-readable Markdown files
- ExcelWriter: Generates spreadsheets with structured data
- HTMLWriter: Produces web-ready HTML documents
- ImageSaver: Saves cropped visual elements
Element Types¶
Doctra classifies document elements into four main types:
Text Elements¶
Regular text content including:
- Paragraphs
- Headings
- Lists
- Captions
Processing: OCR → Text extraction → Markdown formatting
Tables¶
Tabular data with rows and columns.
Processing Options:
- Without VLM: Saved as images only
- With VLM: Converted to Excel/HTML + saved as images
Output: tables.xlsx
, tables.html
, cropped images
Charts¶
Visual representations of data including:
- Bar charts
- Line graphs
- Pie charts
- Scatter plots
Processing Options:
- Without VLM: Saved as images with captions
- With VLM: Data extracted + description generated
Output: Cropped images, optional structured data
Figures¶
General images and diagrams including:
- Photographs
- Illustrations
- Diagrams
- Logos
Processing: Cropped and saved as images with context
Layout Detection¶
Layout detection is the foundation of Doctra's processing.
How It Works¶
- Page Rendering: PDF pages rendered to images at specified DPI
- Model Inference: PaddleOCR layout model identifies regions
- Bounding Boxes: Each element gets coordinates and confidence score
- Classification: Elements labeled as text/table/chart/figure
Detection Parameters¶
parser = StructuredPDFParser(
layout_model_name="PP-DocLayout_plus-L", # Model choice
dpi=200, # Image resolution
min_score=0.5 # Confidence threshold
)
- layout_model_name
- PaddleOCR model to use. Options:
PP-DocLayout_plus-L
(best),PP-DocLayout_plus-M
(faster) - dpi
- Image resolution. Higher = better quality but slower. Range: 100-300
- min_score
- Minimum confidence score (0-1). Higher = fewer false positives
Visualization¶
Verify layout detection quality:
This shows bounding boxes with colors:
- 🔵 Blue: Text
- 🔴 Red: Tables
- 🟢 Green: Charts
- 🟠 Orange: Figures
OCR Processing¶
OCR (Optical Character Recognition) extracts text from images.
Configuration¶
parser = StructuredPDFParser(
ocr_lang="eng", # Language
ocr_psm=6, # Page segmentation mode
ocr_oem=3 # OCR Engine mode
)
- ocr_lang
- Tesseract language code. Examples:
eng
,fra
,spa
,deu
- ocr_psm
-
Page segmentation mode. Common values:
3
: Automatic page segmentation6
: Uniform block of text (default)11
: Sparse text12
: Sparse text with OSD
- ocr_oem
-
OCR Engine mode:
0
: Legacy engine1
: Neural nets LSTM3
: Default (both)
Improving OCR Accuracy¶
-
Increase DPI: Higher resolution = better text recognition
-
Use Image Restoration: Enhance document quality first
-
Correct Language: Specify document language
Image Restoration¶
Image restoration improves document quality before processing.
Restoration Tasks¶
Task | Purpose | When to Use |
---|---|---|
appearance |
General enhancement | Most documents (default) |
dewarping |
Fix perspective | Scanned with distortion |
deshadowing |
Remove shadows | Poor lighting |
deblurring |
Reduce blur | Motion blur, focus issues |
binarization |
B&W conversion | Clean text extraction |
end2end |
Full pipeline | Severely degraded |
Usage¶
from doctra import EnhancedPDFParser
parser = EnhancedPDFParser(
use_image_restoration=True,
restoration_task="appearance",
restoration_device="cuda" # or "cpu"
)
Performance Impact¶
Restoration | Processing Time | Quality Improvement |
---|---|---|
None | Baseline | Baseline |
CPU | +200% | +30-50% |
GPU | +50% | +30-50% |
VLM Integration¶
Vision Language Models convert visual elements to structured data.
Supported Providers¶
- OpenAI: GPT-4 Vision, GPT-4o
- Gemini: Google's vision models
- Anthropic: Claude with vision
- OpenRouter: Access multiple models
Configuration¶
parser = StructuredPDFParser(
use_vlm=True,
vlm_provider="openai",
vlm_api_key="your-api-key",
vlm_model="gpt-4o" # Optional, uses default if not specified
)
What Gets Processed¶
With VLM enabled:
- Tables
- Converted to Excel/HTML with cell-by-cell data
- Charts
- Data extracted + description generated
- Figures
- Descriptions and context generated
Cost Considerations¶
VLM processing requires API calls:
- Per Document: 1-10 API calls depending on visual elements
- Per Element: ~1 API call per table/chart
- Cost: Varies by provider (typically \(0.01-\)0.10 per document)
Output Formats¶
Doctra generates multiple output formats simultaneously.
Markdown (.md)¶
Human-readable document with:
- All text content
- Embedded images
- Table references
- Section structure
Best for: Documentation, reading, version control
HTML (.html)¶
Web-ready document with:
- Styled content
- Interactive tables
- Image galleries
- Responsive layout
Best for: Web publishing, presentations
Excel (.xlsx)¶
Spreadsheet with:
- One sheet per table
- Formatted cells
- Headers and data
Best for: Data analysis, further processing
JSON (.json)¶
Structured data with:
- Element metadata
- Coordinates
- Content
- Relationships
Best for: Programmatic access, integration
Best Practices¶
Choosing the Right Parser¶
# General documents
from doctra import StructuredPDFParser
parser = StructuredPDFParser()
# Scanned or low-quality documents
from doctra import EnhancedPDFParser
parser = EnhancedPDFParser(use_image_restoration=True)
# Only need charts/tables
from doctra import ChartTablePDFParser
parser = ChartTablePDFParser(extract_charts=True, extract_tables=True)
Optimizing Performance¶
-
Use appropriate DPI: Higher isn't always better
-
Enable GPU when available:
-
Batch processing: Reuse parser instances
Managing Costs¶
When using VLM:
- Test without VLM first: Ensure layout detection works
- Process selectively: Use ChartTablePDFParser for specific elements
- Use cheaper models: Consider Gemini for cost savings
Next Steps¶
- Structured Parser - Learn about the base parser
- Enhanced Parser - Document restoration
- VLM Integration - Structured data extraction
- Examples - See it in action