Core Concepts¶

Understanding Doctra's core concepts will help you use the library effectively.

Document Processing Pipeline¶

Doctra follows a multi-stage pipeline for document processing:

graph LR
    A[PDF Input] --> B[Layout Detection]
    B --> C[Element Classification]
    C --> D[OCR Processing]
    D --> E[VLM Processing]
    E --> F[Output Generation]

    B -.Optional.-> G[Image Restoration]
    G --> C

Pipeline Stages¶

Layout Detection: Analyzes document structure using PaddleOCR
Element Classification: Identifies text, tables, charts, and figures
OCR Processing: Extracts text from identified regions
VLM Processing (Optional): Converts visual elements to structured data
Output Generation: Creates Markdown, Excel, HTML, and JSON files

Architecture Overview¶

Doctra is organized into several key components:

Parsers¶

Parsers are the main entry point for document processing. They orchestrate the entire pipeline.

StructuredPDFParser: The base parser for general PDF processing. Handles layout detection, OCR, and output generation.
EnhancedPDFParser: Extends StructuredPDFParser with image restoration capabilities for low-quality documents.
ChartTablePDFParser: Specialized parser focused on extracting only charts and tables.

Engines¶

Engines provide specific processing capabilities:

Layout Detection: PaddleOCR-based layout analysis to identify document structure.
OCR Engine: Tesseract-based text extraction from images.
DocRes Engine: Image restoration for document enhancement.
VLM Service: Vision Language Model integration for structured data extraction.

Exporters¶

Exporters handle output generation in various formats:

MarkdownWriter: Creates human-readable Markdown files
ExcelWriter: Generates spreadsheets with structured data
HTMLWriter: Produces web-ready HTML documents
ImageSaver: Saves cropped visual elements

Element Types¶

Doctra classifies document elements into four main types:

Text Elements¶

Regular text content including:

Paragraphs
Headings
Lists
Captions

Processing: OCR → Text extraction → Markdown formatting

Tables¶

Tabular data with rows and columns.

Processing Options:

Without VLM: Saved as images only
With VLM: Converted to Excel/HTML + saved as images

Output: tables.xlsx, tables.html, cropped images

Charts¶

Visual representations of data including:

Bar charts
Line graphs
Pie charts
Scatter plots

Processing Options:

Without VLM: Saved as images with captions
With VLM: Data extracted + description generated

Output: Cropped images, optional structured data

Figures¶

General images and diagrams including:

Photographs
Illustrations
Diagrams
Logos

Processing: Cropped and saved as images with context

Layout Detection¶

Layout detection is the foundation of Doctra's processing.

How It Works¶

Page Rendering: PDF pages rendered to images at specified DPI
Model Inference: PaddleOCR layout model identifies regions
Bounding Boxes: Each element gets coordinates and confidence score
Classification: Elements labeled as text/table/chart/figure

Detection Parameters¶

parser = StructuredPDFParser(
    layout_model_name="PP-DocLayout_plus-L",  # Model choice
    dpi=200,  # Image resolution
    min_score=0.5  # Confidence threshold
)

layout_model_name: PaddleOCR model to use. Options: PP-DocLayout_plus-L (best), PP-DocLayout_plus-M (faster)
dpi: Image resolution. Higher = better quality but slower. Range: 100-300
min_score: Minimum confidence score (0-1). Higher = fewer false positives

Visualization¶

Verify layout detection quality:

parser.display_pages_with_boxes(
    pdf_path="document.pdf",
    num_pages=3
)

This shows bounding boxes with colors:

🔵 Blue: Text
🔴 Red: Tables
🟢 Green: Charts
🟠 Orange: Figures

OCR Processing¶

OCR (Optical Character Recognition) extracts text from images.

Configuration¶

parser = StructuredPDFParser(
    ocr_lang="eng",  # Language
    ocr_psm=6,  # Page segmentation mode
    ocr_oem=3  # OCR Engine mode
)

ocr_lang

Tesseract language code. Examples: eng, fra, spa, deu

ocr_psm

Page segmentation mode. Common values:

3: Automatic page segmentation
6: Uniform block of text (default)
11: Sparse text
12: Sparse text with OSD

ocr_oem

OCR Engine mode:

0: Legacy engine
1: Neural nets LSTM
3: Default (both)

Improving OCR Accuracy¶

Increase DPI: Higher resolution = better text recognition
```
parser = StructuredPDFParser(dpi=300)
```

Use Image Restoration: Enhance document quality first

from doctra import EnhancedPDFParser
parser = EnhancedPDFParser(use_image_restoration=True)

Correct Language: Specify document language

parser = StructuredPDFParser(ocr_lang="fra")  # French

Image Restoration¶

Image restoration improves document quality before processing.

Restoration Tasks¶

Task	Purpose	When to Use
`appearance`	General enhancement	Most documents (default)
`dewarping`	Fix perspective	Scanned with distortion
`deshadowing`	Remove shadows	Poor lighting
`deblurring`	Reduce blur	Motion blur, focus issues
`binarization`	B&W conversion	Clean text extraction
`end2end`	Full pipeline	Severely degraded

Usage¶

from doctra import EnhancedPDFParser

parser = EnhancedPDFParser(
    use_image_restoration=True,
    restoration_task="appearance",
    restoration_device="cuda"  # or "cpu"
)

Performance Impact¶

Restoration	Processing Time	Quality Improvement
None	Baseline	Baseline
CPU	+200%	+30-50%
GPU	+50%	+30-50%

VLM Integration¶

Vision Language Models convert visual elements to structured data.

Supported Providers¶

OpenAI: GPT-4 Vision, GPT-4o
Gemini: Google's vision models
Anthropic: Claude with vision
OpenRouter: Access multiple models

Configuration¶

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="openai",
    vlm_api_key="your-api-key",
    vlm_model="gpt-4o"  # Optional, uses default if not specified
)

What Gets Processed¶

With VLM enabled:

Tables: Converted to Excel/HTML with cell-by-cell data
Charts: Data extracted + description generated
Figures: Descriptions and context generated

Cost Considerations¶

VLM processing requires API calls:

Per Document: 1-10 API calls depending on visual elements
Per Element: ~1 API call per table/chart
Cost: Varies by provider (typically \(0.01-\)0.10 per document)

Output Formats¶

Doctra generates multiple output formats simultaneously.

Markdown (.md)¶

Human-readable document with:

All text content
Embedded images
Table references
Section structure

Best for: Documentation, reading, version control

HTML (.html)¶

Web-ready document with:

Styled content
Interactive tables
Image galleries
Responsive layout

Best for: Web publishing, presentations

Excel (.xlsx)¶

Spreadsheet with:

One sheet per table
Formatted cells
Headers and data

Best for: Data analysis, further processing

JSON (.json)¶

Structured data with:

Element metadata
Coordinates
Content
Relationships

Best for: Programmatic access, integration

Best Practices¶

Choosing the Right Parser¶

# General documents
from doctra import StructuredPDFParser
parser = StructuredPDFParser()

# Scanned or low-quality documents
from doctra import EnhancedPDFParser
parser = EnhancedPDFParser(use_image_restoration=True)

# Only need charts/tables
from doctra import ChartTablePDFParser
parser = ChartTablePDFParser(extract_charts=True, extract_tables=True)

Optimizing Performance¶

Use appropriate DPI: Higher isn't always better

# Good quality documents
parser = StructuredPDFParser(dpi=150)

# Low quality documents
parser = StructuredPDFParser(dpi=250)

Enable GPU when available:

parser = EnhancedPDFParser(restoration_device="cuda")

Batch processing: Reuse parser instances

parser = StructuredPDFParser()
for pdf in pdf_files:
    parser.parse(pdf)

Managing Costs¶

When using VLM:

Test without VLM first: Ensure layout detection works
Process selectively: Use ChartTablePDFParser for specific elements
Use cheaper models: Consider Gemini for cost savings

Next Steps¶

Structured Parser - Learn about the base parser
Enhanced Parser - Document restoration
VLM Integration - Structured data extraction
Examples - See it in action