Welcome to Doctra¶

Doctra Banner

Overview¶

Doctra is a powerful Python library for parsing, extracting, and analyzing document content from PDFs. It combines state-of-the-art layout detection, OCR, image restoration, and Vision Language Models (VLM) to provide comprehensive document processing capabilities.

Key Features¶

Comprehensive PDF Parsing¶

Layout Detection: Advanced document layout analysis using PaddleOCR
OCR Processing: High-quality text extraction with Tesseract
Visual Elements: Automatic extraction of figures, charts, and tables
Multiple Parsers: Choose the right parser for your use case

Image Restoration¶

6 Restoration Tasks: Dewarping, deshadowing, appearance enhancement, deblurring, binarization, and end-to-end restoration
DocRes Integration: State-of-the-art document image restoration
GPU Acceleration: Automatic CUDA detection for faster processing
Enhanced Quality: Improves document quality for better OCR results

VLM Integration¶

Structured Data Extraction: Convert charts and tables to structured formats
Multiple Providers: OpenAI, Gemini, Anthropic, and OpenRouter support
Automatic Conversion: Transform visual elements into usable data
Flexible Configuration: Easy API key management and model selection

Rich Output Formats¶

Markdown: Human-readable documents with embedded images
Excel: Structured data in spreadsheet format
JSON: Programmatically accessible data
HTML: Interactive web-ready documents
Images: High-quality cropped visual elements

User-Friendly Interfaces¶

Web UI: Gradio-based interface with drag & drop
Command Line: Powerful CLI for automation
Python API: Full programmatic access
Real-time Progress: Track processing status

Quick Start¶

Installation¶

pip install doctra

Basic Usage¶

from doctra import StructuredPDFParser

# Initialize parser
parser = StructuredPDFParser()

# Parse a document
parser.parse("document.pdf")

System Dependencies

Doctra requires Poppler for PDF processing. See the Installation Guide for detailed setup instructions.

Core Components¶

Parsers¶

Parser	Description	Best For
StructuredPDFParser	Complete document processing	General purpose parsing
EnhancedPDFParser	Parsing with image restoration	Scanned or low-quality documents
ChartTablePDFParser	Focused extraction	Only charts and tables needed
PaddleOCRVLPDFParser	End-to-end VLM parsing	Complex documents with charts and tables

Engines¶

Engine	Description	Use Case
DocResEngine	Image restoration	Standalone image enhancement
Layout Detection	Document analysis	Identify document structure
OCR Engine	Text extraction	Extract text from images
VLM Service	AI processing	Convert visuals to structured data

Use Cases¶

Financial Reports: Extract tables, charts, and text from financial documents
Research Papers: Parse academic papers with figures and tables
Document Archival: Convert scanned documents to searchable formats
Data Extraction: Extract structured data from visual elements
Document Enhancement: Restore and improve low-quality documents

Getting Help¶

Documentation: You're reading it! Explore the sidebar for detailed guides
GitHub Issues: Report bugs or request features
PyPI: View package details

📓 Interactive Notebooks¶

Notebook	Colab Badge	Description
01_doctra_quick_start		Comprehensive tutorial covering layout detection, content extraction, and multi-format outputs with visual examples
case_study_01_financial_report_analysis		Financial report analysis: Extract tables and charts from PDF reports, convert visual elements to structured data using VLM, and analyze financial data with pandas
case_study_02_scanned_document_restoration		Scanned document restoration: Apply DocRes engine for image restoration (appearance, dewarping, deshadowing, deblurring, binarization, end2end), restore PDFs, and compare parsing results before and after restoration

What's Next?¶

Quick Start

Get up and running with Doctra in minutes

Quick Start Guide
User Guide

Learn about parsers, engines, and advanced features

Read the Guide
API Reference

Detailed API documentation for all components

API Docs
Examples

Real-world examples and integration patterns

View Examples

Acknowledgments¶

Doctra builds upon several excellent open-source projects:

PaddleOCR - Advanced document layout detection and OCR capabilities
DocRes - State-of-the-art document image restoration model
Outlines - Structured output generation for LLMs

We thank the developers and contributors of these projects for their valuable work.

License¶

Doctra is released under the MIT License. See the LICENSE file for details.