Skip to content

Welcome to Doctra

Doctra Banner

PyPI version Python versions GitHub stars License

Overview

Doctra is a powerful Python library for parsing, extracting, and analyzing document content from PDFs. It combines state-of-the-art layout detection, OCR, image restoration, and Vision Language Models (VLM) to provide comprehensive document processing capabilities.

Key Features

Comprehensive PDF Parsing

  • Layout Detection: Advanced document layout analysis using PaddleOCR
  • OCR Processing: High-quality text extraction with Tesseract
  • Visual Elements: Automatic extraction of figures, charts, and tables
  • Multiple Parsers: Choose the right parser for your use case

Image Restoration

  • 6 Restoration Tasks: Dewarping, deshadowing, appearance enhancement, deblurring, binarization, and end-to-end restoration
  • DocRes Integration: State-of-the-art document image restoration
  • GPU Acceleration: Automatic CUDA detection for faster processing
  • Enhanced Quality: Improves document quality for better OCR results

VLM Integration

  • Structured Data Extraction: Convert charts and tables to structured formats
  • Multiple Providers: OpenAI, Gemini, Anthropic, and OpenRouter support
  • Automatic Conversion: Transform visual elements into usable data
  • Flexible Configuration: Easy API key management and model selection

Rich Output Formats

  • Markdown: Human-readable documents with embedded images
  • Excel: Structured data in spreadsheet format
  • JSON: Programmatically accessible data
  • HTML: Interactive web-ready documents
  • Images: High-quality cropped visual elements

User-Friendly Interfaces

  • Web UI: Gradio-based interface with drag & drop
  • Command Line: Powerful CLI for automation
  • Python API: Full programmatic access
  • Real-time Progress: Track processing status

Quick Start

Installation

pip install doctra

Basic Usage

from doctra import StructuredPDFParser

# Initialize parser
parser = StructuredPDFParser()

# Parse a document
parser.parse("document.pdf")

System Dependencies

Doctra requires Poppler for PDF processing. See the Installation Guide for detailed setup instructions.

Core Components

Parsers

Parser Description Best For
StructuredPDFParser Complete document processing General purpose parsing
EnhancedPDFParser Parsing with image restoration Scanned or low-quality documents
ChartTablePDFParser Focused extraction Only charts and tables needed

Engines

Engine Description Use Case
DocResEngine Image restoration Standalone image enhancement
Layout Detection Document analysis Identify document structure
OCR Engine Text extraction Extract text from images
VLM Service AI processing Convert visuals to structured data

Use Cases

  • Financial Reports: Extract tables, charts, and text from financial documents
  • Research Papers: Parse academic papers with figures and tables
  • Document Archival: Convert scanned documents to searchable formats
  • Data Extraction: Extract structured data from visual elements
  • Document Enhancement: Restore and improve low-quality documents

Getting Help

What's Next?

  • Quick Start


    Get up and running with Doctra in minutes

    Quick Start Guide

  • User Guide


    Learn about parsers, engines, and advanced features

    Read the Guide

  • API Reference


    Detailed API documentation for all components

    API Docs

  • Examples


    Real-world examples and integration patterns

    View Examples

Acknowledgments

Doctra builds upon several excellent open-source projects:

  • PaddleOCR - Advanced document layout detection and OCR capabilities
  • DocRes - State-of-the-art document image restoration model
  • Outlines - Structured output generation for LLMs

We thank the developers and contributors of these projects for their valuable work.

License

Doctra is released under the MIT License. See the LICENSE file for details.