OCR Engine¶

Guide to text extraction using OCR in Doctra.

Overview¶

Doctra supports two OCR engines for text extraction:

PyTesseract (default) - Traditional Tesseract OCR engine with extensive language support
PaddleOCR - Advanced PP-OCRv5_server model released in PaddleOCR 3.0, offering superior accuracy and performance

You can choose between these engines based on your needs. PyTesseract is the default and works well for most use cases, while PaddleOCR provides enhanced accuracy for complex documents.

Choosing an OCR Engine¶

Doctra uses a dependency injection pattern for OCR engines. You initialize the OCR engine externally and pass it to the parser. This provides a clearer API, avoids mixed configurations, and allows reusing OCR engines across multiple parsers.

PyTesseract (Default)¶

PyTesseract is the default OCR engine and works well for most documents. It offers extensive language support and fine-grained control.

from doctra import StructuredPDFParser
from doctra.engines.ocr import PytesseractOCREngine

# Option 1: Use default PyTesseract (automatic if ocr_engine=None)
parser = StructuredPDFParser()  # Creates default PytesseractOCREngine internally

# Option 2: Explicitly configure PyTesseract
tesseract_ocr = PytesseractOCREngine(
    lang="eng",
    psm=6,
    oem=3
)
parser = StructuredPDFParser(ocr_engine=tesseract_ocr)

PaddleOCR with PP-OCRv5_server¶

PaddleOCR provides the advanced PP-OCRv5_server model (default in PaddleOCR 3.0), which offers:

Higher accuracy for complex documents
Better performance on GPU
Advanced text detection and recognition
Automatic model management (models downloaded automatically)

from doctra import StructuredPDFParser
from doctra.engines.ocr import PaddleOCREngine

# Initialize PaddleOCR engine
paddle_ocr = PaddleOCREngine(
    device="gpu",  # Use "cpu" if no GPU available
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False
)

# Pass to parser
parser = StructuredPDFParser(ocr_engine=paddle_ocr)

Reusing OCR Engines¶

One of the benefits of the dependency injection pattern is that you can create an OCR engine once and reuse it across multiple parsers:

from doctra.engines.ocr import PytesseractOCREngine
from doctra import StructuredPDFParser, EnhancedPDFParser

# Create OCR engine once
shared_ocr = PytesseractOCREngine(lang="eng", psm=6, oem=3)

# Reuse across multiple parsers
parser1 = StructuredPDFParser(ocr_engine=shared_ocr)
parser2 = EnhancedPDFParser(ocr_engine=shared_ocr)
parser3 = StructuredPDFParser(ocr_engine=shared_ocr)

PyTesseract Parameters¶

These parameters are configured when initializing PytesseractOCREngine:

lang: Tesseract language code - eng: English - fra: French - spa: Spanish - deu: German - Multiple: eng+fra
psm: Page segmentation mode - 3: Automatic - 4: Assume a single column of text (default) - 6: Uniform block of text - 11: Sparse text - 12: Sparse text with OSD
oem: OCR engine mode - 0: Legacy - 1: Neural nets LSTM - 3: Default (both)
extra_config: Additional Tesseract configuration string

Example:

from doctra.engines.ocr import PytesseractOCREngine

ocr = PytesseractOCREngine(
    lang="eng",
    psm=6,
    oem=3,
    extra_config=""
)

PaddleOCR Parameters¶

These parameters are configured when initializing PaddleOCREngine:

device: Device to use for OCR processing - "gpu": Use GPU acceleration (default, recommended if available) - "cpu": Use CPU processing
use_doc_orientation_classify: Enable document orientation classification model (default: False) - Automatically detects and corrects document orientation
use_doc_unwarping: Enable text image rectification model (default: False) - Corrects perspective distortion in scanned documents
use_textline_orientation: Enable text line orientation classification model (default: False) - Handles rotated text lines

Example:

from doctra.engines.ocr import PaddleOCREngine

ocr = PaddleOCREngine(
    device="gpu",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False
)

Note: The PP-OCRv5_server model is automatically used by default in PaddleOCR 3.0. Models are automatically downloaded on first use and cached for future use.

Improving Accuracy¶

1. Choose the Right OCR Engine¶

For complex documents or when accuracy is critical, consider using PaddleOCR:

from doctra import StructuredPDFParser
from doctra.engines.ocr import PaddleOCREngine

paddle_ocr = PaddleOCREngine(device="gpu")
parser = StructuredPDFParser(ocr_engine=paddle_ocr)

2. Increase DPI¶

Higher resolution improves text recognition for both engines:

parser = StructuredPDFParser(dpi=300)

3. Use Image Restoration¶

Enhance document quality before OCR:

from doctra import EnhancedPDFParser
from doctra.engines.ocr import PaddleOCREngine

paddle_ocr = PaddleOCREngine(device="gpu")
parser = EnhancedPDFParser(
    use_image_restoration=True,
    ocr_engine=paddle_ocr  # Combine with PaddleOCR for best results
)

4. Correct Language (PyTesseract)¶

For PyTesseract, specify the document language when initializing the engine:

from doctra import StructuredPDFParser
from doctra.engines.ocr import PytesseractOCREngine

tesseract_ocr = PytesseractOCREngine(lang="fra")  # For French documents
parser = StructuredPDFParser(ocr_engine=tesseract_ocr)

Multi-language Documents (PyTesseract)¶

PyTesseract supports multiple languages. Configure this when initializing the engine:

from doctra import StructuredPDFParser
from doctra.engines.ocr import PytesseractOCREngine

tesseract_ocr = PytesseractOCREngine(lang="eng+fra+deu")  # Multiple languages
parser = StructuredPDFParser(ocr_engine=tesseract_ocr)

When to Use Each Engine¶

Use PyTesseract when:¶

Working with standard documents
Need multi-language support
Want fine-grained control over OCR parameters
CPU-only environment

Use PaddleOCR when:¶

Dealing with complex or degraded documents
Need maximum accuracy
Have GPU available for faster processing
Working with Asian languages (better support)
Processing large batches of documents