Skip to content

OCR Engine

Guide to text extraction using OCR in Doctra.

Overview

Doctra uses Tesseract OCR to extract text from document images. The OCR engine is highly configurable for different document types and languages.

Configuration

from doctra import StructuredPDFParser

parser = StructuredPDFParser(
    ocr_lang="eng",
    ocr_psm=6,
    ocr_oem=3
)

Parameters

ocr_lang
Tesseract language code - eng: English - fra: French - spa: Spanish - deu: German - Multiple: eng+fra
ocr_psm
Page segmentation mode - 3: Automatic - 6: Uniform block (default) - 11: Sparse text - 12: Sparse with OSD
ocr_oem
OCR engine mode - 0: Legacy - 1: Neural nets LSTM - 3: Default (both)

Improving Accuracy

1. Increase DPI

parser = StructuredPDFParser(dpi=300)

2. Use Image Restoration

from doctra import EnhancedPDFParser

parser = EnhancedPDFParser(
    use_image_restoration=True
)

3. Correct Language

parser = StructuredPDFParser(
    ocr_lang="fra"  # For French documents
)

Multi-language Documents

parser = StructuredPDFParser(
    ocr_lang="eng+fra+deu"  # Multiple languages
)

See Also