OCR Engine¶
Guide to text extraction using OCR in Doctra.
Overview¶
Doctra uses Tesseract OCR to extract text from document images. The OCR engine is highly configurable for different document types and languages.
Configuration¶
from doctra import StructuredPDFParser
parser = StructuredPDFParser(
ocr_lang="eng",
ocr_psm=6,
ocr_oem=3
)
Parameters¶
- ocr_lang
- Tesseract language code
-
eng
: English -fra
: French -spa
: Spanish -deu
: German - Multiple:eng+fra
- ocr_psm
- Page segmentation mode
-
3
: Automatic -6
: Uniform block (default) -11
: Sparse text -12
: Sparse with OSD - ocr_oem
- OCR engine mode
-
0
: Legacy -1
: Neural nets LSTM -3
: Default (both)
Improving Accuracy¶
1. Increase DPI¶
2. Use Image Restoration¶
3. Correct Language¶
Multi-language Documents¶
See Also¶
- Enhanced Parser - Improve OCR with restoration
- Core Concepts - Understanding OCR in the pipeline
- API Reference - OCR configuration options