Installation¶
This guide will help you install Doctra and its dependencies on your system.
Requirements¶
- Python 3.8 or higher
- pip package manager
- Poppler (for PDF processing)
- Tesseract OCR (automatically handled by dependencies)
Installing Doctra¶
From PyPI (Recommended)¶
The easiest way to install Doctra is from PyPI using pip:
This will install Doctra and all Python dependencies automatically.
From Source¶
To install the latest development version from source:
The -e
flag installs in editable mode, which is useful for development.
System Dependencies¶
Doctra requires Poppler for PDF processing. Follow the instructions for your operating system:
Ubuntu/Debian¶
macOS¶
Using Homebrew:
If you don't have Homebrew, install it from brew.sh.
:simple-windows: Windows¶
Option 1: Using Conda¶
Option 2: Manual Installation¶
- Download Poppler for Windows from this link
- Extract the archive
- Add the
bin
directory to your system PATH
Google Colab¶
Optional Dependencies¶
VLM Providers¶
To use Vision Language Models for structured data extraction, install the appropriate provider:
OpenAI¶
Google Gemini¶
All VLM Providers¶
Development Dependencies¶
For contributing to Doctra:
This installs testing, linting, and formatting tools.
Verifying Installation¶
After installation, verify that Doctra is installed correctly:
You should see the version number printed (e.g., 0.4.3
).
Check System Dependencies¶
To check if Poppler is installed correctly:
You should see the Poppler version information.
GPU Support¶
CUDA for Faster Processing¶
Doctra can leverage GPU acceleration for image restoration tasks. To enable GPU support:
- Install CUDA-compatible PyTorch:
- Verify CUDA is available:
PaddlePaddle GPU Support¶
For GPU-accelerated layout detection:
GPU Requirements
GPU support requires:
- NVIDIA GPU with CUDA Compute Capability 3.5+
- CUDA 11.8 or higher
- cuDNN 8.6 or higher
Troubleshooting¶
ImportError: No module named 'doctra'¶
Solution: Ensure Doctra is installed in your active Python environment:
If not listed, reinstall with pip install doctra
.
Poppler not found¶
Symptoms: Error message mentioning "pdftoppm" or "Poppler"
Solution:
- Verify Poppler installation:
pdftoppm -v
- If not installed, follow the System Dependencies section
- On Windows, ensure Poppler's
bin
directory is in your PATH
CUDA out of memory¶
Solution: Use CPU processing or reduce DPI settings:
parser = StructuredPDFParser(
dpi=150, # Reduce from default 200
restoration_device="cpu" # Force CPU usage
)
PaddleOCR model download fails¶
Solution: Manually download models or check your network connection:
from doctra.parsers import StructuredPDFParser
# This will trigger model download
parser = StructuredPDFParser()
Models are downloaded to ~/.paddleocr/
on first use.
Next Steps¶
Now that you have Doctra installed, check out:
- Quick Start - Your first Doctra program
- System Requirements - Detailed hardware requirements
- User Guide - Learn about core concepts
Getting Help¶
If you encounter issues during installation:
- Check the GitHub Issues for similar problems
- Create a new issue with:
- Your operating system and version
- Python version (
python --version
) - Full error message
- Installation method used