Installation¶
This guide will help you install Doctra and its dependencies on your system.
Requirements¶
- Python 3.8 or higher
- pip package manager
- Poppler (for PDF processing)
- Tesseract OCR (automatically handled by dependencies)
Installing Doctra¶
From PyPI (Recommended)¶
The easiest way to install Doctra is from PyPI using pip:
This will install Doctra and all Python dependencies automatically.
From Source¶
To install the latest development version from source:
The -e flag installs in editable mode, which is useful for development.
System Dependencies¶
Doctra requires Poppler for PDF processing. Follow the instructions for your operating system:
Ubuntu/Debian¶
macOS¶
Using Homebrew:
If you don't have Homebrew, install it from brew.sh.
:simple-windows: Windows¶
Option 1: Using Conda¶
Option 2: Manual Installation¶
- Download Poppler for Windows from this link
- Extract the archive
- Add the
bindirectory to your system PATH
Google Colab¶
Optional Dependencies¶
VLM Providers¶
To use Vision Language Models for structured data extraction, install the appropriate provider:
OpenAI¶
Google Gemini¶
All VLM Providers¶
Development Dependencies¶
For contributing to Doctra:
This installs testing, linting, and formatting tools.
Verifying Installation¶
After installation, verify that Doctra is installed correctly:
You should see the version number printed (e.g., 0.4.3).
Check System Dependencies¶
To check if Poppler is installed correctly:
You should see the Poppler version information.
GPU Support¶
CUDA for Faster Processing¶
Doctra can leverage GPU acceleration for image restoration tasks. To enable GPU support:
- Install CUDA-compatible PyTorch:
- Verify CUDA is available:
PaddlePaddle GPU Support¶
For GPU-accelerated layout detection and PaddleOCRVL:
# Install PaddlePaddle GPU (CUDA 12.6)
pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
# Install PaddleOCR with doc-parser support
pip install -U "paddleocr[doc-parser]"
# Install platform-specific safetensors (required for PaddleOCRVL)
# For Linux:
pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
# For Windows:
pip install https://xly-devops.cdn.bcebos.com/safetensors-nightly/safetensors-0.6.2.dev0-cp38-abi3-win_amd64.whl
GPU Requirements
GPU support requires:
- NVIDIA GPU with CUDA Compute Capability 3.5+
- CUDA 12.6 (for PaddlePaddle 3.2.1) or see PaddlePaddle installation guide for other CUDA versions
- cuDNN 8.6 or higher
Automatic Installation
When installing Doctra from PyPI or source, PaddleOCR dependencies are automatically installed with platform-specific handling for safetensors. The installation will automatically select the correct safetensors wheel for your platform (Linux or Windows).
Troubleshooting¶
ImportError: No module named 'doctra'¶
Solution: Ensure Doctra is installed in your active Python environment:
If not listed, reinstall with pip install doctra.
Poppler not found¶
Symptoms: Error message mentioning "pdftoppm" or "Poppler"
Solution:
- Verify Poppler installation:
pdftoppm -v - If not installed, follow the System Dependencies section
- On Windows, ensure Poppler's
bindirectory is in your PATH
CUDA out of memory¶
Solution: Use CPU processing or reduce DPI settings:
parser = StructuredPDFParser(
dpi=150, # Reduce from default 200
restoration_device="cpu" # Force CPU usage
)
PaddleOCR model download fails¶
Solution: Manually download models or check your network connection:
from doctra.parsers import StructuredPDFParser
# This will trigger model download
parser = StructuredPDFParser()
Models are downloaded to ~/.paddleocr/ on first use.
Next Steps¶
Now that you have Doctra installed, check out:
- Quick Start - Your first Doctra program
- System Requirements - Detailed hardware requirements
- User Guide - Learn about core concepts
Getting Help¶
If you encounter issues during installation:
- Check the GitHub Issues for similar problems
- Create a new issue with:
- Your operating system and version
- Python version (
python --version) - Full error message
- Installation method used