Installation¶

This guide will help you install Doctra and its dependencies on your system.

Requirements¶

Python 3.8 or higher
pip package manager
Poppler (for PDF processing)
Tesseract OCR (automatically handled by dependencies)

Installing Doctra¶

From PyPI (Recommended)¶

The easiest way to install Doctra is from PyPI using pip:

pip install doctra

This will install Doctra and all Python dependencies automatically.

From Source¶

To install the latest development version from source:

git clone https://github.com/AdemBoukhris457/Doctra.git
cd Doctra
pip install -e .

The -e flag installs in editable mode, which is useful for development.

System Dependencies¶

Doctra requires Poppler for PDF processing. Follow the instructions for your operating system:

Ubuntu/Debian¶

sudo apt-get update
sudo apt-get install poppler-utils

macOS¶

Using Homebrew:

brew install poppler

If you don't have Homebrew, install it from brew.sh.

:simple-windows: Windows¶

Option 1: Using Conda¶

conda install -c conda-forge poppler

Option 2: Manual Installation¶

Download Poppler for Windows from this link
Extract the archive
Add the bin directory to your system PATH

Google Colab¶

!apt-get install poppler-utils

Optional Dependencies¶

VLM Providers¶

To use Vision Language Models for structured data extraction, install the appropriate provider:

OpenAI¶

pip install doctra[openai]

Google Gemini¶

pip install doctra[gemini]

All VLM Providers¶

pip install doctra[openai,gemini]

Development Dependencies¶

For contributing to Doctra:

pip install doctra[dev]

This installs testing, linting, and formatting tools.

Verifying Installation¶

After installation, verify that Doctra is installed correctly:

import doctra
print(doctra.__version__)

You should see the version number printed (e.g., 0.4.3).

Check System Dependencies¶

To check if Poppler is installed correctly:

pdftoppm -v

You should see the Poppler version information.

GPU Support¶

CUDA for Faster Processing¶

Doctra can leverage GPU acceleration for image restoration tasks. To enable GPU support:

Install CUDA-compatible PyTorch:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Verify CUDA is available:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")

PaddlePaddle GPU Support¶

For GPU-accelerated layout detection and PaddleOCRVL:

# Install PaddlePaddle GPU (CUDA 12.6)
pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

# Install PaddleOCR with doc-parser support
pip install -U "paddleocr[doc-parser]"

# Install platform-specific safetensors (required for PaddleOCRVL)
# For Linux:
pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl

# For Windows:
pip install https://xly-devops.cdn.bcebos.com/safetensors-nightly/safetensors-0.6.2.dev0-cp38-abi3-win_amd64.whl

GPU Requirements

GPU support requires:

NVIDIA GPU with CUDA Compute Capability 3.5+
CUDA 12.6 (for PaddlePaddle 3.2.1) or see PaddlePaddle installation guide for other CUDA versions
cuDNN 8.6 or higher

Automatic Installation

When installing Doctra from PyPI or source, PaddleOCR dependencies are automatically installed with platform-specific handling for safetensors. The installation will automatically select the correct safetensors wheel for your platform (Linux or Windows).

Troubleshooting¶

ImportError: No module named 'doctra'¶

Solution: Ensure Doctra is installed in your active Python environment:

pip list | grep doctra

If not listed, reinstall with pip install doctra.

Poppler not found¶

Symptoms: Error message mentioning "pdftoppm" or "Poppler"

Solution:

Verify Poppler installation: pdftoppm -v
If not installed, follow the System Dependencies section
On Windows, ensure Poppler's bin directory is in your PATH

CUDA out of memory¶

Solution: Use CPU processing or reduce DPI settings:

parser = StructuredPDFParser(
    dpi=150,  # Reduce from default 200
    restoration_device="cpu"  # Force CPU usage
)

PaddleOCR model download fails¶

Solution: Manually download models or check your network connection:

from doctra.parsers import StructuredPDFParser

# This will trigger model download
parser = StructuredPDFParser()

Models are downloaded to ~/.paddleocr/ on first use.

Next Steps¶

Now that you have Doctra installed, check out:

Quick Start - Your first Doctra program
System Requirements - Detailed hardware requirements
User Guide - Learn about core concepts

Getting Help¶

If you encounter issues during installation:

Check the GitHub Issues for similar problems
Create a new issue with:
- Your operating system and version
- Python version (python --version)
- Full error message
- Installation method used