Skip to content

Installation

This guide will help you install Doctra and its dependencies on your system.

Requirements

  • Python 3.8 or higher
  • pip package manager
  • Poppler (for PDF processing)
  • Tesseract OCR (automatically handled by dependencies)

Installing Doctra

The easiest way to install Doctra is from PyPI using pip:

pip install doctra

This will install Doctra and all Python dependencies automatically.

From Source

To install the latest development version from source:

git clone https://github.com/AdemBoukhris457/Doctra.git
cd Doctra
pip install -e .

The -e flag installs in editable mode, which is useful for development.

System Dependencies

Doctra requires Poppler for PDF processing. Follow the instructions for your operating system:

Ubuntu/Debian

sudo apt-get update
sudo apt-get install poppler-utils

macOS

Using Homebrew:

brew install poppler

If you don't have Homebrew, install it from brew.sh.

:simple-windows: Windows

Option 1: Using Conda

conda install -c conda-forge poppler

Option 2: Manual Installation

  1. Download Poppler for Windows from this link
  2. Extract the archive
  3. Add the bin directory to your system PATH

Google Colab

!apt-get install poppler-utils

Optional Dependencies

VLM Providers

To use Vision Language Models for structured data extraction, install the appropriate provider:

OpenAI

pip install doctra[openai]

Google Gemini

pip install doctra[gemini]

All VLM Providers

pip install doctra[openai,gemini]

Development Dependencies

For contributing to Doctra:

pip install doctra[dev]

This installs testing, linting, and formatting tools.

Verifying Installation

After installation, verify that Doctra is installed correctly:

import doctra
print(doctra.__version__)

You should see the version number printed (e.g., 0.4.3).

Check System Dependencies

To check if Poppler is installed correctly:

pdftoppm -v

You should see the Poppler version information.

GPU Support

CUDA for Faster Processing

Doctra can leverage GPU acceleration for image restoration tasks. To enable GPU support:

  1. Install CUDA-compatible PyTorch:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
  1. Verify CUDA is available:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")

PaddlePaddle GPU Support

For GPU-accelerated layout detection:

pip uninstall paddlepaddle
pip install paddlepaddle-gpu

GPU Requirements

GPU support requires:

  • NVIDIA GPU with CUDA Compute Capability 3.5+
  • CUDA 11.8 or higher
  • cuDNN 8.6 or higher

Troubleshooting

ImportError: No module named 'doctra'

Solution: Ensure Doctra is installed in your active Python environment:

pip list | grep doctra

If not listed, reinstall with pip install doctra.

Poppler not found

Symptoms: Error message mentioning "pdftoppm" or "Poppler"

Solution:

  1. Verify Poppler installation: pdftoppm -v
  2. If not installed, follow the System Dependencies section
  3. On Windows, ensure Poppler's bin directory is in your PATH

CUDA out of memory

Solution: Use CPU processing or reduce DPI settings:

parser = StructuredPDFParser(
    dpi=150,  # Reduce from default 200
    restoration_device="cpu"  # Force CPU usage
)

PaddleOCR model download fails

Solution: Manually download models or check your network connection:

from doctra.parsers import StructuredPDFParser

# This will trigger model download
parser = StructuredPDFParser()

Models are downloaded to ~/.paddleocr/ on first use.

Next Steps

Now that you have Doctra installed, check out:

Getting Help

If you encounter issues during installation:

  1. Check the GitHub Issues for similar problems
  2. Create a new issue with:
    • Your operating system and version
    • Python version (python --version)
    • Full error message
    • Installation method used