Skip to content

Doctra Documentation

VLM Integration

AdemBoukhris457/Doctra

VLM Integration¶

Guide to using Vision Language Models with Doctra.

Overview¶

Doctra integrates with Vision Language Models (VLMs) to convert visual elements (charts, tables, figures) into structured data. This enables automatic data extraction and conversion to Excel, HTML, and JSON formats.

Supported Providers¶

OpenAI: GPT-4 Vision, GPT-4o
Gemini: Google's vision models
Anthropic: Claude with vision
OpenRouter: Access multiple models

Basic Configuration¶

from doctra import StructuredPDFParser

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="openai",
    vlm_api_key="your-api-key"
)

parser.parse("document.pdf")

Provider Setup¶

OpenAI¶

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="openai",
    vlm_api_key="sk-xxx",
    vlm_model="gpt-4o"  # Optional
)

Gemini¶

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="gemini",
    vlm_api_key="your-gemini-key"
)

Anthropic¶

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="anthropic",
    vlm_api_key="your-anthropic-key"
)

What Gets Processed¶

With VLM enabled:

Tables: Converted to Excel/HTML with cell data
Charts: Data points extracted + descriptions
Figures: Descriptions and context generated

Output Files¶

outputs/
└── document/
    └── full_parse/
        ├── tables.xlsx      # Extracted table data
        ├── tables.html      # HTML tables
        ├── vlm_items.json   # Structured data
        └── ...

Cost Considerations¶

VLM processing requires API calls:

~1-10 calls per document
~\(0.01-\)0.10 per document
Costs vary by provider

See Also¶

Parsers - Using VLM with parsers
API Reference - VLM configuration options
Examples - VLM usage examples