Skip to content

VLM Integration

Guide to using Vision Language Models with Doctra.

Overview

Doctra integrates with Vision Language Models (VLMs) to convert visual elements (charts, tables, figures) into structured data. This enables automatic data extraction and conversion to Excel, HTML, and JSON formats.

Supported Providers

  • OpenAI: GPT-4 Vision, GPT-4o
  • Gemini: Google's vision models
  • Anthropic: Claude with vision
  • OpenRouter: Access multiple models

Basic Configuration

from doctra import StructuredPDFParser

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="openai",
    vlm_api_key="your-api-key"
)

parser.parse("document.pdf")

Provider Setup

OpenAI

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="openai",
    vlm_api_key="sk-xxx",
    vlm_model="gpt-4o"  # Optional
)

Gemini

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="gemini",
    vlm_api_key="your-gemini-key"
)

Anthropic

parser = StructuredPDFParser(
    use_vlm=True,
    vlm_provider="anthropic",
    vlm_api_key="your-anthropic-key"
)

What Gets Processed

With VLM enabled:

  • Tables: Converted to Excel/HTML with cell data
  • Charts: Data points extracted + descriptions
  • Figures: Descriptions and context generated

Output Files

outputs/
└── document/
    └── full_parse/
        ├── tables.xlsx      # Extracted table data
        ├── tables.html      # HTML tables
        ├── vlm_items.json   # Structured data
        └── ...

Cost Considerations

VLM processing requires API calls:

  • ~1-10 calls per document
  • ~\(0.01-\)0.10 per document
  • Costs vary by provider

See Also