VLM Integration¶
Guide to using Vision Language Models with Doctra.
Overview¶
Doctra integrates with Vision Language Models (VLMs) to convert visual elements (charts, tables, figures) into structured data. This enables automatic data extraction and conversion to Excel, HTML, and JSON formats.
Supported Providers¶
- OpenAI: GPT-4 Vision, GPT-4o
- Gemini: Google's vision models
- Anthropic: Claude with vision
- OpenRouter: Access multiple models
Basic Configuration¶
from doctra import StructuredPDFParser
parser = StructuredPDFParser(
use_vlm=True,
vlm_provider="openai",
vlm_api_key="your-api-key"
)
parser.parse("document.pdf")
Provider Setup¶
OpenAI¶
parser = StructuredPDFParser(
use_vlm=True,
vlm_provider="openai",
vlm_api_key="sk-xxx",
vlm_model="gpt-4o" # Optional
)
Gemini¶
Anthropic¶
parser = StructuredPDFParser(
use_vlm=True,
vlm_provider="anthropic",
vlm_api_key="your-anthropic-key"
)
What Gets Processed¶
With VLM enabled:
- Tables: Converted to Excel/HTML with cell data
- Charts: Data points extracted + descriptions
- Figures: Descriptions and context generated
Output Files¶
outputs/
└── document/
└── full_parse/
├── tables.xlsx # Extracted table data
├── tables.html # HTML tables
├── vlm_items.json # Structured data
└── ...
Cost Considerations¶
VLM processing requires API calls:
- ~1-10 calls per document
- ~\(0.01-\)0.10 per document
- Costs vary by provider
See Also¶
- Parsers - Using VLM with parsers
- API Reference - VLM configuration options
- Examples - VLM usage examples