DOCX Parser¶

The StructuredDOCXParser is a comprehensive parser for Microsoft Word documents (.docx files) that extracts text, tables, images, and structured content while preserving document formatting and order.

Overview¶

The DOCX parser provides:

Complete DOCX Support: Extracts text, tables, images, and formatting from Word documents
Document Order Preservation: Maintains the original sequence of elements (paragraphs, tables, images)
VLM Integration: Optional Vision Language Model support for image analysis and table extraction
Multiple Output Formats: Generates Markdown, HTML, and Excel files
Excel Export: Creates structured Excel files with Table of Contents and clickable hyperlinks
Formatting Preservation: Maintains text formatting (bold, italic, etc.) in output
Progress Tracking: Real-time progress bars for VLM processing

Basic Usage¶

from doctra.parsers.structured_docx_parser import StructuredDOCXParser

# Basic DOCX parsing
parser = StructuredDOCXParser(
    extract_images=True,
    preserve_formatting=True,
    table_detection=True,
    export_excel=True
)

# Parse DOCX document
parser.parse("document.docx")

Advanced Configuration¶

With VLM Enhancement¶

from doctra import StructuredDOCXParser
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",  # or "gemini", "anthropic", "openrouter", "qianfan", "ollama"
    vlm_model="gpt-4-vision",  # Optional, uses default if None
    api_key="your_api_key"
)

parser = StructuredDOCXParser(
    # VLM Engine (pass initialized engine instance)
    vlm=vlm_engine,

    # Processing Options
    extract_images=True,
    preserve_formatting=True,
    table_detection=True,
    export_excel=True
)

# Parse with VLM enhancement
parser.parse("document.docx")

Custom Processing Options¶

parser = StructuredDOCXParser(
    # Disable image extraction for faster processing
    extract_images=False,

    # Disable formatting preservation for plain text
    preserve_formatting=False,

    # Disable table detection if not needed
    table_detection=False,

    # Disable Excel export
    export_excel=False
)

Output Structure¶

When parsing a DOCX document, the parser creates:

outputs/document_name/
├── document.md          # Markdown version with all content
├── document.html        # HTML version with styling
├── tables.xlsx         # Excel file with extracted tables
│   ├── Table of Contents  # Summary sheet with hyperlinks
│   ├── Table 1         # Individual table sheets
│   ├── Table 2
│   └── ...
└── images/             # Extracted images
    ├── image1.png
    ├── image2.jpg
    └── ...

VLM Integration Features¶

When VLM is enabled, the parser:

Analyzes Images: Uses AI to extract structured data from images
Creates Tables: Converts chart images to structured table data
Enhanced Excel Output: Includes VLM-extracted tables in Excel file
Smart Content Display: Shows extracted tables instead of images in Markdown/HTML
Progress Tracking: Shows progress based on number of images processed

VLM Processing Flow¶

Image Detection: Scans document for embedded images
VLM Analysis: Processes each image with the selected VLM model
Structured Extraction: Converts visual content to structured data
Excel Integration: Adds VLM-extracted tables to Excel output
Content Replacement: Replaces image references with extracted tables in Markdown/HTML

Excel Output Features¶

The generated Excel file includes:

Table of Contents: Summary sheet with all extracted tables
Clickable Hyperlinks: Navigate between table sheets
Consistent Styling: Professional formatting with colors and fonts
VLM Integration: Includes both original and VLM-extracted tables
Sheet Naming: Uses actual table titles as sheet names

CLI Usage¶

# Basic DOCX parsing
doctra parse-docx document.docx

# With VLM enhancement
doctra parse-docx document.docx --use-vlm --vlm-provider openai --vlm-api-key your_key

# Custom options
doctra parse-docx document.docx \
  --extract-images \
  --preserve-formatting \
  --table-detection \
  --export-excel

Web UI Usage¶

The DOCX parser is available in the Gradio web interface:

Upload DOCX File: Drag and drop your Word document
Configure VLM: Enable VLM and set your API key
Processing Options: Choose extraction settings
Parse Document: Click "Parse DOCX" to process
View Results: Preview content and download outputs

Parameters Reference¶

VLM Settings¶

Parameter	Type	Default	Description
`vlm`	`Optional[VLMStructuredExtractor]`	`None`	VLM engine instance. If `None`, VLM processing is disabled.

VLM Engine Configuration:

VLM engines must be initialized externally and passed to the parser. This uses a dependency injection pattern for clearer API design.

VLMStructuredExtractor Parameters: - vlm_provider (str, required): VLM provider to use ("openai", "gemini", "anthropic", "openrouter", "qianfan", "ollama") - vlm_model (str, optional): Model name to use (defaults to provider-specific defaults) - api_key (str, optional): API key for the VLM provider (required for all providers except Ollama)

Example:

from doctra.engines.vlm.service import VLMStructuredExtractor

vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    vlm_model="gpt-4-vision",  # Optional
    api_key="your-api-key"
)

parser = StructuredDOCXParser(vlm=vlm_engine)

Processing Options¶

Parameter	Type	Default	Description
`extract_images`	bool	True	Extract embedded images from DOCX
`preserve_formatting`	bool	True	Preserve text formatting in output
`table_detection`	bool	True	Detect and extract tables
`export_excel`	bool	True	Export tables to Excel file

Error Handling¶

The parser handles common errors:

File Not Found: Invalid DOCX file path
Permission Errors: Read-only files or locked documents
VLM API Errors: Invalid API keys or rate limits
Processing Errors: Corrupted documents or unsupported formats

try:
    parser.parse("document.docx")
except FileNotFoundError:
    print("DOCX file not found!")
except Exception as e:
    print(f"Processing error: {e}")

Best Practices¶

Performance Optimization¶

Disable Unused Features: Turn off image extraction or Excel export if not needed
VLM Usage: Use VLM only when structured data extraction is required
Large Documents: Consider processing large documents in smaller chunks

Output Quality¶

Formatting Preservation: Keep enabled for better output quality
Table Detection: Essential for structured data extraction
VLM Enhancement: Improves table extraction from images

Error Prevention¶

File Validation: Ensure DOCX files are not corrupted
API Keys: Set up VLM API keys before processing
Permissions: Ensure write access to output directory

Examples¶

Example 1: Basic Document Processing¶

from doctra.parsers.structured_docx_parser import StructuredDOCXParser

# Initialize parser
parser = StructuredDOCXParser()

# Process document
parser.parse("report.docx")

# Output: outputs/report/document.md, document.html, tables.xlsx

Example 2: VLM-Enhanced Processing¶

from doctra import StructuredDOCXParser
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    api_key="your_api_key"
)

parser = StructuredDOCXParser(vlm=vlm_engine)

# Process with AI enhancement
parser.parse("financial_report.docx")

# Output: Enhanced Excel with VLM-extracted tables

Example 3: Custom Configuration¶

parser = StructuredDOCXParser(
    extract_images=True,
    preserve_formatting=False,  # Plain text output
    table_detection=True,
    export_excel=True
)

# Process with custom settings
parser.parse("data_document.docx")

Troubleshooting¶

Common Issues¶

"python-docx not installed"
Solution: pip install python-docx
"No tables extracted"
Check if table_detection=True
Verify document contains tables
"VLM API error"
Verify API key is correct
Check provider and model compatibility
"Images not extracted"
Check if extract_images=True
Verify document contains embedded images

Performance Tips¶

Use VLM only when needed (adds processing time)
Disable unused features for faster processing
Process large documents in smaller batches
Ensure sufficient disk space for outputs