Skip to content

DOCX Parser

The StructuredDOCXParser is a comprehensive parser for Microsoft Word documents (.docx files) that extracts text, tables, images, and structured content while preserving document formatting and order.

Overview

The DOCX parser provides:

  • Complete DOCX Support: Extracts text, tables, images, and formatting from Word documents
  • Document Order Preservation: Maintains the original sequence of elements (paragraphs, tables, images)
  • VLM Integration: Optional Vision Language Model support for image analysis and table extraction
  • Multiple Output Formats: Generates Markdown, HTML, and Excel files
  • Excel Export: Creates structured Excel files with Table of Contents and clickable hyperlinks
  • Formatting Preservation: Maintains text formatting (bold, italic, etc.) in output
  • Progress Tracking: Real-time progress bars for VLM processing

Basic Usage

from doctra.parsers.structured_docx_parser import StructuredDOCXParser

# Basic DOCX parsing
parser = StructuredDOCXParser(
    extract_images=True,
    preserve_formatting=True,
    table_detection=True,
    export_excel=True
)

# Parse DOCX document
parser.parse("document.docx")

Advanced Configuration

With VLM Enhancement

from doctra import StructuredDOCXParser
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",  # or "gemini", "anthropic", "openrouter", "qianfan", "ollama"
    vlm_model="gpt-4-vision",  # Optional, uses default if None
    api_key="your_api_key"
)

parser = StructuredDOCXParser(
    # VLM Engine (pass initialized engine instance)
    vlm=vlm_engine,

    # Processing Options
    extract_images=True,
    preserve_formatting=True,
    table_detection=True,
    export_excel=True
)

# Parse with VLM enhancement
parser.parse("document.docx")

Custom Processing Options

parser = StructuredDOCXParser(
    # Disable image extraction for faster processing
    extract_images=False,

    # Disable formatting preservation for plain text
    preserve_formatting=False,

    # Disable table detection if not needed
    table_detection=False,

    # Disable Excel export
    export_excel=False
)

Output Structure

When parsing a DOCX document, the parser creates:

outputs/document_name/
├── document.md          # Markdown version with all content
├── document.html        # HTML version with styling
├── tables.xlsx         # Excel file with extracted tables
│   ├── Table of Contents  # Summary sheet with hyperlinks
│   ├── Table 1         # Individual table sheets
│   ├── Table 2
│   └── ...
└── images/             # Extracted images
    ├── image1.png
    ├── image2.jpg
    └── ...

VLM Integration Features

When VLM is enabled, the parser:

  • Analyzes Images: Uses AI to extract structured data from images
  • Creates Tables: Converts chart images to structured table data
  • Enhanced Excel Output: Includes VLM-extracted tables in Excel file
  • Smart Content Display: Shows extracted tables instead of images in Markdown/HTML
  • Progress Tracking: Shows progress based on number of images processed

VLM Processing Flow

  1. Image Detection: Scans document for embedded images
  2. VLM Analysis: Processes each image with the selected VLM model
  3. Structured Extraction: Converts visual content to structured data
  4. Excel Integration: Adds VLM-extracted tables to Excel output
  5. Content Replacement: Replaces image references with extracted tables in Markdown/HTML

Excel Output Features

The generated Excel file includes:

  • Table of Contents: Summary sheet with all extracted tables
  • Clickable Hyperlinks: Navigate between table sheets
  • Consistent Styling: Professional formatting with colors and fonts
  • VLM Integration: Includes both original and VLM-extracted tables
  • Sheet Naming: Uses actual table titles as sheet names

CLI Usage

# Basic DOCX parsing
doctra parse-docx document.docx

# With VLM enhancement
doctra parse-docx document.docx --use-vlm --vlm-provider openai --vlm-api-key your_key

# Custom options
doctra parse-docx document.docx \
  --extract-images \
  --preserve-formatting \
  --table-detection \
  --export-excel

Web UI Usage

The DOCX parser is available in the Gradio web interface:

  1. Upload DOCX File: Drag and drop your Word document
  2. Configure VLM: Enable VLM and set your API key
  3. Processing Options: Choose extraction settings
  4. Parse Document: Click "Parse DOCX" to process
  5. View Results: Preview content and download outputs

Parameters Reference

VLM Settings

Parameter Type Default Description
vlm Optional[VLMStructuredExtractor] None VLM engine instance. If None, VLM processing is disabled.

VLM Engine Configuration:

VLM engines must be initialized externally and passed to the parser. This uses a dependency injection pattern for clearer API design.

VLMStructuredExtractor Parameters: - vlm_provider (str, required): VLM provider to use ("openai", "gemini", "anthropic", "openrouter", "qianfan", "ollama") - vlm_model (str, optional): Model name to use (defaults to provider-specific defaults) - api_key (str, optional): API key for the VLM provider (required for all providers except Ollama)

Example:

from doctra.engines.vlm.service import VLMStructuredExtractor

vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    vlm_model="gpt-4-vision",  # Optional
    api_key="your-api-key"
)

parser = StructuredDOCXParser(vlm=vlm_engine)

Processing Options

Parameter Type Default Description
extract_images bool True Extract embedded images from DOCX
preserve_formatting bool True Preserve text formatting in output
table_detection bool True Detect and extract tables
export_excel bool True Export tables to Excel file

Error Handling

The parser handles common errors:

  • File Not Found: Invalid DOCX file path
  • Permission Errors: Read-only files or locked documents
  • VLM API Errors: Invalid API keys or rate limits
  • Processing Errors: Corrupted documents or unsupported formats
try:
    parser.parse("document.docx")
except FileNotFoundError:
    print("DOCX file not found!")
except Exception as e:
    print(f"Processing error: {e}")

Best Practices

Performance Optimization

  • Disable Unused Features: Turn off image extraction or Excel export if not needed
  • VLM Usage: Use VLM only when structured data extraction is required
  • Large Documents: Consider processing large documents in smaller chunks

Output Quality

  • Formatting Preservation: Keep enabled for better output quality
  • Table Detection: Essential for structured data extraction
  • VLM Enhancement: Improves table extraction from images

Error Prevention

  • File Validation: Ensure DOCX files are not corrupted
  • API Keys: Set up VLM API keys before processing
  • Permissions: Ensure write access to output directory

Examples

Example 1: Basic Document Processing

from doctra.parsers.structured_docx_parser import StructuredDOCXParser

# Initialize parser
parser = StructuredDOCXParser()

# Process document
parser.parse("report.docx")

# Output: outputs/report/document.md, document.html, tables.xlsx

Example 2: VLM-Enhanced Processing

from doctra import StructuredDOCXParser
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    api_key="your_api_key"
)

parser = StructuredDOCXParser(vlm=vlm_engine)

# Process with AI enhancement
parser.parse("financial_report.docx")

# Output: Enhanced Excel with VLM-extracted tables

Example 3: Custom Configuration

parser = StructuredDOCXParser(
    extract_images=True,
    preserve_formatting=False,  # Plain text output
    table_detection=True,
    export_excel=True
)

# Process with custom settings
parser.parse("data_document.docx")

Troubleshooting

Common Issues

  1. "python-docx not installed"
  2. Solution: pip install python-docx

  3. "No tables extracted"

  4. Check if table_detection=True
  5. Verify document contains tables

  6. "VLM API error"

  7. Verify API key is correct
  8. Check provider and model compatibility

  9. "Images not extracted"

  10. Check if extract_images=True
  11. Verify document contains embedded images

Performance Tips

  • Use VLM only when needed (adds processing time)
  • Disable unused features for faster processing
  • Process large documents in smaller batches
  • Ensure sufficient disk space for outputs