Skip to content

Parsers API Reference

Complete API documentation for all Doctra parsers.

StructuredPDFParser

The base parser for comprehensive PDF document processing.

doctra.parsers.structured_pdf_parser.StructuredPDFParser

Comprehensive PDF parser for extracting all types of content.

Processes PDF documents to extract text, tables, charts, and figures.
Supports OCR for text extraction and optional VLM processing for
converting visual elements into structured data.

Features automatic detection and merging of tables split across pages
using proximity detection and LSD-based structure analysis.

:param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
:param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
:param dpi: DPI for PDF rendering (default: 200)
:param min_score: Minimum confidence score for layout detection (default: 0.0)
:param ocr_engine: OCR engine instance (PytesseractOCREngine or PaddleOCREngine). 
                   If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3.
:param box_separator: Separator between text boxes in output (default: "

") :param merge_split_tables: Whether to detect and merge split tables (default: False) :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20) :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.10) :param max_gap_ratio: Maximum allowed gap between tables (default: 0.05) :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0) :param min_merge_confidence: Minimum confidence score for merging (default: 0.7)

Source code in doctra/parsers/structured_pdf_parser.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
class StructuredPDFParser:
    """
    Comprehensive PDF parser for extracting all types of content.

    Processes PDF documents to extract text, tables, charts, and figures.
    Supports OCR for text extraction and optional VLM processing for
    converting visual elements into structured data.

    Features automatic detection and merging of tables split across pages
    using proximity detection and LSD-based structure analysis.

    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_engine: OCR engine instance (PytesseractOCREngine or PaddleOCREngine). 
                       If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3.
    :param box_separator: Separator between text boxes in output (default: "\n")
    :param merge_split_tables: Whether to detect and merge split tables (default: False)
    :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
    :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.10)
    :param max_gap_ratio: Maximum allowed gap between tables (default: 0.05)
    :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
    :param min_merge_confidence: Minimum confidence score for merging (default: 0.7)
    """

    def __init__(
            self,
            *,
            vlm: Optional[VLMStructuredExtractor] = None,
            layout_model_name: str = "PP-DocLayout_plus-L",
            dpi: int = 200,
            min_score: float = 0.0,
            ocr_engine: Optional[Union[PytesseractOCREngine, PaddleOCREngine]] = None,
            box_separator: str = "\n",
            merge_split_tables: bool = False,
            bottom_threshold_ratio: float = 0.20,
            top_threshold_ratio: float = 0.15,
            max_gap_ratio: float = 0.25,
            column_alignment_tolerance: float = 10.0,
            min_merge_confidence: float = 0.65,
    ):
        """
        Initialize the StructuredPDFParser with processing configuration.

        Also suppresses noisy DEBUG logs from external libraries.

        :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
        :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
        :param dpi: DPI for PDF rendering (default: 200)
        :param min_score: Minimum confidence score for layout detection (default: 0.0)
        :param ocr_engine: OCR engine instance (PytesseractOCREngine or PaddleOCREngine).
                           If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3.
        :param box_separator: Separator between text boxes in output (default: "\n")
        :param merge_split_tables: Whether to detect and merge split tables (default: False)
        :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
        :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15)
        :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers)
        :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
        :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)
        """
        self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
        self.dpi = dpi
        self.min_score = min_score

        # Initialize OCR engine - use provided instance or create default
        if ocr_engine is None:
            self.ocr_engine = PytesseractOCREngine(lang="eng", psm=4, oem=3)
        elif isinstance(ocr_engine, (PytesseractOCREngine, PaddleOCREngine)):
            self.ocr_engine = ocr_engine
        else:
            raise TypeError(
                f"ocr_engine must be an instance of PytesseractOCREngine or PaddleOCREngine, "
                f"got {type(ocr_engine).__name__}"
            )

        self.box_separator = box_separator

        # Initialize VLM engine - use provided instance or None
        if vlm is None:
            self.vlm = None
        elif isinstance(vlm, VLMStructuredExtractor):
            self.vlm = vlm
        else:
            raise TypeError(
                f"vlm must be an instance of VLMStructuredExtractor or None, "
                f"got {type(vlm).__name__}"
            )

        self.merge_split_tables = merge_split_tables
        if self.merge_split_tables:
            self.split_table_detector = SplitTableDetector(
                bottom_threshold_ratio=bottom_threshold_ratio,
                top_threshold_ratio=top_threshold_ratio,
                max_gap_ratio=max_gap_ratio,
                column_alignment_tolerance=column_alignment_tolerance,
                min_merge_confidence=min_merge_confidence,
            )
        else:
            self.split_table_detector = None

        # Suppress noisy DEBUG logs from external libraries
        logging.getLogger('pytesseract').setLevel(logging.WARNING)
        logging.getLogger('markdown_it').setLevel(logging.WARNING)

    def parse(self, pdf_path: str) -> None:
        """
        Parse a PDF document and extract all content types.

        :param pdf_path: Path to the input PDF file
        :return: None
        """
        pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]
        out_dir = f"outputs/{pdf_filename}/full_parse"

        os.makedirs(out_dir, exist_ok=True)
        ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

        pages: List[LayoutPage] = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )
        pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        split_table_matches: List[SplitTableMatch] = []
        merged_table_segments = []

        if self.merge_split_tables and self.split_table_detector:
            try:
                split_table_matches = self.split_table_detector.detect_split_tables(pages, pil_pages)
                for match in split_table_matches:
                    merged_table_segments.append(match.segment1)
                    merged_table_segments.append(match.segment2)
            except Exception as e:
                import traceback
                traceback.print_exc()
                split_table_matches = []

        fig_count = sum(sum(1 for b in p.boxes if b.label == "figure") for p in pages)
        chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages)
        table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages)

        md_lines: List[str] = ["# Extracted Content\n"]
        html_lines: List[str] = ["<h1>Extracted Content</h1>"]
        structured_items: List[Dict[str, Any]] = []

        charts_desc = "Charts (VLM β†’ table)" if self.vlm is not None else "Charts (cropped)"
        tables_desc = "Tables (VLM β†’ table)" if self.vlm is not None else "Tables (cropped)"
        figures_desc = "Figures (cropped)"

        with ExitStack() as stack:
            is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
            is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()
            if is_notebook:
                charts_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
                figures_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=fig_count, desc=figures_desc)) if fig_count else None
            else:
                charts_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None
                figures_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=fig_count, desc=figures_desc, leave=True)) if fig_count else None

            for p in pages:
                page_num = p.page_index
                page_img: Image.Image = pil_pages[page_num - 1]
                md_lines.append(f"\n## Page {page_num}\n")
                html_lines.append(f"<h2>Page {page_num}</h2>")

                for i, box in enumerate(sorted(p.boxes, key=reading_order_key), start=1):
                    if box.label in EXCLUDE_LABELS:
                        img_path = save_box_image(page_img, box, out_dir, page_num, i, IMAGE_SUBDIRS)
                        abs_img_path = os.path.abspath(img_path)
                        rel = os.path.relpath(abs_img_path, out_dir)

                        if box.label == "figure":
                            figure_md = f"![Figure β€” page {page_num}]({rel})\n"
                            figure_html = f'<img src="{rel}" alt="Figure β€” page {page_num}" />'
                            md_lines.append(figure_md)
                            html_lines.append(figure_html)
                            if figures_bar: figures_bar.update(1)

                        elif box.label == "chart":
                            if self.vlm is not None:
                                wrote_table = False
                                try:
                                    chart = self.vlm.extract_chart(abs_img_path)
                                    item = to_structured_dict(chart)
                                    if item:
                                        item["page"] = page_num
                                        item["type"] = "Chart"
                                        structured_items.append(item)

                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    chart_md = f"![Chart β€” page {page_num}]({rel})\n"
                                    chart_html = f'<img src="{rel}" alt="Chart β€” page {page_num}" />'
                                    md_lines.append(chart_md)
                                    html_lines.append(chart_html)
                            else:
                                chart_md = f"![Chart β€” page {page_num}]({rel})\n"
                                chart_html = f'<img src="{rel}" alt="Chart β€” page {page_num}" />'
                                md_lines.append(chart_md)
                                html_lines.append(chart_html)
                            if charts_bar: charts_bar.update(1)

                        elif box.label == "table":
                            is_merged = any(seg.match_box(box, page_num) for seg in merged_table_segments)
                            if is_merged:
                                continue

                            if self.vlm is not None:
                                wrote_table = False
                                try:
                                    table = self.vlm.extract_table(abs_img_path)
                                    item = to_structured_dict(table)
                                    if item:
                                        item["page"] = page_num
                                        item["type"] = "Table"
                                        structured_items.append(item)

                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    table_md = f"![Table β€” page {page_num}]({rel})\n"
                                    table_html = f'<img src="{rel}" alt="Table β€” page {page_num}" />'
                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                            else:
                                table_md = f"![Table β€” page {page_num}]({rel})\n"
                                table_html = f'<img src="{rel}" alt="Table β€” page {page_num}" />'
                                md_lines.append(table_md)
                                html_lines.append(table_html)
                            if tables_bar: tables_bar.update(1)
                    else:
                        text = ocr_box_text(self.ocr_engine, page_img, box)
                        if text:
                            md_lines.append(text)
                            md_lines.append(self.box_separator if self.box_separator else "")
                            html_text = text.replace('\n', '<br>')
                            html_lines.append(f"<p>{html_text}</p>")
                            if self.box_separator:
                                html_lines.append("<br>")

            if split_table_matches and self.split_table_detector:
                for match_idx, match in enumerate(split_table_matches):
                    try:
                        merged_img = self.split_table_detector.merge_table_images(match)

                        tables_dir = os.path.join(out_dir, "tables")
                        os.makedirs(tables_dir, exist_ok=True)
                        merged_filename = f"merged_table_{match.segment1.page_index}_{match.segment2.page_index}.png"
                        merged_path = os.path.join(tables_dir, merged_filename)
                        merged_img.save(merged_path)

                        abs_merged_path = os.path.abspath(merged_path)
                        rel_merged = os.path.relpath(abs_merged_path, out_dir)

                        pages_str = f"pages {match.segment1.page_index}-{match.segment2.page_index}"

                        if self.use_vlm and self.vlm:
                            wrote_table = False
                            try:
                                table = self.vlm.extract_table(abs_merged_path)
                                item = to_structured_dict(table)
                                if item:
                                    item["page"] = f"{match.segment1.page_index}-{match.segment2.page_index}"
                                    item["type"] = "Table (Merged)"
                                    item["split_merge"] = True
                                    item["merge_confidence"] = match.confidence
                                    structured_items.append(item)

                                    table_md = render_markdown_table(
                                        item.get("headers"), 
                                        item.get("rows"),
                                        title=item.get("title") or f"Merged Table ({pages_str})"
                                    )
                                    table_html = render_html_table(
                                        item.get("headers"), 
                                        item.get("rows"),
                                        title=item.get("title") or f"Merged Table ({pages_str})"
                                    )

                                    md_lines.append(f"\n### Merged Table ({pages_str})\n")
                                    md_lines.append(table_md)
                                    html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                                    html_lines.append(table_html)
                                    wrote_table = True
                            except Exception as e:
                                pass

                            if not wrote_table:
                                table_md = f"![Merged Table β€” {pages_str}]({rel_merged})\n"
                                table_html = f'<img src="{rel_merged}" alt="Merged Table β€” {pages_str}" />'
                                md_lines.append(f"\n### Merged Table ({pages_str})\n")
                                md_lines.append(table_md)
                                html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                                html_lines.append(table_html)
                        else:
                            table_md = f"![Merged Table β€” {pages_str}]({rel_merged})\n"
                            table_html = f'<img src="{rel_merged}" alt="Merged Table β€” {pages_str}" />'
                            md_lines.append(f"\n### Merged Table ({pages_str})\n")
                            md_lines.append(table_md)
                            html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                            html_lines.append(table_html)

                        if tables_bar: tables_bar.update(1)

                    except Exception as e:
                        print(f"⚠️  Warning: Failed to merge table {match_idx + 1}: {e}")

        md_path = write_markdown(md_lines, out_dir)

        if self.vlm is not None and html_lines:
            html_path = write_html_from_lines(html_lines, out_dir)
        else:
            html_path = write_html(md_lines, out_dir)

        excel_path = None
        html_structured_path = None
        if self.vlm is not None and structured_items:
            excel_path = os.path.join(out_dir, "tables.xlsx")
            write_structured_excel(excel_path, structured_items)
            html_structured_path = os.path.join(out_dir, "tables.html")
            write_structured_html(html_structured_path, structured_items)

        print(f"βœ… Parsing completed successfully!")
        print(f"πŸ“ Output directory: {out_dir}")

    def display_pages_with_boxes(self, pdf_path: str, num_pages: int = 3, cols: int = 2,
                                 page_width: int = 800, spacing: int = 40, save_path: str = None) -> None:
        """
        Display the first N pages of a PDF with bounding boxes and labels overlaid in a modern grid layout.

        Creates a visualization showing layout detection results with bounding boxes,
        labels, and confidence scores overlaid on the PDF pages in a grid format.

        :param pdf_path: Path to the input PDF file
        :param num_pages: Number of pages to display (default: 3)
        :param cols: Number of columns in the grid layout (default: 2)
        :param page_width: Width to resize each page to in pixels (default: 800)
        :param spacing: Spacing between pages in pixels (default: 40)
        :param save_path: Optional path to save the visualization (if None, displays only)
        :return: None
        """
        pages: List[LayoutPage] = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )
        pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        pages_to_show = min(num_pages, len(pages))

        if pages_to_show == 0:
            print("No pages to display")
            return

        rows = (pages_to_show + cols - 1) // cols

        used_labels = set()
        for idx in range(pages_to_show):
            page = pages[idx]
            for box in page.boxes:
                used_labels.add(box.label.lower())

        base_colors = ['#3B82F6', '#EF4444', '#10B981', '#F59E0B', '#8B5CF6',
                       '#F97316', '#EC4899', '#6B7280', '#84CC16', '#06B6D4',
                       '#DC2626', '#059669', '#7C3AED', '#DB2777', '#0891B2']

        dynamic_label_colors = {}
        for i, label in enumerate(sorted(used_labels)):
            dynamic_label_colors[label] = base_colors[i % len(base_colors)]

        processed_pages = []

        for idx in range(pages_to_show):
            page = pages[idx]
            page_img = pil_pages[idx].copy()

            scale_factor = page_width / page_img.width
            new_height = int(page_img.height * scale_factor)
            page_img = page_img.resize((page_width, new_height), Image.LANCZOS)

            draw = ImageDraw.Draw(page_img)

            try:
                font = ImageFont.truetype("arial.ttf", 24)
                small_font = ImageFont.truetype("arial.ttf", 18)
            except:
                try:
                    font = ImageFont.load_default()
                    small_font = ImageFont.load_default()
                except:
                    font = None
                    small_font = None

            for box in page.boxes:
                x1 = int(box.x1 * scale_factor)
                y1 = int(box.y1 * scale_factor)
                x2 = int(box.x2 * scale_factor)
                y2 = int(box.y2 * scale_factor)

                color = dynamic_label_colors.get(box.label.lower(), '#000000')

                draw.rectangle([x1, y1, x2, y2], outline=color, width=3)

                label_text = f"{box.label} ({box.score:.2f})"
                if font:
                    bbox = draw.textbbox((0, 0), label_text, font=small_font)
                    text_width = bbox[2] - bbox[0]
                    text_height = bbox[3] - bbox[1]
                else:
                    text_width = len(label_text) * 8
                    text_height = 15

                label_x = x1
                label_y = max(0, y1 - text_height - 8)

                padding = 4
                draw.rectangle([
                    label_x - padding,
                    label_y - padding,
                    label_x + text_width + padding,
                    label_y + text_height + padding
                ], fill='white', outline=color, width=2)

                draw.text((label_x, label_y), label_text, fill=color, font=small_font)

            title_text = f"Page {page.page_index} ({len(page.boxes)} boxes)"
            if font:
                title_bbox = draw.textbbox((0, 0), title_text, font=font)
                title_width = title_bbox[2] - title_bbox[0]
            else:
                title_width = len(title_text) * 12

            title_x = (page_width - title_width) // 2
            title_y = 10
            draw.rectangle([title_x - 10, title_y - 5, title_x + title_width + 10, title_y + 35],
                           fill='white', outline='#1F2937', width=2)
            draw.text((title_x, title_y), title_text, fill='#1F2937', font=font)

            processed_pages.append(page_img)

        legend_width = 250
        grid_width = cols * page_width + (cols - 1) * spacing
        total_width = grid_width + legend_width + spacing
        grid_height = rows * (processed_pages[0].height if processed_pages else 600) + (rows - 1) * spacing

        final_img = Image.new('RGB', (total_width, grid_height), '#F8FAFC')

        for idx, page_img in enumerate(processed_pages):
            row = idx // cols
            col = idx % cols

            x_pos = col * (page_width + spacing)
            y_pos = row * (page_img.height + spacing)

            final_img.paste(page_img, (x_pos, y_pos))

        legend_x = grid_width + spacing
        legend_y = 20

        draw_legend = ImageDraw.Draw(final_img)

        legend_title = "Element Types"
        if font:
            title_bbox = draw_legend.textbbox((0, 0), legend_title, font=font)
            title_width = title_bbox[2] - title_bbox[0]
            title_height = title_bbox[3] - title_bbox[1]
        else:
            title_width = len(legend_title) * 12
            title_height = 20

        legend_bg_height = len(used_labels) * 35 + title_height + 40
        draw_legend.rectangle([legend_x - 10, legend_y - 10,
                               legend_x + legend_width - 10, legend_y + legend_bg_height],
                              fill='white', outline='#E5E7EB', width=2)

        draw_legend.text((legend_x + 10, legend_y + 5), legend_title,
                         fill='#1F2937', font=font)

        current_y = legend_y + title_height + 20

        for label in sorted(used_labels):
            color = dynamic_label_colors[label]

            square_size = 20
            draw_legend.rectangle([legend_x + 10, current_y,
                                   legend_x + 10 + square_size, current_y + square_size],
                                  fill=color, outline='#6B7280', width=1)

            draw_legend.text((legend_x + 40, current_y + 2), label.title(),
                             fill='#374151', font=small_font)

            current_y += 30

        if save_path:
            final_img.save(save_path, quality=95, optimize=True)
            print(f"Layout visualization saved to: {save_path}")
        else:
            final_img.show()

        print(f"\nπŸ“Š Layout Detection Summary for {os.path.basename(pdf_path)}:")
        print(f"Pages processed: {pages_to_show}")

        total_counts = {}
        for idx in range(pages_to_show):
            page = pages[idx]
            for box in page.boxes:
                total_counts[box.label] = total_counts.get(box.label, 0) + 1

        print("\nTotal elements detected:")
        for label, count in sorted(total_counts.items()):
            print(f"  - {label}: {count}")

        return final_img

__init__(*, vlm=None, layout_model_name='PP-DocLayout_plus-L', dpi=200, min_score=0.0, ocr_engine=None, box_separator='\n', merge_split_tables=False, bottom_threshold_ratio=0.2, top_threshold_ratio=0.15, max_gap_ratio=0.25, column_alignment_tolerance=10.0, min_merge_confidence=0.65)

    Initialize the StructuredPDFParser with processing configuration.

    Also suppresses noisy DEBUG logs from external libraries.

    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_engine: OCR engine instance (PytesseractOCREngine or PaddleOCREngine).
                       If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3.
    :param box_separator: Separator between text boxes in output (default: "

") :param merge_split_tables: Whether to detect and merge split tables (default: False) :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20) :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15) :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers) :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0) :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)

Source code in doctra/parsers/structured_pdf_parser.py
def __init__(
        self,
        *,
        vlm: Optional[VLMStructuredExtractor] = None,
        layout_model_name: str = "PP-DocLayout_plus-L",
        dpi: int = 200,
        min_score: float = 0.0,
        ocr_engine: Optional[Union[PytesseractOCREngine, PaddleOCREngine]] = None,
        box_separator: str = "\n",
        merge_split_tables: bool = False,
        bottom_threshold_ratio: float = 0.20,
        top_threshold_ratio: float = 0.15,
        max_gap_ratio: float = 0.25,
        column_alignment_tolerance: float = 10.0,
        min_merge_confidence: float = 0.65,
):
    """
    Initialize the StructuredPDFParser with processing configuration.

    Also suppresses noisy DEBUG logs from external libraries.

    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_engine: OCR engine instance (PytesseractOCREngine or PaddleOCREngine).
                       If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3.
    :param box_separator: Separator between text boxes in output (default: "\n")
    :param merge_split_tables: Whether to detect and merge split tables (default: False)
    :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
    :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15)
    :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers)
    :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
    :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)
    """
    self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
    self.dpi = dpi
    self.min_score = min_score

    # Initialize OCR engine - use provided instance or create default
    if ocr_engine is None:
        self.ocr_engine = PytesseractOCREngine(lang="eng", psm=4, oem=3)
    elif isinstance(ocr_engine, (PytesseractOCREngine, PaddleOCREngine)):
        self.ocr_engine = ocr_engine
    else:
        raise TypeError(
            f"ocr_engine must be an instance of PytesseractOCREngine or PaddleOCREngine, "
            f"got {type(ocr_engine).__name__}"
        )

    self.box_separator = box_separator

    # Initialize VLM engine - use provided instance or None
    if vlm is None:
        self.vlm = None
    elif isinstance(vlm, VLMStructuredExtractor):
        self.vlm = vlm
    else:
        raise TypeError(
            f"vlm must be an instance of VLMStructuredExtractor or None, "
            f"got {type(vlm).__name__}"
        )

    self.merge_split_tables = merge_split_tables
    if self.merge_split_tables:
        self.split_table_detector = SplitTableDetector(
            bottom_threshold_ratio=bottom_threshold_ratio,
            top_threshold_ratio=top_threshold_ratio,
            max_gap_ratio=max_gap_ratio,
            column_alignment_tolerance=column_alignment_tolerance,
            min_merge_confidence=min_merge_confidence,
        )
    else:
        self.split_table_detector = None

    # Suppress noisy DEBUG logs from external libraries
    logging.getLogger('pytesseract').setLevel(logging.WARNING)
    logging.getLogger('markdown_it').setLevel(logging.WARNING)

display_pages_with_boxes(pdf_path, num_pages=3, cols=2, page_width=800, spacing=40, save_path=None)

Display the first N pages of a PDF with bounding boxes and labels overlaid in a modern grid layout.

Creates a visualization showing layout detection results with bounding boxes, labels, and confidence scores overlaid on the PDF pages in a grid format.

:param pdf_path: Path to the input PDF file :param num_pages: Number of pages to display (default: 3) :param cols: Number of columns in the grid layout (default: 2) :param page_width: Width to resize each page to in pixels (default: 800) :param spacing: Spacing between pages in pixels (default: 40) :param save_path: Optional path to save the visualization (if None, displays only) :return: None

Source code in doctra/parsers/structured_pdf_parser.py
def display_pages_with_boxes(self, pdf_path: str, num_pages: int = 3, cols: int = 2,
                             page_width: int = 800, spacing: int = 40, save_path: str = None) -> None:
    """
    Display the first N pages of a PDF with bounding boxes and labels overlaid in a modern grid layout.

    Creates a visualization showing layout detection results with bounding boxes,
    labels, and confidence scores overlaid on the PDF pages in a grid format.

    :param pdf_path: Path to the input PDF file
    :param num_pages: Number of pages to display (default: 3)
    :param cols: Number of columns in the grid layout (default: 2)
    :param page_width: Width to resize each page to in pixels (default: 800)
    :param spacing: Spacing between pages in pixels (default: 40)
    :param save_path: Optional path to save the visualization (if None, displays only)
    :return: None
    """
    pages: List[LayoutPage] = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )
    pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    pages_to_show = min(num_pages, len(pages))

    if pages_to_show == 0:
        print("No pages to display")
        return

    rows = (pages_to_show + cols - 1) // cols

    used_labels = set()
    for idx in range(pages_to_show):
        page = pages[idx]
        for box in page.boxes:
            used_labels.add(box.label.lower())

    base_colors = ['#3B82F6', '#EF4444', '#10B981', '#F59E0B', '#8B5CF6',
                   '#F97316', '#EC4899', '#6B7280', '#84CC16', '#06B6D4',
                   '#DC2626', '#059669', '#7C3AED', '#DB2777', '#0891B2']

    dynamic_label_colors = {}
    for i, label in enumerate(sorted(used_labels)):
        dynamic_label_colors[label] = base_colors[i % len(base_colors)]

    processed_pages = []

    for idx in range(pages_to_show):
        page = pages[idx]
        page_img = pil_pages[idx].copy()

        scale_factor = page_width / page_img.width
        new_height = int(page_img.height * scale_factor)
        page_img = page_img.resize((page_width, new_height), Image.LANCZOS)

        draw = ImageDraw.Draw(page_img)

        try:
            font = ImageFont.truetype("arial.ttf", 24)
            small_font = ImageFont.truetype("arial.ttf", 18)
        except:
            try:
                font = ImageFont.load_default()
                small_font = ImageFont.load_default()
            except:
                font = None
                small_font = None

        for box in page.boxes:
            x1 = int(box.x1 * scale_factor)
            y1 = int(box.y1 * scale_factor)
            x2 = int(box.x2 * scale_factor)
            y2 = int(box.y2 * scale_factor)

            color = dynamic_label_colors.get(box.label.lower(), '#000000')

            draw.rectangle([x1, y1, x2, y2], outline=color, width=3)

            label_text = f"{box.label} ({box.score:.2f})"
            if font:
                bbox = draw.textbbox((0, 0), label_text, font=small_font)
                text_width = bbox[2] - bbox[0]
                text_height = bbox[3] - bbox[1]
            else:
                text_width = len(label_text) * 8
                text_height = 15

            label_x = x1
            label_y = max(0, y1 - text_height - 8)

            padding = 4
            draw.rectangle([
                label_x - padding,
                label_y - padding,
                label_x + text_width + padding,
                label_y + text_height + padding
            ], fill='white', outline=color, width=2)

            draw.text((label_x, label_y), label_text, fill=color, font=small_font)

        title_text = f"Page {page.page_index} ({len(page.boxes)} boxes)"
        if font:
            title_bbox = draw.textbbox((0, 0), title_text, font=font)
            title_width = title_bbox[2] - title_bbox[0]
        else:
            title_width = len(title_text) * 12

        title_x = (page_width - title_width) // 2
        title_y = 10
        draw.rectangle([title_x - 10, title_y - 5, title_x + title_width + 10, title_y + 35],
                       fill='white', outline='#1F2937', width=2)
        draw.text((title_x, title_y), title_text, fill='#1F2937', font=font)

        processed_pages.append(page_img)

    legend_width = 250
    grid_width = cols * page_width + (cols - 1) * spacing
    total_width = grid_width + legend_width + spacing
    grid_height = rows * (processed_pages[0].height if processed_pages else 600) + (rows - 1) * spacing

    final_img = Image.new('RGB', (total_width, grid_height), '#F8FAFC')

    for idx, page_img in enumerate(processed_pages):
        row = idx // cols
        col = idx % cols

        x_pos = col * (page_width + spacing)
        y_pos = row * (page_img.height + spacing)

        final_img.paste(page_img, (x_pos, y_pos))

    legend_x = grid_width + spacing
    legend_y = 20

    draw_legend = ImageDraw.Draw(final_img)

    legend_title = "Element Types"
    if font:
        title_bbox = draw_legend.textbbox((0, 0), legend_title, font=font)
        title_width = title_bbox[2] - title_bbox[0]
        title_height = title_bbox[3] - title_bbox[1]
    else:
        title_width = len(legend_title) * 12
        title_height = 20

    legend_bg_height = len(used_labels) * 35 + title_height + 40
    draw_legend.rectangle([legend_x - 10, legend_y - 10,
                           legend_x + legend_width - 10, legend_y + legend_bg_height],
                          fill='white', outline='#E5E7EB', width=2)

    draw_legend.text((legend_x + 10, legend_y + 5), legend_title,
                     fill='#1F2937', font=font)

    current_y = legend_y + title_height + 20

    for label in sorted(used_labels):
        color = dynamic_label_colors[label]

        square_size = 20
        draw_legend.rectangle([legend_x + 10, current_y,
                               legend_x + 10 + square_size, current_y + square_size],
                              fill=color, outline='#6B7280', width=1)

        draw_legend.text((legend_x + 40, current_y + 2), label.title(),
                         fill='#374151', font=small_font)

        current_y += 30

    if save_path:
        final_img.save(save_path, quality=95, optimize=True)
        print(f"Layout visualization saved to: {save_path}")
    else:
        final_img.show()

    print(f"\nπŸ“Š Layout Detection Summary for {os.path.basename(pdf_path)}:")
    print(f"Pages processed: {pages_to_show}")

    total_counts = {}
    for idx in range(pages_to_show):
        page = pages[idx]
        for box in page.boxes:
            total_counts[box.label] = total_counts.get(box.label, 0) + 1

    print("\nTotal elements detected:")
    for label, count in sorted(total_counts.items()):
        print(f"  - {label}: {count}")

    return final_img

parse(pdf_path)

Parse a PDF document and extract all content types.

:param pdf_path: Path to the input PDF file :return: None

Source code in doctra/parsers/structured_pdf_parser.py
def parse(self, pdf_path: str) -> None:
    """
    Parse a PDF document and extract all content types.

    :param pdf_path: Path to the input PDF file
    :return: None
    """
    pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]
    out_dir = f"outputs/{pdf_filename}/full_parse"

    os.makedirs(out_dir, exist_ok=True)
    ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

    pages: List[LayoutPage] = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )
    pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    split_table_matches: List[SplitTableMatch] = []
    merged_table_segments = []

    if self.merge_split_tables and self.split_table_detector:
        try:
            split_table_matches = self.split_table_detector.detect_split_tables(pages, pil_pages)
            for match in split_table_matches:
                merged_table_segments.append(match.segment1)
                merged_table_segments.append(match.segment2)
        except Exception as e:
            import traceback
            traceback.print_exc()
            split_table_matches = []

    fig_count = sum(sum(1 for b in p.boxes if b.label == "figure") for p in pages)
    chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages)
    table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages)

    md_lines: List[str] = ["# Extracted Content\n"]
    html_lines: List[str] = ["<h1>Extracted Content</h1>"]
    structured_items: List[Dict[str, Any]] = []

    charts_desc = "Charts (VLM β†’ table)" if self.vlm is not None else "Charts (cropped)"
    tables_desc = "Tables (VLM β†’ table)" if self.vlm is not None else "Tables (cropped)"
    figures_desc = "Figures (cropped)"

    with ExitStack() as stack:
        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()
        if is_notebook:
            charts_bar = stack.enter_context(
                create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
            tables_bar = stack.enter_context(
                create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
            figures_bar = stack.enter_context(
                create_notebook_friendly_bar(total=fig_count, desc=figures_desc)) if fig_count else None
        else:
            charts_bar = stack.enter_context(
                create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
            tables_bar = stack.enter_context(
                create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None
            figures_bar = stack.enter_context(
                create_beautiful_progress_bar(total=fig_count, desc=figures_desc, leave=True)) if fig_count else None

        for p in pages:
            page_num = p.page_index
            page_img: Image.Image = pil_pages[page_num - 1]
            md_lines.append(f"\n## Page {page_num}\n")
            html_lines.append(f"<h2>Page {page_num}</h2>")

            for i, box in enumerate(sorted(p.boxes, key=reading_order_key), start=1):
                if box.label in EXCLUDE_LABELS:
                    img_path = save_box_image(page_img, box, out_dir, page_num, i, IMAGE_SUBDIRS)
                    abs_img_path = os.path.abspath(img_path)
                    rel = os.path.relpath(abs_img_path, out_dir)

                    if box.label == "figure":
                        figure_md = f"![Figure β€” page {page_num}]({rel})\n"
                        figure_html = f'<img src="{rel}" alt="Figure β€” page {page_num}" />'
                        md_lines.append(figure_md)
                        html_lines.append(figure_html)
                        if figures_bar: figures_bar.update(1)

                    elif box.label == "chart":
                        if self.vlm is not None:
                            wrote_table = False
                            try:
                                chart = self.vlm.extract_chart(abs_img_path)
                                item = to_structured_dict(chart)
                                if item:
                                    item["page"] = page_num
                                    item["type"] = "Chart"
                                    structured_items.append(item)

                                    table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                     title=item.get("title"))
                                    table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                   title=item.get("title"))

                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                                    wrote_table = True
                            except Exception as e:
                                pass
                            if not wrote_table:
                                chart_md = f"![Chart β€” page {page_num}]({rel})\n"
                                chart_html = f'<img src="{rel}" alt="Chart β€” page {page_num}" />'
                                md_lines.append(chart_md)
                                html_lines.append(chart_html)
                        else:
                            chart_md = f"![Chart β€” page {page_num}]({rel})\n"
                            chart_html = f'<img src="{rel}" alt="Chart β€” page {page_num}" />'
                            md_lines.append(chart_md)
                            html_lines.append(chart_html)
                        if charts_bar: charts_bar.update(1)

                    elif box.label == "table":
                        is_merged = any(seg.match_box(box, page_num) for seg in merged_table_segments)
                        if is_merged:
                            continue

                        if self.vlm is not None:
                            wrote_table = False
                            try:
                                table = self.vlm.extract_table(abs_img_path)
                                item = to_structured_dict(table)
                                if item:
                                    item["page"] = page_num
                                    item["type"] = "Table"
                                    structured_items.append(item)

                                    table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                     title=item.get("title"))
                                    table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                   title=item.get("title"))

                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                                    wrote_table = True
                            except Exception as e:
                                pass
                            if not wrote_table:
                                table_md = f"![Table β€” page {page_num}]({rel})\n"
                                table_html = f'<img src="{rel}" alt="Table β€” page {page_num}" />'
                                md_lines.append(table_md)
                                html_lines.append(table_html)
                        else:
                            table_md = f"![Table β€” page {page_num}]({rel})\n"
                            table_html = f'<img src="{rel}" alt="Table β€” page {page_num}" />'
                            md_lines.append(table_md)
                            html_lines.append(table_html)
                        if tables_bar: tables_bar.update(1)
                else:
                    text = ocr_box_text(self.ocr_engine, page_img, box)
                    if text:
                        md_lines.append(text)
                        md_lines.append(self.box_separator if self.box_separator else "")
                        html_text = text.replace('\n', '<br>')
                        html_lines.append(f"<p>{html_text}</p>")
                        if self.box_separator:
                            html_lines.append("<br>")

        if split_table_matches and self.split_table_detector:
            for match_idx, match in enumerate(split_table_matches):
                try:
                    merged_img = self.split_table_detector.merge_table_images(match)

                    tables_dir = os.path.join(out_dir, "tables")
                    os.makedirs(tables_dir, exist_ok=True)
                    merged_filename = f"merged_table_{match.segment1.page_index}_{match.segment2.page_index}.png"
                    merged_path = os.path.join(tables_dir, merged_filename)
                    merged_img.save(merged_path)

                    abs_merged_path = os.path.abspath(merged_path)
                    rel_merged = os.path.relpath(abs_merged_path, out_dir)

                    pages_str = f"pages {match.segment1.page_index}-{match.segment2.page_index}"

                    if self.use_vlm and self.vlm:
                        wrote_table = False
                        try:
                            table = self.vlm.extract_table(abs_merged_path)
                            item = to_structured_dict(table)
                            if item:
                                item["page"] = f"{match.segment1.page_index}-{match.segment2.page_index}"
                                item["type"] = "Table (Merged)"
                                item["split_merge"] = True
                                item["merge_confidence"] = match.confidence
                                structured_items.append(item)

                                table_md = render_markdown_table(
                                    item.get("headers"), 
                                    item.get("rows"),
                                    title=item.get("title") or f"Merged Table ({pages_str})"
                                )
                                table_html = render_html_table(
                                    item.get("headers"), 
                                    item.get("rows"),
                                    title=item.get("title") or f"Merged Table ({pages_str})"
                                )

                                md_lines.append(f"\n### Merged Table ({pages_str})\n")
                                md_lines.append(table_md)
                                html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                                html_lines.append(table_html)
                                wrote_table = True
                        except Exception as e:
                            pass

                        if not wrote_table:
                            table_md = f"![Merged Table β€” {pages_str}]({rel_merged})\n"
                            table_html = f'<img src="{rel_merged}" alt="Merged Table β€” {pages_str}" />'
                            md_lines.append(f"\n### Merged Table ({pages_str})\n")
                            md_lines.append(table_md)
                            html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                            html_lines.append(table_html)
                    else:
                        table_md = f"![Merged Table β€” {pages_str}]({rel_merged})\n"
                        table_html = f'<img src="{rel_merged}" alt="Merged Table β€” {pages_str}" />'
                        md_lines.append(f"\n### Merged Table ({pages_str})\n")
                        md_lines.append(table_md)
                        html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                        html_lines.append(table_html)

                    if tables_bar: tables_bar.update(1)

                except Exception as e:
                    print(f"⚠️  Warning: Failed to merge table {match_idx + 1}: {e}")

    md_path = write_markdown(md_lines, out_dir)

    if self.vlm is not None and html_lines:
        html_path = write_html_from_lines(html_lines, out_dir)
    else:
        html_path = write_html(md_lines, out_dir)

    excel_path = None
    html_structured_path = None
    if self.vlm is not None and structured_items:
        excel_path = os.path.join(out_dir, "tables.xlsx")
        write_structured_excel(excel_path, structured_items)
        html_structured_path = os.path.join(out_dir, "tables.html")
        write_structured_html(html_structured_path, structured_items)

    print(f"βœ… Parsing completed successfully!")
    print(f"πŸ“ Output directory: {out_dir}")

EnhancedPDFParser

Enhanced parser with image restoration capabilities.

doctra.parsers.enhanced_pdf_parser.EnhancedPDFParser

Bases: StructuredPDFParser

Enhanced PDF Parser with Image Restoration capabilities.

Extends the StructuredPDFParser with DocRes image restoration to improve
document quality before processing. This is particularly useful for:
- Scanned documents with shadows or distortion
- Low-quality PDFs that need enhancement
- Documents with perspective issues

:param use_image_restoration: Whether to apply DocRes image restoration (default: True)
:param restoration_task: DocRes task to use ("dewarping", "deshadowing", "appearance", "deblurring", "binarization", "end2end", default: "appearance")
:param restoration_device: Device for DocRes processing ("cuda", "cpu", or None for auto-detect, default: None)
:param restoration_dpi: DPI for restoration processing (default: 200)
:param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
:param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
:param dpi: DPI for PDF rendering (default: 200)
:param min_score: Minimum confidence score for layout detection (default: 0.0)
:param ocr_engine: OCR engine instance (PytesseractOCREngine or PaddleOCREngine). 
                   If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3.
:param box_separator: Separator between text boxes in output (default: "

") :param merge_split_tables: Whether to detect and merge split tables (default: False) :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20) :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15) :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25) :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0) :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)

Source code in doctra/parsers/enhanced_pdf_parser.py
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
class EnhancedPDFParser(StructuredPDFParser):
    """
    Enhanced PDF Parser with Image Restoration capabilities.

    Extends the StructuredPDFParser with DocRes image restoration to improve
    document quality before processing. This is particularly useful for:
    - Scanned documents with shadows or distortion
    - Low-quality PDFs that need enhancement
    - Documents with perspective issues

    :param use_image_restoration: Whether to apply DocRes image restoration (default: True)
    :param restoration_task: DocRes task to use ("dewarping", "deshadowing", "appearance", "deblurring", "binarization", "end2end", default: "appearance")
    :param restoration_device: Device for DocRes processing ("cuda", "cpu", or None for auto-detect, default: None)
    :param restoration_dpi: DPI for restoration processing (default: 200)
    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_engine: OCR engine instance (PytesseractOCREngine or PaddleOCREngine). 
                       If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3.
    :param box_separator: Separator between text boxes in output (default: "\n")
    :param merge_split_tables: Whether to detect and merge split tables (default: False)
    :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
    :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15)
    :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25)
    :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
    :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)
    """

    def __init__(
        self,
        *,
        use_image_restoration: bool = True,
        restoration_task: str = "appearance",
        restoration_device: Optional[str] = None,
        restoration_dpi: int = 200,
        vlm: Optional[VLMStructuredExtractor] = None,
        layout_model_name: str = "PP-DocLayout_plus-L",
        dpi: int = 200,
        min_score: float = 0.0,
        ocr_engine: Optional[Union[PytesseractOCREngine, PaddleOCREngine]] = None,
        box_separator: str = "\n",
        merge_split_tables: bool = False,
        bottom_threshold_ratio: float = 0.20,
        top_threshold_ratio: float = 0.15,
        max_gap_ratio: float = 0.25,
        column_alignment_tolerance: float = 10.0,
        min_merge_confidence: float = 0.65,
    ):
        """
        Initialize the Enhanced PDF Parser with image restoration capabilities.
        """
        super().__init__(
            vlm=vlm,
            layout_model_name=layout_model_name,
            dpi=dpi,
            min_score=min_score,
            ocr_engine=ocr_engine,
            box_separator=box_separator,
            merge_split_tables=merge_split_tables,
            bottom_threshold_ratio=bottom_threshold_ratio,
            top_threshold_ratio=top_threshold_ratio,
            max_gap_ratio=max_gap_ratio,
            column_alignment_tolerance=column_alignment_tolerance,
            min_merge_confidence=min_merge_confidence,
        )

        self.use_image_restoration = use_image_restoration
        self.restoration_task = restoration_task
        self.restoration_device = restoration_device
        self.restoration_dpi = restoration_dpi

        self.docres_engine = None
        if self.use_image_restoration:
            try:
                self.docres_engine = DocResEngine(
                    device=restoration_device,
                    use_half_precision=True
                )
                print(f"βœ… DocRes engine initialized with task: {restoration_task}")
            except Exception as e:
                print(f"⚠️ DocRes initialization failed: {e}")
                print("   Continuing without image restoration...")
                self.use_image_restoration = False
                self.docres_engine = None

    def parse(self, pdf_path: str, enhanced_output_dir: str = None) -> None:
        """
        Parse a PDF document with optional image restoration.

        :param pdf_path: Path to the input PDF file
        :param enhanced_output_dir: Directory for enhanced images (if None, uses default)
        :return: None
        """
        pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]

        if enhanced_output_dir is None:
            out_dir = f"outputs/{pdf_filename}/enhanced_parse"
        else:
            out_dir = enhanced_output_dir

        os.makedirs(out_dir, exist_ok=True)
        ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

        if self.use_image_restoration and self.docres_engine:
            print(f"πŸ”„ Processing PDF with image restoration: {os.path.basename(pdf_path)}")
            enhanced_pages = self._process_pages_with_restoration(pdf_path, out_dir)

            enhanced_pdf_path = os.path.join(out_dir, f"{pdf_filename}_enhanced.pdf")
            try:
                self._create_enhanced_pdf_from_pages(enhanced_pages, enhanced_pdf_path)
            except Exception as e:
                print(f"⚠️ Failed to create enhanced PDF: {e}")
        else:
            print(f"πŸ”„ Processing PDF without image restoration: {os.path.basename(pdf_path)}")
            enhanced_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        print("πŸ” Running layout detection on enhanced pages...")
        pages = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )

        pil_pages = enhanced_pages

        self._process_parsing_logic(pages, pil_pages, out_dir, pdf_filename, pdf_path)

    def _process_pages_with_restoration(self, pdf_path: str, out_dir: str) -> List[Image.Image]:
        """
        Process PDF pages with DocRes image restoration.

        :param pdf_path: Path to the input PDF file
        :param out_dir: Output directory for enhanced images
        :return: List of enhanced PIL images
        """
        original_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.restoration_dpi)]

        if not original_pages:
            print("❌ No pages found in PDF")
            return []

        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        if is_notebook:
            progress_bar = create_notebook_friendly_bar(
                total=len(original_pages), 
                desc=f"DocRes {self.restoration_task}"
            )
        else:
            progress_bar = create_beautiful_progress_bar(
                total=len(original_pages), 
                desc=f"DocRes {self.restoration_task}",
                leave=True
            )

        enhanced_pages = []
        enhanced_dir = os.path.join(out_dir, "enhanced_pages")
        os.makedirs(enhanced_dir, exist_ok=True)

        try:
            with progress_bar:
                for i, page_img in enumerate(original_pages):
                    try:
                        img_array = np.array(page_img)

                        restored_img, metadata = self.docres_engine.restore_image(
                            img_array, 
                            task=self.restoration_task
                        )

                        enhanced_page = Image.fromarray(restored_img)
                        enhanced_pages.append(enhanced_page)

                        enhanced_path = os.path.join(enhanced_dir, f"page_{i+1:03d}_enhanced.jpg")
                        enhanced_page.save(enhanced_path, "JPEG", quality=95)

                        progress_bar.set_description(f"βœ… Page {i+1}/{len(original_pages)} enhanced")
                        progress_bar.update(1)

                    except Exception as e:
                        print(f"  ⚠️ Page {i+1} restoration failed: {e}, using original")
                        enhanced_pages.append(page_img)
                        progress_bar.set_description(f"⚠️ Page {i+1} failed, using original")
                        progress_bar.update(1)

        finally:
            if hasattr(progress_bar, 'close'):
                progress_bar.close()

        return enhanced_pages

    def _process_parsing_logic(self, pages, pil_pages, out_dir, pdf_filename, pdf_path):
        """
        Process the parsing logic with enhanced pages.
        This is extracted from the parent class to allow customization.
        """
        split_table_matches: List[SplitTableMatch] = []
        merged_table_segments = []

        if self.merge_split_tables and self.split_table_detector:
            try:
                split_table_matches = self.split_table_detector.detect_split_tables(pages, pil_pages)
                if split_table_matches:
                    print(f"πŸ”— Detected {len(split_table_matches)} split table(s) to merge")
                for match in split_table_matches:
                    merged_table_segments.append(match.segment1)
                    merged_table_segments.append(match.segment2)
            except Exception as e:
                import traceback
                traceback.print_exc()
                split_table_matches = []

        fig_count = sum(sum(1 for b in p.boxes if b.label == "figure") for p in pages)
        chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages)
        table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages)

        md_lines: List[str] = ["# Enhanced Document Content\n"]
        html_lines: List[str] = ["<h1>Enhanced Document Content</h1>"]
        structured_items: List[Dict[str, Any]] = []
        page_content: Dict[int, List[str]] = {}

        charts_desc = "Charts (VLM β†’ table)" if self.vlm is not None else "Charts (cropped)"
        tables_desc = "Tables (VLM β†’ table)" if self.vlm is not None else "Tables (cropped)"
        figures_desc = "Figures (cropped)"

        with ExitStack() as stack:
            is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
            if is_notebook:
                charts_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
                figures_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=fig_count, desc=figures_desc)) if fig_count else None
            else:
                charts_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None
                figures_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=fig_count,                     desc=figures_desc, leave=True)) if fig_count else None

            for page_num in range(1, len(pil_pages) + 1):
                page_content[page_num] = [f"# Page {page_num} Content\n"]

            for p in pages:
                page_num = p.page_index
                page_img: Image.Image = pil_pages[page_num - 1]
                md_lines.append(f"\n## Page {page_num}\n")
                html_lines.append(f"<h2>Page {page_num}</h2>")

                for i, box in enumerate(sorted(p.boxes, key=reading_order_key), start=1):
                    if box.label in EXCLUDE_LABELS:
                        img_path = save_box_image(page_img, box, out_dir, page_num, i, IMAGE_SUBDIRS)
                        abs_img_path = os.path.abspath(img_path)
                        rel = os.path.relpath(abs_img_path, out_dir)

                        if box.label == "figure":
                            figure_md = f"![Figure β€” page {page_num}]({rel})\n"
                            figure_html = f'<img src="{rel}" alt="Figure β€” page {page_num}" />'
                            md_lines.append(figure_md)
                            html_lines.append(figure_html)
                            page_content[page_num].append(figure_md)
                            if figures_bar: figures_bar.update(1)

                        elif box.label == "chart":
                            if self.vlm is not None:
                                wrote_table = False
                                try:
                                    chart = self.vlm.extract_chart(abs_img_path)
                                    item = to_structured_dict(chart)
                                    if item:
                                        item["page"] = page_num
                                        item["type"] = "Chart"
                                        structured_items.append(item)

                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        page_content[page_num].append(table_md)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    chart_md = f"![Chart β€” page {page_num}]({rel})\n"
                                    chart_html = f'<img src="{rel}" alt="Chart β€” page {page_num}" />'
                                    md_lines.append(chart_md)
                                    html_lines.append(chart_html)
                                    page_content[page_num].append(chart_md)
                            else:
                                chart_md = f"![Chart β€” page {page_num}]({rel})\n"
                                chart_html = f'<img src="{rel}" alt="Chart β€” page {page_num}" />'
                                md_lines.append(chart_md)
                                html_lines.append(chart_html)
                                page_content[page_num].append(chart_md)
                            if charts_bar: charts_bar.update(1)

                        elif box.label == "table":
                            is_merged = any(seg.match_box(box, page_num) for seg in merged_table_segments)
                            if is_merged:
                                continue

                            if self.vlm is not None:
                                wrote_table = False
                                try:
                                    table = self.vlm.extract_table(abs_img_path)
                                    item = to_structured_dict(table)
                                    if item:
                                        item["page"] = page_num
                                        item["type"] = "Table"
                                        structured_items.append(item)

                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        page_content[page_num].append(table_md)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    table_md = f"![Table β€” page {page_num}]({rel})\n"
                                    table_html = f'<img src="{rel}" alt="Table β€” page {page_num}" />'
                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                                    page_content[page_num].append(table_md)
                            else:
                                table_md = f"![Table β€” page {page_num}]({rel})\n"
                                table_html = f'<img src="{rel}" alt="Table β€” page {page_num}" />'
                                md_lines.append(table_md)
                                html_lines.append(table_html)
                                page_content[page_num].append(table_md)
                            if tables_bar: tables_bar.update(1)
                    else:
                        text = ocr_box_text(self.ocr_engine, page_img, box)
                        if text:
                            md_lines.append(text)
                            md_lines.append(self.box_separator if self.box_separator else "")
                            html_text = text.replace('\n', '<br>')
                            html_lines.append(f"<p>{html_text}</p>")
                            if self.box_separator:
                                html_lines.append("<br>")
                            page_content[page_num].append(text)
                            page_content[page_num].append(self.box_separator if self.box_separator else "")

            if split_table_matches and self.split_table_detector:
                for match_idx, match in enumerate(split_table_matches):
                    try:
                        merged_img = self.split_table_detector.merge_table_images(match)

                        tables_dir = os.path.join(out_dir, "tables")
                        os.makedirs(tables_dir, exist_ok=True)
                        merged_filename = f"merged_table_{match.segment1.page_index}_{match.segment2.page_index}.png"
                        merged_path = os.path.join(tables_dir, merged_filename)
                        merged_img.save(merged_path)

                        abs_merged_path = os.path.abspath(merged_path)
                        rel_merged = os.path.relpath(abs_merged_path, out_dir)

                        pages_str = f"pages {match.segment1.page_index}-{match.segment2.page_index}"

                        if self.vlm is not None:
                            wrote_table = False
                            try:
                                table = self.vlm.extract_table(abs_merged_path)
                                item = to_structured_dict(table)
                                if item:
                                    item["page"] = f"{match.segment1.page_index}-{match.segment2.page_index}"
                                    item["type"] = "Table (Merged)"
                                    item["split_merge"] = True
                                    item["merge_confidence"] = match.confidence
                                    structured_items.append(item)

                                    table_md = render_markdown_table(
                                        item.get("headers"), 
                                        item.get("rows"),
                                        title=item.get("title") or f"Merged Table ({pages_str})"
                                    )
                                    table_html = render_html_table(
                                        item.get("headers"), 
                                        item.get("rows"),
                                        title=item.get("title") or f"Merged Table ({pages_str})"
                                    )

                                    md_lines.append(f"\n### Merged Table ({pages_str})\n")
                                    md_lines.append(table_md)
                                    html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                                    html_lines.append(table_html)
                                    wrote_table = True
                            except Exception as e:
                                pass

                            if not wrote_table:
                                table_md = f"![Merged Table β€” {pages_str}]({rel_merged})\n"
                                table_html = f'<img src="{rel_merged}" alt="Merged Table β€” {pages_str}" />'
                                md_lines.append(f"\n### Merged Table ({pages_str})\n")
                                md_lines.append(table_md)
                                html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                                html_lines.append(table_html)
                        else:
                            table_md = f"![Merged Table β€” {pages_str}]({rel_merged})\n"
                            table_html = f'<img src="{rel_merged}" alt="Merged Table β€” {pages_str}" />'
                            md_lines.append(f"\n### Merged Table ({pages_str})\n")
                            md_lines.append(table_md)
                            html_lines.append(f'<h3>Merged Table ({pages_str})</h3>')
                            html_lines.append(table_html)

                        if tables_bar: tables_bar.update(1)

                    except Exception as e:
                        print(f"⚠️  Warning: Failed to merge table {match_idx + 1}: {e}")

        md_path = write_markdown(md_lines, out_dir)

        if self.vlm is not None and html_lines:
            html_path = write_html_from_lines(html_lines, out_dir)
        else:
            html_path = write_html(md_lines, out_dir)

        pages_dir = os.path.join(out_dir, "pages")
        os.makedirs(pages_dir, exist_ok=True)

        for page_num, content_lines in page_content.items():
            page_md_path = os.path.join(pages_dir, f"page_{page_num:03d}.md")
            write_markdown(content_lines, os.path.dirname(page_md_path), os.path.basename(page_md_path))

        excel_path = None
        html_structured_path = None
        if self.vlm is not None and structured_items:
            excel_path = os.path.join(out_dir, "tables.xlsx")
            write_structured_excel(excel_path, structured_items)
            html_structured_path = os.path.join(out_dir, "tables.html")
            write_structured_html(html_structured_path, structured_items)

        print(f"βœ… Enhanced parsing completed successfully!")
        print(f"πŸ“ Output directory: {out_dir}")

    def _create_enhanced_pdf_from_pages(self, enhanced_pages: List[Image.Image], output_path: str) -> None:
        """
        Create an enhanced PDF from already processed enhanced pages.

        :param enhanced_pages: List of enhanced PIL images
        :param output_path: Path for the enhanced PDF
        """
        if not enhanced_pages:
            raise ValueError("No enhanced pages provided")

        try:
            enhanced_pages[0].save(
                output_path,
                "PDF",
                resolution=100.0,
                save_all=True,
                append_images=enhanced_pages[1:] if len(enhanced_pages) > 1 else []
            )
            print(f"βœ… Enhanced PDF saved from processed pages: {output_path}")
        except Exception as e:
            print(f"❌ Error creating enhanced PDF from pages: {e}")
            raise

    def restore_pdf_only(self, pdf_path: str, output_path: str = None, task: str = None) -> str:
        """
        Apply DocRes restoration to a PDF without parsing.

        :param pdf_path: Path to the input PDF file
        :param output_path: Path for the enhanced PDF (if None, auto-generates)
        :param task: DocRes restoration task (if None, uses instance default)
        :return: Path to the enhanced PDF or None if failed
        """
        if not self.use_image_restoration or not self.docres_engine:
            raise RuntimeError("Image restoration is not enabled or DocRes engine is not available")

        task = task or self.restoration_task
        return self.docres_engine.restore_pdf(pdf_path, output_path, task, self.restoration_dpi)

    def get_restoration_info(self) -> Dict[str, Any]:
        """
        Get information about the current restoration configuration.

        :return: Dictionary with restoration settings and status
        """
        return {
            'enabled': self.use_image_restoration,
            'task': self.restoration_task,
            'device': self.restoration_device,
            'dpi': self.restoration_dpi,
            'engine_available': self.docres_engine is not None,
            'supported_tasks': self.docres_engine.get_supported_tasks() if self.docres_engine else []
        }

__init__(*, use_image_restoration=True, restoration_task='appearance', restoration_device=None, restoration_dpi=200, vlm=None, layout_model_name='PP-DocLayout_plus-L', dpi=200, min_score=0.0, ocr_engine=None, box_separator='\n', merge_split_tables=False, bottom_threshold_ratio=0.2, top_threshold_ratio=0.15, max_gap_ratio=0.25, column_alignment_tolerance=10.0, min_merge_confidence=0.65)

Initialize the Enhanced PDF Parser with image restoration capabilities.

Source code in doctra/parsers/enhanced_pdf_parser.py
def __init__(
    self,
    *,
    use_image_restoration: bool = True,
    restoration_task: str = "appearance",
    restoration_device: Optional[str] = None,
    restoration_dpi: int = 200,
    vlm: Optional[VLMStructuredExtractor] = None,
    layout_model_name: str = "PP-DocLayout_plus-L",
    dpi: int = 200,
    min_score: float = 0.0,
    ocr_engine: Optional[Union[PytesseractOCREngine, PaddleOCREngine]] = None,
    box_separator: str = "\n",
    merge_split_tables: bool = False,
    bottom_threshold_ratio: float = 0.20,
    top_threshold_ratio: float = 0.15,
    max_gap_ratio: float = 0.25,
    column_alignment_tolerance: float = 10.0,
    min_merge_confidence: float = 0.65,
):
    """
    Initialize the Enhanced PDF Parser with image restoration capabilities.
    """
    super().__init__(
        vlm=vlm,
        layout_model_name=layout_model_name,
        dpi=dpi,
        min_score=min_score,
        ocr_engine=ocr_engine,
        box_separator=box_separator,
        merge_split_tables=merge_split_tables,
        bottom_threshold_ratio=bottom_threshold_ratio,
        top_threshold_ratio=top_threshold_ratio,
        max_gap_ratio=max_gap_ratio,
        column_alignment_tolerance=column_alignment_tolerance,
        min_merge_confidence=min_merge_confidence,
    )

    self.use_image_restoration = use_image_restoration
    self.restoration_task = restoration_task
    self.restoration_device = restoration_device
    self.restoration_dpi = restoration_dpi

    self.docres_engine = None
    if self.use_image_restoration:
        try:
            self.docres_engine = DocResEngine(
                device=restoration_device,
                use_half_precision=True
            )
            print(f"βœ… DocRes engine initialized with task: {restoration_task}")
        except Exception as e:
            print(f"⚠️ DocRes initialization failed: {e}")
            print("   Continuing without image restoration...")
            self.use_image_restoration = False
            self.docres_engine = None

get_restoration_info()

Get information about the current restoration configuration.

:return: Dictionary with restoration settings and status

Source code in doctra/parsers/enhanced_pdf_parser.py
def get_restoration_info(self) -> Dict[str, Any]:
    """
    Get information about the current restoration configuration.

    :return: Dictionary with restoration settings and status
    """
    return {
        'enabled': self.use_image_restoration,
        'task': self.restoration_task,
        'device': self.restoration_device,
        'dpi': self.restoration_dpi,
        'engine_available': self.docres_engine is not None,
        'supported_tasks': self.docres_engine.get_supported_tasks() if self.docres_engine else []
    }

parse(pdf_path, enhanced_output_dir=None)

Parse a PDF document with optional image restoration.

:param pdf_path: Path to the input PDF file :param enhanced_output_dir: Directory for enhanced images (if None, uses default) :return: None

Source code in doctra/parsers/enhanced_pdf_parser.py
def parse(self, pdf_path: str, enhanced_output_dir: str = None) -> None:
    """
    Parse a PDF document with optional image restoration.

    :param pdf_path: Path to the input PDF file
    :param enhanced_output_dir: Directory for enhanced images (if None, uses default)
    :return: None
    """
    pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]

    if enhanced_output_dir is None:
        out_dir = f"outputs/{pdf_filename}/enhanced_parse"
    else:
        out_dir = enhanced_output_dir

    os.makedirs(out_dir, exist_ok=True)
    ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

    if self.use_image_restoration and self.docres_engine:
        print(f"πŸ”„ Processing PDF with image restoration: {os.path.basename(pdf_path)}")
        enhanced_pages = self._process_pages_with_restoration(pdf_path, out_dir)

        enhanced_pdf_path = os.path.join(out_dir, f"{pdf_filename}_enhanced.pdf")
        try:
            self._create_enhanced_pdf_from_pages(enhanced_pages, enhanced_pdf_path)
        except Exception as e:
            print(f"⚠️ Failed to create enhanced PDF: {e}")
    else:
        print(f"πŸ”„ Processing PDF without image restoration: {os.path.basename(pdf_path)}")
        enhanced_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    print("πŸ” Running layout detection on enhanced pages...")
    pages = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )

    pil_pages = enhanced_pages

    self._process_parsing_logic(pages, pil_pages, out_dir, pdf_filename, pdf_path)

restore_pdf_only(pdf_path, output_path=None, task=None)

Apply DocRes restoration to a PDF without parsing.

:param pdf_path: Path to the input PDF file :param output_path: Path for the enhanced PDF (if None, auto-generates) :param task: DocRes restoration task (if None, uses instance default) :return: Path to the enhanced PDF or None if failed

Source code in doctra/parsers/enhanced_pdf_parser.py
def restore_pdf_only(self, pdf_path: str, output_path: str = None, task: str = None) -> str:
    """
    Apply DocRes restoration to a PDF without parsing.

    :param pdf_path: Path to the input PDF file
    :param output_path: Path for the enhanced PDF (if None, auto-generates)
    :param task: DocRes restoration task (if None, uses instance default)
    :return: Path to the enhanced PDF or None if failed
    """
    if not self.use_image_restoration or not self.docres_engine:
        raise RuntimeError("Image restoration is not enabled or DocRes engine is not available")

    task = task or self.restoration_task
    return self.docres_engine.restore_pdf(pdf_path, output_path, task, self.restoration_dpi)

ChartTablePDFParser

Specialized parser for extracting charts and tables.

doctra.parsers.table_chart_extractor.ChartTablePDFParser

Specialized PDF parser for extracting charts and tables.

Focuses specifically on chart and table extraction from PDF documents, with optional VLM (Vision Language Model) processing to convert visual elements into structured data.

:param extract_charts: Whether to extract charts from the document (default: True) :param extract_tables: Whether to extract tables from the document (default: True) :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled. :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L") :param dpi: DPI for PDF rendering (default: 200) :param min_score: Minimum confidence score for layout detection (default: 0.0) :param merge_split_tables: Whether to detect and merge split tables (default: False) :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20) :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15) :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers) :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0) :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)

Source code in doctra/parsers/table_chart_extractor.py
class ChartTablePDFParser:
    """
    Specialized PDF parser for extracting charts and tables.

    Focuses specifically on chart and table extraction from PDF documents,
    with optional VLM (Vision Language Model) processing to convert visual
    elements into structured data.

    :param extract_charts: Whether to extract charts from the document (default: True)
    :param extract_tables: Whether to extract tables from the document (default: True)
    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param merge_split_tables: Whether to detect and merge split tables (default: False)
    :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
    :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15)
    :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers)
    :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
    :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)
    """

    def __init__(
            self,
            *,
            extract_charts: bool = True,
            extract_tables: bool = True,
            vlm: Optional[VLMStructuredExtractor] = None,
            layout_model_name: str = "PP-DocLayout_plus-L",
            dpi: int = 200,
            min_score: float = 0.0,
            merge_split_tables: bool = False,
            bottom_threshold_ratio: float = 0.20,
            top_threshold_ratio: float = 0.15,
            max_gap_ratio: float = 0.25,
            column_alignment_tolerance: float = 10.0,
            min_merge_confidence: float = 0.65,
    ):
        """
        Initialize the ChartTablePDFParser with extraction configuration.

        :param extract_charts: Whether to extract charts from the document (default: True)
        :param extract_tables: Whether to extract tables from the document (default: True)
        :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
        :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
        :param dpi: DPI for PDF rendering (default: 200)
        :param min_score: Minimum confidence score for layout detection (default: 0.0)
        :param merge_split_tables: Whether to detect and merge split tables (default: False)
        :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
        :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15)
        :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers)
        :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
        :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)
        """
        if not extract_charts and not extract_tables:
            raise ValueError("At least one of extract_charts or extract_tables must be True")

        self.extract_charts = extract_charts
        self.extract_tables = extract_tables
        self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
        self.dpi = dpi
        self.min_score = min_score

        # Initialize VLM engine - use provided instance or None
        if vlm is None:
            self.vlm = None
        elif isinstance(vlm, VLMStructuredExtractor):
            self.vlm = vlm
        else:
            raise TypeError(
                f"vlm must be an instance of VLMStructuredExtractor or None, "
                f"got {type(vlm).__name__}"
            )

        # Initialize split table detector if enabled
        self.merge_split_tables = merge_split_tables
        if self.merge_split_tables and self.extract_tables:
            self.split_table_detector = SplitTableDetector(
                bottom_threshold_ratio=bottom_threshold_ratio,
                top_threshold_ratio=top_threshold_ratio,
                max_gap_ratio=max_gap_ratio,
                column_alignment_tolerance=column_alignment_tolerance,
                min_merge_confidence=min_merge_confidence,
            )
        else:
            self.split_table_detector = None

    def parse(self, pdf_path: str, output_base_dir: str = "outputs") -> None:
        """
        Parse a PDF document and extract charts and/or tables.

        :param pdf_path: Path to the input PDF file
        :param output_base_dir: Base directory for output files (default: "outputs")
        :return: None
        """
        pdf_name = Path(pdf_path).stem
        out_dir = os.path.join(output_base_dir, pdf_name, "structured_parsing")
        os.makedirs(out_dir, exist_ok=True)

        charts_dir = None
        tables_dir = None

        if self.extract_charts:
            charts_dir = os.path.join(out_dir, "charts")
            os.makedirs(charts_dir, exist_ok=True)

        if self.extract_tables:
            tables_dir = os.path.join(out_dir, "tables")
            os.makedirs(tables_dir, exist_ok=True)

        pages: List[LayoutPage] = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )
        pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        # Detect split tables if enabled
        split_table_matches: List[SplitTableMatch] = []
        merged_table_segments = []

        if self.merge_split_tables and self.extract_tables:
            if self.split_table_detector:
                try:
                    split_table_matches = self.split_table_detector.detect_split_tables(pages, pil_pages)
                    if split_table_matches:
                        print(f"πŸ”— Detected {len(split_table_matches)} split table(s) to merge")
                    for match in split_table_matches:
                        merged_table_segments.append(match.segment1)
                        merged_table_segments.append(match.segment2)
                except Exception as e:
                    import traceback
                    traceback.print_exc()
                    split_table_matches = []

        target_labels = []
        if self.extract_charts:
            target_labels.append("chart")
        if self.extract_tables:
            target_labels.append("table")

        chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages) if self.extract_charts else 0
        table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages) if self.extract_tables else 0

        if self.vlm is not None:
            md_lines: List[str] = ["# Extracted Charts and Tables\n"]
            structured_items: List[Dict[str, Any]] = []
            vlm_items: List[Dict[str, Any]] = []

        charts_desc = "Charts (VLM β†’ table)" if self.vlm is not None else "Charts (cropped)"
        tables_desc = "Tables (VLM β†’ table)" if self.vlm is not None else "Tables (cropped)"

        chart_counter = 1
        table_counter = 1

        with ExitStack() as stack:
            is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
            is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()

            if is_notebook:
                charts_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
            else:
                charts_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None

            for p in pages:
                page_num = p.page_index
                page_img: Image.Image = pil_pages[page_num - 1]

                target_items = [box for box in p.boxes if box.label in target_labels]

                if target_items and self.vlm is not None:
                    md_lines.append(f"\n## Page {page_num}\n")

                for box in sorted(target_items, key=reading_order_key):
                    if box.label == "chart" and self.extract_charts:
                        chart_filename = f"chart_{chart_counter:03d}.png"
                        chart_path = os.path.join(charts_dir, chart_filename)

                        cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                        cropped_img.save(chart_path)

                        if self.vlm is not None:
                            rel_path = os.path.join("charts", chart_filename)
                            wrote_table = False

                            try:
                                extracted_chart = self.vlm.extract_chart(chart_path)
                                structured_item = to_structured_dict(extracted_chart)
                                if structured_item:
                                    structured_item["page"] = page_num
                                    structured_item["type"] = "Chart"
                                    structured_items.append(structured_item)
                                    vlm_items.append({
                                        "kind": "chart",
                                        "page": page_num,
                                        "image_rel_path": rel_path,
                                        "title": structured_item.get("title"),
                                        "headers": structured_item.get("headers"),
                                        "rows": structured_item.get("rows"),
                                    })
                                    md_lines.append(
                                        render_markdown_table(
                                            structured_item.get("headers"),
                                            structured_item.get("rows"),
                                            title=structured_item.get(
                                                "title") or f"Chart {chart_counter} β€” page {page_num}"
                                        )
                                    )
                                    wrote_table = True
                            except Exception:
                                pass

                            if not wrote_table:
                                md_lines.append(f"![Chart {chart_counter} β€” page {page_num}]({rel_path})\n")

                        chart_counter += 1
                        if charts_bar:
                            charts_bar.update(1)

                    elif box.label == "table" and self.extract_tables:
                        # Skip table segments that are part of merged tables
                        is_merged = any(seg.match_box(box, page_num) for seg in merged_table_segments)
                        if is_merged:
                            continue

                        table_filename = f"table_{table_counter:03d}.png"
                        table_path = os.path.join(tables_dir, table_filename)

                        cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                        cropped_img.save(table_path)

                        if self.vlm is not None:
                            rel_path = os.path.join("tables", table_filename)
                            wrote_table = False

                            try:
                                extracted_table = self.vlm.extract_table(table_path)
                                structured_item = to_structured_dict(extracted_table)
                                if structured_item:
                                    structured_item["page"] = page_num
                                    structured_item["type"] = "Table"
                                    structured_items.append(structured_item)
                                    vlm_items.append({
                                        "kind": "table",
                                        "page": page_num,
                                        "image_rel_path": rel_path,
                                        "title": structured_item.get("title"),
                                        "headers": structured_item.get("headers"),
                                        "rows": structured_item.get("rows"),
                                    })
                                    md_lines.append(
                                        render_markdown_table(
                                            structured_item.get("headers"),
                                            structured_item.get("rows"),
                                            title=structured_item.get(
                                                "title") or f"Table {table_counter} β€” page {page_num}"
                                        )
                                    )
                                    wrote_table = True
                            except Exception:
                                pass

                            if not wrote_table:
                                md_lines.append(f"![Table {table_counter} β€” page {page_num}]({rel_path})\n")

                        table_counter += 1
                        if tables_bar:
                            tables_bar.update(1)

        # Process merged tables if any were detected
        if split_table_matches and self.split_table_detector and self.extract_tables:
            for match_idx, match in enumerate(split_table_matches):
                try:
                    merged_img = self.split_table_detector.merge_table_images(match)

                    merged_filename = f"merged_table_{match.segment1.page_index}_{match.segment2.page_index}.png"
                    merged_path = os.path.join(tables_dir, merged_filename)
                    merged_img.save(merged_path)

                    abs_merged_path = os.path.abspath(merged_path)
                    rel_merged = os.path.relpath(abs_merged_path, out_dir)

                    pages_str = f"pages {match.segment1.page_index}-{match.segment2.page_index}"

                    if self.vlm is not None:
                        wrote_table = False
                        try:
                            extracted_table = self.vlm.extract_table(abs_merged_path)
                            structured_item = to_structured_dict(extracted_table)
                            if structured_item:
                                structured_item["page"] = f"{match.segment1.page_index}-{match.segment2.page_index}"
                                structured_item["type"] = "Table (Merged)"
                                structured_item["split_merge"] = True
                                structured_item["merge_confidence"] = match.confidence
                                structured_items.append(structured_item)

                                vlm_items.append({
                                    "kind": "table",
                                    "page": pages_str,
                                    "image_rel_path": rel_merged,
                                    "title": structured_item.get("title"),
                                    "headers": structured_item.get("headers"),
                                    "rows": structured_item.get("rows"),
                                    "split_merge": True,
                                    "merge_confidence": match.confidence,
                                })

                                md_lines.append(f"\n### Merged Table ({pages_str})\n")
                                md_lines.append(
                                    render_markdown_table(
                                        structured_item.get("headers"),
                                        structured_item.get("rows"),
                                        title=structured_item.get("title") or f"Merged Table ({pages_str})"
                                    )
                                )
                                wrote_table = True
                        except Exception as e:
                            pass

                        if not wrote_table:
                            md_lines.append(f"\n### Merged Table ({pages_str})\n")
                            md_lines.append(f"![Merged Table ({pages_str})]({rel_merged})\n")
                except Exception as e:
                    import traceback
                    traceback.print_exc()

        excel_path = None

        if self.vlm is not None:

            if structured_items:
                if self.extract_charts and self.extract_tables:
                    excel_filename = "parsed_tables_charts.xlsx"
                elif self.extract_charts:
                    excel_filename = "parsed_charts.xlsx"
                elif self.extract_tables:
                    excel_filename = "parsed_tables.xlsx"
                else:
                    excel_filename = "parsed_data.xlsx"  # fallback


                excel_path = os.path.join(out_dir, excel_filename)
                write_structured_excel(excel_path, structured_items)

                html_filename = excel_filename.replace('.xlsx', '.html')
                html_path = os.path.join(out_dir, html_filename)
                write_structured_html(html_path, structured_items)

            if 'vlm_items' in locals() and vlm_items:
                with open(os.path.join(out_dir, "vlm_items.json"), 'w', encoding='utf-8') as jf:
                    json.dump(vlm_items, jf, ensure_ascii=False, indent=2)

        extraction_types = []
        if self.extract_charts:
            extraction_types.append("charts")
        if self.extract_tables:
            extraction_types.append("tables")

        print(f"βœ… Parsing completed successfully!")
        print(f"πŸ“ Output directory: {out_dir}")

__init__(*, extract_charts=True, extract_tables=True, vlm=None, layout_model_name='PP-DocLayout_plus-L', dpi=200, min_score=0.0, merge_split_tables=False, bottom_threshold_ratio=0.2, top_threshold_ratio=0.15, max_gap_ratio=0.25, column_alignment_tolerance=10.0, min_merge_confidence=0.65)

Initialize the ChartTablePDFParser with extraction configuration.

:param extract_charts: Whether to extract charts from the document (default: True) :param extract_tables: Whether to extract tables from the document (default: True) :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled. :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L") :param dpi: DPI for PDF rendering (default: 200) :param min_score: Minimum confidence score for layout detection (default: 0.0) :param merge_split_tables: Whether to detect and merge split tables (default: False) :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20) :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15) :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers) :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0) :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)

Source code in doctra/parsers/table_chart_extractor.py
def __init__(
        self,
        *,
        extract_charts: bool = True,
        extract_tables: bool = True,
        vlm: Optional[VLMStructuredExtractor] = None,
        layout_model_name: str = "PP-DocLayout_plus-L",
        dpi: int = 200,
        min_score: float = 0.0,
        merge_split_tables: bool = False,
        bottom_threshold_ratio: float = 0.20,
        top_threshold_ratio: float = 0.15,
        max_gap_ratio: float = 0.25,
        column_alignment_tolerance: float = 10.0,
        min_merge_confidence: float = 0.65,
):
    """
    Initialize the ChartTablePDFParser with extraction configuration.

    :param extract_charts: Whether to extract charts from the document (default: True)
    :param extract_tables: Whether to extract tables from the document (default: True)
    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param merge_split_tables: Whether to detect and merge split tables (default: False)
    :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
    :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15)
    :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25, accounts for headers/footers)
    :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
    :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)
    """
    if not extract_charts and not extract_tables:
        raise ValueError("At least one of extract_charts or extract_tables must be True")

    self.extract_charts = extract_charts
    self.extract_tables = extract_tables
    self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
    self.dpi = dpi
    self.min_score = min_score

    # Initialize VLM engine - use provided instance or None
    if vlm is None:
        self.vlm = None
    elif isinstance(vlm, VLMStructuredExtractor):
        self.vlm = vlm
    else:
        raise TypeError(
            f"vlm must be an instance of VLMStructuredExtractor or None, "
            f"got {type(vlm).__name__}"
        )

    # Initialize split table detector if enabled
    self.merge_split_tables = merge_split_tables
    if self.merge_split_tables and self.extract_tables:
        self.split_table_detector = SplitTableDetector(
            bottom_threshold_ratio=bottom_threshold_ratio,
            top_threshold_ratio=top_threshold_ratio,
            max_gap_ratio=max_gap_ratio,
            column_alignment_tolerance=column_alignment_tolerance,
            min_merge_confidence=min_merge_confidence,
        )
    else:
        self.split_table_detector = None

parse(pdf_path, output_base_dir='outputs')

Parse a PDF document and extract charts and/or tables.

:param pdf_path: Path to the input PDF file :param output_base_dir: Base directory for output files (default: "outputs") :return: None

Source code in doctra/parsers/table_chart_extractor.py
def parse(self, pdf_path: str, output_base_dir: str = "outputs") -> None:
    """
    Parse a PDF document and extract charts and/or tables.

    :param pdf_path: Path to the input PDF file
    :param output_base_dir: Base directory for output files (default: "outputs")
    :return: None
    """
    pdf_name = Path(pdf_path).stem
    out_dir = os.path.join(output_base_dir, pdf_name, "structured_parsing")
    os.makedirs(out_dir, exist_ok=True)

    charts_dir = None
    tables_dir = None

    if self.extract_charts:
        charts_dir = os.path.join(out_dir, "charts")
        os.makedirs(charts_dir, exist_ok=True)

    if self.extract_tables:
        tables_dir = os.path.join(out_dir, "tables")
        os.makedirs(tables_dir, exist_ok=True)

    pages: List[LayoutPage] = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )
    pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    # Detect split tables if enabled
    split_table_matches: List[SplitTableMatch] = []
    merged_table_segments = []

    if self.merge_split_tables and self.extract_tables:
        if self.split_table_detector:
            try:
                split_table_matches = self.split_table_detector.detect_split_tables(pages, pil_pages)
                if split_table_matches:
                    print(f"πŸ”— Detected {len(split_table_matches)} split table(s) to merge")
                for match in split_table_matches:
                    merged_table_segments.append(match.segment1)
                    merged_table_segments.append(match.segment2)
            except Exception as e:
                import traceback
                traceback.print_exc()
                split_table_matches = []

    target_labels = []
    if self.extract_charts:
        target_labels.append("chart")
    if self.extract_tables:
        target_labels.append("table")

    chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages) if self.extract_charts else 0
    table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages) if self.extract_tables else 0

    if self.vlm is not None:
        md_lines: List[str] = ["# Extracted Charts and Tables\n"]
        structured_items: List[Dict[str, Any]] = []
        vlm_items: List[Dict[str, Any]] = []

    charts_desc = "Charts (VLM β†’ table)" if self.vlm is not None else "Charts (cropped)"
    tables_desc = "Tables (VLM β†’ table)" if self.vlm is not None else "Tables (cropped)"

    chart_counter = 1
    table_counter = 1

    with ExitStack() as stack:
        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()

        if is_notebook:
            charts_bar = stack.enter_context(
                create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
            tables_bar = stack.enter_context(
                create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
        else:
            charts_bar = stack.enter_context(
                create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
            tables_bar = stack.enter_context(
                create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None

        for p in pages:
            page_num = p.page_index
            page_img: Image.Image = pil_pages[page_num - 1]

            target_items = [box for box in p.boxes if box.label in target_labels]

            if target_items and self.vlm is not None:
                md_lines.append(f"\n## Page {page_num}\n")

            for box in sorted(target_items, key=reading_order_key):
                if box.label == "chart" and self.extract_charts:
                    chart_filename = f"chart_{chart_counter:03d}.png"
                    chart_path = os.path.join(charts_dir, chart_filename)

                    cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                    cropped_img.save(chart_path)

                    if self.vlm is not None:
                        rel_path = os.path.join("charts", chart_filename)
                        wrote_table = False

                        try:
                            extracted_chart = self.vlm.extract_chart(chart_path)
                            structured_item = to_structured_dict(extracted_chart)
                            if structured_item:
                                structured_item["page"] = page_num
                                structured_item["type"] = "Chart"
                                structured_items.append(structured_item)
                                vlm_items.append({
                                    "kind": "chart",
                                    "page": page_num,
                                    "image_rel_path": rel_path,
                                    "title": structured_item.get("title"),
                                    "headers": structured_item.get("headers"),
                                    "rows": structured_item.get("rows"),
                                })
                                md_lines.append(
                                    render_markdown_table(
                                        structured_item.get("headers"),
                                        structured_item.get("rows"),
                                        title=structured_item.get(
                                            "title") or f"Chart {chart_counter} β€” page {page_num}"
                                    )
                                )
                                wrote_table = True
                        except Exception:
                            pass

                        if not wrote_table:
                            md_lines.append(f"![Chart {chart_counter} β€” page {page_num}]({rel_path})\n")

                    chart_counter += 1
                    if charts_bar:
                        charts_bar.update(1)

                elif box.label == "table" and self.extract_tables:
                    # Skip table segments that are part of merged tables
                    is_merged = any(seg.match_box(box, page_num) for seg in merged_table_segments)
                    if is_merged:
                        continue

                    table_filename = f"table_{table_counter:03d}.png"
                    table_path = os.path.join(tables_dir, table_filename)

                    cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                    cropped_img.save(table_path)

                    if self.vlm is not None:
                        rel_path = os.path.join("tables", table_filename)
                        wrote_table = False

                        try:
                            extracted_table = self.vlm.extract_table(table_path)
                            structured_item = to_structured_dict(extracted_table)
                            if structured_item:
                                structured_item["page"] = page_num
                                structured_item["type"] = "Table"
                                structured_items.append(structured_item)
                                vlm_items.append({
                                    "kind": "table",
                                    "page": page_num,
                                    "image_rel_path": rel_path,
                                    "title": structured_item.get("title"),
                                    "headers": structured_item.get("headers"),
                                    "rows": structured_item.get("rows"),
                                })
                                md_lines.append(
                                    render_markdown_table(
                                        structured_item.get("headers"),
                                        structured_item.get("rows"),
                                        title=structured_item.get(
                                            "title") or f"Table {table_counter} β€” page {page_num}"
                                    )
                                )
                                wrote_table = True
                        except Exception:
                            pass

                        if not wrote_table:
                            md_lines.append(f"![Table {table_counter} β€” page {page_num}]({rel_path})\n")

                    table_counter += 1
                    if tables_bar:
                        tables_bar.update(1)

    # Process merged tables if any were detected
    if split_table_matches and self.split_table_detector and self.extract_tables:
        for match_idx, match in enumerate(split_table_matches):
            try:
                merged_img = self.split_table_detector.merge_table_images(match)

                merged_filename = f"merged_table_{match.segment1.page_index}_{match.segment2.page_index}.png"
                merged_path = os.path.join(tables_dir, merged_filename)
                merged_img.save(merged_path)

                abs_merged_path = os.path.abspath(merged_path)
                rel_merged = os.path.relpath(abs_merged_path, out_dir)

                pages_str = f"pages {match.segment1.page_index}-{match.segment2.page_index}"

                if self.vlm is not None:
                    wrote_table = False
                    try:
                        extracted_table = self.vlm.extract_table(abs_merged_path)
                        structured_item = to_structured_dict(extracted_table)
                        if structured_item:
                            structured_item["page"] = f"{match.segment1.page_index}-{match.segment2.page_index}"
                            structured_item["type"] = "Table (Merged)"
                            structured_item["split_merge"] = True
                            structured_item["merge_confidence"] = match.confidence
                            structured_items.append(structured_item)

                            vlm_items.append({
                                "kind": "table",
                                "page": pages_str,
                                "image_rel_path": rel_merged,
                                "title": structured_item.get("title"),
                                "headers": structured_item.get("headers"),
                                "rows": structured_item.get("rows"),
                                "split_merge": True,
                                "merge_confidence": match.confidence,
                            })

                            md_lines.append(f"\n### Merged Table ({pages_str})\n")
                            md_lines.append(
                                render_markdown_table(
                                    structured_item.get("headers"),
                                    structured_item.get("rows"),
                                    title=structured_item.get("title") or f"Merged Table ({pages_str})"
                                )
                            )
                            wrote_table = True
                    except Exception as e:
                        pass

                    if not wrote_table:
                        md_lines.append(f"\n### Merged Table ({pages_str})\n")
                        md_lines.append(f"![Merged Table ({pages_str})]({rel_merged})\n")
            except Exception as e:
                import traceback
                traceback.print_exc()

    excel_path = None

    if self.vlm is not None:

        if structured_items:
            if self.extract_charts and self.extract_tables:
                excel_filename = "parsed_tables_charts.xlsx"
            elif self.extract_charts:
                excel_filename = "parsed_charts.xlsx"
            elif self.extract_tables:
                excel_filename = "parsed_tables.xlsx"
            else:
                excel_filename = "parsed_data.xlsx"  # fallback


            excel_path = os.path.join(out_dir, excel_filename)
            write_structured_excel(excel_path, structured_items)

            html_filename = excel_filename.replace('.xlsx', '.html')
            html_path = os.path.join(out_dir, html_filename)
            write_structured_html(html_path, structured_items)

        if 'vlm_items' in locals() and vlm_items:
            with open(os.path.join(out_dir, "vlm_items.json"), 'w', encoding='utf-8') as jf:
                json.dump(vlm_items, jf, ensure_ascii=False, indent=2)

    extraction_types = []
    if self.extract_charts:
        extraction_types.append("charts")
    if self.extract_tables:
        extraction_types.append("tables")

    print(f"βœ… Parsing completed successfully!")
    print(f"πŸ“ Output directory: {out_dir}")

PaddleOCRVLPDFParser

End-to-end document parser using PaddleOCRVL Vision-Language Model.

doctra.parsers.paddleocr_vl_parser.PaddleOCRVLPDFParser

PDF Parser using PaddleOCRVL for end-to-end document parsing.

Combines PaddleOCRVL's vision-language model capabilities with: - DocRes image restoration for enhanced document quality - Split table detection and merging across pages

:param use_image_restoration: Whether to apply DocRes image restoration (default: True) :param restoration_task: DocRes task to use (default: "appearance") :param restoration_device: Device for DocRes processing (default: None for auto-detect) :param restoration_dpi: DPI for restoration processing (default: 200) :param use_chart_recognition: Enable chart recognition in PaddleOCRVL (default: True) :param use_doc_orientation_classify: Enable document orientation classification (default: False) :param use_doc_unwarping: Enable document unwarping (default: False) :param use_layout_detection: Enable layout detection (default: True) :param device: Device for PaddleOCRVL processing ("gpu" or "cpu", default: "gpu") :param merge_split_tables: Whether to detect and merge split tables (default: True) :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20) :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15) :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25) :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0) :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)

Source code in doctra/parsers/paddleocr_vl_parser.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
class PaddleOCRVLPDFParser:
    """
    PDF Parser using PaddleOCRVL for end-to-end document parsing.

    Combines PaddleOCRVL's vision-language model capabilities with:
    - DocRes image restoration for enhanced document quality
    - Split table detection and merging across pages

    :param use_image_restoration: Whether to apply DocRes image restoration (default: True)
    :param restoration_task: DocRes task to use (default: "appearance")
    :param restoration_device: Device for DocRes processing (default: None for auto-detect)
    :param restoration_dpi: DPI for restoration processing (default: 200)
    :param use_chart_recognition: Enable chart recognition in PaddleOCRVL (default: True)
    :param use_doc_orientation_classify: Enable document orientation classification (default: False)
    :param use_doc_unwarping: Enable document unwarping (default: False)
    :param use_layout_detection: Enable layout detection (default: True)
    :param device: Device for PaddleOCRVL processing ("gpu" or "cpu", default: "gpu")
    :param merge_split_tables: Whether to detect and merge split tables (default: True)
    :param bottom_threshold_ratio: Ratio for "too close to bottom" detection (default: 0.20)
    :param top_threshold_ratio: Ratio for "too close to top" detection (default: 0.15)
    :param max_gap_ratio: Maximum allowed gap between tables (default: 0.25)
    :param column_alignment_tolerance: Pixel tolerance for column alignment (default: 10.0)
    :param min_merge_confidence: Minimum confidence score for merging (default: 0.65)
    """

    def __init__(
        self,
        *,
        use_image_restoration: bool = True,
        restoration_task: str = "appearance",
        restoration_device: Optional[str] = None,
        restoration_dpi: int = 200,
        use_chart_recognition: bool = True,
        use_doc_orientation_classify: bool = False,
        use_doc_unwarping: bool = False,
        use_layout_detection: bool = True,
        device: str = "gpu",
        merge_split_tables: bool = True,
        bottom_threshold_ratio: float = 0.20,
        top_threshold_ratio: float = 0.15,
        max_gap_ratio: float = 0.25,
        column_alignment_tolerance: float = 10.0,
        min_merge_confidence: float = 0.65,
    ):
        """
        Initialize the PaddleOCRVL PDF Parser.
        """
        if not PADDLEOCR_VL_AVAILABLE:
            raise ImportError(
                "PaddleOCRVL is not available. Please install paddleocr:\n"
                "pip install paddleocr>=2.6.0"
            )

        try:
            with silence():
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    self.paddleocr_vl = PaddleOCRVL(
                        use_doc_orientation_classify=use_doc_orientation_classify,
                        use_doc_unwarping=use_doc_unwarping,
                        use_layout_detection=use_layout_detection,
                    )
            print("βœ… PaddleOCRVL pipeline initialized")
        except Exception as e:
            raise RuntimeError(f"Failed to initialize PaddleOCRVL: {e}")

        self.use_chart_recognition = use_chart_recognition
        self.device = device

        self.use_image_restoration = use_image_restoration
        self.restoration_task = restoration_task
        self.restoration_device = restoration_device
        self.restoration_dpi = restoration_dpi

        self.docres_engine = None
        if self.use_image_restoration:
            try:
                self.docres_engine = DocResEngine(
                    device=restoration_device,
                    use_half_precision=True
                )
                print(f"βœ… DocRes engine initialized with task: {restoration_task}")
            except Exception as e:
                print(f"⚠️ DocRes initialization failed: {e}")
                print("   Continuing without image restoration...")
                self.use_image_restoration = False
                self.docres_engine = None

        self.merge_split_tables = merge_split_tables
        if self.merge_split_tables:
            self.split_table_detector = SplitTableDetector(
                bottom_threshold_ratio=bottom_threshold_ratio,
                top_threshold_ratio=top_threshold_ratio,
                max_gap_ratio=max_gap_ratio,
                column_alignment_tolerance=column_alignment_tolerance,
                min_merge_confidence=min_merge_confidence,
            )
        else:
            self.split_table_detector = None

    def parse(self, pdf_path: str, output_dir: Optional[str] = None) -> None:
        """
        Parse a PDF document using PaddleOCRVL.

        :param pdf_path: Path to the input PDF file
        :param output_dir: Output directory (if None, uses default)
        :return: None
        """
        pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]

        if output_dir is None:
            out_dir = f"outputs/{pdf_filename}/paddleocr_vl_parse"
        else:
            out_dir = output_dir

        os.makedirs(out_dir, exist_ok=True)
        ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

        print(f"πŸ”„ Processing PDF: {os.path.basename(pdf_path)}")

        if self.use_image_restoration and self.docres_engine:
            print("πŸ”„ Applying DocRes image restoration...")
            enhanced_pages = self._process_pages_with_restoration(pdf_path, out_dir)
        else:
            enhanced_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.restoration_dpi)]

        if not enhanced_pages:
            print("❌ No pages found in PDF")
            return

        print("πŸ” Processing pages with PaddleOCRVL...")
        all_results = []

        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        if is_notebook:
            progress_bar = create_notebook_friendly_bar(
                total=len(enhanced_pages),
                desc="PaddleOCRVL processing"
            )
        else:
            progress_bar = create_beautiful_progress_bar(
                total=len(enhanced_pages),
                desc="PaddleOCRVL processing",
                leave=True
            )

        with progress_bar:
            for page_idx, page_img in enumerate(enhanced_pages):
                try:
                    with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmp_file:
                        tmp_path = tmp_file.name
                        page_img.save(tmp_path, "JPEG", quality=95)

                    try:
                        with warnings.catch_warnings():
                            warnings.simplefilter("ignore")
                            with open(os.devnull, "w") as devnull:
                                with contextlib.redirect_stderr(devnull):
                                    output = self.paddleocr_vl.predict(
                                        input=tmp_path,
                                        device=self.device,
                                        use_chart_recognition=self.use_chart_recognition
                                    )

                        if output and len(output) > 0:
                            result = output[0]
                            result['page_index'] = page_idx + 1
                            all_results.append(result)

                        progress_bar.set_description(f"βœ… Page {page_idx + 1}/{len(enhanced_pages)} processed")
                    finally:
                        try:
                            os.unlink(tmp_path)
                        except:
                            pass

                    progress_bar.update(1)

                except Exception as e:
                    print(f"⚠️ Page {page_idx + 1} processing failed: {e}")
                    progress_bar.update(1)

        split_table_matches: List[SplitTableMatch] = []
        merged_table_segments = []

        if self.merge_split_tables and self.split_table_detector:
            print("πŸ”— Detecting split tables...")
            try:
                pages_for_detection = self._convert_to_layout_pages(all_results, enhanced_pages)
                split_table_matches = self.split_table_detector.detect_split_tables(
                    pages_for_detection, enhanced_pages
                )
                if split_table_matches:
                    print(f"πŸ”— Detected {len(split_table_matches)} split table(s) to merge")
                for match in split_table_matches:
                    merged_table_segments.append(match.segment1)
                    merged_table_segments.append(match.segment2)
            except Exception as e:
                import traceback
                traceback.print_exc()
                print(f"⚠️ Split table detection failed: {e}")
                split_table_matches = []

        self._generate_outputs(
            all_results, enhanced_pages, split_table_matches, merged_table_segments, out_dir
        )

        print(f"βœ… Parsing completed successfully!")
        print(f"πŸ“ Output directory: {out_dir}")

    def _process_pages_with_restoration(self, pdf_path: str, out_dir: str) -> List[Image.Image]:
        """
        Process PDF pages with DocRes image restoration.

        :param pdf_path: Path to the input PDF file
        :param out_dir: Output directory for enhanced images
        :return: List of enhanced PIL images
        """
        original_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.restoration_dpi)]

        if not original_pages:
            return []

        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        if is_notebook:
            progress_bar = create_notebook_friendly_bar(
                total=len(original_pages),
                desc=f"DocRes {self.restoration_task}"
            )
        else:
            progress_bar = create_beautiful_progress_bar(
                total=len(original_pages),
                desc=f"DocRes {self.restoration_task}",
                leave=True
            )

        enhanced_pages = []
        enhanced_dir = os.path.join(out_dir, "enhanced_pages")
        os.makedirs(enhanced_dir, exist_ok=True)

        try:
            with progress_bar:
                for i, page_img in enumerate(original_pages):
                    try:
                        img_array = np.array(page_img)

                        restored_img, metadata = self.docres_engine.restore_image(
                            img_array,
                            task=self.restoration_task
                        )

                        enhanced_page = Image.fromarray(restored_img)
                        enhanced_pages.append(enhanced_page)

                        enhanced_path = os.path.join(enhanced_dir, f"page_{i+1:03d}_enhanced.jpg")
                        enhanced_page.save(enhanced_path, "JPEG", quality=95)

                        progress_bar.set_description(f"βœ… Page {i+1}/{len(original_pages)} enhanced")
                        progress_bar.update(1)

                    except Exception as e:
                        print(f"  ⚠️ Page {i+1} restoration failed: {e}, using original")
                        enhanced_pages.append(page_img)
                        progress_bar.update(1)

        finally:
            if hasattr(progress_bar, 'close'):
                progress_bar.close()

        return enhanced_pages

    def _convert_to_layout_pages(self, results: List[Dict], page_images: List[Image.Image]):
        """
        Convert PaddleOCRVL results to a format compatible with split table detector.

        This creates a minimal LayoutPage-like structure from PaddleOCRVL output.
        """
        from doctra.engines.layout.layout_models import LayoutBox, LayoutPage

        pages = []
        for result in results:
            page_idx = result.get('page_index', 1)
            if page_idx < 1 or page_idx > len(page_images):
                continue

            page_img = page_images[page_idx - 1]
            boxes = []

            layout_det = result.get('layout_det_res', {})
            layout_boxes = layout_det.get('boxes', [])

            for box_data in layout_boxes:
                coords = box_data.get('coordinate', [])
                if len(coords) >= 4:
                    x1, y1, x2, y2 = float(coords[0]), float(coords[1]), float(coords[2]), float(coords[3])
                    label = box_data.get('label', 'unknown')
                    score = box_data.get('score', 0.0)

                    box = LayoutBox(
                        x1=x1,
                        y1=y1,
                        x2=x2,
                        y2=y2,
                        label=label,
                        score=score
                    )
                    boxes.append(box)

            page = LayoutPage(
                page_index=page_idx,
                width=page_img.width,
                height=page_img.height,
                boxes=boxes
            )
            pages.append(page)

        return pages

    def _generate_outputs(
        self,
        results: List[Dict],
        page_images: List[Image.Image],
        split_table_matches: List[SplitTableMatch],
        merged_table_segments: List[TableSegment],
        out_dir: str
    ) -> None:
        """
        Generate markdown, HTML, and Excel outputs from PaddleOCRVL results.
        """
        md_lines: List[str] = ["# PaddleOCRVL Document Content\n"]
        html_lines: List[str] = ["<h1>PaddleOCRVL Document Content</h1>"]
        structured_items: List[Dict[str, Any]] = []

        for result in results:
            page_idx = result.get('page_index', 1)
            page_img = page_images[page_idx - 1]

            md_lines.append(f"\n## Page {page_idx}\n")
            html_lines.append(f"<h2>Page {page_idx}</h2>")

            parsing_res_list = result.get('parsing_res_list', [])

            for item in parsing_res_list:
                if isinstance(item, dict):
                    label = item.get('block_label', item.get('label', 'unknown'))
                    bbox = item.get('block_bbox', item.get('bbox', None))
                    content = item.get('block_content', item.get('content', ''))
                else:
                    item_str = str(item)

                    label_match = re.search(r'label:\s*(\w+)', item_str)
                    label = label_match.group(1) if label_match else 'unknown'

                    bbox_match = re.search(r'bbox:\s*\[([\d\.,\s]+)\]', item_str)
                    bbox = None
                    if bbox_match:
                        bbox_str = bbox_match.group(1)
                        bbox = [float(x.strip()) for x in bbox_str.split(',')]

                    content_match = re.search(r'content:\s*(.+?)(?=\s*#################|$)', item_str, re.DOTALL)
                    content = content_match.group(1).strip() if content_match else ''

                if not content:
                    continue

                if label == 'table':
                    table_html_match = re.search(r'<table>.*?</table>', content, re.DOTALL)
                    if table_html_match:
                        table_html = table_html_match.group(0)
                        try:
                            table_md = self._html_table_to_markdown(table_html)
                            md_lines.append(f"\n### Table\n\n{table_md}\n")
                            html_lines.append(f"<h3>Table</h3>\n{table_html}")

                            structured_table = self._extract_table_data(table_html)
                            if structured_table:
                                structured_table['page'] = page_idx
                                structured_table['type'] = 'Table'
                                structured_items.append(structured_table)
                        except Exception as e:
                            if bbox:
                                self._save_element_image(page_img, bbox, out_dir, page_idx, label, md_lines, html_lines)

                elif label == 'chart':
                    chart_table = self._parse_chart_content(content)

                    if chart_table:
                        chart_table['page'] = page_idx
                        chart_table['type'] = 'Chart'
                        structured_items.append(chart_table)

                        table_md = render_markdown_table(
                            chart_table.get("headers"),
                            chart_table.get("rows"),
                            title=chart_table.get("title", "Chart")
                        )
                        table_html = render_html_table(
                            chart_table.get("headers"),
                            chart_table.get("rows"),
                            title=chart_table.get("title", "Chart")
                        )
                        md_lines.append(f"\n### Chart\n\n{table_md}\n")
                        html_lines.append(f"<h3>Chart</h3>\n{table_html}")
                    else:
                        md_lines.append(f"\n### Chart\n\n```\n{content}\n```\n")
                        html_lines.append(f"<h3>Chart</h3>\n<pre>{content}</pre>")

                elif label in ['header', 'text', 'figure_title', 'vision_footnote', 'number', 'numbers', 'paragraph_title', 'paragraph_titles']:
                    md_lines.append(f"{content}\n")
                    html_lines.append(f"<p>{content.replace(chr(10), '<br>')}</p>")

                else:
                    if bbox:
                        self._save_element_image(page_img, bbox, out_dir, page_idx, label, md_lines, html_lines)

        if split_table_matches and self.split_table_detector:
            for match_idx, match in enumerate(split_table_matches):
                try:
                    merged_img = self.split_table_detector.merge_table_images(match)

                    tables_dir = os.path.join(out_dir, "tables")
                    os.makedirs(tables_dir, exist_ok=True)
                    merged_filename = f"merged_table_{match.segment1.page_index}_{match.segment2.page_index}.png"
                    merged_path = os.path.join(tables_dir, merged_filename)
                    merged_img.save(merged_path)

                    abs_merged_path = os.path.abspath(merged_path)
                    rel_merged = os.path.relpath(abs_merged_path, out_dir)

                    pages_str = f"pages {match.segment1.page_index}-{match.segment2.page_index}"

                    with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp_file:
                        tmp_path = tmp_file.name
                        merged_img.save(tmp_path, "PNG")

                    try:
                        with warnings.catch_warnings():
                            warnings.simplefilter("ignore")
                            with open(os.devnull, "w") as devnull:
                                with contextlib.redirect_stderr(devnull):
                                    merged_output = self.paddleocr_vl.predict(
                                        input=tmp_path,
                                        device=self.device,
                                        use_chart_recognition=self.use_chart_recognition
                                    )

                        if merged_output and len(merged_output) > 0:
                            merged_result = merged_output[0]
                            parsing_res = merged_result.get('parsing_res_list', [])

                            for item in parsing_res:
                                if isinstance(item, dict):
                                    label = item.get('block_label', item.get('label', ''))
                                    content = item.get('block_content', item.get('content', ''))
                                else:
                                    item_str = str(item)
                                    label_match = re.search(r'label:\s*(\w+)', item_str)
                                    label = label_match.group(1) if label_match else ''
                                    content_match = re.search(r'content:\s*(.+?)(?=\s*#################|$)', item_str, re.DOTALL)
                                    content = content_match.group(1).strip() if content_match else ''

                                if label.lower() == 'table' and content:
                                    table_html_match = re.search(r'<table>.*?</table>', content, re.DOTALL)
                                    if table_html_match:
                                        table_html = table_html_match.group(0)
                                        table_md = self._html_table_to_markdown(table_html)
                                        md_lines.append(f"\n### Merged Table ({pages_str})\n\n{table_md}\n")
                                        html_lines.append(f"<h3>Merged Table ({pages_str})</h3>\n{table_html}")

                                        structured_table = self._extract_table_data(table_html)
                                        if structured_table:
                                            structured_table['page'] = pages_str
                                            structured_table['type'] = 'Table (Merged)'
                                            structured_table['split_merge'] = True
                                            structured_table['merge_confidence'] = match.confidence
                                            structured_items.append(structured_table)
                    finally:
                        try:
                            os.unlink(tmp_path)
                        except:
                            pass

                except Exception as e:
                    print(f"⚠️ Warning: Failed to process merged table {match_idx + 1}: {e}")

        md_path = write_markdown(md_lines, out_dir)

        if structured_items:
            html_path = write_html_from_lines(html_lines, out_dir)
            excel_path = os.path.join(out_dir, "tables.xlsx")
            write_structured_excel(excel_path, structured_items)
            html_structured_path = os.path.join(out_dir, "tables.html")
            write_structured_html(html_structured_path, structured_items)
        else:
            html_path = write_html(md_lines, out_dir)

    def _save_element_image(
        self,
        page_img: Image.Image,
        bbox: List[float],
        out_dir: str,
        page_idx: int,
        label: str,
        md_lines: List[str],
        html_lines: List[str]
    ) -> None:
        """Save an element as an image and add references to markdown/HTML."""
        try:
            x1, y1, x2, y2 = int(bbox[0]), int(bbox[1]), int(bbox[2]), int(bbox[3])
            cropped = page_img.crop((x1, y1, x2, y2))

            label_dir = os.path.join(out_dir, label + "s" if label != 'figure' else 'figures')
            os.makedirs(label_dir, exist_ok=True)

            img_filename = f"page_{page_idx:03d}_{label}_1.png"
            img_path = os.path.join(label_dir, img_filename)
            cropped.save(img_path, "PNG")

            rel_path = os.path.relpath(img_path, out_dir)
            md_lines.append(f"![{label.title()} β€” page {page_idx}]({rel_path})\n")
            html_lines.append(f'<img src="{rel_path}" alt="{label.title()} β€” page {page_idx}" />')
        except Exception as e:
            print(f"⚠️ Failed to save {label} image: {e}")

    def _html_table_to_markdown(self, html_table: str) -> str:
        """Convert HTML table to markdown format."""
        try:
            try:
                from bs4 import BeautifulSoup
                soup = BeautifulSoup(html_table, 'html.parser')
                table = soup.find('table')

                if not table:
                    return self._simple_html_to_markdown(html_table)

                rows = []
                for tr in table.find_all('tr'):
                    cells = []
                    for td in tr.find_all(['td', 'th']):
                        text = td.get_text(strip=True)
                        cells.append(text)
                    if cells:
                        rows.append(cells)

                if not rows:
                    return self._simple_html_to_markdown(html_table)

                md_lines = []
                md_lines.append('| ' + ' | '.join(rows[0]) + ' |')
                md_lines.append('| ' + ' | '.join(['---'] * len(rows[0])) + ' |')

                for row in rows[1:]:
                    while len(row) < len(rows[0]):
                        row.append('')
                    md_lines.append('| ' + ' | '.join(row) + ' |')

                return '\n'.join(md_lines)
            except ImportError:
                return self._simple_html_to_markdown(html_table)
        except Exception as e:
            return self._simple_html_to_markdown(html_table)

    def _simple_html_to_markdown(self, html_table: str) -> str:
        """Simple HTML table to markdown conversion without BeautifulSoup."""
        rows = []
        row_pattern = r'<tr[^>]*>(.*?)</tr>'
        cell_pattern = r'<t[dh][^>]*>(.*?)</t[dh]>'

        for row_match in re.finditer(row_pattern, html_table, re.DOTALL):
            row_html = row_match.group(1)
            cells = []
            for cell_match in re.finditer(cell_pattern, row_html, re.DOTALL):
                cell_text = cell_match.group(1).strip()
                cell_text = re.sub(r'<[^>]+>', '', cell_text)
                cells.append(cell_text)
            if cells:
                rows.append(cells)

        if not rows:
            return html_table

        md_lines = []
        if rows:
            md_lines.append('| ' + ' | '.join(rows[0]) + ' |')
            md_lines.append('| ' + ' | '.join(['---'] * len(rows[0])) + ' |')

            for row in rows[1:]:
                while len(row) < len(rows[0]):
                    row.append('')
                md_lines.append('| ' + ' | '.join(row) + ' |')

        return '\n'.join(md_lines)

    def _extract_table_data(self, html_table: str) -> Optional[Dict[str, Any]]:
        """Extract structured table data from HTML table."""
        try:
            try:
                from bs4 import BeautifulSoup
                soup = BeautifulSoup(html_table, 'html.parser')
                table = soup.find('table')

                if not table:
                    return self._simple_extract_table_data(html_table)

                rows = []
                headers = None

                for tr in table.find_all('tr'):
                    cells = []
                    for td in tr.find_all(['td', 'th']):
                        text = td.get_text(strip=True)
                        cells.append(text)

                    if cells:
                        if headers is None:
                            headers = cells
                        else:
                            rows.append(cells)

                if headers and rows:
                    return {
                        'title': '',
                        'headers': headers,
                        'rows': rows
                    }

                return None
            except ImportError:
                return self._simple_extract_table_data(html_table)
        except Exception as e:
            return self._simple_extract_table_data(html_table)

    def _simple_extract_table_data(self, html_table: str) -> Optional[Dict[str, Any]]:
        """Simple table data extraction without BeautifulSoup."""
        rows = []
        row_pattern = r'<tr[^>]*>(.*?)</tr>'
        cell_pattern = r'<t[dh][^>]*>(.*?)</t[dh]>'

        for row_match in re.finditer(row_pattern, html_table, re.DOTALL):
            row_html = row_match.group(1)
            cells = []
            for cell_match in re.finditer(cell_pattern, row_html, re.DOTALL):
                cell_text = cell_match.group(1).strip()
                cell_text = re.sub(r'<[^>]+>', '', cell_text)
                cells.append(cell_text)
            if cells:
                rows.append(cells)

        if not rows:
            return None

        headers = rows[0] if rows else None
        data_rows = rows[1:] if len(rows) > 1 else []

        if headers and data_rows:
            return {
                'title': '',
                'headers': headers,
                'rows': data_rows
            }

        return None

    def _parse_chart_content(self, content: str) -> Optional[Dict[str, Any]]:
        """
        Parse chart content from pipe-delimited format to structured table data.

        Example input:
        Category | Percentage
        PCT system fees | 358.6%
        Madrid system fees | 76.2%

        :param content: Chart content in pipe-delimited format
        :return: Dictionary with headers and rows, or None if parsing fails
        """
        if not content or not content.strip():
            return None

        lines = [line.strip() for line in content.split('\n') if line.strip()]
        if not lines:
            return None

        header_line = lines[0]
        headers = [h.strip() for h in header_line.split('|') if h.strip()]

        if not headers:
            return None

        rows = []
        for line in lines[1:]:
            cells = [c.strip() for c in line.split('|') if c.strip()]
            if cells:
                while len(cells) < len(headers):
                    cells.append('')
                if len(cells) > len(headers):
                    cells = cells[:len(headers)]
                rows.append(cells)

        if headers and rows:
            return {
                'title': '',
                'headers': headers,
                'rows': rows
            }

        return None

__init__(*, use_image_restoration=True, restoration_task='appearance', restoration_device=None, restoration_dpi=200, use_chart_recognition=True, use_doc_orientation_classify=False, use_doc_unwarping=False, use_layout_detection=True, device='gpu', merge_split_tables=True, bottom_threshold_ratio=0.2, top_threshold_ratio=0.15, max_gap_ratio=0.25, column_alignment_tolerance=10.0, min_merge_confidence=0.65)

Initialize the PaddleOCRVL PDF Parser.

Source code in doctra/parsers/paddleocr_vl_parser.py
def __init__(
    self,
    *,
    use_image_restoration: bool = True,
    restoration_task: str = "appearance",
    restoration_device: Optional[str] = None,
    restoration_dpi: int = 200,
    use_chart_recognition: bool = True,
    use_doc_orientation_classify: bool = False,
    use_doc_unwarping: bool = False,
    use_layout_detection: bool = True,
    device: str = "gpu",
    merge_split_tables: bool = True,
    bottom_threshold_ratio: float = 0.20,
    top_threshold_ratio: float = 0.15,
    max_gap_ratio: float = 0.25,
    column_alignment_tolerance: float = 10.0,
    min_merge_confidence: float = 0.65,
):
    """
    Initialize the PaddleOCRVL PDF Parser.
    """
    if not PADDLEOCR_VL_AVAILABLE:
        raise ImportError(
            "PaddleOCRVL is not available. Please install paddleocr:\n"
            "pip install paddleocr>=2.6.0"
        )

    try:
        with silence():
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                self.paddleocr_vl = PaddleOCRVL(
                    use_doc_orientation_classify=use_doc_orientation_classify,
                    use_doc_unwarping=use_doc_unwarping,
                    use_layout_detection=use_layout_detection,
                )
        print("βœ… PaddleOCRVL pipeline initialized")
    except Exception as e:
        raise RuntimeError(f"Failed to initialize PaddleOCRVL: {e}")

    self.use_chart_recognition = use_chart_recognition
    self.device = device

    self.use_image_restoration = use_image_restoration
    self.restoration_task = restoration_task
    self.restoration_device = restoration_device
    self.restoration_dpi = restoration_dpi

    self.docres_engine = None
    if self.use_image_restoration:
        try:
            self.docres_engine = DocResEngine(
                device=restoration_device,
                use_half_precision=True
            )
            print(f"βœ… DocRes engine initialized with task: {restoration_task}")
        except Exception as e:
            print(f"⚠️ DocRes initialization failed: {e}")
            print("   Continuing without image restoration...")
            self.use_image_restoration = False
            self.docres_engine = None

    self.merge_split_tables = merge_split_tables
    if self.merge_split_tables:
        self.split_table_detector = SplitTableDetector(
            bottom_threshold_ratio=bottom_threshold_ratio,
            top_threshold_ratio=top_threshold_ratio,
            max_gap_ratio=max_gap_ratio,
            column_alignment_tolerance=column_alignment_tolerance,
            min_merge_confidence=min_merge_confidence,
        )
    else:
        self.split_table_detector = None

parse(pdf_path, output_dir=None)

Parse a PDF document using PaddleOCRVL.

:param pdf_path: Path to the input PDF file :param output_dir: Output directory (if None, uses default) :return: None

Source code in doctra/parsers/paddleocr_vl_parser.py
def parse(self, pdf_path: str, output_dir: Optional[str] = None) -> None:
    """
    Parse a PDF document using PaddleOCRVL.

    :param pdf_path: Path to the input PDF file
    :param output_dir: Output directory (if None, uses default)
    :return: None
    """
    pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]

    if output_dir is None:
        out_dir = f"outputs/{pdf_filename}/paddleocr_vl_parse"
    else:
        out_dir = output_dir

    os.makedirs(out_dir, exist_ok=True)
    ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

    print(f"πŸ”„ Processing PDF: {os.path.basename(pdf_path)}")

    if self.use_image_restoration and self.docres_engine:
        print("πŸ”„ Applying DocRes image restoration...")
        enhanced_pages = self._process_pages_with_restoration(pdf_path, out_dir)
    else:
        enhanced_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.restoration_dpi)]

    if not enhanced_pages:
        print("❌ No pages found in PDF")
        return

    print("πŸ” Processing pages with PaddleOCRVL...")
    all_results = []

    is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
    if is_notebook:
        progress_bar = create_notebook_friendly_bar(
            total=len(enhanced_pages),
            desc="PaddleOCRVL processing"
        )
    else:
        progress_bar = create_beautiful_progress_bar(
            total=len(enhanced_pages),
            desc="PaddleOCRVL processing",
            leave=True
        )

    with progress_bar:
        for page_idx, page_img in enumerate(enhanced_pages):
            try:
                with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmp_file:
                    tmp_path = tmp_file.name
                    page_img.save(tmp_path, "JPEG", quality=95)

                try:
                    with warnings.catch_warnings():
                        warnings.simplefilter("ignore")
                        with open(os.devnull, "w") as devnull:
                            with contextlib.redirect_stderr(devnull):
                                output = self.paddleocr_vl.predict(
                                    input=tmp_path,
                                    device=self.device,
                                    use_chart_recognition=self.use_chart_recognition
                                )

                    if output and len(output) > 0:
                        result = output[0]
                        result['page_index'] = page_idx + 1
                        all_results.append(result)

                    progress_bar.set_description(f"βœ… Page {page_idx + 1}/{len(enhanced_pages)} processed")
                finally:
                    try:
                        os.unlink(tmp_path)
                    except:
                        pass

                progress_bar.update(1)

            except Exception as e:
                print(f"⚠️ Page {page_idx + 1} processing failed: {e}")
                progress_bar.update(1)

    split_table_matches: List[SplitTableMatch] = []
    merged_table_segments = []

    if self.merge_split_tables and self.split_table_detector:
        print("πŸ”— Detecting split tables...")
        try:
            pages_for_detection = self._convert_to_layout_pages(all_results, enhanced_pages)
            split_table_matches = self.split_table_detector.detect_split_tables(
                pages_for_detection, enhanced_pages
            )
            if split_table_matches:
                print(f"πŸ”— Detected {len(split_table_matches)} split table(s) to merge")
            for match in split_table_matches:
                merged_table_segments.append(match.segment1)
                merged_table_segments.append(match.segment2)
        except Exception as e:
            import traceback
            traceback.print_exc()
            print(f"⚠️ Split table detection failed: {e}")
            split_table_matches = []

    self._generate_outputs(
        all_results, enhanced_pages, split_table_matches, merged_table_segments, out_dir
    )

    print(f"βœ… Parsing completed successfully!")
    print(f"πŸ“ Output directory: {out_dir}")

StructuredDOCXParser

Comprehensive parser for Microsoft Word documents (.docx files).

doctra.parsers.structured_docx_parser.StructuredDOCXParser

Comprehensive DOCX parser for extracting all types of content.

Processes DOCX documents to extract text, tables, images, and figures. Supports structured data extraction and optional VLM processing for enhanced content analysis.

:param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled. :param extract_images: Whether to extract embedded images (default: True) :param preserve_formatting: Whether to preserve text formatting in output (default: True) :param table_detection: Whether to detect and extract tables (default: True)

Source code in doctra/parsers/structured_docx_parser.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
class StructuredDOCXParser:
    """
    Comprehensive DOCX parser for extracting all types of content.

    Processes DOCX documents to extract text, tables, images, and figures.
    Supports structured data extraction and optional VLM processing for
    enhanced content analysis.

    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param extract_images: Whether to extract embedded images (default: True)
    :param preserve_formatting: Whether to preserve text formatting in output (default: True)
    :param table_detection: Whether to detect and extract tables (default: True)
    """

    def __init__(
        self,
        *,
        vlm: Optional[VLMStructuredExtractor] = None,
        extract_images: bool = True,
        preserve_formatting: bool = True,
        table_detection: bool = True,
        export_excel: bool = True,
    ):
        """
        Initialize the StructuredDOCXParser with processing configuration.

        :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
        :param extract_images: Whether to extract embedded images (default: True)
        :param preserve_formatting: Whether to preserve text formatting in output (default: True)
        :param table_detection: Whether to detect and extract tables (default: True)
        :param export_excel: Whether to export tables to Excel file (default: True)
        """
        if Document is None:
            raise ImportError("python-docx is required for DOCX parsing. Install with: pip install python-docx")

        self.extract_images = extract_images
        self.preserve_formatting = preserve_formatting
        self.table_detection = table_detection
        self.export_excel = export_excel

        # Initialize VLM engine - use provided instance or None
        if vlm is None:
            self.vlm = None
        elif isinstance(vlm, VLMStructuredExtractor):
            self.vlm = vlm
        else:
            raise TypeError(
                f"vlm must be an instance of VLMStructuredExtractor or None, "
                f"got {type(vlm).__name__}"
            )

    def parse(self, docx_path: str) -> None:
        """
        Parse a DOCX document and extract all content.

        :param docx_path: Path to the DOCX file to parse
        """
        if not os.path.exists(docx_path):
            raise FileNotFoundError(f"DOCX file not found: {docx_path}")

        docx_path = Path(docx_path)
        output_dir = Path(f"outputs/{docx_path.stem}")
        output_dir.mkdir(parents=True, exist_ok=True)

        print(f"πŸ“„ Processing DOCX: {docx_path.name}")

        try:
            doc = Document(docx_path)

            document_data = self._extract_document_structure(doc)

            images_data = []
            if self.extract_images:
                images_data = self._extract_images(doc, output_dir)

            tables_data = [elem for elem in document_data['elements'] if elem['type'] == 'table']

            if self.vlm is not None and images_data:
                total_steps = len(images_data)
            else:
                total_steps = 1

            progress_bar = tqdm(total=total_steps, desc="Processing DOCX", unit="image")

            vlm_extracted_data = []
            if self.vlm is not None and images_data:
                vlm_extracted_data = self._process_vlm_data(images_data, output_dir, progress_bar)
            else:
                progress_bar.update(1)

            progress_bar.close()

            self._generate_markdown_output(document_data, images_data, output_dir, vlm_extracted_data)
            self._generate_html_output(document_data, images_data, output_dir, vlm_extracted_data)

            if self.export_excel:
                if vlm_extracted_data:
                    self._generate_excel_output_with_vlm(tables_data, vlm_extracted_data, output_dir)
                else:
                    self._generate_excel_output(tables_data, output_dir)

            print(f"βœ… DOCX parsing completed successfully!")
            print(f"πŸ“Š Extracted: {len(document_data.get('paragraphs', []))} paragraphs, "
                  f"{len(tables_data)} tables, {len(images_data)} images")

        except Exception as e:
            print(f"❌ Error parsing DOCX: {e}")
            raise

    def _extract_document_structure(self, doc: DocumentType) -> Dict[str, Any]:
        """Extract the overall document structure."""
        document_data = {
            'elements': [],  # Mixed list of paragraphs, tables, and other elements
            'paragraphs': [],
            'headings': [],
            'lists': [],
            'metadata': {}
        }

        document_data['metadata'] = {
            'title': doc.core_properties.title or '',
            'author': doc.core_properties.author or '',
            'subject': doc.core_properties.subject or '',
            'created': str(doc.core_properties.created) if doc.core_properties.created else '',
            'modified': str(doc.core_properties.modified) if doc.core_properties.modified else '',
        }

        self._extract_document_elements_in_order(doc, document_data)

        return document_data

    def _extract_document_elements_in_order(self, doc: DocumentType, document_data: Dict):
        """Extract document elements (paragraphs and tables) in their original order."""
        elements = []
        paragraph_index = 0
        table_index = 0

        for element in doc.element.body:
            if element.tag.endswith('p'):
                for para in doc.paragraphs:
                    if para._element == element and para.text.strip():
                        para_data = {
                            'type': 'paragraph',
                            'index': paragraph_index,
                            'text': para.text.strip(),
                            'style': para.style.name if para.style else 'Normal',
                            'is_heading': para.style.name.startswith('Heading') if para.style else False,
                            'level': self._get_heading_level(para.style.name) if para.style else 0,
                            'formatting': self._extract_formatting(para) if self.preserve_formatting else {}
                        }

                        elements.append(para_data)
                        document_data['paragraphs'].append(para_data)

                        # Categorize headings
                        if para_data['is_heading']:
                            document_data['headings'].append(para_data)

                        paragraph_index += 1
                        break

            elif element.tag.endswith('tbl'):
                for table in doc.tables:
                    if table._element == element:
                        table_data = {
                            'type': 'table',
                            'index': table_index,
                            'rows': len(table.rows),
                            'cols': len(table.columns),
                            'data': [],
                            'markdown': ''
                        }

                        for row_idx, row in enumerate(table.rows):
                            row_data = []
                            for cell in row.cells:
                                cell_text = cell.text.strip()
                                row_data.append(cell_text)
                            table_data['data'].append(row_data)

                        if table_data['data']:
                            headers = table_data['data'][0] if table_data['data'] else []
                            rows = table_data['data'][1:] if len(table_data['data']) > 1 else []
                            table_data['markdown'] = render_markdown_table(headers, rows)

                        elements.append(table_data)
                        table_index += 1
                        break

        document_data['elements'] = elements

    def _extract_tables(self, doc: DocumentType, output_dir: Path) -> List[Dict[str, Any]]:
        """Extract all tables from the document."""
        tables_data = []

        for table_idx, table in enumerate(doc.tables):
            table_data = {
                'index': table_idx,
                'rows': len(table.rows),
                'cols': len(table.columns),
                'data': [],
                'markdown': ''
            }

            for row_idx, row in enumerate(table.rows):
                row_data = []
                for cell in row.cells:
                    cell_text = cell.text.strip()
                    row_data.append(cell_text)
                table_data['data'].append(row_data)

            if table_data['data']:
                headers = table_data['data'][0] if table_data['data'] else []
                rows = table_data['data'][1:] if len(table_data['data']) > 1 else []
                table_data['markdown'] = render_markdown_table(headers, rows)
                print(f"πŸ“Š Table {table_idx + 1}: {len(table_data['data'])} rows, {len(table_data['data'][0]) if table_data['data'] else 0} columns")

            tables_data.append(table_data)

        return tables_data

    def _extract_images(self, doc: DocumentType, output_dir: Path) -> List[Dict[str, Any]]:
        """Extract embedded images from the document."""
        images_data = []
        images_dir = output_dir / "images"
        images_dir.mkdir(exist_ok=True)

        try:
            for rel in doc.part.rels.values():
                if hasattr(rel, 'target_ref'):
                    content_type = getattr(rel, 'target_content_type', 'unknown')
                    is_image = False
                    if "image" in rel.target_ref or "media" in rel.target_ref:
                        is_image = True
                    elif content_type and "image/" in content_type:
                        is_image = True
                    elif rel.target_ref.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp', '.tiff', '.webp')):
                        is_image = True

                    if is_image:
                        try:
                            image_blob = rel.target_part.blob
                            if image_blob:
                                original_filename = rel.target_ref
                                clean_filename = Path(original_filename).name

                                image_data = {
                                    'filename': clean_filename,
                                    'original_path': original_filename,
                                    'type': clean_filename.split('.')[-1].lower(),
                                    'path': str(images_dir / clean_filename)
                                }

                                target_path = Path(image_data['path'])
                                target_path.parent.mkdir(parents=True, exist_ok=True)

                                with open(target_path, 'wb') as f:
                                    f.write(image_blob)

                                images_data.append(image_data)
                        except Exception as img_error:
                            pass  # Silently skip problematic images

        except Exception as e:
            pass  # Silently skip if relationships can't be accessed

        return images_data

    def _process_vlm_data(self, images_data: List, output_dir: Path, progress_bar=None) -> List[Dict]:
        """Process images with VLM to extract structured data."""
        vlm_extracted_data = []
        if images_data:
            for i, img_data in enumerate(images_data):
                try:
                    if progress_bar:
                        progress_bar.set_description(f"Processing image {i+1}/{len(images_data)}: {img_data['filename']}")

                    result = self.vlm.extract_table_or_chart(img_data['path'])

                    if hasattr(result, 'title') and hasattr(result, 'description'):
                        vlm_data = {
                            'title': result.title,
                            'description': result.description,
                            'headers': result.headers,
                            'rows': result.rows,
                            'type': 'TabularArtifact',
                            'source_image': img_data['filename'],
                            'page': f"Image {i+1}"
                        }
                        vlm_extracted_data.append(vlm_data)
                    elif isinstance(result, str):
                        # Try to parse JSON string and create proper structure
                        try:
                            parsed_data = json.loads(result)
                            vlm_data = {
                                'title': parsed_data.get('title', f"Extracted from {img_data['filename']}"),
                                'description': parsed_data.get('description', ''),
                                'headers': parsed_data.get('headers', []),
                                'rows': parsed_data.get('rows', []),
                                'type': 'TabularArtifact',
                                'source_image': img_data['filename'],
                                'page': f"Image {i+1}"
                            }
                            vlm_extracted_data.append(vlm_data)
                        except json.JSONDecodeError:
                            # Fallback for non-JSON string
                            vlm_data = {
                                'title': f"Extracted from {img_data['filename']}",
                                'description': result[:300] if len(result) > 300 else result,
                                'headers': [],
                                'rows': [],
                                'type': 'TabularArtifact',
                                'source_image': img_data['filename'],
                                'page': f"Image {i+1}",
                                'raw_response': result
                            }
                            vlm_extracted_data.append(vlm_data)

                    # Update progress bar after each image
                    if progress_bar:
                        progress_bar.update(1)

                except Exception as img_error:
                    # Still update progress bar even if image processing fails
                    if progress_bar:
                        progress_bar.update(1)
                    pass  # Silently skip problematic images

        return vlm_extracted_data

    def _safe_sheet_name(self, raw_title: str) -> str:
        """
        Create a safe Excel sheet name from a raw title.

        Ensures the sheet name is valid for Excel by removing invalid characters,
        handling length limits, and avoiding duplicates.
        """
        import re

        # Excel invalid characters
        invalid_chars = r'[:\\/*?\[\]]'
        max_length = 31

        name = (raw_title or "Untitled").strip()
        name = re.sub(invalid_chars, "_", name)
        name = re.sub(r"\s+", " ", name)
        name = name[:max_length] if name else "Sheet"

        return name

    def _generate_markdown_output(self, document_data: Dict, images_data: List, output_dir: Path, vlm_extracted_data: List = None):
        """Generate markdown output."""
        markdown_content = []

        if document_data['metadata']['title']:
            markdown_content.append(f"# {document_data['metadata']['title']}")

        for element in document_data['elements']:
            if element['type'] == 'paragraph':
                if element['is_heading']:
                    level = element['level']
                    markdown_content.append(f"{'#' * level} {element['text']}")
                else:
                    markdown_content.append(element['text'])
            elif element['type'] == 'table':
                if element['markdown']:
                    markdown_content.append(f"\n## Table {element['index'] + 1}")
                    markdown_content.append(element['markdown'])

        if vlm_extracted_data:
            for i, vlm_table in enumerate(vlm_extracted_data):
                if vlm_table['rows']:
                    markdown_content.append(f"\n## {vlm_table['title']}")
                    if vlm_table['description']:
                        markdown_content.append(f"*{vlm_table['description']}*")

                    if vlm_table['headers'] and vlm_table['rows']:
                        vlm_markdown = render_markdown_table(vlm_table['headers'], vlm_table['rows'])
                        markdown_content.append(vlm_markdown)
        else:
            for img in images_data:
                relative_path = f"images/{img['filename']}"
                markdown_content.append(f"\n![{img['filename']}]({relative_path})")

        write_markdown(markdown_content, str(output_dir), "document.md")

    def _generate_html_output(self, document_data: Dict, images_data: List, output_dir: Path, vlm_extracted_data: List = None):
        """Generate HTML output."""
        html_content = []

        if document_data['metadata']['title']:
            html_content.append(f"<h1>{document_data['metadata']['title']}</h1>")

        for element in document_data['elements']:
            if element['type'] == 'paragraph':
                if element['is_heading']:
                    level = element['level']
                    html_content.append(f"<h{level}>{element['text']}</h{level}>")
                else:
                    html_content.append(f"<p>{element['text']}</p>")
            elif element['type'] == 'table':
                if element['data']:
                    html_content.append(f"<h2>Table {element['index'] + 1}</h2>")
                    html_table = self._generate_html_table(element['data'])
                    html_content.append(html_table)

        if vlm_extracted_data:
            for i, vlm_table in enumerate(vlm_extracted_data):
                if vlm_table['rows']:
                    html_content.append(f"<h2>{vlm_table['title']}</h2>")
                    if vlm_table['description']:
                        html_content.append(f"<p><em>{vlm_table['description']}</em></p>")

                    if vlm_table['headers'] and vlm_table['rows']:
                        table_data = [vlm_table['headers']] + vlm_table['rows']
                        vlm_html_table = self._generate_html_table(table_data)
                        html_content.append(vlm_html_table)
        else:
            for img in images_data:
                relative_path = f"images/{img['filename']}"
                html_content.append(f'<img src="{relative_path}" alt="{img["filename"]}" />')

        write_html(html_content, str(output_dir), "document.html")

    def _generate_excel_output(self, tables_data: List, output_dir: Path):
        """Generate Excel output with all tables and Table of Contents."""
        if not tables_data:
            print("⚠️  No tables found to export to Excel")
            return

        if not EXCEL_AVAILABLE:
            print("⚠️  Excel export requires pandas and openpyxl: Missing dependencies")
            print("Install with: pip install pandas openpyxl")
            return

        try:
            wb = Workbook()
            wb.remove(wb.active)

            HEADER_FILL = PatternFill(fill_type="solid", start_color="FF2E7D32", end_color="FF2E7D32")
            HEADER_FONT = Font(color="FFFFFFFF", bold=True)
            HEADER_ALIGN = Alignment(horizontal="center", vertical="center", wrap_text=True)

            toc_data = []
            sheet_index = 1
            sheet_mapping = {}

            for i, table in enumerate(tables_data):
                if table['data']:
                    table_title = table.get('title', f"Table {i+1}")
                    sheet_name = self._safe_sheet_name(table_title)
                    ws = wb.create_sheet(title=sheet_name)

                    for row_idx, row_data in enumerate(table['data']):
                        for col_idx, cell_value in enumerate(row_data):
                            ws.cell(row=row_idx + 1, column=col_idx + 1, value=cell_value)

                    if table['data']:
                        ncols = len(table['data'][0]) if table['data'] else 0
                        for col_idx in range(1, ncols + 1):
                            cell = ws.cell(row=1, column=col_idx)
                            cell.fill = HEADER_FILL
                            cell.font = HEADER_FONT
                            cell.alignment = HEADER_ALIGN
                        ws.freeze_panes = "A2"

                    for column in ws.columns:
                        max_length = 0
                        column_letter = column[0].column_letter
                        for cell in column:
                            try:
                                if len(str(cell.value)) > max_length:
                                    max_length = len(str(cell.value))
                            except:
                                pass
                        adjusted_width = min(max_length + 2, 50)
                        ws.column_dimensions[column_letter].width = adjusted_width

                    toc_data.append([
                        sheet_index,
                        table_title,
                        f"Original table from document",
                        len(table['data']),
                        len(table['data'][0]) if table['data'] else 0,
                        "Document"
                    ])
                    sheet_mapping[table_title] = sheet_name
                    sheet_index += 1

            if toc_data:
                toc_ws = wb.create_sheet(title="Table_of_Contents", index=0)

                toc_headers = ["Sheet #", "Table Name", "Description", "Rows", "Columns", "Source"]
                for col_idx, header in enumerate(toc_headers):
                    cell = toc_ws.cell(row=1, column=col_idx + 1, value=header)
                    cell.fill = HEADER_FILL
                    cell.font = HEADER_FONT
                    cell.alignment = HEADER_ALIGN

                for row_idx, row_data in enumerate(toc_data):
                    for col_idx, cell_value in enumerate(row_data):
                        cell = toc_ws.cell(row=row_idx + 2, column=col_idx + 1, value=cell_value)

                        if col_idx == 1 and cell_value in sheet_mapping:
                            sheet_name = sheet_mapping[cell_value]

                            if ' ' in sheet_name or any(char in sheet_name for char in ['[', ']', '*', '?', ':', '\\', '/']):
                                hyperlink_ref = f"#'{sheet_name}'!A1"
                            else:
                                hyperlink_ref = f"#{sheet_name}!A1"

                            cell.hyperlink = Hyperlink(ref=hyperlink_ref, target=hyperlink_ref)
                            cell.font = Font(color="0000FF", underline="single")

                        if col_idx == 2:
                            cell.alignment = Alignment(wrap_text=True, vertical="top")

                toc_ws.column_dimensions['A'].width = 10
                toc_ws.column_dimensions['B'].width = 30
                toc_ws.column_dimensions['C'].width = 60
                toc_ws.column_dimensions['D'].width = 10
                toc_ws.column_dimensions['E'].width = 10
                toc_ws.column_dimensions['F'].width = 15

                for row_idx in range(2, len(toc_data) + 2):
                    toc_ws.row_dimensions[row_idx].height = 30

            excel_path = output_dir / "tables.xlsx"
            wb.save(excel_path)

        except Exception as e:
            print(f"❌ Error creating Excel file: {e}")

    def _generate_excel_output_with_vlm(self, tables_data: List, vlm_extracted_data: List, output_dir: Path):
        """Generate Excel output with both original tables and VLM extracted data, including table of contents."""
        if not tables_data and not vlm_extracted_data:
            print("⚠️  No tables found to export to Excel")
            return

        if not EXCEL_AVAILABLE:
            print("⚠️  Excel export requires pandas and openpyxl: Missing dependencies")
            print("Install with: pip install pandas openpyxl")
            return

        try:
            wb = Workbook()
            wb.remove(wb.active)


            # Define styling constants (matching PDF parser)
            HEADER_FILL = PatternFill(fill_type="solid", start_color="FF2E7D32", end_color="FF2E7D32")  # Green
            HEADER_FONT = Font(color="FFFFFFFF", bold=True)
            HEADER_ALIGN = Alignment(horizontal="center", vertical="center", wrap_text=True)

            toc_data = []
            sheet_index = 1
            sheet_mapping = {}

            for i, table in enumerate(tables_data):
                if table['data']:
                    table_title = table.get('title', f"Table {i+1}")
                    sheet_name = self._safe_sheet_name(table_title)
                    ws = wb.create_sheet(title=sheet_name)

                    for row_idx, row_data in enumerate(table['data']):
                        for col_idx, cell_value in enumerate(row_data):
                            ws.cell(row=row_idx + 1, column=col_idx + 1, value=cell_value)

                    if table['data']:
                        ncols = len(table['data'][0]) if table['data'] else 0
                        for col_idx in range(1, ncols + 1):
                            cell = ws.cell(row=1, column=col_idx)
                            cell.fill = HEADER_FILL
                            cell.font = HEADER_FONT
                            cell.alignment = HEADER_ALIGN
                        ws.freeze_panes = "A2"

                    for column in ws.columns:
                        max_length = 0
                        column_letter = column[0].column_letter
                        for cell in column:
                            try:
                                if len(str(cell.value)) > max_length:
                                    max_length = len(str(cell.value))
                            except:
                                pass
                        adjusted_width = min(max_length + 2, 50)
                        ws.column_dimensions[column_letter].width = adjusted_width

                    toc_data.append([
                        sheet_index,
                        table_title,
                        f"Original table from document",
                        len(table['data']),
                        len(table['data'][0]) if table['data'] else 0,
                        "Document"
                    ])
                    sheet_mapping[table_title] = sheet_name
                    sheet_index += 1

            for i, vlm_table in enumerate(vlm_extracted_data):
                if vlm_table['rows']:
                    table_title = vlm_table['title']
                    sheet_name = self._safe_sheet_name(table_title)
                    ws = wb.create_sheet(title=sheet_name)

                    for col_idx, header in enumerate(vlm_table['headers']):
                        cell = ws.cell(row=1, column=col_idx + 1, value=header)
                        cell.fill = HEADER_FILL
                        cell.font = HEADER_FONT
                        cell.alignment = HEADER_ALIGN

                    for row_idx, row_data in enumerate(vlm_table['rows']):
                        for col_idx, cell_value in enumerate(row_data):
                            ws.cell(row=row_idx + 2, column=col_idx + 1, value=cell_value)

                    ws.freeze_panes = "A2"

                    for column in ws.columns:
                        max_length = 0
                        column_letter = column[0].column_letter
                        for cell in column:
                            try:
                                if len(str(cell.value)) > max_length:
                                    max_length = len(str(cell.value))
                            except:
                                pass
                        adjusted_width = min(max_length + 2, 50)
                        ws.column_dimensions[column_letter].width = adjusted_width

                    toc_data.append([
                        sheet_index,
                        table_title,
                        vlm_table['description'],
                        len(vlm_table['rows']),
                        len(vlm_table['headers']),
                        "VLM Extracted"
                    ])
                    sheet_mapping[table_title] = sheet_name
                    sheet_index += 1

            if toc_data:
                toc_ws = wb.create_sheet(title="Table_of_Contents", index=0)

                toc_headers = ["Sheet #", "Table Name", "Description", "Rows", "Columns", "Source"]
                for col_idx, header in enumerate(toc_headers):
                    cell = toc_ws.cell(row=1, column=col_idx + 1, value=header)
                    cell.fill = HEADER_FILL
                    cell.font = HEADER_FONT
                    cell.alignment = HEADER_ALIGN

                for row_idx, row_data in enumerate(toc_data):
                    for col_idx, cell_value in enumerate(row_data):
                        cell = toc_ws.cell(row=row_idx + 2, column=col_idx + 1, value=cell_value)

                        if col_idx == 1 and cell_value in sheet_mapping:
                            sheet_name = sheet_mapping[cell_value]

                            if ' ' in sheet_name or any(char in sheet_name for char in ['[', ']', '*', '?', ':', '\\', '/']):
                                hyperlink_ref = f"#'{sheet_name}'!A1"
                            else:
                                hyperlink_ref = f"#{sheet_name}!A1"

                            cell.hyperlink = Hyperlink(ref=hyperlink_ref, target=hyperlink_ref)
                            cell.font = Font(color="0000FF", underline="single")

                        if col_idx == 2:
                            cell.alignment = Alignment(wrap_text=True, vertical="top")

                toc_ws.column_dimensions['A'].width = 10
                toc_ws.column_dimensions['B'].width = 30
                toc_ws.column_dimensions['C'].width = 60
                toc_ws.column_dimensions['D'].width = 10
                toc_ws.column_dimensions['E'].width = 10
                toc_ws.column_dimensions['F'].width = 15

                for row_idx in range(2, len(toc_data) + 2):
                    toc_ws.row_dimensions[row_idx].height = 30

            excel_path = output_dir / "tables.xlsx"
            wb.save(excel_path)

        except Exception as e:
            print(f"❌ Error creating Excel file: {e}")


    def _get_heading_level(self, style_name: str) -> int:
        """Extract heading level from style name."""
        if style_name.startswith('Heading'):
            try:
                return int(style_name.split()[-1])
            except:
                return 1
        return 0

    def _extract_formatting(self, paragraph: Paragraph) -> Dict[str, Any]:
        """Extract formatting information from paragraph."""
        formatting = {
            'bold': False,
            'italic': False,
            'underline': False,
            'font_size': None,
            'font_name': None
        }

        try:
            for run in paragraph.runs:
                if run.bold:
                    formatting['bold'] = True
                if run.italic:
                    formatting['italic'] = True
                if run.underline:
                    formatting['underline'] = True
                if run.font.size:
                    formatting['font_size'] = run.font.size.pt
                if run.font.name:
                    formatting['font_name'] = run.font.name
        except:
            pass

        return formatting

    def _generate_html_table(self, table_data: List[List[str]]) -> str:
        """Generate HTML table from table data."""
        if not table_data:
            return ""

        html = ["<table border='1'>"]

        for row_idx, row in enumerate(table_data):
            html.append("<tr>")
            for cell in row:
                tag = "th" if row_idx == 0 else "td"
                html.append(f"<{tag}>{cell}</{tag}>")
            html.append("</tr>")

        html.append("</table>")
        return '\n'.join(html)

__init__(*, vlm=None, extract_images=True, preserve_formatting=True, table_detection=True, export_excel=True)

Initialize the StructuredDOCXParser with processing configuration.

:param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled. :param extract_images: Whether to extract embedded images (default: True) :param preserve_formatting: Whether to preserve text formatting in output (default: True) :param table_detection: Whether to detect and extract tables (default: True) :param export_excel: Whether to export tables to Excel file (default: True)

Source code in doctra/parsers/structured_docx_parser.py
def __init__(
    self,
    *,
    vlm: Optional[VLMStructuredExtractor] = None,
    extract_images: bool = True,
    preserve_formatting: bool = True,
    table_detection: bool = True,
    export_excel: bool = True,
):
    """
    Initialize the StructuredDOCXParser with processing configuration.

    :param vlm: VLM engine instance (VLMStructuredExtractor). If None, VLM processing is disabled.
    :param extract_images: Whether to extract embedded images (default: True)
    :param preserve_formatting: Whether to preserve text formatting in output (default: True)
    :param table_detection: Whether to detect and extract tables (default: True)
    :param export_excel: Whether to export tables to Excel file (default: True)
    """
    if Document is None:
        raise ImportError("python-docx is required for DOCX parsing. Install with: pip install python-docx")

    self.extract_images = extract_images
    self.preserve_formatting = preserve_formatting
    self.table_detection = table_detection
    self.export_excel = export_excel

    # Initialize VLM engine - use provided instance or None
    if vlm is None:
        self.vlm = None
    elif isinstance(vlm, VLMStructuredExtractor):
        self.vlm = vlm
    else:
        raise TypeError(
            f"vlm must be an instance of VLMStructuredExtractor or None, "
            f"got {type(vlm).__name__}"
        )

parse(docx_path)

Parse a DOCX document and extract all content.

:param docx_path: Path to the DOCX file to parse

Source code in doctra/parsers/structured_docx_parser.py
def parse(self, docx_path: str) -> None:
    """
    Parse a DOCX document and extract all content.

    :param docx_path: Path to the DOCX file to parse
    """
    if not os.path.exists(docx_path):
        raise FileNotFoundError(f"DOCX file not found: {docx_path}")

    docx_path = Path(docx_path)
    output_dir = Path(f"outputs/{docx_path.stem}")
    output_dir.mkdir(parents=True, exist_ok=True)

    print(f"πŸ“„ Processing DOCX: {docx_path.name}")

    try:
        doc = Document(docx_path)

        document_data = self._extract_document_structure(doc)

        images_data = []
        if self.extract_images:
            images_data = self._extract_images(doc, output_dir)

        tables_data = [elem for elem in document_data['elements'] if elem['type'] == 'table']

        if self.vlm is not None and images_data:
            total_steps = len(images_data)
        else:
            total_steps = 1

        progress_bar = tqdm(total=total_steps, desc="Processing DOCX", unit="image")

        vlm_extracted_data = []
        if self.vlm is not None and images_data:
            vlm_extracted_data = self._process_vlm_data(images_data, output_dir, progress_bar)
        else:
            progress_bar.update(1)

        progress_bar.close()

        self._generate_markdown_output(document_data, images_data, output_dir, vlm_extracted_data)
        self._generate_html_output(document_data, images_data, output_dir, vlm_extracted_data)

        if self.export_excel:
            if vlm_extracted_data:
                self._generate_excel_output_with_vlm(tables_data, vlm_extracted_data, output_dir)
            else:
                self._generate_excel_output(tables_data, output_dir)

        print(f"βœ… DOCX parsing completed successfully!")
        print(f"πŸ“Š Extracted: {len(document_data.get('paragraphs', []))} paragraphs, "
              f"{len(tables_data)} tables, {len(images_data)} images")

    except Exception as e:
        print(f"❌ Error parsing DOCX: {e}")
        raise

Quick Reference

StructuredPDFParser

from doctra import StructuredPDFParser
from doctra.engines.ocr import PytesseractOCREngine, PaddleOCREngine
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize OCR engine (optional - defaults to PyTesseract if None)
ocr_engine = PytesseractOCREngine(lang="eng", psm=4, oem=3)

# Initialize VLM engine (optional - None to disable VLM)
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    vlm_model="gpt-4o",  # Optional
    api_key="your-api-key"
)

parser = StructuredPDFParser(
    # Layout Detection
    layout_model_name: str = "PP-DocLayout_plus-L",
    dpi: int = 200,
    min_score: float = 0.0,

    # OCR Engine (pass initialized engine instance)
    ocr_engine: Optional[Union[PytesseractOCREngine, PaddleOCREngine]] = None,

    # VLM Engine (pass initialized engine instance)
    vlm: Optional[VLMStructuredExtractor] = None,

    # Split Table Merging
    merge_split_tables: bool = False,
    bottom_threshold_ratio: float = 0.20,
    top_threshold_ratio: float = 0.15,
    max_gap_ratio: float = 0.25,
    column_alignment_tolerance: float = 10.0,
    min_merge_confidence: float = 0.65,

    # Output Settings
    box_separator: str = "\n"
)

# Parse document
parser.parse(
    pdf_path: str,
    output_base_dir: str = "outputs"
)

# Visualize layout
parser.display_pages_with_boxes(
    pdf_path: str,
    num_pages: int = 3,
    cols: int = 2,
    page_width: int = 800,
    spacing: int = 40,
    save_path: str = None
)

EnhancedPDFParser

from doctra import EnhancedPDFParser
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine (optional)
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    api_key="your-api-key"
)

parser = EnhancedPDFParser(
    # Image Restoration
    use_image_restoration: bool = True,
    restoration_task: str = "appearance",
    restoration_device: str = None,
    restoration_dpi: int = 200,

    # VLM Engine (pass initialized engine instance)
    vlm: Optional[VLMStructuredExtractor] = None,

    # Layout Detection
    layout_model_name: str = "PP-DocLayout_plus-L",
    dpi: int = 200,
    min_score: float = 0.0,

    # OCR Engine (optional)
    ocr_engine: Optional[Union[PytesseractOCREngine, PaddleOCREngine]] = None,

    # Split Table Merging
    merge_split_tables: bool = False,
    bottom_threshold_ratio: float = 0.20,
    top_threshold_ratio: float = 0.15,
    max_gap_ratio: float = 0.25,
    column_alignment_tolerance: float = 10.0,
    min_merge_confidence: float = 0.65,

    # Output Settings
    box_separator: str = "\n"
)

# Parse with enhancement
parser.parse(
    pdf_path: str,
    output_base_dir: str = "outputs"
)

ChartTablePDFParser

from doctra import ChartTablePDFParser
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine (optional)
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    api_key="your-api-key"
)

parser = ChartTablePDFParser(
    # Extraction Settings
    extract_charts: bool = True,
    extract_tables: bool = True,

    # VLM Engine (pass initialized engine instance)
    vlm: Optional[VLMStructuredExtractor] = None,

    # Layout Detection
    layout_model_name: str = "PP-DocLayout_plus-L",
    dpi: int = 200,
    min_score: float = 0.0,

    # Split Table Merging
    merge_split_tables: bool = False,
    bottom_threshold_ratio: float = 0.20,
    top_threshold_ratio: float = 0.15,
    max_gap_ratio: float = 0.25,
    column_alignment_tolerance: float = 10.0,
    min_merge_confidence: float = 0.65,
)

# Extract charts/tables
parser.parse(
    pdf_path: str,
    output_base_dir: str = "outputs"
)

StructuredDOCXParser

from doctra import StructuredDOCXParser
from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine (optional)
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    api_key="your-api-key"
)

parser = StructuredDOCXParser(
    # VLM Engine (pass initialized engine instance)
    vlm: Optional[VLMStructuredExtractor] = None,

    # Processing Options
    extract_images: bool = True,
    preserve_formatting: bool = True,
    table_detection: bool = True,
    export_excel: bool = True
)

# Parse DOCX document
parser.parse(
    docx_path: str
)

Parameter Reference

Layout Detection Parameters

Parameter Type Default Description
layout_model_name str "PP-DocLayout_plus-L" PaddleOCR layout detection model
dpi int 200 Image resolution for rendering PDF pages
min_score float 0.0 Minimum confidence score for detected elements

OCR Parameters

Parameter Type Default Description
ocr_engine Optional[Union[PytesseractOCREngine, PaddleOCREngine]] None OCR engine instance. If None, creates a default PytesseractOCREngine with lang="eng", psm=4, oem=3

OCR Engine Configuration:

OCR engines must be initialized externally and passed to the parser. This uses a dependency injection pattern for clearer API design.

PytesseractOCREngine Parameters: - lang (str, default: "eng"): Tesseract language code (e.g., "eng", "fra", "spa", "deu", or multiple: "eng+fra") - psm (int, default: 4): Page segmentation mode (3=Automatic, 4=Single column, 6=Uniform block, 11=Sparse text, 12=Sparse with OSD) - oem (int, default: 3): OCR engine mode (0=Legacy, 1=Neural nets LSTM, 3=Default both) - extra_config (str, default: ""): Additional Tesseract configuration string

PaddleOCREngine Parameters: - device (str, default: "gpu"): Device for OCR processing ("cpu" or "gpu") - use_doc_orientation_classify (bool, default: False): Enable document orientation classification - use_doc_unwarping (bool, default: False): Enable text image rectification - use_textline_orientation (bool, default: False): Enable text line orientation classification

Example:

from doctra.engines.ocr import PytesseractOCREngine, PaddleOCREngine

# PyTesseract
tesseract_ocr = PytesseractOCREngine(lang="eng", psm=4, oem=3)
parser = StructuredPDFParser(ocr_engine=tesseract_ocr)

# PaddleOCR
paddle_ocr = PaddleOCREngine(device="gpu")
parser = StructuredPDFParser(ocr_engine=paddle_ocr)

Note: When using PaddleOCR, PaddleOCR 3.0's PP-OCRv5_server model is used by default. Models are automatically downloaded on first use.

VLM Parameters

Parameter Type Default Description
vlm Optional[VLMStructuredExtractor] None VLM engine instance. If None, VLM processing is disabled.

VLM Engine Configuration:

VLM engines must be initialized externally and passed to the parser. This uses a dependency injection pattern for clearer API design.

VLMStructuredExtractor Parameters: - vlm_provider (str, required): VLM provider to use ("openai", "gemini", "anthropic", "openrouter", "qianfan", "ollama") - vlm_model (str, optional): Model name to use (defaults to provider-specific defaults) - api_key (str, optional): API key for the VLM provider (required for all providers except Ollama)

Example:

from doctra.engines.vlm.service import VLMStructuredExtractor

# Initialize VLM engine
vlm_engine = VLMStructuredExtractor(
    vlm_provider="openai",
    vlm_model="gpt-4o",  # Optional
    api_key="your-api-key"
)

# Pass to parser
parser = StructuredPDFParser(vlm=vlm_engine)

Image Restoration Parameters

Parameter Type Default Description
use_image_restoration bool True Enable image restoration
restoration_task str "appearance" Restoration task type
restoration_device str None Device: "cuda", "cpu", or None (auto-detect)
restoration_dpi int 200 DPI for restoration processing

Split Table Merging Parameters

Available for both StructuredPDFParser and EnhancedPDFParser.

Parameter Type Default Description
merge_split_tables bool False Enable automatic detection and merging of tables split across pages
bottom_threshold_ratio float 0.20 Ratio (0-1) for detecting tables near bottom of page. Tables within this ratio from the bottom are considered candidates.
top_threshold_ratio float 0.15 Ratio (0-1) for detecting tables near top of page. Tables within this ratio from the top are considered candidates.
max_gap_ratio float 0.25 Maximum allowed gap between table segments as ratio of page height. Accounts for headers, footers, and page margins.
column_alignment_tolerance float 10.0 Pixel tolerance for column alignment validation when comparing table structures.
min_merge_confidence float 0.65 Minimum confidence score (0-1) required to merge two table segments. Higher values are more conservative.

Extraction Parameters

Parameter Type Default Description
extract_charts bool True Extract chart elements
extract_tables bool True Extract table elements

DOCX Processing Parameters

Parameter Type Default Description
extract_images bool True Extract embedded images from DOCX
preserve_formatting bool True Preserve text formatting in output
table_detection bool True Detect and extract tables
export_excel bool True Export tables to Excel file

Output Parameters

Parameter Type Default Description
box_separator str "\n" Separator between detected elements

Return Values

parse() Method

Returns: None

Generates output files in the specified output_base_dir:

outputs/
└── <document_name>/
    β”œβ”€β”€ full_parse/  # or 'enhanced_parse/', 'structured_parsing/'
    β”‚   β”œβ”€β”€ result.md
    β”‚   β”œβ”€β”€ result.html
    β”‚   β”œβ”€β”€ tables.xlsx  # If VLM enabled
    β”‚   β”œβ”€β”€ tables.html  # If VLM enabled
    β”‚   β”œβ”€β”€ vlm_items.json  # If VLM enabled
    β”‚   └── images/
    β”‚       β”œβ”€β”€ figures/
    β”‚       β”œβ”€β”€ charts/
    β”‚       └── tables/

For DOCX parsing, generates:

outputs/
└── <document_name>/
    β”œβ”€β”€ document.md
    β”œβ”€β”€ document.html
    β”œβ”€β”€ tables.xlsx  # With Table of Contents
    └── images/
        β”œβ”€β”€ image1.png
        β”œβ”€β”€ image2.jpg
        └── ...

display_pages_with_boxes() Method

Returns: None

Displays or saves visualization of layout detection.

Error Handling

All parsers may raise:

  • FileNotFoundError: PDF file not found
  • ValueError: Invalid parameter values
  • RuntimeError: Processing errors (e.g., Poppler not found)
  • APIError: VLM API errors (when VLM enabled)

Example error handling:

from doctra import StructuredPDFParser

parser = StructuredPDFParser()

try:
    parser.parse("document.pdf")
except FileNotFoundError:
    print("PDF file not found!")
except ValueError as e:
    print(f"Invalid parameter: {e}")
except RuntimeError as e:
    print(f"Processing error: {e}")

Examples

See the Examples section for detailed usage examples.