Skip to content

Parsers API Reference

Complete API documentation for all Doctra parsers.

StructuredPDFParser

The base parser for comprehensive PDF document processing.

doctra.parsers.structured_pdf_parser.StructuredPDFParser

Comprehensive PDF parser for extracting all types of content.

Processes PDF documents to extract text, tables, charts, and figures.
Supports OCR for text extraction and optional VLM processing for
converting visual elements into structured data.

:param use_vlm: Whether to use VLM for structured data extraction (default: False)
:param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
:param vlm_model: Model name to use (defaults to provider-specific defaults)
:param vlm_api_key: API key for VLM provider (required if use_vlm is True)
:param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
:param dpi: DPI for PDF rendering (default: 200)
:param min_score: Minimum confidence score for layout detection (default: 0.0)
:param ocr_lang: OCR language code (default: "eng")
:param ocr_psm: Tesseract page segmentation mode (default: 4)
:param ocr_oem: Tesseract OCR engine mode (default: 3)
:param ocr_extra_config: Additional Tesseract configuration (default: "")
:param box_separator: Separator between text boxes in output (default: "

")

Source code in doctra/parsers/structured_pdf_parser.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
class StructuredPDFParser:
    """
    Comprehensive PDF parser for extracting all types of content.

    Processes PDF documents to extract text, tables, charts, and figures.
    Supports OCR for text extraction and optional VLM processing for
    converting visual elements into structured data.

    :param use_vlm: Whether to use VLM for structured data extraction (default: False)
    :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
    :param vlm_model: Model name to use (defaults to provider-specific defaults)
    :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_lang: OCR language code (default: "eng")
    :param ocr_psm: Tesseract page segmentation mode (default: 4)
    :param ocr_oem: Tesseract OCR engine mode (default: 3)
    :param ocr_extra_config: Additional Tesseract configuration (default: "")
    :param box_separator: Separator between text boxes in output (default: "\n")
    """

    def __init__(
            self,
            *,
            use_vlm: bool = False,
            vlm_provider: str = "gemini",
            vlm_model: str | None = None,
            vlm_api_key: str | None = None,
            layout_model_name: str = "PP-DocLayout_plus-L",
            dpi: int = 200,
            min_score: float = 0.0,
            ocr_lang: str = "eng",
            ocr_psm: int = 4,
            ocr_oem: int = 3,
            ocr_extra_config: str = "",
            box_separator: str = "\n",
    ):
        """
        Initialize the StructuredPDFParser with processing configuration.

        :param use_vlm: Whether to use VLM for structured data extraction (default: False)
        :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
        :param vlm_model: Model name to use (defaults to provider-specific defaults)
        :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
        :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
        :param dpi: DPI for PDF rendering (default: 200)
        :param min_score: Minimum confidence score for layout detection (default: 0.0)
        :param ocr_lang: OCR language code (default: "eng")
        :param ocr_psm: Tesseract page segmentation mode (default: 4)
        :param ocr_oem: Tesseract OCR engine mode (default: 3)
        :param ocr_extra_config: Additional Tesseract configuration (default: "")
        :param box_separator: Separator between text boxes in output (default: "\n")
        """
        self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
        self.dpi = dpi
        self.min_score = min_score
        self.ocr_engine = PytesseractOCREngine(
            lang=ocr_lang, psm=ocr_psm, oem=ocr_oem, extra_config=ocr_extra_config
        )
        self.box_separator = box_separator
        self.use_vlm = use_vlm
        self.vlm = None
        if self.use_vlm:
            try:
                self.vlm = VLMStructuredExtractor(
                    vlm_provider=vlm_provider,
                    vlm_model=vlm_model,
                    api_key=vlm_api_key,
                )
            except Exception as e:
                self.vlm = None

    def parse(self, pdf_path: str) -> None:
        """
        Parse a PDF document and extract all content types.

        :param pdf_path: Path to the input PDF file
        :return: None
        """
        pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]
        out_dir = f"outputs/{pdf_filename}/full_parse"

        os.makedirs(out_dir, exist_ok=True)
        ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

        pages: List[LayoutPage] = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )
        pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        fig_count = sum(sum(1 for b in p.boxes if b.label == "figure") for p in pages)
        chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages)
        table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages)

        md_lines: List[str] = ["# Extracted Content\n"]
        html_lines: List[str] = ["<h1>Extracted Content</h1>"]  # For direct HTML generation
        structured_items: List[Dict[str, Any]] = []

        charts_desc = "Charts (VLM → table)" if self.use_vlm else "Charts (cropped)"
        tables_desc = "Tables (VLM → table)" if self.use_vlm else "Tables (cropped)"
        figures_desc = "Figures (cropped)"

        with ExitStack() as stack:
            is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
            is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()
            if is_notebook:
                charts_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
                figures_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=fig_count, desc=figures_desc)) if fig_count else None
            else:
                charts_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None
                figures_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=fig_count, desc=figures_desc, leave=True)) if fig_count else None

            for p in pages:
                page_num = p.page_index
                page_img: Image.Image = pil_pages[page_num - 1]
                md_lines.append(f"\n## Page {page_num}\n")
                html_lines.append(f"<h2>Page {page_num}</h2>")

                for i, box in enumerate(sorted(p.boxes, key=reading_order_key), start=1):
                    if box.label in EXCLUDE_LABELS:
                        img_path = save_box_image(page_img, box, out_dir, page_num, i, IMAGE_SUBDIRS)
                        abs_img_path = os.path.abspath(img_path)
                        rel = os.path.relpath(abs_img_path, out_dir)

                        if box.label == "figure":
                            figure_md = f"![Figure — page {page_num}]({rel})\n"
                            figure_html = f'<img src="{rel}" alt="Figure — page {page_num}" />'
                            md_lines.append(figure_md)
                            html_lines.append(figure_html)
                            if figures_bar: figures_bar.update(1)

                        elif box.label == "chart":
                            if self.use_vlm and self.vlm:
                                wrote_table = False
                                try:
                                    chart = self.vlm.extract_chart(abs_img_path)
                                    item = to_structured_dict(chart)
                                    if item:
                                        # Add page and type information to structured item
                                        item["page"] = page_num
                                        item["type"] = "Chart"
                                        structured_items.append(item)

                                        # Generate both markdown and HTML tables
                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    chart_md = f"![Chart — page {page_num}]({rel})\n"
                                    chart_html = f'<img src="{rel}" alt="Chart — page {page_num}" />'
                                    md_lines.append(chart_md)
                                    html_lines.append(chart_html)
                            else:
                                chart_md = f"![Chart — page {page_num}]({rel})\n"
                                chart_html = f'<img src="{rel}" alt="Chart — page {page_num}" />'
                                md_lines.append(chart_md)
                                html_lines.append(chart_html)
                            if charts_bar: charts_bar.update(1)

                        elif box.label == "table":
                            if self.use_vlm and self.vlm:
                                wrote_table = False
                                try:
                                    table = self.vlm.extract_table(abs_img_path)
                                    item = to_structured_dict(table)
                                    if item:
                                        # Add page and type information to structured item
                                        item["page"] = page_num
                                        item["type"] = "Table"
                                        structured_items.append(item)

                                        # Generate both markdown and HTML tables
                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    table_md = f"![Table — page {page_num}]({rel})\n"
                                    table_html = f'<img src="{rel}" alt="Table — page {page_num}" />'
                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                            else:
                                table_md = f"![Table — page {page_num}]({rel})\n"
                                table_html = f'<img src="{rel}" alt="Table — page {page_num}" />'
                                md_lines.append(table_md)
                                html_lines.append(table_html)
                            if tables_bar: tables_bar.update(1)
                    else:
                        text = ocr_box_text(self.ocr_engine, page_img, box)
                        if text:
                            md_lines.append(text)
                            md_lines.append(self.box_separator if self.box_separator else "")
                            # Convert text to HTML (basic conversion)
                            html_text = text.replace('\n', '<br>')
                            html_lines.append(f"<p>{html_text}</p>")
                            if self.box_separator:
                                html_lines.append("<br>")

        md_path = write_markdown(md_lines, out_dir)

        # Use HTML lines if VLM is enabled for better table formatting
        if self.use_vlm and html_lines:
            html_path = write_html_from_lines(html_lines, out_dir)
        else:
            html_path = write_html(md_lines, out_dir)

        excel_path = None
        html_structured_path = None
        if self.use_vlm and structured_items:
            excel_path = os.path.join(out_dir, "tables.xlsx")
            write_structured_excel(excel_path, structured_items)
            html_structured_path = os.path.join(out_dir, "tables.html")
            write_structured_html(html_structured_path, structured_items)

        print(f"✅ Parsing completed successfully!")
        print(f"📁 Output directory: {out_dir}")

    def display_pages_with_boxes(self, pdf_path: str, num_pages: int = 3, cols: int = 2,
                                 page_width: int = 800, spacing: int = 40, save_path: str = None) -> None:
        """
        Display the first N pages of a PDF with bounding boxes and labels overlaid in a modern grid layout.

        Creates a visualization showing layout detection results with bounding boxes,
        labels, and confidence scores overlaid on the PDF pages in a grid format.

        :param pdf_path: Path to the input PDF file
        :param num_pages: Number of pages to display (default: 3)
        :param cols: Number of columns in the grid layout (default: 2)
        :param page_width: Width to resize each page to in pixels (default: 800)
        :param spacing: Spacing between pages in pixels (default: 40)
        :param save_path: Optional path to save the visualization (if None, displays only)
        :return: None
        """
        pages: List[LayoutPage] = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )
        pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        pages_to_show = min(num_pages, len(pages))

        if pages_to_show == 0:
            print("No pages to display")
            return

        rows = (pages_to_show + cols - 1) // cols

        used_labels = set()
        for idx in range(pages_to_show):
            page = pages[idx]
            for box in page.boxes:
                used_labels.add(box.label.lower())

        base_colors = ['#3B82F6', '#EF4444', '#10B981', '#F59E0B', '#8B5CF6',
                       '#F97316', '#EC4899', '#6B7280', '#84CC16', '#06B6D4',
                       '#DC2626', '#059669', '#7C3AED', '#DB2777', '#0891B2']

        dynamic_label_colors = {}
        for i, label in enumerate(sorted(used_labels)):
            dynamic_label_colors[label] = base_colors[i % len(base_colors)]

        processed_pages = []

        for idx in range(pages_to_show):
            page = pages[idx]
            page_img = pil_pages[idx].copy()

            scale_factor = page_width / page_img.width
            new_height = int(page_img.height * scale_factor)
            page_img = page_img.resize((page_width, new_height), Image.LANCZOS)

            draw = ImageDraw.Draw(page_img)

            try:
                font = ImageFont.truetype("arial.ttf", 24)
                small_font = ImageFont.truetype("arial.ttf", 18)
            except:
                try:
                    font = ImageFont.load_default()
                    small_font = ImageFont.load_default()
                except:
                    font = None
                    small_font = None

            for box in page.boxes:
                x1 = int(box.x1 * scale_factor)
                y1 = int(box.y1 * scale_factor)
                x2 = int(box.x2 * scale_factor)
                y2 = int(box.y2 * scale_factor)

                color = dynamic_label_colors.get(box.label.lower(), '#000000')

                draw.rectangle([x1, y1, x2, y2], outline=color, width=3)

                label_text = f"{box.label} ({box.score:.2f})"
                if font:
                    bbox = draw.textbbox((0, 0), label_text, font=small_font)
                    text_width = bbox[2] - bbox[0]
                    text_height = bbox[3] - bbox[1]
                else:
                    text_width = len(label_text) * 8
                    text_height = 15

                label_x = x1
                label_y = max(0, y1 - text_height - 8)

                padding = 4
                draw.rectangle([
                    label_x - padding,
                    label_y - padding,
                    label_x + text_width + padding,
                    label_y + text_height + padding
                ], fill='white', outline=color, width=2)

                draw.text((label_x, label_y), label_text, fill=color, font=small_font)

            title_text = f"Page {page.page_index} ({len(page.boxes)} boxes)"
            if font:
                title_bbox = draw.textbbox((0, 0), title_text, font=font)
                title_width = title_bbox[2] - title_bbox[0]
            else:
                title_width = len(title_text) * 12

            title_x = (page_width - title_width) // 2
            title_y = 10
            draw.rectangle([title_x - 10, title_y - 5, title_x + title_width + 10, title_y + 35],
                           fill='white', outline='#1F2937', width=2)
            draw.text((title_x, title_y), title_text, fill='#1F2937', font=font)

            processed_pages.append(page_img)

        legend_width = 250
        grid_width = cols * page_width + (cols - 1) * spacing
        total_width = grid_width + legend_width + spacing
        grid_height = rows * (processed_pages[0].height if processed_pages else 600) + (rows - 1) * spacing

        final_img = Image.new('RGB', (total_width, grid_height), '#F8FAFC')

        for idx, page_img in enumerate(processed_pages):
            row = idx // cols
            col = idx % cols

            x_pos = col * (page_width + spacing)
            y_pos = row * (page_img.height + spacing)

            final_img.paste(page_img, (x_pos, y_pos))

        legend_x = grid_width + spacing
        legend_y = 20

        draw_legend = ImageDraw.Draw(final_img)

        legend_title = "Element Types"
        if font:
            title_bbox = draw_legend.textbbox((0, 0), legend_title, font=font)
            title_width = title_bbox[2] - title_bbox[0]
            title_height = title_bbox[3] - title_bbox[1]
        else:
            title_width = len(legend_title) * 12
            title_height = 20

        legend_bg_height = len(used_labels) * 35 + title_height + 40
        draw_legend.rectangle([legend_x - 10, legend_y - 10,
                               legend_x + legend_width - 10, legend_y + legend_bg_height],
                              fill='white', outline='#E5E7EB', width=2)

        draw_legend.text((legend_x + 10, legend_y + 5), legend_title,
                         fill='#1F2937', font=font)

        current_y = legend_y + title_height + 20

        for label in sorted(used_labels):
            color = dynamic_label_colors[label]

            square_size = 20
            draw_legend.rectangle([legend_x + 10, current_y,
                                   legend_x + 10 + square_size, current_y + square_size],
                                  fill=color, outline='#6B7280', width=1)

            draw_legend.text((legend_x + 40, current_y + 2), label.title(),
                             fill='#374151', font=small_font)

            current_y += 30

        if save_path:
            final_img.save(save_path, quality=95, optimize=True)
            print(f"Layout visualization saved to: {save_path}")
        else:
            final_img.show()

        print(f"\n📊 Layout Detection Summary for {os.path.basename(pdf_path)}:")
        print(f"Pages processed: {pages_to_show}")

        total_counts = {}
        for idx in range(pages_to_show):
            page = pages[idx]
            for box in page.boxes:
                total_counts[box.label] = total_counts.get(box.label, 0) + 1

        print("\nTotal elements detected:")
        for label, count in sorted(total_counts.items()):
            print(f"  - {label}: {count}")

        return final_img

__init__(*, use_vlm=False, vlm_provider='gemini', vlm_model=None, vlm_api_key=None, layout_model_name='PP-DocLayout_plus-L', dpi=200, min_score=0.0, ocr_lang='eng', ocr_psm=4, ocr_oem=3, ocr_extra_config='', box_separator='\n')

    Initialize the StructuredPDFParser with processing configuration.

    :param use_vlm: Whether to use VLM for structured data extraction (default: False)
    :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
    :param vlm_model: Model name to use (defaults to provider-specific defaults)
    :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_lang: OCR language code (default: "eng")
    :param ocr_psm: Tesseract page segmentation mode (default: 4)
    :param ocr_oem: Tesseract OCR engine mode (default: 3)
    :param ocr_extra_config: Additional Tesseract configuration (default: "")
    :param box_separator: Separator between text boxes in output (default: "

")

Source code in doctra/parsers/structured_pdf_parser.py
def __init__(
        self,
        *,
        use_vlm: bool = False,
        vlm_provider: str = "gemini",
        vlm_model: str | None = None,
        vlm_api_key: str | None = None,
        layout_model_name: str = "PP-DocLayout_plus-L",
        dpi: int = 200,
        min_score: float = 0.0,
        ocr_lang: str = "eng",
        ocr_psm: int = 4,
        ocr_oem: int = 3,
        ocr_extra_config: str = "",
        box_separator: str = "\n",
):
    """
    Initialize the StructuredPDFParser with processing configuration.

    :param use_vlm: Whether to use VLM for structured data extraction (default: False)
    :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
    :param vlm_model: Model name to use (defaults to provider-specific defaults)
    :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_lang: OCR language code (default: "eng")
    :param ocr_psm: Tesseract page segmentation mode (default: 4)
    :param ocr_oem: Tesseract OCR engine mode (default: 3)
    :param ocr_extra_config: Additional Tesseract configuration (default: "")
    :param box_separator: Separator between text boxes in output (default: "\n")
    """
    self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
    self.dpi = dpi
    self.min_score = min_score
    self.ocr_engine = PytesseractOCREngine(
        lang=ocr_lang, psm=ocr_psm, oem=ocr_oem, extra_config=ocr_extra_config
    )
    self.box_separator = box_separator
    self.use_vlm = use_vlm
    self.vlm = None
    if self.use_vlm:
        try:
            self.vlm = VLMStructuredExtractor(
                vlm_provider=vlm_provider,
                vlm_model=vlm_model,
                api_key=vlm_api_key,
            )
        except Exception as e:
            self.vlm = None

display_pages_with_boxes(pdf_path, num_pages=3, cols=2, page_width=800, spacing=40, save_path=None)

Display the first N pages of a PDF with bounding boxes and labels overlaid in a modern grid layout.

Creates a visualization showing layout detection results with bounding boxes, labels, and confidence scores overlaid on the PDF pages in a grid format.

:param pdf_path: Path to the input PDF file :param num_pages: Number of pages to display (default: 3) :param cols: Number of columns in the grid layout (default: 2) :param page_width: Width to resize each page to in pixels (default: 800) :param spacing: Spacing between pages in pixels (default: 40) :param save_path: Optional path to save the visualization (if None, displays only) :return: None

Source code in doctra/parsers/structured_pdf_parser.py
def display_pages_with_boxes(self, pdf_path: str, num_pages: int = 3, cols: int = 2,
                             page_width: int = 800, spacing: int = 40, save_path: str = None) -> None:
    """
    Display the first N pages of a PDF with bounding boxes and labels overlaid in a modern grid layout.

    Creates a visualization showing layout detection results with bounding boxes,
    labels, and confidence scores overlaid on the PDF pages in a grid format.

    :param pdf_path: Path to the input PDF file
    :param num_pages: Number of pages to display (default: 3)
    :param cols: Number of columns in the grid layout (default: 2)
    :param page_width: Width to resize each page to in pixels (default: 800)
    :param spacing: Spacing between pages in pixels (default: 40)
    :param save_path: Optional path to save the visualization (if None, displays only)
    :return: None
    """
    pages: List[LayoutPage] = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )
    pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    pages_to_show = min(num_pages, len(pages))

    if pages_to_show == 0:
        print("No pages to display")
        return

    rows = (pages_to_show + cols - 1) // cols

    used_labels = set()
    for idx in range(pages_to_show):
        page = pages[idx]
        for box in page.boxes:
            used_labels.add(box.label.lower())

    base_colors = ['#3B82F6', '#EF4444', '#10B981', '#F59E0B', '#8B5CF6',
                   '#F97316', '#EC4899', '#6B7280', '#84CC16', '#06B6D4',
                   '#DC2626', '#059669', '#7C3AED', '#DB2777', '#0891B2']

    dynamic_label_colors = {}
    for i, label in enumerate(sorted(used_labels)):
        dynamic_label_colors[label] = base_colors[i % len(base_colors)]

    processed_pages = []

    for idx in range(pages_to_show):
        page = pages[idx]
        page_img = pil_pages[idx].copy()

        scale_factor = page_width / page_img.width
        new_height = int(page_img.height * scale_factor)
        page_img = page_img.resize((page_width, new_height), Image.LANCZOS)

        draw = ImageDraw.Draw(page_img)

        try:
            font = ImageFont.truetype("arial.ttf", 24)
            small_font = ImageFont.truetype("arial.ttf", 18)
        except:
            try:
                font = ImageFont.load_default()
                small_font = ImageFont.load_default()
            except:
                font = None
                small_font = None

        for box in page.boxes:
            x1 = int(box.x1 * scale_factor)
            y1 = int(box.y1 * scale_factor)
            x2 = int(box.x2 * scale_factor)
            y2 = int(box.y2 * scale_factor)

            color = dynamic_label_colors.get(box.label.lower(), '#000000')

            draw.rectangle([x1, y1, x2, y2], outline=color, width=3)

            label_text = f"{box.label} ({box.score:.2f})"
            if font:
                bbox = draw.textbbox((0, 0), label_text, font=small_font)
                text_width = bbox[2] - bbox[0]
                text_height = bbox[3] - bbox[1]
            else:
                text_width = len(label_text) * 8
                text_height = 15

            label_x = x1
            label_y = max(0, y1 - text_height - 8)

            padding = 4
            draw.rectangle([
                label_x - padding,
                label_y - padding,
                label_x + text_width + padding,
                label_y + text_height + padding
            ], fill='white', outline=color, width=2)

            draw.text((label_x, label_y), label_text, fill=color, font=small_font)

        title_text = f"Page {page.page_index} ({len(page.boxes)} boxes)"
        if font:
            title_bbox = draw.textbbox((0, 0), title_text, font=font)
            title_width = title_bbox[2] - title_bbox[0]
        else:
            title_width = len(title_text) * 12

        title_x = (page_width - title_width) // 2
        title_y = 10
        draw.rectangle([title_x - 10, title_y - 5, title_x + title_width + 10, title_y + 35],
                       fill='white', outline='#1F2937', width=2)
        draw.text((title_x, title_y), title_text, fill='#1F2937', font=font)

        processed_pages.append(page_img)

    legend_width = 250
    grid_width = cols * page_width + (cols - 1) * spacing
    total_width = grid_width + legend_width + spacing
    grid_height = rows * (processed_pages[0].height if processed_pages else 600) + (rows - 1) * spacing

    final_img = Image.new('RGB', (total_width, grid_height), '#F8FAFC')

    for idx, page_img in enumerate(processed_pages):
        row = idx // cols
        col = idx % cols

        x_pos = col * (page_width + spacing)
        y_pos = row * (page_img.height + spacing)

        final_img.paste(page_img, (x_pos, y_pos))

    legend_x = grid_width + spacing
    legend_y = 20

    draw_legend = ImageDraw.Draw(final_img)

    legend_title = "Element Types"
    if font:
        title_bbox = draw_legend.textbbox((0, 0), legend_title, font=font)
        title_width = title_bbox[2] - title_bbox[0]
        title_height = title_bbox[3] - title_bbox[1]
    else:
        title_width = len(legend_title) * 12
        title_height = 20

    legend_bg_height = len(used_labels) * 35 + title_height + 40
    draw_legend.rectangle([legend_x - 10, legend_y - 10,
                           legend_x + legend_width - 10, legend_y + legend_bg_height],
                          fill='white', outline='#E5E7EB', width=2)

    draw_legend.text((legend_x + 10, legend_y + 5), legend_title,
                     fill='#1F2937', font=font)

    current_y = legend_y + title_height + 20

    for label in sorted(used_labels):
        color = dynamic_label_colors[label]

        square_size = 20
        draw_legend.rectangle([legend_x + 10, current_y,
                               legend_x + 10 + square_size, current_y + square_size],
                              fill=color, outline='#6B7280', width=1)

        draw_legend.text((legend_x + 40, current_y + 2), label.title(),
                         fill='#374151', font=small_font)

        current_y += 30

    if save_path:
        final_img.save(save_path, quality=95, optimize=True)
        print(f"Layout visualization saved to: {save_path}")
    else:
        final_img.show()

    print(f"\n📊 Layout Detection Summary for {os.path.basename(pdf_path)}:")
    print(f"Pages processed: {pages_to_show}")

    total_counts = {}
    for idx in range(pages_to_show):
        page = pages[idx]
        for box in page.boxes:
            total_counts[box.label] = total_counts.get(box.label, 0) + 1

    print("\nTotal elements detected:")
    for label, count in sorted(total_counts.items()):
        print(f"  - {label}: {count}")

    return final_img

parse(pdf_path)

Parse a PDF document and extract all content types.

:param pdf_path: Path to the input PDF file :return: None

Source code in doctra/parsers/structured_pdf_parser.py
def parse(self, pdf_path: str) -> None:
    """
    Parse a PDF document and extract all content types.

    :param pdf_path: Path to the input PDF file
    :return: None
    """
    pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]
    out_dir = f"outputs/{pdf_filename}/full_parse"

    os.makedirs(out_dir, exist_ok=True)
    ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

    pages: List[LayoutPage] = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )
    pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    fig_count = sum(sum(1 for b in p.boxes if b.label == "figure") for p in pages)
    chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages)
    table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages)

    md_lines: List[str] = ["# Extracted Content\n"]
    html_lines: List[str] = ["<h1>Extracted Content</h1>"]  # For direct HTML generation
    structured_items: List[Dict[str, Any]] = []

    charts_desc = "Charts (VLM → table)" if self.use_vlm else "Charts (cropped)"
    tables_desc = "Tables (VLM → table)" if self.use_vlm else "Tables (cropped)"
    figures_desc = "Figures (cropped)"

    with ExitStack() as stack:
        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()
        if is_notebook:
            charts_bar = stack.enter_context(
                create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
            tables_bar = stack.enter_context(
                create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
            figures_bar = stack.enter_context(
                create_notebook_friendly_bar(total=fig_count, desc=figures_desc)) if fig_count else None
        else:
            charts_bar = stack.enter_context(
                create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
            tables_bar = stack.enter_context(
                create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None
            figures_bar = stack.enter_context(
                create_beautiful_progress_bar(total=fig_count, desc=figures_desc, leave=True)) if fig_count else None

        for p in pages:
            page_num = p.page_index
            page_img: Image.Image = pil_pages[page_num - 1]
            md_lines.append(f"\n## Page {page_num}\n")
            html_lines.append(f"<h2>Page {page_num}</h2>")

            for i, box in enumerate(sorted(p.boxes, key=reading_order_key), start=1):
                if box.label in EXCLUDE_LABELS:
                    img_path = save_box_image(page_img, box, out_dir, page_num, i, IMAGE_SUBDIRS)
                    abs_img_path = os.path.abspath(img_path)
                    rel = os.path.relpath(abs_img_path, out_dir)

                    if box.label == "figure":
                        figure_md = f"![Figure — page {page_num}]({rel})\n"
                        figure_html = f'<img src="{rel}" alt="Figure — page {page_num}" />'
                        md_lines.append(figure_md)
                        html_lines.append(figure_html)
                        if figures_bar: figures_bar.update(1)

                    elif box.label == "chart":
                        if self.use_vlm and self.vlm:
                            wrote_table = False
                            try:
                                chart = self.vlm.extract_chart(abs_img_path)
                                item = to_structured_dict(chart)
                                if item:
                                    # Add page and type information to structured item
                                    item["page"] = page_num
                                    item["type"] = "Chart"
                                    structured_items.append(item)

                                    # Generate both markdown and HTML tables
                                    table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                     title=item.get("title"))
                                    table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                   title=item.get("title"))

                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                                    wrote_table = True
                            except Exception as e:
                                pass
                            if not wrote_table:
                                chart_md = f"![Chart — page {page_num}]({rel})\n"
                                chart_html = f'<img src="{rel}" alt="Chart — page {page_num}" />'
                                md_lines.append(chart_md)
                                html_lines.append(chart_html)
                        else:
                            chart_md = f"![Chart — page {page_num}]({rel})\n"
                            chart_html = f'<img src="{rel}" alt="Chart — page {page_num}" />'
                            md_lines.append(chart_md)
                            html_lines.append(chart_html)
                        if charts_bar: charts_bar.update(1)

                    elif box.label == "table":
                        if self.use_vlm and self.vlm:
                            wrote_table = False
                            try:
                                table = self.vlm.extract_table(abs_img_path)
                                item = to_structured_dict(table)
                                if item:
                                    # Add page and type information to structured item
                                    item["page"] = page_num
                                    item["type"] = "Table"
                                    structured_items.append(item)

                                    # Generate both markdown and HTML tables
                                    table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                     title=item.get("title"))
                                    table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                   title=item.get("title"))

                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                                    wrote_table = True
                            except Exception as e:
                                pass
                            if not wrote_table:
                                table_md = f"![Table — page {page_num}]({rel})\n"
                                table_html = f'<img src="{rel}" alt="Table — page {page_num}" />'
                                md_lines.append(table_md)
                                html_lines.append(table_html)
                        else:
                            table_md = f"![Table — page {page_num}]({rel})\n"
                            table_html = f'<img src="{rel}" alt="Table — page {page_num}" />'
                            md_lines.append(table_md)
                            html_lines.append(table_html)
                        if tables_bar: tables_bar.update(1)
                else:
                    text = ocr_box_text(self.ocr_engine, page_img, box)
                    if text:
                        md_lines.append(text)
                        md_lines.append(self.box_separator if self.box_separator else "")
                        # Convert text to HTML (basic conversion)
                        html_text = text.replace('\n', '<br>')
                        html_lines.append(f"<p>{html_text}</p>")
                        if self.box_separator:
                            html_lines.append("<br>")

    md_path = write_markdown(md_lines, out_dir)

    # Use HTML lines if VLM is enabled for better table formatting
    if self.use_vlm and html_lines:
        html_path = write_html_from_lines(html_lines, out_dir)
    else:
        html_path = write_html(md_lines, out_dir)

    excel_path = None
    html_structured_path = None
    if self.use_vlm and structured_items:
        excel_path = os.path.join(out_dir, "tables.xlsx")
        write_structured_excel(excel_path, structured_items)
        html_structured_path = os.path.join(out_dir, "tables.html")
        write_structured_html(html_structured_path, structured_items)

    print(f"✅ Parsing completed successfully!")
    print(f"📁 Output directory: {out_dir}")

EnhancedPDFParser

Enhanced parser with image restoration capabilities.

doctra.parsers.enhanced_pdf_parser.EnhancedPDFParser

Bases: StructuredPDFParser

Enhanced PDF Parser with Image Restoration capabilities.

Extends the StructuredPDFParser with DocRes image restoration to improve
document quality before processing. This is particularly useful for:
- Scanned documents with shadows or distortion
- Low-quality PDFs that need enhancement
- Documents with perspective issues

:param use_image_restoration: Whether to apply DocRes image restoration (default: True)
:param restoration_task: DocRes task to use ("dewarping", "deshadowing", "appearance", "deblurring", "binarization", "end2end", default: "appearance")
:param restoration_device: Device for DocRes processing ("cuda", "cpu", or None for auto-detect, default: None)
:param restoration_dpi: DPI for restoration processing (default: 200)
:param use_vlm: Whether to use VLM for structured data extraction (default: False)
:param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
:param vlm_model: Model name to use (defaults to provider-specific defaults)
:param vlm_api_key: API key for VLM provider (required if use_vlm is True)
:param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
:param dpi: DPI for PDF rendering (default: 200)
:param min_score: Minimum confidence score for layout detection (default: 0.0)
:param ocr_lang: OCR language code (default: "eng")
:param ocr_psm: Tesseract page segmentation mode (default: 4)
:param ocr_oem: Tesseract OCR engine mode (default: 3)
:param ocr_extra_config: Additional Tesseract configuration (default: "")
:param box_separator: Separator between text boxes in output (default: "

")

Source code in doctra/parsers/enhanced_pdf_parser.py
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
class EnhancedPDFParser(StructuredPDFParser):
    """
    Enhanced PDF Parser with Image Restoration capabilities.

    Extends the StructuredPDFParser with DocRes image restoration to improve
    document quality before processing. This is particularly useful for:
    - Scanned documents with shadows or distortion
    - Low-quality PDFs that need enhancement
    - Documents with perspective issues

    :param use_image_restoration: Whether to apply DocRes image restoration (default: True)
    :param restoration_task: DocRes task to use ("dewarping", "deshadowing", "appearance", "deblurring", "binarization", "end2end", default: "appearance")
    :param restoration_device: Device for DocRes processing ("cuda", "cpu", or None for auto-detect, default: None)
    :param restoration_dpi: DPI for restoration processing (default: 200)
    :param use_vlm: Whether to use VLM for structured data extraction (default: False)
    :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
    :param vlm_model: Model name to use (defaults to provider-specific defaults)
    :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    :param ocr_lang: OCR language code (default: "eng")
    :param ocr_psm: Tesseract page segmentation mode (default: 4)
    :param ocr_oem: Tesseract OCR engine mode (default: 3)
    :param ocr_extra_config: Additional Tesseract configuration (default: "")
    :param box_separator: Separator between text boxes in output (default: "\n")
    """

    def __init__(
        self,
        *,
        use_image_restoration: bool = True,
        restoration_task: str = "appearance",
        restoration_device: Optional[str] = None,
        restoration_dpi: int = 200,
        use_vlm: bool = False,
        vlm_provider: str = "gemini",
        vlm_model: str | None = None,
        vlm_api_key: str | None = None,
        layout_model_name: str = "PP-DocLayout_plus-L",
        dpi: int = 200,
        min_score: float = 0.0,
        ocr_lang: str = "eng",
        ocr_psm: int = 4,
        ocr_oem: int = 3,
        ocr_extra_config: str = "",
        box_separator: str = "\n",
    ):
        """
        Initialize the Enhanced PDF Parser with image restoration capabilities.
        """
        # Initialize parent class
        super().__init__(
            use_vlm=use_vlm,
            vlm_provider=vlm_provider,
            vlm_model=vlm_model,
            vlm_api_key=vlm_api_key,
            layout_model_name=layout_model_name,
            dpi=dpi,
            min_score=min_score,
            ocr_lang=ocr_lang,
            ocr_psm=ocr_psm,
            ocr_oem=ocr_oem,
            ocr_extra_config=ocr_extra_config,
            box_separator=box_separator,
        )

        # Image restoration settings
        self.use_image_restoration = use_image_restoration
        self.restoration_task = restoration_task
        self.restoration_device = restoration_device
        self.restoration_dpi = restoration_dpi

        # Initialize DocRes engine if needed
        self.docres_engine = None
        if self.use_image_restoration:
            try:
                self.docres_engine = DocResEngine(
                    device=restoration_device,
                    use_half_precision=True
                )
                print(f"✅ DocRes engine initialized with task: {restoration_task}")
            except Exception as e:
                print(f"⚠️ DocRes initialization failed: {e}")
                print("   Continuing without image restoration...")
                self.use_image_restoration = False
                self.docres_engine = None

    def parse(self, pdf_path: str, enhanced_output_dir: str = None) -> None:
        """
        Parse a PDF document with optional image restoration.

        :param pdf_path: Path to the input PDF file
        :param enhanced_output_dir: Directory for enhanced images (if None, uses default)
        :return: None
        """
        pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]

        # Set up output directories
        if enhanced_output_dir is None:
            out_dir = f"outputs/{pdf_filename}/enhanced_parse"
        else:
            out_dir = enhanced_output_dir

        os.makedirs(out_dir, exist_ok=True)
        ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

        # Process PDF pages with optional restoration
        if self.use_image_restoration and self.docres_engine:
            print(f"🔄 Processing PDF with image restoration: {os.path.basename(pdf_path)}")
            enhanced_pages = self._process_pages_with_restoration(pdf_path, out_dir)

            # Create enhanced PDF file using the already processed enhanced pages
            enhanced_pdf_path = os.path.join(out_dir, f"{pdf_filename}_enhanced.pdf")
            try:
                self._create_enhanced_pdf_from_pages(enhanced_pages, enhanced_pdf_path)
            except Exception as e:
                print(f"⚠️ Failed to create enhanced PDF: {e}")
        else:
            print(f"🔄 Processing PDF without image restoration: {os.path.basename(pdf_path)}")
            enhanced_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        # Run layout detection on enhanced pages
        print("🔍 Running layout detection on enhanced pages...")
        pages = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )

        # Use enhanced pages for processing
        pil_pages = enhanced_pages

        # Continue with standard parsing logic
        self._process_parsing_logic(pages, pil_pages, out_dir, pdf_filename, pdf_path)

    def _process_pages_with_restoration(self, pdf_path: str, out_dir: str) -> List[Image.Image]:
        """
        Process PDF pages with DocRes image restoration.

        :param pdf_path: Path to the input PDF file
        :param out_dir: Output directory for enhanced images
        :return: List of enhanced PIL images
        """
        # Render original pages
        original_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.restoration_dpi)]

        if not original_pages:
            print("❌ No pages found in PDF")
            return []

        # Create progress bar
        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        if is_notebook:
            progress_bar = create_notebook_friendly_bar(
                total=len(original_pages), 
                desc=f"DocRes {self.restoration_task}"
            )
        else:
            progress_bar = create_beautiful_progress_bar(
                total=len(original_pages), 
                desc=f"DocRes {self.restoration_task}",
                leave=True
            )

        enhanced_pages = []
        enhanced_dir = os.path.join(out_dir, "enhanced_pages")
        os.makedirs(enhanced_dir, exist_ok=True)

        try:
            with progress_bar:
                for i, page_img in enumerate(original_pages):
                    try:
                        # Convert PIL to numpy array
                        img_array = np.array(page_img)

                        # Apply DocRes restoration
                        restored_img, metadata = self.docres_engine.restore_image(
                            img_array, 
                            task=self.restoration_task
                        )

                        # Convert back to PIL Image
                        enhanced_page = Image.fromarray(restored_img)
                        enhanced_pages.append(enhanced_page)

                        # Save enhanced page for reference
                        enhanced_path = os.path.join(enhanced_dir, f"page_{i+1:03d}_enhanced.jpg")
                        enhanced_page.save(enhanced_path, "JPEG", quality=95)

                        progress_bar.set_description(f"✅ Page {i+1}/{len(original_pages)} enhanced")
                        progress_bar.update(1)

                    except Exception as e:
                        print(f"  ⚠️ Page {i+1} restoration failed: {e}, using original")
                        enhanced_pages.append(page_img)
                        progress_bar.set_description(f"⚠️ Page {i+1} failed, using original")
                        progress_bar.update(1)

        finally:
            if hasattr(progress_bar, 'close'):
                progress_bar.close()

        return enhanced_pages

    def _process_parsing_logic(self, pages, pil_pages, out_dir, pdf_filename, pdf_path):
        """
        Process the parsing logic with enhanced pages.
        This is extracted from the parent class to allow customization.
        """

        fig_count = sum(sum(1 for b in p.boxes if b.label == "figure") for p in pages)
        chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages)
        table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages)

        md_lines: List[str] = ["# Enhanced Document Content\n"]
        html_lines: List[str] = ["<h1>Enhanced Document Content</h1>"]  # For direct HTML generation
        structured_items: List[Dict[str, Any]] = []
        page_content: Dict[int, List[str]] = {}  # Store content by page

        charts_desc = "Charts (VLM → table)" if self.use_vlm else "Charts (cropped)"
        tables_desc = "Tables (VLM → table)" if self.use_vlm else "Tables (cropped)"
        figures_desc = "Figures (cropped)"

        with ExitStack() as stack:
            is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
            if is_notebook:
                charts_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
                figures_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=fig_count, desc=figures_desc)) if fig_count else None
            else:
                charts_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None
                figures_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=fig_count, desc=figures_desc, leave=True)) if fig_count else None

            # Initialize page content for all pages first
            for page_num in range(1, len(pil_pages) + 1):
                page_content[page_num] = [f"# Page {page_num} Content\n"]

            for p in pages:
                page_num = p.page_index
                page_img: Image.Image = pil_pages[page_num - 1]
                md_lines.append(f"\n## Page {page_num}\n")
                html_lines.append(f"<h2>Page {page_num}</h2>")

                for i, box in enumerate(sorted(p.boxes, key=reading_order_key), start=1):
                    if box.label in EXCLUDE_LABELS:
                        img_path = save_box_image(page_img, box, out_dir, page_num, i, IMAGE_SUBDIRS)
                        abs_img_path = os.path.abspath(img_path)
                        rel = os.path.relpath(abs_img_path, out_dir)

                        if box.label == "figure":
                            figure_md = f"![Figure — page {page_num}]({rel})\n"
                            figure_html = f'<img src="{rel}" alt="Figure — page {page_num}" />'
                            md_lines.append(figure_md)
                            html_lines.append(figure_html)
                            page_content[page_num].append(figure_md)
                            if figures_bar: figures_bar.update(1)

                        elif box.label == "chart":
                            if self.use_vlm and self.vlm:
                                wrote_table = False
                                try:
                                    chart = self.vlm.extract_chart(abs_img_path)
                                    item = to_structured_dict(chart)
                                    if item:
                                        # Add page and type information to structured item
                                        item["page"] = page_num
                                        item["type"] = "Chart"
                                        structured_items.append(item)

                                        # Generate both markdown and HTML tables
                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        page_content[page_num].append(table_md)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    chart_md = f"![Chart — page {page_num}]({rel})\n"
                                    chart_html = f'<img src="{rel}" alt="Chart — page {page_num}" />'
                                    md_lines.append(chart_md)
                                    html_lines.append(chart_html)
                                    page_content[page_num].append(chart_md)
                            else:
                                chart_md = f"![Chart — page {page_num}]({rel})\n"
                                chart_html = f'<img src="{rel}" alt="Chart — page {page_num}" />'
                                md_lines.append(chart_md)
                                html_lines.append(chart_html)
                                page_content[page_num].append(chart_md)
                            if charts_bar: charts_bar.update(1)

                        elif box.label == "table":
                            if self.use_vlm and self.vlm:
                                wrote_table = False
                                try:
                                    table = self.vlm.extract_table(abs_img_path)
                                    item = to_structured_dict(table)
                                    if item:
                                        # Add page and type information to structured item
                                        item["page"] = page_num
                                        item["type"] = "Table"
                                        structured_items.append(item)

                                        # Generate both markdown and HTML tables
                                        table_md = render_markdown_table(item.get("headers"), item.get("rows"),
                                                                         title=item.get("title"))
                                        table_html = render_html_table(item.get("headers"), item.get("rows"),
                                                                       title=item.get("title"))

                                        md_lines.append(table_md)
                                        html_lines.append(table_html)
                                        page_content[page_num].append(table_md)
                                        wrote_table = True
                                except Exception as e:
                                    pass
                                if not wrote_table:
                                    table_md = f"![Table — page {page_num}]({rel})\n"
                                    table_html = f'<img src="{rel}" alt="Table — page {page_num}" />'
                                    md_lines.append(table_md)
                                    html_lines.append(table_html)
                                    page_content[page_num].append(table_md)
                            else:
                                table_md = f"![Table — page {page_num}]({rel})\n"
                                table_html = f'<img src="{rel}" alt="Table — page {page_num}" />'
                                md_lines.append(table_md)
                                html_lines.append(table_html)
                                page_content[page_num].append(table_md)
                            if tables_bar: tables_bar.update(1)
                    else:
                        text = ocr_box_text(self.ocr_engine, page_img, box)
                        if text:
                            md_lines.append(text)
                            md_lines.append(self.box_separator if self.box_separator else "")
                            # Convert text to HTML (basic conversion)
                            html_text = text.replace('\n', '<br>')
                            html_lines.append(f"<p>{html_text}</p>")
                            if self.box_separator:
                                html_lines.append("<br>")
                            page_content[page_num].append(text)
                            page_content[page_num].append(self.box_separator if self.box_separator else "")

        md_path = write_markdown(md_lines, out_dir)

        # Use HTML lines if VLM is enabled for better table formatting
        if self.use_vlm and html_lines:
            html_path = write_html_from_lines(html_lines, out_dir)
        else:
            html_path = write_html(md_lines, out_dir)

        # Create pages folder and save individual page markdown files
        pages_dir = os.path.join(out_dir, "pages")
        os.makedirs(pages_dir, exist_ok=True)

        for page_num, content_lines in page_content.items():
            page_md_path = os.path.join(pages_dir, f"page_{page_num:03d}.md")
            write_markdown(content_lines, os.path.dirname(page_md_path), os.path.basename(page_md_path))

        excel_path = None
        html_structured_path = None
        if self.use_vlm and structured_items:
            excel_path = os.path.join(out_dir, "tables.xlsx")
            write_structured_excel(excel_path, structured_items)
            html_structured_path = os.path.join(out_dir, "tables.html")
            write_structured_html(html_structured_path, structured_items)

        print(f"✅ Enhanced parsing completed successfully!")
        print(f"📁 Output directory: {out_dir}")

    def _create_enhanced_pdf_from_pages(self, enhanced_pages: List[Image.Image], output_path: str) -> None:
        """
        Create an enhanced PDF from already processed enhanced pages.

        :param enhanced_pages: List of enhanced PIL images
        :param output_path: Path for the enhanced PDF
        """
        if not enhanced_pages:
            raise ValueError("No enhanced pages provided")

        try:
            # Create enhanced PDF from the processed pages
            enhanced_pages[0].save(
                output_path,
                "PDF",
                resolution=100.0,
                save_all=True,
                append_images=enhanced_pages[1:] if len(enhanced_pages) > 1 else []
            )
            print(f"✅ Enhanced PDF saved from processed pages: {output_path}")
        except Exception as e:
            print(f"❌ Error creating enhanced PDF from pages: {e}")
            raise

    def restore_pdf_only(self, pdf_path: str, output_path: str = None, task: str = None) -> str:
        """
        Apply DocRes restoration to a PDF without parsing.

        :param pdf_path: Path to the input PDF file
        :param output_path: Path for the enhanced PDF (if None, auto-generates)
        :param task: DocRes restoration task (if None, uses instance default)
        :return: Path to the enhanced PDF or None if failed
        """
        if not self.use_image_restoration or not self.docres_engine:
            raise RuntimeError("Image restoration is not enabled or DocRes engine is not available")

        task = task or self.restoration_task
        return self.docres_engine.restore_pdf(pdf_path, output_path, task, self.restoration_dpi)

    def get_restoration_info(self) -> Dict[str, Any]:
        """
        Get information about the current restoration configuration.

        :return: Dictionary with restoration settings and status
        """
        return {
            'enabled': self.use_image_restoration,
            'task': self.restoration_task,
            'device': self.restoration_device,
            'dpi': self.restoration_dpi,
            'engine_available': self.docres_engine is not None,
            'supported_tasks': self.docres_engine.get_supported_tasks() if self.docres_engine else []
        }

__init__(*, use_image_restoration=True, restoration_task='appearance', restoration_device=None, restoration_dpi=200, use_vlm=False, vlm_provider='gemini', vlm_model=None, vlm_api_key=None, layout_model_name='PP-DocLayout_plus-L', dpi=200, min_score=0.0, ocr_lang='eng', ocr_psm=4, ocr_oem=3, ocr_extra_config='', box_separator='\n')

Initialize the Enhanced PDF Parser with image restoration capabilities.

Source code in doctra/parsers/enhanced_pdf_parser.py
def __init__(
    self,
    *,
    use_image_restoration: bool = True,
    restoration_task: str = "appearance",
    restoration_device: Optional[str] = None,
    restoration_dpi: int = 200,
    use_vlm: bool = False,
    vlm_provider: str = "gemini",
    vlm_model: str | None = None,
    vlm_api_key: str | None = None,
    layout_model_name: str = "PP-DocLayout_plus-L",
    dpi: int = 200,
    min_score: float = 0.0,
    ocr_lang: str = "eng",
    ocr_psm: int = 4,
    ocr_oem: int = 3,
    ocr_extra_config: str = "",
    box_separator: str = "\n",
):
    """
    Initialize the Enhanced PDF Parser with image restoration capabilities.
    """
    # Initialize parent class
    super().__init__(
        use_vlm=use_vlm,
        vlm_provider=vlm_provider,
        vlm_model=vlm_model,
        vlm_api_key=vlm_api_key,
        layout_model_name=layout_model_name,
        dpi=dpi,
        min_score=min_score,
        ocr_lang=ocr_lang,
        ocr_psm=ocr_psm,
        ocr_oem=ocr_oem,
        ocr_extra_config=ocr_extra_config,
        box_separator=box_separator,
    )

    # Image restoration settings
    self.use_image_restoration = use_image_restoration
    self.restoration_task = restoration_task
    self.restoration_device = restoration_device
    self.restoration_dpi = restoration_dpi

    # Initialize DocRes engine if needed
    self.docres_engine = None
    if self.use_image_restoration:
        try:
            self.docres_engine = DocResEngine(
                device=restoration_device,
                use_half_precision=True
            )
            print(f"✅ DocRes engine initialized with task: {restoration_task}")
        except Exception as e:
            print(f"⚠️ DocRes initialization failed: {e}")
            print("   Continuing without image restoration...")
            self.use_image_restoration = False
            self.docres_engine = None

get_restoration_info()

Get information about the current restoration configuration.

:return: Dictionary with restoration settings and status

Source code in doctra/parsers/enhanced_pdf_parser.py
def get_restoration_info(self) -> Dict[str, Any]:
    """
    Get information about the current restoration configuration.

    :return: Dictionary with restoration settings and status
    """
    return {
        'enabled': self.use_image_restoration,
        'task': self.restoration_task,
        'device': self.restoration_device,
        'dpi': self.restoration_dpi,
        'engine_available': self.docres_engine is not None,
        'supported_tasks': self.docres_engine.get_supported_tasks() if self.docres_engine else []
    }

parse(pdf_path, enhanced_output_dir=None)

Parse a PDF document with optional image restoration.

:param pdf_path: Path to the input PDF file :param enhanced_output_dir: Directory for enhanced images (if None, uses default) :return: None

Source code in doctra/parsers/enhanced_pdf_parser.py
def parse(self, pdf_path: str, enhanced_output_dir: str = None) -> None:
    """
    Parse a PDF document with optional image restoration.

    :param pdf_path: Path to the input PDF file
    :param enhanced_output_dir: Directory for enhanced images (if None, uses default)
    :return: None
    """
    pdf_filename = os.path.splitext(os.path.basename(pdf_path))[0]

    # Set up output directories
    if enhanced_output_dir is None:
        out_dir = f"outputs/{pdf_filename}/enhanced_parse"
    else:
        out_dir = enhanced_output_dir

    os.makedirs(out_dir, exist_ok=True)
    ensure_output_dirs(out_dir, IMAGE_SUBDIRS)

    # Process PDF pages with optional restoration
    if self.use_image_restoration and self.docres_engine:
        print(f"🔄 Processing PDF with image restoration: {os.path.basename(pdf_path)}")
        enhanced_pages = self._process_pages_with_restoration(pdf_path, out_dir)

        # Create enhanced PDF file using the already processed enhanced pages
        enhanced_pdf_path = os.path.join(out_dir, f"{pdf_filename}_enhanced.pdf")
        try:
            self._create_enhanced_pdf_from_pages(enhanced_pages, enhanced_pdf_path)
        except Exception as e:
            print(f"⚠️ Failed to create enhanced PDF: {e}")
    else:
        print(f"🔄 Processing PDF without image restoration: {os.path.basename(pdf_path)}")
        enhanced_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    # Run layout detection on enhanced pages
    print("🔍 Running layout detection on enhanced pages...")
    pages = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )

    # Use enhanced pages for processing
    pil_pages = enhanced_pages

    # Continue with standard parsing logic
    self._process_parsing_logic(pages, pil_pages, out_dir, pdf_filename, pdf_path)

restore_pdf_only(pdf_path, output_path=None, task=None)

Apply DocRes restoration to a PDF without parsing.

:param pdf_path: Path to the input PDF file :param output_path: Path for the enhanced PDF (if None, auto-generates) :param task: DocRes restoration task (if None, uses instance default) :return: Path to the enhanced PDF or None if failed

Source code in doctra/parsers/enhanced_pdf_parser.py
def restore_pdf_only(self, pdf_path: str, output_path: str = None, task: str = None) -> str:
    """
    Apply DocRes restoration to a PDF without parsing.

    :param pdf_path: Path to the input PDF file
    :param output_path: Path for the enhanced PDF (if None, auto-generates)
    :param task: DocRes restoration task (if None, uses instance default)
    :return: Path to the enhanced PDF or None if failed
    """
    if not self.use_image_restoration or not self.docres_engine:
        raise RuntimeError("Image restoration is not enabled or DocRes engine is not available")

    task = task or self.restoration_task
    return self.docres_engine.restore_pdf(pdf_path, output_path, task, self.restoration_dpi)

ChartTablePDFParser

Specialized parser for extracting charts and tables.

doctra.parsers.table_chart_extractor.ChartTablePDFParser

Specialized PDF parser for extracting charts and tables.

Focuses specifically on chart and table extraction from PDF documents, with optional VLM (Vision Language Model) processing to convert visual elements into structured data.

:param extract_charts: Whether to extract charts from the document (default: True) :param extract_tables: Whether to extract tables from the document (default: True) :param use_vlm: Whether to use VLM for structured data extraction (default: False) :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini") :param vlm_model: Model name to use (defaults to provider-specific defaults) :param vlm_api_key: API key for VLM provider (required if use_vlm is True) :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L") :param dpi: DPI for PDF rendering (default: 200) :param min_score: Minimum confidence score for layout detection (default: 0.0)

Source code in doctra/parsers/table_chart_extractor.py
class ChartTablePDFParser:
    """
    Specialized PDF parser for extracting charts and tables.

    Focuses specifically on chart and table extraction from PDF documents,
    with optional VLM (Vision Language Model) processing to convert visual
    elements into structured data.

    :param extract_charts: Whether to extract charts from the document (default: True)
    :param extract_tables: Whether to extract tables from the document (default: True)
    :param use_vlm: Whether to use VLM for structured data extraction (default: False)
    :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
    :param vlm_model: Model name to use (defaults to provider-specific defaults)
    :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    """

    def __init__(
            self,
            *,
            extract_charts: bool = True,
            extract_tables: bool = True,
            use_vlm: bool = False,
            vlm_provider: str = "gemini",
            vlm_model: str | None = None,
            vlm_api_key: str | None = None,
            layout_model_name: str = "PP-DocLayout_plus-L",
            dpi: int = 200,
            min_score: float = 0.0,
    ):
        """
        Initialize the ChartTablePDFParser with extraction configuration.

        :param extract_charts: Whether to extract charts from the document (default: True)
        :param extract_tables: Whether to extract tables from the document (default: True)
        :param use_vlm: Whether to use VLM for structured data extraction (default: False)
        :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
        :param vlm_model: Model name to use (defaults to provider-specific defaults)
        :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
        :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
        :param dpi: DPI for PDF rendering (default: 200)
        :param min_score: Minimum confidence score for layout detection (default: 0.0)
        """
        if not extract_charts and not extract_tables:
            raise ValueError("At least one of extract_charts or extract_tables must be True")

        self.extract_charts = extract_charts
        self.extract_tables = extract_tables
        self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
        self.dpi = dpi
        self.min_score = min_score

        self.use_vlm = use_vlm
        self.vlm = None
        if self.use_vlm:
            self.vlm = VLMStructuredExtractor(
                vlm_provider=vlm_provider,
                vlm_model=vlm_model,
                api_key=vlm_api_key,
            )

    def parse(self, pdf_path: str, output_base_dir: str = "outputs") -> None:
        """
        Parse a PDF document and extract charts and/or tables.

        :param pdf_path: Path to the input PDF file
        :param output_base_dir: Base directory for output files (default: "outputs")
        :return: None
        """
        pdf_name = Path(pdf_path).stem
        out_dir = os.path.join(output_base_dir, pdf_name, "structured_parsing")
        os.makedirs(out_dir, exist_ok=True)

        charts_dir = None
        tables_dir = None

        if self.extract_charts:
            charts_dir = os.path.join(out_dir, "charts")
            os.makedirs(charts_dir, exist_ok=True)

        if self.extract_tables:
            tables_dir = os.path.join(out_dir, "tables")
            os.makedirs(tables_dir, exist_ok=True)

        pages: List[LayoutPage] = self.layout_engine.predict_pdf(
            pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
        )
        pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

        target_labels = []
        if self.extract_charts:
            target_labels.append("chart")
        if self.extract_tables:
            target_labels.append("table")

        chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages) if self.extract_charts else 0
        table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages) if self.extract_tables else 0

        if self.use_vlm:
            md_lines: List[str] = ["# Extracted Charts and Tables\n"]
            structured_items: List[Dict[str, Any]] = []
            vlm_items: List[Dict[str, Any]] = []

        charts_desc = "Charts (VLM → table)" if self.use_vlm else "Charts (cropped)"
        tables_desc = "Tables (VLM → table)" if self.use_vlm else "Tables (cropped)"

        chart_counter = 1
        table_counter = 1

        with ExitStack() as stack:
            is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
            is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()

            if is_notebook:
                charts_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
            else:
                charts_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
                tables_bar = stack.enter_context(
                    create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None

            for p in pages:
                page_num = p.page_index
                page_img: Image.Image = pil_pages[page_num - 1]

                target_items = [box for box in p.boxes if box.label in target_labels]

                if target_items and self.use_vlm:
                    md_lines.append(f"\n## Page {page_num}\n")

                for box in sorted(target_items, key=reading_order_key):
                    if box.label == "chart" and self.extract_charts:
                        chart_filename = f"chart_{chart_counter:03d}.png"
                        chart_path = os.path.join(charts_dir, chart_filename)

                        cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                        cropped_img.save(chart_path)

                        if self.use_vlm and self.vlm:
                            rel_path = os.path.join("charts", chart_filename)
                            wrote_table = False

                            try:
                                extracted_chart = self.vlm.extract_chart(chart_path)
                                structured_item = to_structured_dict(extracted_chart)
                                if structured_item:
                                    # Add page and type information to structured item
                                    structured_item["page"] = page_num
                                    structured_item["type"] = "Chart"
                                    structured_items.append(structured_item)
                                    vlm_items.append({
                                        "kind": "chart",
                                        "page": page_num,
                                        "image_rel_path": rel_path,
                                        "title": structured_item.get("title"),
                                        "headers": structured_item.get("headers"),
                                        "rows": structured_item.get("rows"),
                                    })
                                    md_lines.append(
                                        render_markdown_table(
                                            structured_item.get("headers"),
                                            structured_item.get("rows"),
                                            title=structured_item.get(
                                                "title") or f"Chart {chart_counter} — page {page_num}"
                                        )
                                    )
                                    wrote_table = True
                            except Exception:
                                pass

                            if not wrote_table:
                                md_lines.append(f"![Chart {chart_counter} — page {page_num}]({rel_path})\n")

                        chart_counter += 1
                        if charts_bar:
                            charts_bar.update(1)

                    elif box.label == "table" and self.extract_tables:
                        table_filename = f"table_{table_counter:03d}.png"
                        table_path = os.path.join(tables_dir, table_filename)

                        cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                        cropped_img.save(table_path)

                        if self.use_vlm and self.vlm:
                            rel_path = os.path.join("tables", table_filename)
                            wrote_table = False

                            try:
                                extracted_table = self.vlm.extract_table(table_path)
                                structured_item = to_structured_dict(extracted_table)
                                if structured_item:
                                    # Add page and type information to structured item
                                    structured_item["page"] = page_num
                                    structured_item["type"] = "Table"
                                    structured_items.append(structured_item)
                                    vlm_items.append({
                                        "kind": "table",
                                        "page": page_num,
                                        "image_rel_path": rel_path,
                                        "title": structured_item.get("title"),
                                        "headers": structured_item.get("headers"),
                                        "rows": structured_item.get("rows"),
                                    })
                                    md_lines.append(
                                        render_markdown_table(
                                            structured_item.get("headers"),
                                            structured_item.get("rows"),
                                            title=structured_item.get(
                                                "title") or f"Table {table_counter} — page {page_num}"
                                        )
                                    )
                                    wrote_table = True
                            except Exception:
                                pass

                            if not wrote_table:
                                md_lines.append(f"![Table {table_counter} — page {page_num}]({rel_path})\n")

                        table_counter += 1
                        if tables_bar:
                            tables_bar.update(1)

        excel_path = None

        if self.use_vlm:

            if structured_items:
                if self.extract_charts and self.extract_tables:
                    excel_filename = "parsed_tables_charts.xlsx"
                elif self.extract_charts:
                    excel_filename = "parsed_charts.xlsx"
                elif self.extract_tables:
                    excel_filename = "parsed_tables.xlsx"
                else:
                    excel_filename = "parsed_data.xlsx"  # fallback


                excel_path = os.path.join(out_dir, excel_filename)
                write_structured_excel(excel_path, structured_items)

                html_filename = excel_filename.replace('.xlsx', '.html')
                html_path = os.path.join(out_dir, html_filename)
                write_structured_html(html_path, structured_items)

            if 'vlm_items' in locals() and vlm_items:
                with open(os.path.join(out_dir, "vlm_items.json"), 'w', encoding='utf-8') as jf:
                    json.dump(vlm_items, jf, ensure_ascii=False, indent=2)

        extraction_types = []
        if self.extract_charts:
            extraction_types.append("charts")
        if self.extract_tables:
            extraction_types.append("tables")

        print(f"✅ Parsing completed successfully!")
        print(f"📁 Output directory: {out_dir}")

__init__(*, extract_charts=True, extract_tables=True, use_vlm=False, vlm_provider='gemini', vlm_model=None, vlm_api_key=None, layout_model_name='PP-DocLayout_plus-L', dpi=200, min_score=0.0)

Initialize the ChartTablePDFParser with extraction configuration.

:param extract_charts: Whether to extract charts from the document (default: True) :param extract_tables: Whether to extract tables from the document (default: True) :param use_vlm: Whether to use VLM for structured data extraction (default: False) :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini") :param vlm_model: Model name to use (defaults to provider-specific defaults) :param vlm_api_key: API key for VLM provider (required if use_vlm is True) :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L") :param dpi: DPI for PDF rendering (default: 200) :param min_score: Minimum confidence score for layout detection (default: 0.0)

Source code in doctra/parsers/table_chart_extractor.py
def __init__(
        self,
        *,
        extract_charts: bool = True,
        extract_tables: bool = True,
        use_vlm: bool = False,
        vlm_provider: str = "gemini",
        vlm_model: str | None = None,
        vlm_api_key: str | None = None,
        layout_model_name: str = "PP-DocLayout_plus-L",
        dpi: int = 200,
        min_score: float = 0.0,
):
    """
    Initialize the ChartTablePDFParser with extraction configuration.

    :param extract_charts: Whether to extract charts from the document (default: True)
    :param extract_tables: Whether to extract tables from the document (default: True)
    :param use_vlm: Whether to use VLM for structured data extraction (default: False)
    :param vlm_provider: VLM provider to use ("gemini", "openai", "anthropic", or "openrouter", default: "gemini")
    :param vlm_model: Model name to use (defaults to provider-specific defaults)
    :param vlm_api_key: API key for VLM provider (required if use_vlm is True)
    :param layout_model_name: Layout detection model name (default: "PP-DocLayout_plus-L")
    :param dpi: DPI for PDF rendering (default: 200)
    :param min_score: Minimum confidence score for layout detection (default: 0.0)
    """
    if not extract_charts and not extract_tables:
        raise ValueError("At least one of extract_charts or extract_tables must be True")

    self.extract_charts = extract_charts
    self.extract_tables = extract_tables
    self.layout_engine = PaddleLayoutEngine(model_name=layout_model_name)
    self.dpi = dpi
    self.min_score = min_score

    self.use_vlm = use_vlm
    self.vlm = None
    if self.use_vlm:
        self.vlm = VLMStructuredExtractor(
            vlm_provider=vlm_provider,
            vlm_model=vlm_model,
            api_key=vlm_api_key,
        )

parse(pdf_path, output_base_dir='outputs')

Parse a PDF document and extract charts and/or tables.

:param pdf_path: Path to the input PDF file :param output_base_dir: Base directory for output files (default: "outputs") :return: None

Source code in doctra/parsers/table_chart_extractor.py
def parse(self, pdf_path: str, output_base_dir: str = "outputs") -> None:
    """
    Parse a PDF document and extract charts and/or tables.

    :param pdf_path: Path to the input PDF file
    :param output_base_dir: Base directory for output files (default: "outputs")
    :return: None
    """
    pdf_name = Path(pdf_path).stem
    out_dir = os.path.join(output_base_dir, pdf_name, "structured_parsing")
    os.makedirs(out_dir, exist_ok=True)

    charts_dir = None
    tables_dir = None

    if self.extract_charts:
        charts_dir = os.path.join(out_dir, "charts")
        os.makedirs(charts_dir, exist_ok=True)

    if self.extract_tables:
        tables_dir = os.path.join(out_dir, "tables")
        os.makedirs(tables_dir, exist_ok=True)

    pages: List[LayoutPage] = self.layout_engine.predict_pdf(
        pdf_path, batch_size=1, layout_nms=True, dpi=self.dpi, min_score=self.min_score
    )
    pil_pages = [im for (im, _, _) in render_pdf_to_images(pdf_path, dpi=self.dpi)]

    target_labels = []
    if self.extract_charts:
        target_labels.append("chart")
    if self.extract_tables:
        target_labels.append("table")

    chart_count = sum(sum(1 for b in p.boxes if b.label == "chart") for p in pages) if self.extract_charts else 0
    table_count = sum(sum(1 for b in p.boxes if b.label == "table") for p in pages) if self.extract_tables else 0

    if self.use_vlm:
        md_lines: List[str] = ["# Extracted Charts and Tables\n"]
        structured_items: List[Dict[str, Any]] = []
        vlm_items: List[Dict[str, Any]] = []

    charts_desc = "Charts (VLM → table)" if self.use_vlm else "Charts (cropped)"
    tables_desc = "Tables (VLM → table)" if self.use_vlm else "Tables (cropped)"

    chart_counter = 1
    table_counter = 1

    with ExitStack() as stack:
        is_notebook = "ipykernel" in sys.modules or "jupyter" in sys.modules
        is_terminal = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()

        if is_notebook:
            charts_bar = stack.enter_context(
                create_notebook_friendly_bar(total=chart_count, desc=charts_desc)) if chart_count else None
            tables_bar = stack.enter_context(
                create_notebook_friendly_bar(total=table_count, desc=tables_desc)) if table_count else None
        else:
            charts_bar = stack.enter_context(
                create_beautiful_progress_bar(total=chart_count, desc=charts_desc, leave=True)) if chart_count else None
            tables_bar = stack.enter_context(
                create_beautiful_progress_bar(total=table_count, desc=tables_desc, leave=True)) if table_count else None

        for p in pages:
            page_num = p.page_index
            page_img: Image.Image = pil_pages[page_num - 1]

            target_items = [box for box in p.boxes if box.label in target_labels]

            if target_items and self.use_vlm:
                md_lines.append(f"\n## Page {page_num}\n")

            for box in sorted(target_items, key=reading_order_key):
                if box.label == "chart" and self.extract_charts:
                    chart_filename = f"chart_{chart_counter:03d}.png"
                    chart_path = os.path.join(charts_dir, chart_filename)

                    cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                    cropped_img.save(chart_path)

                    if self.use_vlm and self.vlm:
                        rel_path = os.path.join("charts", chart_filename)
                        wrote_table = False

                        try:
                            extracted_chart = self.vlm.extract_chart(chart_path)
                            structured_item = to_structured_dict(extracted_chart)
                            if structured_item:
                                # Add page and type information to structured item
                                structured_item["page"] = page_num
                                structured_item["type"] = "Chart"
                                structured_items.append(structured_item)
                                vlm_items.append({
                                    "kind": "chart",
                                    "page": page_num,
                                    "image_rel_path": rel_path,
                                    "title": structured_item.get("title"),
                                    "headers": structured_item.get("headers"),
                                    "rows": structured_item.get("rows"),
                                })
                                md_lines.append(
                                    render_markdown_table(
                                        structured_item.get("headers"),
                                        structured_item.get("rows"),
                                        title=structured_item.get(
                                            "title") or f"Chart {chart_counter} — page {page_num}"
                                    )
                                )
                                wrote_table = True
                        except Exception:
                            pass

                        if not wrote_table:
                            md_lines.append(f"![Chart {chart_counter} — page {page_num}]({rel_path})\n")

                    chart_counter += 1
                    if charts_bar:
                        charts_bar.update(1)

                elif box.label == "table" and self.extract_tables:
                    table_filename = f"table_{table_counter:03d}.png"
                    table_path = os.path.join(tables_dir, table_filename)

                    cropped_img = page_img.crop((box.x1, box.y1, box.x2, box.y2))
                    cropped_img.save(table_path)

                    if self.use_vlm and self.vlm:
                        rel_path = os.path.join("tables", table_filename)
                        wrote_table = False

                        try:
                            extracted_table = self.vlm.extract_table(table_path)
                            structured_item = to_structured_dict(extracted_table)
                            if structured_item:
                                # Add page and type information to structured item
                                structured_item["page"] = page_num
                                structured_item["type"] = "Table"
                                structured_items.append(structured_item)
                                vlm_items.append({
                                    "kind": "table",
                                    "page": page_num,
                                    "image_rel_path": rel_path,
                                    "title": structured_item.get("title"),
                                    "headers": structured_item.get("headers"),
                                    "rows": structured_item.get("rows"),
                                })
                                md_lines.append(
                                    render_markdown_table(
                                        structured_item.get("headers"),
                                        structured_item.get("rows"),
                                        title=structured_item.get(
                                            "title") or f"Table {table_counter} — page {page_num}"
                                    )
                                )
                                wrote_table = True
                        except Exception:
                            pass

                        if not wrote_table:
                            md_lines.append(f"![Table {table_counter} — page {page_num}]({rel_path})\n")

                    table_counter += 1
                    if tables_bar:
                        tables_bar.update(1)

    excel_path = None

    if self.use_vlm:

        if structured_items:
            if self.extract_charts and self.extract_tables:
                excel_filename = "parsed_tables_charts.xlsx"
            elif self.extract_charts:
                excel_filename = "parsed_charts.xlsx"
            elif self.extract_tables:
                excel_filename = "parsed_tables.xlsx"
            else:
                excel_filename = "parsed_data.xlsx"  # fallback


            excel_path = os.path.join(out_dir, excel_filename)
            write_structured_excel(excel_path, structured_items)

            html_filename = excel_filename.replace('.xlsx', '.html')
            html_path = os.path.join(out_dir, html_filename)
            write_structured_html(html_path, structured_items)

        if 'vlm_items' in locals() and vlm_items:
            with open(os.path.join(out_dir, "vlm_items.json"), 'w', encoding='utf-8') as jf:
                json.dump(vlm_items, jf, ensure_ascii=False, indent=2)

    extraction_types = []
    if self.extract_charts:
        extraction_types.append("charts")
    if self.extract_tables:
        extraction_types.append("tables")

    print(f"✅ Parsing completed successfully!")
    print(f"📁 Output directory: {out_dir}")

Quick Reference

StructuredPDFParser

from doctra import StructuredPDFParser

parser = StructuredPDFParser(
    # Layout Detection
    layout_model_name: str = "PP-DocLayout_plus-L",
    dpi: int = 200,
    min_score: float = 0.0,

    # OCR Settings
    ocr_lang: str = "eng",
    ocr_psm: int = 4,
    ocr_oem: int = 3,
    ocr_extra_config: str = "",

    # VLM Settings
    use_vlm: bool = False,
    vlm_provider: str = None,
    vlm_api_key: str = None,
    vlm_model: str = None,

    # Output Settings
    box_separator: str = "\n"
)

# Parse document
parser.parse(
    pdf_path: str,
    output_base_dir: str = "outputs"
)

# Visualize layout
parser.display_pages_with_boxes(
    pdf_path: str,
    num_pages: int = 3,
    cols: int = 2,
    page_width: int = 800,
    spacing: int = 40,
    save_path: str = None
)

EnhancedPDFParser

from doctra import EnhancedPDFParser

parser = EnhancedPDFParser(
    # Image Restoration
    use_image_restoration: bool = True,
    restoration_task: str = "appearance",
    restoration_device: str = None,
    restoration_dpi: int = 200,

    # All StructuredPDFParser parameters...
)

# Parse with enhancement
parser.parse(
    pdf_path: str,
    output_base_dir: str = "outputs"
)

ChartTablePDFParser

from doctra import ChartTablePDFParser

parser = ChartTablePDFParser(
    # Extraction Settings
    extract_charts: bool = True,
    extract_tables: bool = True,

    # VLM Settings
    use_vlm: bool = False,
    vlm_provider: str = None,
    vlm_api_key: str = None,
    vlm_model: str = None,

    # Layout Detection
    layout_model_name: str = "PP-DocLayout_plus-L",
    dpi: int = 200,
    min_score: float = 0.0
)

# Extract charts/tables
parser.parse(
    pdf_path: str,
    output_base_dir: str = "outputs"
)

Parameter Reference

Layout Detection Parameters

Parameter Type Default Description
layout_model_name str "PP-DocLayout_plus-L" PaddleOCR layout detection model
dpi int 200 Image resolution for rendering PDF pages
min_score float 0.0 Minimum confidence score for detected elements

OCR Parameters

Parameter Type Default Description
ocr_lang str "eng" Tesseract language code
ocr_psm int 4 Page segmentation mode
ocr_oem int 3 OCR engine mode
ocr_extra_config str "" Additional Tesseract configuration

VLM Parameters

Parameter Type Default Description
use_vlm bool False Enable VLM processing
vlm_provider str None Provider: "openai", "gemini", "anthropic", "openrouter"
vlm_api_key str None API key for the VLM provider
vlm_model str None Specific model to use (provider-dependent)

Image Restoration Parameters

Parameter Type Default Description
use_image_restoration bool True Enable image restoration
restoration_task str "appearance" Restoration task type
restoration_device str None Device: "cuda", "cpu", or None (auto-detect)
restoration_dpi int 200 DPI for restoration processing

Extraction Parameters

Parameter Type Default Description
extract_charts bool True Extract chart elements
extract_tables bool True Extract table elements

Output Parameters

Parameter Type Default Description
box_separator str "\n" Separator between detected elements

Return Values

parse() Method

Returns: None

Generates output files in the specified output_base_dir:

outputs/
└── <document_name>/
    ├── full_parse/  # or 'enhanced_parse/', 'structured_parsing/'
    │   ├── result.md
    │   ├── result.html
    │   ├── tables.xlsx  # If VLM enabled
    │   ├── tables.html  # If VLM enabled
    │   ├── vlm_items.json  # If VLM enabled
    │   └── images/
    │       ├── figures/
    │       ├── charts/
    │       └── tables/

display_pages_with_boxes() Method

Returns: None

Displays or saves visualization of layout detection.

Error Handling

All parsers may raise:

  • FileNotFoundError: PDF file not found
  • ValueError: Invalid parameter values
  • RuntimeError: Processing errors (e.g., Poppler not found)
  • APIError: VLM API errors (when VLM enabled)

Example error handling:

from doctra import StructuredPDFParser

parser = StructuredPDFParser()

try:
    parser.parse("document.pdf")
except FileNotFoundError:
    print("PDF file not found!")
except ValueError as e:
    print(f"Invalid parameter: {e}")
except RuntimeError as e:
    print(f"Processing error: {e}")

Examples

See the Examples section for detailed usage examples.