Output Formats¶

Silkweb supports a wide range of output formats for both page content and extracted data.

Page content formats¶

Every SilkPage returned by fetch() provides the content in multiple representations:

Property	Format	Description
`page.html`	Raw HTML	The full HTML source as received
`page.text`	Plain text	Cleaned text with boilerplate removed (via Trafilatura)
`page.markdown`	Markdown	Cleaned content with headings, links, bold/italic, lists preserved
`page.json_ld()`	JSON-LD	Structured data from `<script type="application/ld+json">` tags
`page.hydration_data()`	Dict \| None	Next.js `__NEXT_DATA__`, Nuxt `__NUXT_DATA__` or inline `__NUXT__` (best-effort)
`page.network_requests()`	List[dict]	Browser-tier fetch capture only; empty otherwise
`page.tables()`	List	All `<table>` elements parsed into row/column lists
`page.links()`	List[str]	All `<a href>` links as absolute URLs
`page.article()`	Dict	Article metadata (title, author, date, content)

page = silkweb.fetch("https://example.com")

print(page.html)         # raw HTML
print(page.text)         # "Example Domain\nThis domain is..."
print(page.markdown)     # "# Example Domain\n\nThis domain is..."

Extraction return types¶

Python dict / list (default)¶

data = silkweb.ask(url, "all products")
# [{'name': '...', 'price': 29.99}, ...]

Pydantic models¶

from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float

products: list[Product] = silkweb.extract(url, schema=Product, prompt="all products")
products[0].model_dump()
products[0].model_dump_json()

Automatic DataFrame detection (`ask` only by default)¶

async_ask / ask default to output="auto". When pandas or polars is already imported and auto_detect_dataframe is enabled, ask() can return a DataFrame instead of list[dict].

extract() / async_extract() default to output="python", so they return list[BaseModel] even if pandas is imported. Use output="auto" for the same import-based heuristic, or output="df" for a deterministic DataFrame.

import pandas  # importing enables auto-detection for ask()
import silkweb

df = silkweb.ask(url, "all products")
# df is a pandas DataFrame, not a list[dict]

import polars
import silkweb

df = silkweb.ask(url, "all products")
# df is a polars DataFrame

Typed extraction with a schema:

import pandas
import silkweb
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float

# Same heuristic as ask: needs output="auto" (not the default for extract)
rows = silkweb.extract(url, schema=Product, prompt="all products", output="auto")

# Or force a DataFrame explicitly (recommended for extract)
df = silkweb.extract(url, schema=Product, prompt="all products", output="df", dataframe_engine="pandas")

Forcing a DataFrame (recommended)¶

If you want deterministic behavior (no auto-detection), force the output:

df = silkweb.ask(url, "all products", output="df", dataframe_engine="pandas")

Explicit DataFrame conversion¶

from silkweb.output.dataframe import to_dataframe

df = to_dataframe(data, engine="pandas")   # force pandas
df = to_dataframe(data, engine="polars")   # force polars
df = to_dataframe(data, engine="auto")     # auto-detect

File output formats¶

All file output functions accept a list of results (dicts or Pydantic models) and an output path.

JSON¶

from silkweb.output.files import to_json

to_json(data, "products.json")
to_json(data, "products.json.gz")   # auto-gzip

JSONL (JSON Lines)¶

One JSON object per line — ideal for streaming and large datasets:

from silkweb.output.files import to_jsonl

to_jsonl(data, "products.jsonl")
to_jsonl(data, "products.jsonl.gz")  # auto-gzip

CSV¶

from silkweb.output.files import to_csv

to_csv(data, "products.csv")
to_csv(data, "products.csv.gz")  # auto-gzip

Parquet¶

Columnar format, excellent for analytics:

from silkweb.output.files import to_parquet

to_parquet(data, "products.parquet")

Note

Requires pandas and pyarrow to be installed.

SQLite¶

from silkweb.output.files import to_sqlite

to_sqlite(data, "products.db", table="products")

DuckDB¶

from silkweb.output.files import to_duckdb

to_duckdb(data, "store.duckdb", table="products")

Note

Requires the duckdb package to be installed.

Markdown table¶

from silkweb.output.files import to_markdown

to_markdown(data, "products.md")

HuggingFace Dataset¶

Convert results to a HuggingFace datasets.Dataset for ML pipelines:

from silkweb.output.dataset import to_dataset

ds = to_dataset(data)
ds.push_to_hub("your-org/scraped-products")

Note

Requires the datasets package to be installed.

Auto-gzip¶

Any file output function automatically gzips when the path ends in .gz:

to_json(data, "products.json.gz")
to_jsonl(data, "products.jsonl.gz")
to_csv(data, "products.csv.gz")

Format summary¶

Format	Function	Requires	Gzip support
JSON	`to_json()`	built-in	Yes
JSONL	`to_jsonl()`	built-in	Yes
CSV	`to_csv()`	built-in	Yes
Parquet	`to_parquet()`	`pandas`, `pyarrow`	No
DuckDB	`to_duckdb()`	`duckdb`	No
SQLite	`to_sqlite()`	built-in	No
Markdown table	`to_markdown()`	built-in	No
pandas DataFrame	`to_dataframe(engine="pandas")`	`pandas`	N/A
polars DataFrame	`to_dataframe(engine="polars")`	`polars`	N/A
HuggingFace Dataset	`to_dataset()`	`datasets`	N/A

Output Formats¶

Page content formats¶

Extraction return types¶

Python dict / list (default)¶

Pydantic models¶

Automatic DataFrame detection (ask only by default)¶

Forcing a DataFrame (recommended)¶

Explicit DataFrame conversion¶

File output formats¶

JSON¶

JSONL (JSON Lines)¶

CSV¶

Parquet¶

SQLite¶

DuckDB¶

Markdown table¶

HuggingFace Dataset¶

Auto-gzip¶

Format summary¶

Automatic DataFrame detection (`ask` only by default)¶