Quick Start¶

Get from zero to structured web data in 5 minutes.

1. Install Silkweb¶

pip install "silkweb[all]"

For local LLM support, install Ollama and pull a model:

ollama pull qwen2.5:14b

2. Fetch a page¶

The simplest operation — fetch any URL and get a SilkPage object:

import silkweb

page = silkweb.fetch("https://books.toscrape.com")

print(page.status)         # 200
print(page.url)            # https://books.toscrape.com/
print(len(page.html))      # raw HTML length
print(page.text[:200])     # cleaned plain text via Trafilatura
print(page.markdown[:200]) # cleaned markdown (headings, links, lists preserved)

Every SilkPage exposes the content in multiple formats:

Property	Format	Description
`page.html`	Raw HTML	The full HTML source
`page.text`	Plain text	Cleaned text with boilerplate removed (via Trafilatura)
`page.markdown`	Markdown	Cleaned content with headings, links, bold/italic preserved

Silkweb automatically selects the best fetcher tier. If a simple HTTP request fails (anti-bot, JavaScript-rendered content), it escalates to a stealth browser transparently.

3. Extract data with plain English¶

No selectors needed — just describe what you want:

books = silkweb.ask(
    "https://books.toscrape.com",
    "all books with title and price"
)

for book in books:
    print(f"{book['title']}: £{book['price']}")

Behind the scenes, Silkweb:

Fetches the page (auto-selecting the right tier)
Cleans the HTML (strips nav, footer, ads)
Uses an LLM to infer a schema from your prompt
Extracts structured data matching that schema
Compiles CSS/XPath selectors and caches them
Returns typed results

Progress logging

ask / async_ask emit structured logs (via structlog) when log_level is set to INFO or lower. By default (log_level="WARNING"), they are quiet and do not print progress to stdout.

4. Use a Pydantic schema for type safety¶

When you know the shape of the data ahead of time:

from pydantic import BaseModel
from silkweb import extract

class Book(BaseModel):
    title: str
    price: float
    rating: int
    in_stock: bool

books = extract(
    "https://books.toscrape.com",
    schema=Book,
    prompt="all books on the page"
)

for book in books:
    print(f"{book.title} — £{book.price} ({book.rating} stars)")

Every item is validated against your Pydantic model. Invalid results trigger automatic self-healing (re-extraction with corrected prompts).

5. Use SilkQL for structured queries¶

SilkQL is Silkweb's query language for the web:

results = silkweb.query("https://news.ycombinator.com", """
{
    stories[] {
        title
        url
        score(int)
        author
        comments(int)
    }
}
""")

for item in results.data:
    print(f"[{item.score}] {item.title}")

6. Traditional CSS/XPath scraping¶

Silkweb works perfectly fine without LLMs:

page = silkweb.fetch("https://example.com")

# Content formats
print(page.html)       # raw HTML
print(page.text)       # clean plain text
print(page.markdown)   # clean markdown

# CSS selectors
headings = page.css("h1, h2, h3")
for h in headings:
    print(h.text)

# XPath
links = page.xpath("//a[@href]", kind="elements")
for link in links:
    print(link["href"], link.text)

# Convenience methods
all_links = page.links()
tables = page.tables()
json_ld = page.json_ld()

7. Async usage¶

All functions have async counterparts:

import asyncio
import silkweb

async def main():
    page = await silkweb.async_fetch("https://example.com")
    data = await silkweb.async_ask(
        "https://books.toscrape.com",
        "all book titles and prices"
    )
    return data

results = asyncio.run(main())

8. Configure Silkweb¶

import silkweb

silkweb.configure(
    extraction_model="ollama/qwen2.5:14b",
    max_tier=2,                 # don't go beyond Playwright
    rate_limit_per_domain=3,    # max 3 req/s per domain
    cache_backend="sqlite",     # persistent caching
    log_level="INFO",           # see what's happening
)

Next steps¶

Fetcher Tiers — understand auto-escalation
LLM Extraction — the full extraction pipeline
SilkQL — structured queries
Anti-Bot — proxy pools, rate limiting, stealth
API Reference — complete function signatures