HTML Parsing & Selectors¶

Silkweb exposes a selector API on top of lxml: CSS via lxml.cssselect, XPath via lxml, plus extractors that work without any LLM.

Page content formats¶

Every SilkPage returned by fetch() exposes the content in three formats:

Property	Format	Description
`page.html`	Raw HTML	The full HTML source as received
`page.text`	Plain text	Cleaned text with boilerplate removed (via Trafilatura)
`page.markdown`	Markdown	Cleaned content with headings, links, bold/italic, lists preserved

import silkweb

page = silkweb.fetch("https://example.com")

print(page.html)         # raw HTML source
print(page.text)         # "Example Domain\nThis domain is..."
print(page.markdown)     # "# Example Domain\n\nThis domain is..."

CSS selectors¶

page.css() uses lxml.cssselect.CSSSelector (Selectors Level 3 subset supported by cssselect).

page = silkweb.fetch(url)

# Returns list[SilkElement]
items = page.css(".product-card")

# First match only
title = page.css_first("h1")

# Text shorthand
title_text = page.css_first("h1").text

XPath selectors¶

kind="elements" (default): returns list[SilkElement] — use for paths that match element nodes.
kind="values": returns raw XPath results (strings, numbers, attribute values, text nodes) — use for //@href, /text(), etc.

# Elements (default)
items = page.xpath("//div[@class='product-card']")

# Attribute nodes and text — use kind="values"
hrefs = page.xpath("//a[@class='product-link']/@href", kind="values")
prices = page.xpath("//span[contains(@class, 'price')]/text()", kind="values")

SilkElement¶

Every element returned by css() or xpath() (with kind="elements") is a SilkElement with a rich API:

el = page.css_first(".product-card")

el.text          # inner text content
el.html          # outer HTML
el.attrs         # dict of all attributes
el["href"]       # access attribute by name
el.xpath         # XPath address in the document
el.parent        # parent SilkElement
el.children      # list of child SilkElements
el.siblings      # list of sibling SilkElements

Built-in smart extractors¶

These methods work without any LLM — they use heuristics and standard patterns:

Links¶

page.links(external=...) needs a non-empty page.url (or the URL you passed to SilkPage) to decide what “internal” vs “external” means. If there is no base URL, external filtering is not applied.

all_links = page.links()                    # all <a href> as absolute URLs
external = page.links(external=True)        # external domains only
internal = page.links(external=False)       # same-domain only

Tables¶

Returns each <table> as a list of rows; colspan/rowspan are not expanded. Nested tables each appear as a separate top-level table.

tables = page.tables()
# Returns a list of tables, each as a list of rows

JSON-LD structured data¶

json_ld = page.json_ld()
# Parses all <script type="application/ld+json"> tags
# [{"@type": "Product", "name": "...", "price": "..."}, ...]

Metadata¶

meta = page.metadata
# {'title': '...', 'description': '...', ...} from <title> and <meta name|property>
# Open Graph / Twitter tags appear when present as meta tags (e.g. og:title, twitter:card)

Article extraction¶

article = page.article()
# {'title': '...', 'text': '...', 'author': '...', 'date': '...', 'language': '...'}
# `text` is the same main-body text as `page.text` (Trafilatura when available).

Hydration data¶

Embedded JSON from common SPA patterns (best-effort):

Next.js — <script id="__NEXT_DATA__" type="application/json">
Nuxt — <script id="__NUXT_DATA__" ...> when present, else a regex for inline __NUXT__ = {...}

Other stacks (Remix, SvelteKit, etc.) are not parsed automatically yet; use page.html and custom extraction if needed.

data = page.hydration_data()  # dict | None

Repeated pattern detection¶

Heuristic: finds the most repeated (tag, class) pair under <body> among elements that have a class attribute, then returns one record per element.

Each record has text, xpath, and url (first a[href] inside the element, made absolute with page.url), not separate title/price fields.

records = page.detect_records()
# [{'text': '...', 'xpath': '...', 'url': '...' | None}, ...]

Network log (browser tiers)¶

page.network_requests() is only non-empty when a browser tier fetcher captured events (see fetch guides). Otherwise it returns [].

SilkMeta provenance¶

Every extracted item can carry provenance metadata:

from silkweb.parse.page import SilkMeta

# SilkMeta fields:
# - url: source URL
# - fetched_at: timestamp
# - fetch_tier: which tier was used (0-3)
# - xpath: XPath address of the source element
# - llm_model: which model extracted this (if LLM was used)
# - selector_from_cache: whether cached selectors were used
# - confidence: extraction confidence score