Caching¶

Silkweb uses a three-layer cache system to minimize redundant network requests, browser launches, and LLM calls.

Three-layer cache¶

Layer 1 — HTTP cache¶

Stores raw HTTP responses with conditional GET support (ETag / Last-Modified). Prevents redundant network requests.

Backend: hishel library
Key: URL
Supports conditional GET headers
Configurable TTL and max size

Layer 2 — Rendered page cache¶

Stores post-JavaScript DOM snapshots as serialized SilkPage objects. Prevents redundant browser launches for Tier 2/3 fetches.

Key: (url, content-hash of raw HTML)
Backend: SQLite with JSON serialization (or Redis)
Configurable TTL

Layer 3 — Selector cache¶

Stores LLM-synthesized CSS/XPath selectors keyed by (domain, skeleton_key). This is the most impactful cache — it means the LLM is called only once per page template and schema.

Key: (domain, skeleton_key)
skeleton_key: xxhash(dom_tag_nesting) plus a signature of the schema fields (prevents reusing selectors compiled for a different schema)
Backend: SQLite
Configurable TTL (default: never expires)

Configuration¶

silkweb.configure(
    cache_enabled=True,
    cache_backend="sqlite",             # "sqlite" | "redis" | "memory"
    cache_path="~/.silkweb/cache",      # for sqlite
    http_cache_ttl=3600,                # HTTP cache TTL in seconds (1 hour)
    page_cache_ttl=1800,                # Rendered page cache TTL (30 min)
    selector_cache_ttl=None,            # Selector cache TTL (None = forever)
)

cache_backend='memory'

cache_backend="memory" uses an in-process RAM cache for rendered pages (Layer 2). The HTTP cache (Layer 1) is disabled in this mode (it is only implemented for SQLite/Redis backends).

Redis backend¶

silkweb.configure(
    cache_backend="redis",
    redis_url="redis://localhost:6379",
)

Managing the cache¶

Inspect stats¶

stats = silkweb.cache.stats()
# {
#   'http': {...},
#   'page': {...},
#   'selectors': {...}
# }

Clear cache¶

# Clear all layers
silkweb.cache.clear()

# Clear specific layer
silkweb.cache.clear(layer="http")
silkweb.cache.clear(layer="page")
silkweb.cache.clear(layer="selectors")

# Clear by domain
silkweb.cache.clear(domain="amazon.com")
silkweb.cache.clear(domain="amazon.com", layer="selectors")

CLI cache management¶

silkweb cache stats
silkweb cache clear --layer selectors
silkweb cache clear --domain amazon.com

Bypassing the cache¶

# Skip cache for a single fetch
page = silkweb.fetch(url, no_cache=True)

# Force LLM re-extraction (bypass selector cache)
data = silkweb.ask(url, "products", force_llm=True)

force_llm scope

force_llm=True bypasses the selector-cache fast path (so Silkweb runs the full LLM extraction pipeline even if selectors are cached). It does not automatically bypass the HTTP cache or rendered page cache. Use no_cache=True on fetch() / ask() / extract() / query() to bypass page caching for a call.

How the selector cache works¶

The selector cache is the key to Silkweb's "extract once, scrape millions" approach:

First visit to a page template: full LLM pipeline runs (clean, schema, extract, compile selectors)
Selectors are cached keyed by (domain, skeleton_key)
The skeleton_key includes a DOM skeleton hash (tag structure only) plus a signature of the schema fields
Subsequent pages whose structure matches (and schema matches) can reuse selectors and skip LLM extraction for that path
If selectors fail (empty results, validation errors), the self-healer invalidates the cache and re-runs the LLM pipeline

Cache backends¶

Backend	Persistence	Distributed	Best for
`sqlite`	Yes	No	Default, single-machine
`redis`	Yes	Yes	Multi-process or distributed
`memory`	No	No	Testing, short-lived scripts