Caching¶
Silkweb uses a three-layer cache system to minimize redundant network requests, browser launches, and LLM calls.
Three-layer cache¶
Layer 1 — HTTP cache¶
Stores raw HTTP responses with conditional GET support (ETag / Last-Modified). Prevents redundant network requests.
- Backend:
hishellibrary - Key: URL
- Supports conditional GET headers
- Configurable TTL and max size
Layer 2 — Rendered page cache¶
Stores post-JavaScript DOM snapshots as serialized SilkPage objects. Prevents redundant browser launches for Tier 2/3 fetches.
- Key:
(url, content-hash of raw HTML) - Backend: SQLite with JSON serialization (or Redis)
- Configurable TTL
Layer 3 — Selector cache¶
Stores LLM-synthesized CSS/XPath selectors keyed by (domain, skeleton_key). This is the most impactful cache — it means the LLM is called only once per page template and schema.
- Key:
(domain, skeleton_key) skeleton_key:xxhash(dom_tag_nesting)plus a signature of the schema fields (prevents reusing selectors compiled for a different schema)- Backend: SQLite
- Configurable TTL (default: never expires)
Configuration¶
silkweb.configure(
cache_enabled=True,
cache_backend="sqlite", # "sqlite" | "redis" | "memory"
cache_path="~/.silkweb/cache", # for sqlite
http_cache_ttl=3600, # HTTP cache TTL in seconds (1 hour)
page_cache_ttl=1800, # Rendered page cache TTL (30 min)
selector_cache_ttl=None, # Selector cache TTL (None = forever)
)
cache_backend='memory'
cache_backend="memory" uses an in-process RAM cache for rendered pages (Layer 2).
The HTTP cache (Layer 1) is disabled in this mode (it is only implemented for SQLite/Redis backends).
Redis backend¶
Managing the cache¶
Inspect stats¶
Clear cache¶
# Clear all layers
silkweb.cache.clear()
# Clear specific layer
silkweb.cache.clear(layer="http")
silkweb.cache.clear(layer="page")
silkweb.cache.clear(layer="selectors")
# Clear by domain
silkweb.cache.clear(domain="amazon.com")
silkweb.cache.clear(domain="amazon.com", layer="selectors")
CLI cache management¶
Bypassing the cache¶
# Skip cache for a single fetch
page = silkweb.fetch(url, no_cache=True)
# Force LLM re-extraction (bypass selector cache)
data = silkweb.ask(url, "products", force_llm=True)
force_llm scope
force_llm=True bypasses the selector-cache fast path (so Silkweb runs the full LLM extraction pipeline even if selectors are cached).
It does not automatically bypass the HTTP cache or rendered page cache. Use no_cache=True on fetch() / ask() / extract() / query() to bypass page caching for a call.
How the selector cache works¶
The selector cache is the key to Silkweb's "extract once, scrape millions" approach:
- First visit to a page template: full LLM pipeline runs (clean, schema, extract, compile selectors)
- Selectors are cached keyed by
(domain, skeleton_key) - The
skeleton_keyincludes a DOM skeleton hash (tag structure only) plus a signature of the schema fields - Subsequent pages whose structure matches (and schema matches) can reuse selectors and skip LLM extraction for that path
- If selectors fail (empty results, validation errors), the self-healer invalidates the cache and re-runs the LLM pipeline
Cache backends¶
| Backend | Persistence | Distributed | Best for |
|---|---|---|---|
sqlite |
Yes | No | Default, single-machine |
redis |
Yes | Yes | Multi-process or distributed |
memory |
No | No | Testing, short-lived scripts |