Skip to content

Cache

Silkweb uses a three-layer caching system to minimize network requests and LLM calls.

Selector cache key (how reuse works)

The selector cache stores synthesized selector sets under a key derived from:

  • domain: the URL hostname (e.g. books.toscrape.com)
  • DOM skeleton hash: a stable fingerprint of the page’s tag nesting (ignores text/attributes)
  • schema signature: a signature of the schema’s field names/types, so selectors compiled for one schema aren’t reused for a different schema

Cache Manager

CacheManager dataclass

CacheManager(http: HttpCache, page: RenderedPageCache, selectors: SelectorCache)

HTTP Cache (Layer 1)

HttpCache dataclass

HttpCache(enabled: bool = True, backend: HttpBackend = 'sqlite', ttl_s: float | None = None, max_size_bytes: int | None = None, redis_url: str | None = None, sqlite_path: str | None = None)

HTTP cache via hishel for httpx.

Notes: - Conditional GET (ETag/Last-Modified) is handled by hishel. - TTL is implemented through AsyncSqliteStorage(default_ttl=...). - max_size is best-effort; currently not enforced by hishel storage directly.

Rendered Page Cache (Layer 2)

RenderedPageCache dataclass

RenderedPageCache(backend: PageBackend = 'sqlite', sqlite_path: str | None = None, ttl_seconds: int | None = None, redis_url: str | None = None, _mem_pages: dict[tuple[str, str], dict[str, Any]] | None = None, _mem_last: dict[str, str] | None = None, _mem_timestamps: dict[tuple[str, str], datetime] | None = None)

Selector Cache (Layer 3)

SelectorCache

SelectorCache(path: str | None = None, ttl_seconds: int | None = None)
Source code in silkweb/cache/selectors.py
def __init__(self, path: str | None = None, ttl_seconds: int | None = None) -> None:
    self.path = path or _default_db_path()
    self.ttl_seconds = ttl_seconds
    os.makedirs(os.path.dirname(self.path), exist_ok=True)
    self._init_db()

dom_skeleton_hash

dom_skeleton_hash(html: str) -> str

Hash of DOM "skeleton": tag names + nesting only (no attrs, no text).

This is designed to be stable across content changes for the same template.

Source code in silkweb/cache/selectors.py
def dom_skeleton_hash(html: str) -> str:
    """
    Hash of DOM "skeleton": tag names + nesting only (no attrs, no text).

    This is designed to be stable across content changes for the same template.
    """
    doc = lxml_html.fromstring(html or "<html/>")

    def walk(node: etree._Element, out: list[str]) -> None:
        out.append(f"<{node.tag}>")
        for child in node:
            if isinstance(child, etree._Element):
                walk(child, out)
        out.append(f"</{node.tag}>")

    parts: list[str] = []
    walk(doc, parts)
    skeleton = "".join(parts)
    return xxhash.xxh64(skeleton).hexdigest()