SilkQL Query Language¶
SilkQL is Silkweb's structured query language for web data extraction. It lets you define exactly what fields you want, with type coercions and modifiers, in a compact syntax.
Basic syntax¶
Example¶
import silkweb
results = silkweb.query("https://news.ycombinator.com", """
{
stories[] {
title
url
score(int)
author
comments(int, optional)
}
}
""")
for story in results.data:
print(f"[{story.score}] {story.title} by {story.author}")
print(f"Pages scraped: {results.pages_scraped}")
print(f"Cached: {results.cached}")
Type coercions¶
Type coercions convert extracted string values to Python types:
| Coercion | Result type | Example |
|---|---|---|
int |
int |
score(int) → 312 |
float |
float |
rating(float) → 4.5 |
currency |
float |
price(currency) → strips $, , → 29.99 |
bool |
bool |
in_stock(bool) → True |
url |
str |
link(url) → absolute URL |
iso_date |
str |
published(iso_date) → "2024-01-15" |
list |
list[str] |
tags(list) → ["python", "web"] |
json |
Any |
metadata(json) → parsed JSON |
Modifiers¶
| Modifier | Effect |
|---|---|
optional |
Field defaults to None if not found |
unique |
Deduplicate values |
min_count=N |
Require at least N items in a collection |
Collections¶
Use [] to denote a collection (list of records):
Collections can be nested:
Pagination¶
Add a pagination block to automatically follow paginated results:
results = silkweb.query("https://example.com/products", """
{
products[] {
name
price(currency)
}
pagination {
next_page_url(url)
}
}
""", follow_pagination=True)
When follow_pagination=True, Silkweb will:
- Extract the
next_page_urlfrom each page - Navigate to the next page
- Merge results across all pages
- Continue until no more
next_page_urlis found
Python API details¶
query/async_queryuse the same cleaner / extraction / selector models asextract(fromget_config()orconfigure(...)); passcleaner_model=andselector_model=on the call to override. Those kwargs are not forwarded to the fetcher.QueryResult:datais a one-element list containing the merged root model for multi-page runs;cachedis true if any page used the selector cache.- Flat root queries (no single list collection at the root): the extractor must return exactly one row; otherwise a
SilkwebExtractionErroris raised (use a root list likeitems[] { ... }for many rows). query_from_html: same SilkQL pipeline as one page ofasync_query(no pagination); useful when you already have HTML.
Validation¶
Validate a SilkQL string without executing it:
from silkweb.silkql.parser import parse
ast = parse("""
{
title
price(currency)
items[] {
name
quantity(int)
}
}
""")
print(ast) # RootNode with fields and collections
Or via the CLI:
Compilation¶
SilkQL queries are compiled to Pydantic models:
from silkweb.silkql.compiler import compile_query
Model = compile_query("""
{
name
price(currency)
in_stock(bool)
}
""")
# Model is a Pydantic BaseModel with validated fields
print(Model.model_json_schema())
The currency coercion generates a BeforeValidator that strips currency symbols and commas before parsing as float.