Frequent Items Sketch Functions
Frequency sketches identify heavy hitters — items that appear frequently in a data stream. They estimate item frequencies and can enumerate the most frequent items.
Table of contents
Aggregation Functions
ds_freq_sketch
Builds a frequency sketch from string values.
ds_freq_sketch(value) -> varbinary
ds_freq_sketch(value, maxMapSize) -> varbinary
| Parameter | Type | Description |
|---|---|---|
value | VARCHAR | String values to track |
maxMapSize | INTEGER | Maximum internal map size (default 64). Higher = tracks more items |
ds_freq_union
Computes the union of multiple frequency sketches.
ds_freq_union(sketch) -> varbinary
Scalar Functions
ds_freq_estimate
Returns the estimated frequency of a specific item.
ds_freq_estimate(sketch, item) -> bigint
Example:
SELECT ds_freq_estimate(ds_freq_sketch(search_term), 'trino') AS trino_count
FROM searches;
ds_freq_lower_bound
Returns the lower bound on the frequency of an item.
ds_freq_lower_bound(sketch, item) -> bigint
ds_freq_upper_bound
Returns the upper bound on the frequency of an item.
ds_freq_upper_bound(sketch, item) -> bigint
The true frequency is always between ds_freq_lower_bound and ds_freq_upper_bound. Use lower bound when false positives are costly; use upper bound when false negatives are costly.
ds_freq_frequent_items
Returns the frequent items as a JSON array.
ds_freq_frequent_items(sketch, errorType) -> varchar
| Parameter | Type | Description |
|---|---|---|
sketch | VARBINARY | Frequency sketch |
errorType | VARCHAR | 'NO_FALSE_POSITIVES' or 'NO_FALSE_NEGATIVES' |
NO_FALSE_POSITIVES: every returned item is truly frequent (may miss some)NO_FALSE_NEGATIVES: every truly frequent item is returned (may include extras)
Example:
SELECT ds_freq_frequent_items(ds_freq_sketch(search_term), 'NO_FALSE_POSITIVES')
FROM searches;
-- Returns: [{"item":"trino","estimate":15234,"lower":15200,"upper":15268}, ...]
ds_freq_stringify
ds_freq_stringify(sketch) -> varchar
When to Use Frequency Sketches
- Top-K queries: find the most popular products, search terms, or error codes
- Anomaly detection: track item frequencies over time, flag items with unusual spikes
- Data quality: identify duplicate or over-represented values in a dataset