Trino DataSketches Plugin
Apache DataSketches probabilistic data structures as Trino SQL functions.
Overview
This plugin provides 65 SQL functions across 8 sketch families for Trino 479, enabling approximate computations on massive datasets:
| Sketch Family | Use Case |
|---|---|
| HLL | Cardinality estimation (count distinct) |
| Theta | Cardinality with set operations (union, intersect, exclude) |
| CPC | Compact cardinality estimation |
| KLL | Quantile approximation (percentiles, ranks, CDF, PMF) |
| Quantiles | Classic quantile approximation (DoublesSketch) |
| Frequencies | Frequent items / heavy hitters |
| Tuple ArrayOfDoubles | Cardinality with associated numeric values |
| Tuple DoubleSummary | Cardinality with summary statistics |
Why Sketches?
Probabilistic data structures (sketches) let you compute approximate answers to queries like “count distinct” or “what’s the 99th percentile” in a single pass over the data, using a fraction of the memory that exact computation requires. They are:
- Mergeable — pre-aggregate sketches, then combine them for any time range or dimension
- Fast — single-pass, no sorting or shuffling needed
- Compact — a sketch of 10 billion values fits in a few KB
- Accurate — typical error is 1-3% for cardinality, tighter for quantiles
Compatibility
| Component | Version |
|---|---|
| Trino | 479 |
| datasketches-java | 9.0.0 |
| Java | 21+ |