Getting Started
Table of contents
- Build
- Install
- Docker Quick Start
- Your First Sketch
- How Sketches Work in SQL
- Running Tests
- Known Limitations
Build
mvn clean package -DskipTests
This produces a fat jar at target/trino-datasketches-plugin.jar (~30 MB) containing the plugin and all its dependencies.
Install
Copy the plugin jar into a new directory under your Trino plugin path:
mkdir -p /usr/lib/trino/plugin/datasketches
cp target/trino-datasketches-plugin.jar /usr/lib/trino/plugin/datasketches/
Restart Trino. All ds_* functions are available immediately — no catalog configuration needed.
Docker Quick Start
# Build the plugin
mvn clean package -DskipTests
# Start Trino with the plugin mounted
docker run -d --name trino \
-v $(pwd)/target/trino-datasketches-plugin.jar:/usr/lib/trino/plugin/datasketches/trino-datasketches-plugin.jar \
-p 8080:8080 \
trinodb/trino:479
Connect with the Trino CLI:
trino --server localhost:8080
Verify the plugin is loaded:
SHOW FUNCTIONS LIKE 'ds_%';
Your First Sketch
-- Generate some synthetic data
WITH data AS (
SELECT CAST('user_' || CAST(x AS VARCHAR) AS VARCHAR) AS user_id
FROM UNNEST(sequence(1, 10000)) AS t(x)
)
-- Build an HLL sketch and estimate cardinality
SELECT ds_hll_estimate(ds_hll_sketch(user_id)) AS approx_count
FROM data;
approx_count
--------------
10012.0
How Sketches Work in SQL
The typical workflow is:
- Build a sketch from raw data using an aggregation function (
ds_hll_sketch,ds_kll_sketch, etc.) - Store the sketch as a
VARBINARYcolumn in a table - Query the sketch using scalar functions (
ds_hll_estimate,ds_kll_quantile, etc.) - Merge sketches from different partitions using union functions (
ds_hll_union,ds_kll_union, etc.)
-- Step 1 & 2: Pre-aggregate into a sketch table
CREATE TABLE daily_sketches AS
SELECT event_date,
ds_hll_sketch(user_id) AS user_sketch,
ds_kll_sketch(latency) AS latency_sketch
FROM events
GROUP BY event_date;
-- Step 3: Query a single day
SELECT ds_hll_estimate(user_sketch) AS users,
ds_kll_quantile(latency_sketch, 0.99) AS p99
FROM daily_sketches
WHERE event_date = DATE '2024-01-15';
-- Step 4: Merge across a week
SELECT ds_hll_estimate(ds_hll_union(user_sketch)) AS weekly_users,
ds_kll_quantile(ds_kll_union(latency_sketch), 0.99) AS weekly_p99
FROM daily_sketches
WHERE event_date BETWEEN DATE '2024-01-08' AND DATE '2024-01-14';
Sketches are serialized as VARBINARY. You can store them in any Trino connector that supports VARBINARY columns (Hive, Iceberg, Delta Lake, Memory, etc.).
Running Tests
# Unit tests
mvn test
# Docker integration tests (requires Docker)
mvn package -DskipTests
mvn failsafe:integration-test failsafe:verify
Known Limitations
GROUP BY with sketch aggregations does not work in Trino 479 due to a bytecode generation issue in Trino’s StateCompiler. Aggregations work correctly in single-group mode (no GROUP BY). To work around this, build sketches in separate queries and union them.