Getting Started

Table of contents

  1. Build
  2. Install
  3. Docker Quick Start
  4. Your First Sketch
  5. How Sketches Work in SQL
  6. Running Tests
  7. Known Limitations

Build

mvn clean package -DskipTests

This produces a fat jar at target/trino-datasketches-plugin.jar (~30 MB) containing the plugin and all its dependencies.

Install

Copy the plugin jar into a new directory under your Trino plugin path:

mkdir -p /usr/lib/trino/plugin/datasketches
cp target/trino-datasketches-plugin.jar /usr/lib/trino/plugin/datasketches/

Restart Trino. All ds_* functions are available immediately — no catalog configuration needed.

Docker Quick Start

# Build the plugin
mvn clean package -DskipTests

# Start Trino with the plugin mounted
docker run -d --name trino \
  -v $(pwd)/target/trino-datasketches-plugin.jar:/usr/lib/trino/plugin/datasketches/trino-datasketches-plugin.jar \
  -p 8080:8080 \
  trinodb/trino:479

Connect with the Trino CLI:

trino --server localhost:8080

Verify the plugin is loaded:

SHOW FUNCTIONS LIKE 'ds_%';

Your First Sketch

-- Generate some synthetic data
WITH data AS (
    SELECT CAST('user_' || CAST(x AS VARCHAR) AS VARCHAR) AS user_id
    FROM UNNEST(sequence(1, 10000)) AS t(x)
)
-- Build an HLL sketch and estimate cardinality
SELECT ds_hll_estimate(ds_hll_sketch(user_id)) AS approx_count
FROM data;
 approx_count
--------------
      10012.0

How Sketches Work in SQL

The typical workflow is:

  1. Build a sketch from raw data using an aggregation function (ds_hll_sketch, ds_kll_sketch, etc.)
  2. Store the sketch as a VARBINARY column in a table
  3. Query the sketch using scalar functions (ds_hll_estimate, ds_kll_quantile, etc.)
  4. Merge sketches from different partitions using union functions (ds_hll_union, ds_kll_union, etc.)
-- Step 1 & 2: Pre-aggregate into a sketch table
CREATE TABLE daily_sketches AS
SELECT event_date,
       ds_hll_sketch(user_id)  AS user_sketch,
       ds_kll_sketch(latency)  AS latency_sketch
FROM events
GROUP BY event_date;

-- Step 3: Query a single day
SELECT ds_hll_estimate(user_sketch)          AS users,
       ds_kll_quantile(latency_sketch, 0.99) AS p99
FROM daily_sketches
WHERE event_date = DATE '2024-01-15';

-- Step 4: Merge across a week
SELECT ds_hll_estimate(ds_hll_union(user_sketch)) AS weekly_users,
       ds_kll_quantile(ds_kll_union(latency_sketch), 0.99) AS weekly_p99
FROM daily_sketches
WHERE event_date BETWEEN DATE '2024-01-08' AND DATE '2024-01-14';

Sketches are serialized as VARBINARY. You can store them in any Trino connector that supports VARBINARY columns (Hive, Iceberg, Delta Lake, Memory, etc.).

Running Tests

# Unit tests
mvn test

# Docker integration tests (requires Docker)
mvn package -DskipTests
mvn failsafe:integration-test failsafe:verify

Known Limitations

GROUP BY with sketch aggregations does not work in Trino 479 due to a bytecode generation issue in Trino’s StateCompiler. Aggregations work correctly in single-group mode (no GROUP BY). To work around this, build sketches in separate queries and union them.


Back to top

Trino DataSketches Plugin — Apache License 2.0