Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
5. Flexibility takes time: Join on the Fly
customers
orders
customers.state =
‘California’ AND
customers.gender
= ‘Female’
JOIN customers ON
(customers.customer_id
= orders.customer_id)
Group By
customers.city,
Month(orders.date)
sum(orders.amount)
FILTER JOIN GROUP BY AGGREGATION
- Flexible to do any computation
- High query cost: disk & network
I/O, Data Partitioning, Data Serde
6. ETL Trade-offs: Pre-joined Table
customers
orders
state =
‘California’
AND gender
= ‘Female’
JOIN customers ON
(customers.customer_id
= orders.customer_id)
Group By city,
Month(orders.
date)
sum(amount)
FILTER
JOIN GROUP BY AGGREGATION
user_orders
_joined
Pre-Joined
Table
- Flexible to explore user dimensions
- Query time is still proportional to the
data scan, not predictable
7. ETL Trade-offs: Pre-aggregated Table
state =
‘California’
AND gender
= ‘Female’
Group By city,
Month(orders.
date)
sum(sum_amount)
FILTER GROUP BY AGGREGATION
user_orders
_joined
Pre-Joined
Table
user_orders_
aggregated
Pre-Aggregated
Table
SELECT
sum(amount) as
sum_amount, date,
city GROUP BY
date, city
Aggregation
+ GroupBy
- Reduced query runtime workload
- Query time is still proportional to the
multiplication of non-groupBy columns
8. ETL Trade-offs: Pre-cubed Table
state =
‘California’
AND gender
= ‘Female’
sum_amount,
month, city
FILTER
user_orders
_joined
Pre-Joined
Table
user_orders
_cubed
Pre-Aggregated
Table
SELECT
sum(amount) as
sum_amount, date,
Month(date) as
month, city
GROUP BY CUBE
(date, city,
Month(date))
Cubing
PROJECTION
- Predictable query runtime
- Storage overhead: one raw record
translates to multiple records
- Dimension explosion
9. Fact Table
Dimension Table Pre-Join Pre-Aggregation Pre-Cube
Latency
Flexibility
low
high
low
high
Not to Trade-off Using Apache Pinot
Throughput high
low
10. User Facing Applications Business Facing Metrics
Apache Pinot Overview
Anomaly Detection
- Ingestion: Millions of events/sec
- Workload: Thousands of queries/sec
- Performance: Millisecond
- Operation: Thousands Nodes Cluster
ADLS
GCS
Real-Time Offline
12. Secrets Behind Apache Pinot
Scan
Aggregation
Filter
Storage
Bloom
Filter
Inverted Index
Columnar Store
Byte
Encoding
Sorted
Index ❏Common Techniques
❏Pinot
Compression
Star-Tree Pre-aggregation
Star-
Tree
Index
Bit/RLE
Encoding
Per-segment flexible query planning
Range
Index
Text
Index
13. Apache Pinot - StarTree Index
• Configurable trade-off between latency and space by partial pre-aggregation
technique
• Be able to achieve a hard upper bound for query latencies
No pre-computation
Latency
Storage
Full Pre-Cube
(KV Store)
Partial pre-computation
(Startree Index)
T= 10000
T= 100
17. Trino Pinot Connector: Aggregation Pushdown
Chasing the light: Aggregation pushdown
- Issue single Pinot broker request
- Best-effort push down for aggregations like
count/sum/min/max/distinct/approximate_distinct, etc
- 10~100x latency improvement
18. Passthrough Broker Queries
SELECT CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’
END AS size, team, count(*) FROM
pinot.default.”SELECT team
FROM baseball_stats WHERE conference = ‘America East’”
GROUP BY CASE WHEN team = ‘Giants’ then ‘BIG’ else
‘SMALL’ END, 2
Group by Expression Support
19. Passthrough Broker Queries
SELECT team, count(*) FROM
pinot.default.”SELECT team, player
FROM baseball_stats WHERE conference = ‘America East’”
ORDER BY CASE WHEN team = ‘Giants’ then ‘BIG’ else
‘SMALL’ END, 2
Order by Expression Support
20. Trino Pinot Connector: Server Query + Pinot Streaming API
Pinot Streaming(gRpc) Connector
- Distributed workload in parallel among Trino workers
- Configurable memory footprint for data pulling from Pinot
- Open the gate of queries requires full table scan or join
21. Ongoing and Future Work on the Connector
● Data Insertion
○ Push segments to the controller
○ Adds or replaces segments.
Explain controller - single coordinator controlling all actions
cluster state - to maintain partition assignment - partition to server mapping
Cluster state maintains start and end offsets
For certain query pattern (slice and dice on a given list of dimensions), we allow users to configure a upper bound of documents to scan. Pinot will intelligently partial pre-aggregate the records to achieve the requirement, but without exploding the storage
Components: Coordinator (query endpoint, metadata), workers (process query results)
Divides the work into splits which are processed in parallel
Results returned from coordinator in final phase of processing
Join 2 pinot tables
Build Broker Query
Pushdown filter, aggregation, limit
Produce a single broker split
Submit broker request
Produce Results to Trino
Process joins, other filters, aggregations and final limit
Return results to client
Build Broker Query
Pushdown filter, aggregation, limit
Produce a single broker split
Submit broker request
Produce Results to Trino
Process joins, other filters, aggregations and final limit
Return results to client
First step is to get the metadata from the pinot controller,
Talk about how this is configurable with cache ttl config
Pinot - Fast single table OLAP
Trino - Powerful connector ecosystem
Complete system - covers entire landscape
Get the best of Trino and Pinot
Proven stack at Uber and many more
Pinot - Fast single table OLAP
Trino - Powerful connector ecosystem
Complete system - covers entire landscape
Get the best of Trino and Pinot
Proven stack at Uber and many more