Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.
Join us for this webinar where Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.
Dipti will cover:
-Open Data Lake analytics - what it is and what use cases it supports
-Why companies are moving to an open data lake analytics approach
-Why the open source data lake query engine Presto is critical to this approach
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Unlocking the Value of Your Data Lake
1. Unlocking the Value of
Your Data Lake
Dipti Borkar
Cofounder, Chief Product Officer &
Chief Evangelist
Chairperson |Community Team
Presto Foundation
2. 2
Today’s Speaker
Dipti is a Cofounder, CPO & Chief Evangelist of Ahana with over 15 years experience
in distributed data and database technology including relational, NoSQL and
federated systems. She is also the Presto Foundation Outreach Chairperson. Prior
to Ahana, Dipti held VP roles at Alluxio, Kinetica and Couchbase. At Alluxio, she was
Vice President of Products and at Couchbase she held several leadership positions
there including VP, Product Marketing, Head of Global Technical Sales and Head of
Product Management. Earlier in her career Dipti managed development teams at
IBM DB2 Distributed where she started her career as a database software engineer.
Dipti holds a M.S. in Computer Science from UC San Diego, and an MBA from the
Haas School of Business at UC Berkeley.
Dipti Borkar
Cofounder, Chief Product
Officer and Chief Evangelist
Ahana
3. 3
The Traditional
Data Warehouse
• Relational Database
• Columnar Structure
• In-Database Analytics
• Structured Data
• Modeled Data
• Extract, Transform, Load
• SQL Access
Challenges
• Expensive
• Difficult to Manage
• Costly to Maintain
• Limited Data
• Limited Access
3
4. 4
The Drivers Behind Modernization
Digital
Transformation
Real Time
Events
Modern
Processing
Techniques
More Data
Fast Data
Smart Data
The Deconstructed Database
5. 5
Why Open Data Lake Analytics?
Enterprise Data
Beyond Enterprise
Data
IoT, Third-party,
Telemetry, Event
1000X
More Data
Terabytes to
Petabytes
Open &
Flexible
Open Source,
Open Formats
Reporting &
Dashboarding
Data
Science
In-data lake
transformation
Reporting &
Dashboarding
Data Warehouse
Open Data Lakes
6. 6
The Traditional Data Lake
• File System Data Store / Object Store
• Structured / Semi-Structured Data
• Ingestion
• Discovery
• Data Science
• Notebook and Python Access
• Less expensive, but…
• Good enough performance
• Supports ~70% of DW workloads
• Different approach to governance
6
7. 7
Data
SQL Query Processing
Data Warehouse
Cloud Data Lake
Data Processing
1-10 TB
1TB -> PB
The Next Data Warehouse is Open Data Lake Analytics
Reporting &
Dashboarding
Data
Science
In-data lake
transformation
Open Data Lake Analytics
Reporting & Dashboarding
8. 8
Data Warehouse
Operational
Data Stores
Third Party
Data
Machine Learning
Semi- | unstructured
Data Virtualization /
Federated Access
Streaming &
IoT Data
SQL Query Processing
SQL Query
Processing
The Data Platform
ETL
ELT
Data
Engg
Storage
Compute
1-10 TB
Query & Processing
Storage Compute
SQL
Structured Workloads
1TB -> PB
Data
Lake
Reporting
Dashboards
Visualizations
Notebooks
Custom Apps
9. 9
Cloud data lake driving open source SQL query engines
Presto is the De-Facto SQL Engine for Data Lakes
https://db-engines.com/en/ranking_trend/relational+dbms
10. 10
Similarities with Modern Data Warehouse &
The Modern Data Lake
• Cloud-First
• In-Memory Capabilities
• Complex Data Types
• Separate Storage & Compute
• Expanded Analytics
• Improved Performance
• Storage Options
• SQL Access
• Cloud-First
• In-Memory Capabilities
• Columnar Data Types
• Separate Storage & Compute
• Expanded Analytics
• Improved Performance
• Storage Options
• SQL Access
11. Merging the Data Warehouse and the Data Lake with a Distributed Query Engine
11
1. SQL Access
2. Data Lake and Data Warehouse Access
3. Unified Analytics
4. Distributed Queries
5. Limitless Scale
6. Complex Data Types
• Leverage Resources
• Better Insight
• More Use Cases
• Leverage Platforms
• Remove Limits
• Amplified Insight
13. 13
Emerging
use cases
Use Cases
Data Lakehouse
analytics
Reporting &
dashboarding
Interactive
querying
use cases
Transformation
using SQL (ETL)
Federated access
across data sources
SQL
Data Science
Customer-facing
app analytics
24. 24
Challenges with SQL on Open Data Lakes
Cloud DW / AWS Serverless
options get very expensive for
growing data volumes
▪ Cloud data warehouse
costs grow much faster
than compute engine costs
▪ Serverless options like
AWS Athena charge /query
and get expensive
“Do it yourself” approach
is complicated
Big data skills in platform
teams are limited
Presto is complicated and
operationally very time
consuming
Presto on AWS like AWS
Athena has limited capabilities
and doesn’t scale
▪ Limited concurrency of 20
per account
▪ No visibility into cluster
logs, query logs, no
flexibility / control on
scale
26. 26
Open Source Presto Overview
• Distributed SQL query engine
• Created at
• ANSI SQL on Databases, Data lakes
• Designed to be interactive & access
petabytes of data
• Open source, hosted at
https://github.com/prestodb
29. 29
How Ahana Cloud works?
~ 30 mins to create the compute plane
https://app.ahana.cloud/signup Create Presto Clusters in your account
30. 30
Ahana Cloud for Presto
Ahana Console (Control Plane)
CLUSTER
ORCHESTRATION
CONSOLIDATED
LOGGING
SECURITY &
ACCESS
BILLING &
SUPPORT
In-VPC Presto Clusters (Compute Plane)
AD HOC CLUSTER 1
TEST CLUSTER 2
PROD CLUSTER N
Glue
S3
RDS
Elasticsearch
Ahana
Cloud Account
Ahana console
oversees and
manages every
Presto cluster
Customer
Cloud Account
In-VPC orchestration of
Presto clusters, where
metadata, monitoring,
and data sources
reside
31. 31
Ahana Cloud Overview
1. Ahana Managed Service
Console
2. Add data sources
3. Query data where it lives with
Federated Connectors (in place)
4. Cluster management
32. 32
Case study: Securonix
NextGen SIEM
Cluster
AWS S3 Data
Lake
Glue
Metastore
Securonix is a Security information and
event management software
They use Ahana for in-app SQL
analytics on data from AWS S3 for
threat hunting
They pull in billions of events per day
that get stored in S3
With Ahana Cloud, they saw 3x better
price performance compared with
Presto on AWS
33. 33
Ahana Cloud for Presto - Summary
Brings SQL on AWS S3 with an open data lake
+
USER
Presto compute brought to your data in your
VPC in your account
Fully managed Presto cluster life cycle
including idle-time management
Query AWS DBs - RDS/MySQL , RDS/Postgres,
Elasticsearch, Redshift, Elasticsearch
Cloud-native and highly available running on
Kubernetes
Bring your own
BI tool / Data Science Notebook
Metadata Catalog
Transaction Manager
Easy to use
3x Price Performance
Open & Flexible