Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake Implementation
Processing and Querying Data in Place
S T G 2 0 4
John Mallory
Storage Business
Development
Manager
AWS/BD
Gene Stevens
CTO & Co-Founder
ProtectWise
Joshua Hollander
Principal Software
Engineer
ProtectWise

Finding Value in Data is a Journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Methods
AI/MLSQL Query

Why Use AWS for Big Data?
Agility Scalability
Get to Insights Faster
Broadest and Deepest
Capabilities
Low Cost
Data Migrations Made Easy

Defining the AWS Data Lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous datasets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read

User-Defined Functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and Querying In Place
Fully Managed Process & Query
• Catalog, Transform, & Query Data in Amazon S3
• No physical instances to manage
Lambda Function

Central Storage
Catalog
Processing and
Analytics
Access and Secure
Example of AWS Services for Data LakeIngest
Methods

High Performance
Why Amazon S3 for the Data Lake?
SecureDurable
Available
Easy to use
Scalable & Affordable
Integrated

Optimize Costs with Data Tiering
Hot
Cold
Amazon S3
Standard
Amazon S3 -
Infrequent Access
Amazon
Glacier
HDFS  Use EMR/Hadoop with local
HDFS for hottest datasets
 Store cooler data in Amazon S3
and cold in Amazon Glacier to
reduce costs
 Use S3 Analytics to optimize
tiering strategy
Amazon S3
Analytics

A Data Lake Needs to
Accommodate a Wide
Variety of Concurrent
Data Sources
Rapidly Ingest All Data Sources
IoT, Sensor Data, Clickstream Data,
Social Media Feeds, Streaming Logs
Oracle, MySQL, MongoDB, DB2,
SQL Server, Amazon RDS
On-premises ERP, Mainframes,
Lab Equipment, NAS Storage
Offline Sensor Data, NAS,
On-premises Hadoop
On-premises data lakes, EDW,
Large-Scale Data Collection
Ingest
Methods

Ingest Optimization Considerations
Separate ingest buckets from Amazon S3 data lake buckets
• Prepare data before loading into data lake
• Preserve raw assets (potentially lifecycle to Amazon Glacier)
Aggregate smaller files/objects ahead of Amazon S3 where possible
• Reduces transaction costs
• Avoids TPS limits (can also pre-partition S3 bucket)
Consider AWS Storage Gateway/AWS Snowball Edge for file data
• Converts files to S3 objects, while preserving metadata
Build Automated Data Pipelines
• AWS Lambda

Amazon Kinesis—Real Time
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL

Common AWS Data Pipeline Configuration
Raw Data
Amazon S3
Highly decoupled configurations scale better, are more fault tolerant, and cost optimized
ETL (Hadoop)
Amazon EMR
Triggered Code
AWS Lambda
Staged Data
(Data Lake)
Amazon S3
ETL & Catalog Management
AWS Glue
Data Warehouse
Amazon Redshift
Triggered Code
AWS Lambda

Process Data in Place…
Amazon S3
Amazon Athena Amazon Redshift
Spectrum
Amazon SageMaker AWS Glue

Amazon S3 Select and Amazon Glacier Select
Select subset of data from an object based on a SQL expression

Motivation Behind Amazon S3 Select
GET all the data from S3 objects, and my application will filter the data that I need
Redshift Spectrum Example:
Customer: Run 50,000 queries
Amount of data fetched from S3: 6 PBs
Amount of data used in Amazon Redshift: 650 TB
Data needed from S3: 10%

Amazon S3 Select
Output
Format: delimited text (CSV,
TSV), JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV,
TSV, JSON, Parquet…
Compression: GZIP, BZIP2 …

Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X Faster at 1/5 of the cost

Before
Amazon S3 Select: Accelerating Big Data
After
After
5X Faster with 1/40 of the CPU
EMR 5.18.0

Choosing the Right Data Formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy, but not efficient
• Compress & store/archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row oriented (AVRO) good for full data scans
Key considerations are cost, performance & support

Choosing the Right Data Formats (con’t.)
Pay by the amount of data scanned per query
Use Compressed Columnar Formats
• Parquet
• ORC
Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

Data Prep is ~80% of Data Lake Work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

AWS Glue—Serverless Data Catalog & ETL
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless

Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Supports Multiple Data Formats – Define Schema on Demand
$
Query Instantly Pay per query Open Easy

Amazon Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n S 3 d a t a l a k e
S3 data lake
Amazon
Redshift data
Redshift Spectrum
query engine
Exabyte Amazon Redshift SQL queries against S3
Join data across Amazon Redshift and Amazon S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
End-to-End
Machine
Learning
Platform
Zero setup Flexible Model
Training
Pay by the
second
Amazon SageMaker
The quickest and easiest way to get ML models from idea to production
$

Top Data Lake Use Case Scenarios
Data Warehouse Modernization
Protect Your On-premises Data Lake
Real Time Analytics
Business Intelligence & Data Exploration
AI/Machine Learning

Case Study: ProtectWise
Gene Stevens
CTO & Co-Founder
ProtectWise
Joshua Hollander
Principal Software Engineer
ProtectWise

ProtectWise Overview
● Cloud-Delivered
Network Detection
& Response (NDR)
platform
● >500 terabytes of
data analyzed/day
● >10M
transactions/second
● Petabytes at rest

Why AWS
Core strategy
● Time to market
● COGS innovation as foundational
● Evolutionary architecture
● Continuous learning
● Human time as most critical
● AWS roadmap ahead of industry
Innovation as first principle
Or, how to be as good at what you don’t know as what you do know

Why AWS
● Rate of innovation is king
● Easily introduce new technologies
● Ability to surgically target optimizations
● Must be ahead of feature and volume growth
● Transactional costs must continuously trend down
Building your own infrastructure puts core strategy at risk
COGS innovation is foundational

What is Explorer?
Interactively search entire network timeline
What IPs did 172.20.243.225 talk to using BitTorrent in the last 30
days?
Hunt for threats
Did device X make any HTTP requests with a User Agent associated
with known malware?
Visualize

Explorer

Explorer 1.0
Before Explorer:
● Used Apache Cassandra
● Used for just KV
lookups
Explorer required:
● Searching
● Sorting
● Faceting

Choosing DataStax Enterprise Edition
● Already using Cassandra
● Needed searchability
● Not all of this data is immutable
● Time to market prioritized over cost

Challenges with initial solution
● Operational
○ Performance tuning
○ Scaling
○ Data retention
● Cost
○ Licensing
○ Instances
○ Storage
○ Operational

Possibilities
● Cassandra + Elasticsearch
● NoSQL solution “X”
● Hive
● Presto
Want the cost and scale profile of S3-based warehouse
solutions but with similar query performance to DSE.

Hybrid custom + off-the-shelf

Recipe
● Inspired by Hive, Presto et al
● Leverage existing tools
Combine with secret ingredient...

Secret ingredient: Probabilistic meta indexes
● Hive already has Bloom filter support via ORC on
block level
Bloom filters!!!
● Our customization: the metastore is a Bloom filter!

A “searchable” Bloom filter
Term Doc IDs
0 0,1,2
1 1,2
2 1
3 0
4 1,2
5 0
Index Field Value Indexed Values Doc ID
ip 192.168.0.1 {0, 3, 5} 0
ip 10.0.0.1 {1, 2, 4} 1
ip 8.8.8.8 {0, 1, 4} 2
Indexing
Field Query String Actual Query
ip ip:192.168.0.1 ip_bits:0 AND 3 AND 5
ip ip:10.0.0.1 ip_bits:1 AND 4 AND 5
Queries

Query system
REST API
Leverage Spark APIs: PrunedFilteredScan
1. Spark SQL analyzes queries
2. Meta Index is queried with Spark Pushdown values
a. Apply partition filters to any Spark Pushdown filters
b. Apply bloom filters to any Spark Pushdown filters
3. Hand off resulting file list to Spark: FileScanRDD

● Hive/Presto like query system
● Custom metastore with probabilistic meta indexes

Outcome
● 95% Cost reduction over previous Explorer system
● Minimizes the amount of data pulled from S3 to
satisfy queries
● Performance:
○ p95 Query times in seconds
○ p95 Key Value lookup times are sub-second!
● Scale out is nearly zero effort
○ No storage scale out required
○ Query servers auto-scale according to query load

Cheap
Scalable
• Separate storage from compute
• Storage scales with zero effort
• Scale compute independently
Reliable
Fast
Amazon S3

Spark "On-Demand"
Amazon EMR
Scaling
• Inter-cluster
• Intra-cluster
• Easy auto scaling
Used for
• Ingest
• Query

The future
● Incorporating S3 Select
○ Decrease query time and cost even more
○ Effort currently underway
● Leveraging Kinesis Data Firehose
○ Simplify ingest pipeline
○ Reduce operational complexity
● Go “Serverless”
○ Meta store
○ Query compute
○ Ingest

Lessons learned
• Features first, economics later
• A hybrid of build and buy can be effective
• Open source ecosystem is rich
• AWS tools are flexible building blocks
• AWS roadmap leads on innovation
High-speed, probabilistic data lakes
sourced in Amazon S3 are foundational.

Thank you!
Gene Stevens @genestevens
Joshua Hollander @jholla14
ProtectWise.com/TestDrive

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018

Similaire à Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018 (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018