SlideShare une entreprise Scribd logo
1  sur  52
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake Implementation
Processing and Querying Data in Place
S T G 2 0 4
John Mallory
Storage Business
Development
Manager
AWS/BD
Gene Stevens
CTO & Co-Founder
ProtectWise
Joshua Hollander
Principal Software
Engineer
ProtectWise
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Finding Value in Data is a Journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Methods
AI/MLSQL Query
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why Use AWS for Big Data?
Agility Scalability
Get to Insights Faster
Broadest and Deepest
Capabilities
Low Cost
Data Migrations Made Easy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining the AWS Data Lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous datasets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
User-Defined Functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and Querying In Place
Fully Managed Process & Query
• Catalog, Transform, & Query Data in Amazon S3
• No physical instances to manage
Lambda Function
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Central Storage
Catalog
Processing and
Analytics
Access and Secure
Example of AWS Services for Data LakeIngest
Methods
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High Performance
Why Amazon S3 for the Data Lake?
SecureDurable
Available
Easy to use
Scalable & Affordable
Integrated
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimize Costs with Data Tiering
Hot
Cold
Amazon S3
Standard
Amazon S3 -
Infrequent Access
Amazon
Glacier
HDFS  Use EMR/Hadoop with local
HDFS for hottest datasets
 Store cooler data in Amazon S3
and cold in Amazon Glacier to
reduce costs
 Use S3 Analytics to optimize
tiering strategy
Amazon S3
Analytics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Data Lake Needs to
Accommodate a Wide
Variety of Concurrent
Data Sources
Rapidly Ingest All Data Sources
IoT, Sensor Data, Clickstream Data,
Social Media Feeds, Streaming Logs
Oracle, MySQL, MongoDB, DB2,
SQL Server, Amazon RDS
On-premises ERP, Mainframes,
Lab Equipment, NAS Storage
Offline Sensor Data, NAS,
On-premises Hadoop
On-premises data lakes, EDW,
Large-Scale Data Collection
Ingest
Methods
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ingest Optimization Considerations
Separate ingest buckets from Amazon S3 data lake buckets
• Prepare data before loading into data lake
• Preserve raw assets (potentially lifecycle to Amazon Glacier)
Aggregate smaller files/objects ahead of Amazon S3 where possible
• Reduces transaction costs
• Avoids TPS limits (can also pre-partition S3 bucket)
Consider AWS Storage Gateway/AWS Snowball Edge for file data
• Converts files to S3 objects, while preserving metadata
Build Automated Data Pipelines
• AWS Lambda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Kinesis—Real Time
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common AWS Data Pipeline Configuration
Raw Data
Amazon S3
Highly decoupled configurations scale better, are more fault tolerant, and cost optimized
ETL (Hadoop)
Amazon EMR
Triggered Code
AWS Lambda
Staged Data
(Data Lake)
Amazon S3
ETL & Catalog Management
AWS Glue
Data Warehouse
Amazon Redshift
Triggered Code
AWS Lambda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process Data in Place…
Amazon S3
Amazon Athena Amazon Redshift
Spectrum
Amazon SageMaker AWS Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon S3 Select and Amazon Glacier Select
Select subset of data from an object based on a SQL expression
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Motivation Behind Amazon S3 Select
GET all the data from S3 objects, and my application will filter the data that I need
Redshift Spectrum Example:
Customer: Run 50,000 queries
Amount of data fetched from S3: 6 PBs
Amount of data used in Amazon Redshift: 650 TB
Data needed from S3: 10%
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon S3 Select
Output
Format: delimited text (CSV,
TSV), JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV,
TSV, JSON, Parquet…
Compression: GZIP, BZIP2 …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X Faster at 1/5 of the cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Before
Amazon S3 Select: Accelerating Big Data
After
After
5X Faster with 1/40 of the CPU
EMR 5.18.0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Choosing the Right Data Formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy, but not efficient
• Compress & store/archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row oriented (AVRO) good for full data scans
Key considerations are cost, performance & support
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Choosing the Right Data Formats (con’t.)
Pay by the amount of data scanned per query
Use Compressed Columnar Formats
• Parquet
• ORC
Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Prep is ~80% of Data Lake Work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue—Serverless Data Catalog & ETL
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Supports Multiple Data Formats – Define Schema on Demand
$
Query Instantly Pay per query Open Easy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n S 3 d a t a l a k e
S3 data lake
Amazon
Redshift data
Redshift Spectrum
query engine
Exabyte Amazon Redshift SQL queries against S3
Join data across Amazon Redshift and Amazon S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
End-to-End
Machine
Learning
Platform
Zero setup Flexible Model
Training
Pay by the
second
Amazon SageMaker
The quickest and easiest way to get ML models from idea to production
$
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Top Data Lake Use Case Scenarios
Data Warehouse Modernization
Protect Your On-premises Data Lake
Real Time Analytics
Business Intelligence & Data Exploration
AI/Machine Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Case Study: ProtectWise
Gene Stevens
CTO & Co-Founder
ProtectWise
Joshua Hollander
Principal Software Engineer
ProtectWise
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ProtectWise Overview
● Cloud-Delivered
Network Detection
& Response (NDR)
platform
● >500 terabytes of
data analyzed/day
● >10M
transactions/second
● Petabytes at rest
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why AWS
Core strategy
● Time to market
● COGS innovation as foundational
● Evolutionary architecture
● Continuous learning
● Human time as most critical
● AWS roadmap ahead of industry
Innovation as first principle
Or, how to be as good at what you don’t know as what you do know
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why AWS
● Rate of innovation is king
● Easily introduce new technologies
● Ability to surgically target optimizations
● Must be ahead of feature and volume growth
● Transactional costs must continuously trend down
Building your own infrastructure puts core strategy at risk
COGS innovation is foundational
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is Explorer?
Interactively search entire network timeline
What IPs did 172.20.243.225 talk to using BitTorrent in the last 30
days?
Hunt for threats
Did device X make any HTTP requests with a User Agent associated
with known malware?
Visualize
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Explorer
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Explorer 1.0
Before Explorer:
● Used Apache Cassandra
● Used for just KV
lookups
Explorer required:
● Searching
● Sorting
● Faceting
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Choosing DataStax Enterprise Edition
● Already using Cassandra
● Needed searchability
● Not all of this data is immutable
● Time to market prioritized over cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Challenges with initial solution
● Operational
○ Performance tuning
○ Scaling
○ Data retention
● Cost
○ Licensing
○ Instances
○ Storage
○ Operational
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Possibilities
● Cassandra + Elasticsearch
● NoSQL solution “X”
● Hive
● Presto
Want the cost and scale profile of S3-based warehouse
solutions but with similar query performance to DSE.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hybrid custom + off-the-shelf
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recipe
● Inspired by Hive, Presto et al
● Leverage existing tools
Combine with secret ingredient...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Secret ingredient: Probabilistic meta indexes
● Hive already has Bloom filter support via ORC on
block level
Bloom filters!!!
● Our customization: the metastore is a Bloom filter!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A “searchable” Bloom filter
Term Doc IDs
0 0,1,2
1 1,2
2 1
3 0
4 1,2
5 0
Index Field Value Indexed Values Doc ID
ip 192.168.0.1 {0, 3, 5} 0
ip 10.0.0.1 {1, 2, 4} 1
ip 8.8.8.8 {0, 1, 4} 2
Indexing
Field Query String Actual Query
ip ip:192.168.0.1 ip_bits:0 AND 3 AND 5
ip ip:10.0.0.1 ip_bits:1 AND 4 AND 5
Queries
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Query system
REST API
Leverage Spark APIs: PrunedFilteredScan
1. Spark SQL analyzes queries
2. Meta Index is queried with Spark Pushdown values
a. Apply partition filters to any Spark Pushdown filters
b. Apply bloom filters to any Spark Pushdown filters
3. Hand off resulting file list to Spark: FileScanRDD
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
● Hive/Presto like query system
● Custom metastore with probabilistic meta indexes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Outcome
● 95% Cost reduction over previous Explorer system
● Minimizes the amount of data pulled from S3 to
satisfy queries
● Performance:
○ p95 Query times in seconds
○ p95 Key Value lookup times are sub-second!
● Scale out is nearly zero effort
○ No storage scale out required
○ Query servers auto-scale according to query load
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cheap
Scalable
• Separate storage from compute
• Storage scales with zero effort
• Scale compute independently
Reliable
Fast
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spark "On-Demand"
Amazon EMR
Scaling
• Inter-cluster
• Intra-cluster
• Easy auto scaling
Used for
• Ingest
• Query
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The future
● Incorporating S3 Select
○ Decrease query time and cost even more
○ Effort currently underway
● Leveraging Kinesis Data Firehose
○ Simplify ingest pipeline
○ Reduce operational complexity
● Go “Serverless”
○ Meta store
○ Query compute
○ Ingest
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lessons learned
• Features first, economics later
• A hybrid of build and buy can be effective
• Open source ecosystem is rich
• AWS tools are flexible building blocks
• AWS roadmap leads on innovation
High-speed, probabilistic data lakes
sourced in Amazon S3 are foundational.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Gene Stevens @genestevens
Joshua Hollander @jholla14
ProtectWise.com/TestDrive
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Contenu connexe

Tendances

Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesCarole Gunst
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...Amazon Web Services Korea
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 

Tendances (20)

Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Amazon S3 Masterclass
Amazon S3 MasterclassAmazon S3 Masterclass
Amazon S3 Masterclass
 

Similaire à Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018

Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Amazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfSasikumarPalanivel3
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfsaidbilgen
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

Similaire à Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018 (20)

Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake Implementation Processing and Querying Data in Place S T G 2 0 4 John Mallory Storage Business Development Manager AWS/BD Gene Stevens CTO & Co-Founder ProtectWise Joshua Hollander Principal Software Engineer ProtectWise
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Finding Value in Data is a Journey Business Monitoring Business Insights New Business Opportunity Business Optimization Business Transformation Evolving Tools and Methods AI/MLSQL Query
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why Use AWS for Big Data? Agility Scalability Get to Insights Faster Broadest and Deepest Capabilities Low Cost Data Migrations Made Easy
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining the AWS Data Lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous datasets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User-Defined Functions • Bring your own functions & code • Execute without provisioning servers Processing and Querying In Place Fully Managed Process & Query • Catalog, Transform, & Query Data in Amazon S3 • No physical instances to manage Lambda Function
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Central Storage Catalog Processing and Analytics Access and Secure Example of AWS Services for Data LakeIngest Methods
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High Performance Why Amazon S3 for the Data Lake? SecureDurable Available Easy to use Scalable & Affordable Integrated
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimize Costs with Data Tiering Hot Cold Amazon S3 Standard Amazon S3 - Infrequent Access Amazon Glacier HDFS  Use EMR/Hadoop with local HDFS for hottest datasets  Store cooler data in Amazon S3 and cold in Amazon Glacier to reduce costs  Use S3 Analytics to optimize tiering strategy Amazon S3 Analytics
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Data Lake Needs to Accommodate a Wide Variety of Concurrent Data Sources Rapidly Ingest All Data Sources IoT, Sensor Data, Clickstream Data, Social Media Feeds, Streaming Logs Oracle, MySQL, MongoDB, DB2, SQL Server, Amazon RDS On-premises ERP, Mainframes, Lab Equipment, NAS Storage Offline Sensor Data, NAS, On-premises Hadoop On-premises data lakes, EDW, Large-Scale Data Collection Ingest Methods
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ingest Optimization Considerations Separate ingest buckets from Amazon S3 data lake buckets • Prepare data before loading into data lake • Preserve raw assets (potentially lifecycle to Amazon Glacier) Aggregate smaller files/objects ahead of Amazon S3 where possible • Reduces transaction costs • Avoids TPS limits (can also pre-partition S3 bucket) Consider AWS Storage Gateway/AWS Snowball Edge for file data • Converts files to S3 objects, while preserving metadata Build Automated Data Pipelines • AWS Lambda
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis—Real Time Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics SQL
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common AWS Data Pipeline Configuration Raw Data Amazon S3 Highly decoupled configurations scale better, are more fault tolerant, and cost optimized ETL (Hadoop) Amazon EMR Triggered Code AWS Lambda Staged Data (Data Lake) Amazon S3 ETL & Catalog Management AWS Glue Data Warehouse Amazon Redshift Triggered Code AWS Lambda
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process Data in Place… Amazon S3 Amazon Athena Amazon Redshift Spectrum Amazon SageMaker AWS Glue
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 Select and Amazon Glacier Select Select subset of data from an object based on a SQL expression
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Motivation Behind Amazon S3 Select GET all the data from S3 objects, and my application will filter the data that I need Redshift Spectrum Example: Customer: Run 50,000 queries Amount of data fetched from S3: 6 PBs Amount of data used in Amazon Redshift: 650 TB Data needed from S3: 10%
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 Select Output Format: delimited text (CSV, TSV), JSON … Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate Input Format: delimited text (CSV, TSV, JSON, Parquet… Compression: GZIP, BZIP2 …
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Before 200 seconds and 11.2 cents # Download and process all keys for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) contents = response['Body'].read() for line in contents.split('n')[:-1]: line_count +=1 try: data = line.split(',') srcIp = data[0][:8] …. Amazon S3 Select: Serverless MapReduce After 95 seconds and costs 2.8 cents # Select IP Address and Keys for key in src_keys: response = s3_client.select_object_content (Bucket=src_bucket, Key=key, expression = SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) contents = response['Body'].read() for line in contents: line_count +=1 try: …. 2X Faster at 1/5 of the cost
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Before Amazon S3 Select: Accelerating Big Data After After 5X Faster with 1/40 of the CPU EMR 5.18.0
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choosing the Right Data Formats There is no such thing as the “best” data format • All involve tradeoffs, depending on workload & tools • CSV, TSV, JSON are easy, but not efficient • Compress & store/archive as raw input • Columnar compressed are generally preferred • Parquet or ORC • Smaller storage footprint = lower cost • More efficient scan & query • Row oriented (AVRO) good for full data scans Key considerations are cost, performance & support
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choosing the Right Data Formats (con’t.) Pay by the amount of data scanned per query Use Compressed Columnar Formats • Parquet • ORC Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Prep is ~80% of Data Lake Work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue—Serverless Data Catalog & ETL Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Supports Multiple Data Formats – Define Schema on Demand $ Query Instantly Pay per query Open Easy
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n S 3 d a t a l a k e S3 data lake Amazon Redshift data Redshift Spectrum query engine Exabyte Amazon Redshift SQL queries against S3 Join data across Amazon Redshift and Amazon S3 Scale compute and storage separately Stable query performance and unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats Pay only for the amount of data scanned
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. End-to-End Machine Learning Platform Zero setup Flexible Model Training Pay by the second Amazon SageMaker The quickest and easiest way to get ML models from idea to production $
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Top Data Lake Use Case Scenarios Data Warehouse Modernization Protect Your On-premises Data Lake Real Time Analytics Business Intelligence & Data Exploration AI/Machine Learning
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Case Study: ProtectWise Gene Stevens CTO & Co-Founder ProtectWise Joshua Hollander Principal Software Engineer ProtectWise
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ProtectWise Overview ● Cloud-Delivered Network Detection & Response (NDR) platform ● >500 terabytes of data analyzed/day ● >10M transactions/second ● Petabytes at rest
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why AWS Core strategy ● Time to market ● COGS innovation as foundational ● Evolutionary architecture ● Continuous learning ● Human time as most critical ● AWS roadmap ahead of industry Innovation as first principle Or, how to be as good at what you don’t know as what you do know
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why AWS ● Rate of innovation is king ● Easily introduce new technologies ● Ability to surgically target optimizations ● Must be ahead of feature and volume growth ● Transactional costs must continuously trend down Building your own infrastructure puts core strategy at risk COGS innovation is foundational
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is Explorer? Interactively search entire network timeline What IPs did 172.20.243.225 talk to using BitTorrent in the last 30 days? Hunt for threats Did device X make any HTTP requests with a User Agent associated with known malware? Visualize
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Explorer
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Explorer 1.0 Before Explorer: ● Used Apache Cassandra ● Used for just KV lookups Explorer required: ● Searching ● Sorting ● Faceting
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choosing DataStax Enterprise Edition ● Already using Cassandra ● Needed searchability ● Not all of this data is immutable ● Time to market prioritized over cost
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Challenges with initial solution ● Operational ○ Performance tuning ○ Scaling ○ Data retention ● Cost ○ Licensing ○ Instances ○ Storage ○ Operational
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Possibilities ● Cassandra + Elasticsearch ● NoSQL solution “X” ● Hive ● Presto Want the cost and scale profile of S3-based warehouse solutions but with similar query performance to DSE.
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hybrid custom + off-the-shelf
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recipe ● Inspired by Hive, Presto et al ● Leverage existing tools Combine with secret ingredient...
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Secret ingredient: Probabilistic meta indexes ● Hive already has Bloom filter support via ORC on block level Bloom filters!!! ● Our customization: the metastore is a Bloom filter!
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A “searchable” Bloom filter Term Doc IDs 0 0,1,2 1 1,2 2 1 3 0 4 1,2 5 0 Index Field Value Indexed Values Doc ID ip 192.168.0.1 {0, 3, 5} 0 ip 10.0.0.1 {1, 2, 4} 1 ip 8.8.8.8 {0, 1, 4} 2 Indexing Field Query String Actual Query ip ip:192.168.0.1 ip_bits:0 AND 3 AND 5 ip ip:10.0.0.1 ip_bits:1 AND 4 AND 5 Queries
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Query system REST API Leverage Spark APIs: PrunedFilteredScan 1. Spark SQL analyzes queries 2. Meta Index is queried with Spark Pushdown values a. Apply partition filters to any Spark Pushdown filters b. Apply bloom filters to any Spark Pushdown filters 3. Hand off resulting file list to Spark: FileScanRDD
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ● Hive/Presto like query system ● Custom metastore with probabilistic meta indexes
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Outcome ● 95% Cost reduction over previous Explorer system ● Minimizes the amount of data pulled from S3 to satisfy queries ● Performance: ○ p95 Query times in seconds ○ p95 Key Value lookup times are sub-second! ● Scale out is nearly zero effort ○ No storage scale out required ○ Query servers auto-scale according to query load
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cheap Scalable • Separate storage from compute • Storage scales with zero effort • Scale compute independently Reliable Fast Amazon S3
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spark "On-Demand" Amazon EMR Scaling • Inter-cluster • Intra-cluster • Easy auto scaling Used for • Ingest • Query
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The future ● Incorporating S3 Select ○ Decrease query time and cost even more ○ Effort currently underway ● Leveraging Kinesis Data Firehose ○ Simplify ingest pipeline ○ Reduce operational complexity ● Go “Serverless” ○ Meta store ○ Query compute ○ Ingest
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lessons learned • Features first, economics later • A hybrid of build and buy can be effective • Open source ecosystem is rich • AWS tools are flexible building blocks • AWS roadmap leads on innovation High-speed, probabilistic data lakes sourced in Amazon S3 are foundational.
  • 51. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Gene Stevens @genestevens Joshua Hollander @jholla14 ProtectWise.com/TestDrive
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.