Best Practices for Building a Data Lake on AWS

AWS Big Data:
Presented By:
Jay Duff
jay@reluscloud.com
Big Data Practice Director
Your Fast Track to Data Lakes

Big Data Model Maturity Index
Business Monitoring
Business Insights
Business Optimization
Data Monetization
Business Metamorphosis
Schmarzo, Bill. Big Data: Understanding How Data Powers Big
Business (Kindle Locations 292-295). Wiley. Kindle Edition.
Inward Focus External Focus

Seizing the Big Data
Opportunity
• Business Monitoring – How DID the business perform
• Business Insights – Discovering patterns,
correlations, influences
• Business Optimization – Embedding algorithms to
automatically adjust operations (or the customer
experience)
• Data Monetization – Leveraging (or enriching) your data
assets (or platform) for new revenue opportunities
• Business Metamorphosis – Ultimate goal of new
products in new markets
Business Monitoring
Business Insights
Business Optimization
Data Monetization
Business Metamorphosis

Business Insights - Discovering
• Data Lake –
• Raw but accessible, unfiltered, largely unstructured
• Reduced bias
• More difficult to consume
• Narrow audience
• Data Warehouse
• Structured, Optimized
• Defined Measures & Metrics
• Easy to consume (e.g. Daily Dashboard)
• Large audience

Getting Started
S3
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
 How to include data
 How to make the data valuable
for discovery
 Accessible
 Secure
 Cataloged
 Durable
Cost OptimizationPerformance
SecurityReliability Operational Excellence
Ref: AWS Well-Architected Framework, https://aws.amazon.com/whitepapers/
Data is Only as Valuable as
the Decisions it Enables –
Ion Stoica, RISE Lab

AWS Athena
• Eliminate ETL
• Eliminate
Database
• Query S3 Directly
• Auto Scale
S3
ETL
Extraction
Transformation
Loading
Athena Service
AWS Service, based on Presto, provides ability to
query data in many formats without a client cluster

S3 Ingest
• AWS Console
• CLI - $ aws s3 cp …
• SDK – embed into your existing application
• Sqoop (EMR) – RDBMS to S3
• AWS Kinesis Firehose – Streaming data to S3
• AWS Snowball
• AWS Direct Connect
S3Ingest
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
Download Airline Data – CSV Format

AWS Console – S3, upload 6 data
files
To Ingest Into S3:
$ aws s3 cp . s3://scalawag.data/fl/ --recursive
1. Download fight data, 2016,
Jan, Feb, Mar, Apr, May, Jun
2. Unzip
3. Unique file names (per year &
month)
4. Copy to S3 (using CLI)

Athena: Creating a Table
• Create Database: AWS
• Table: flight
• Data location:
s3://scalawag.data/fl/

Building Create Table
Statement
Console will guide you through
the process
• Select CSV format
• Add columns
• Skip Partitions
• Run Query
Additional Formats:
• Apache Web Log
• CSV, TSV, Delimited Text
• JSON
• Parquet & ORC

Query
• Created the table in Hive
Metastore
• CSV
• External
• S3
• Athena – Hosted Cluster
• There is no Database
• Autoscaling
• 2.44 sec
• You pay per bytes scanned

Pricing
• 3 sec
• 104 MB
• $5 TB/scanned = $0.0005
• Partitioning by month –
1/6
• Compressed Columnar
Storage – 1/8

Simplified Data Lake
• Retrieved External Data
• Uploaded to the Data Lake via Command Line Interface (CLI)
• (One time) Defined the Table
• Queried Data (directly from S3)
• No ETL
• No Cluster
• No Database
• What about more complex use cases?

Common Challenges
• Unstructured Data may lead to:
• Ungoverned Chaos
• Unusable Data
• Disparate & Complex Tools (that are quickly changing)
• Enterprise Wide Collaboration
• Security
• Unified, Consistent
• Common Toolset
• Storage & Compute costs

Complex Storage Requirements
– Why S3
• Durable & Available
• High Performance & Scalable
• Easy To Use
• AWS SDKs
• Trigger Events – Notifications & Process Steps
• Integrated
• Encryption – Managed SSE, SSE-C, SSE-KMS
• Policies – Lifecycle, Encryption, Access, Backup
• Native connectivity to EMR, Redshift, DynamoDB, Elasticsearch
• Low Cost & Storage Cost is Decoupled From Compute
• Widely Adopted

Adding ETL Complexity
• Storage Formats
• Parquet, ORC
• Avro
• Partitioning
• By Date
• By Geography
• Deduplication
• Streaming
S3
EMR
ETL
Extraction
Transformation
Loading
Elastic Map Reduce Service

AWS Elastic Map Reduce (EMR)
• EMR is a managed cluster running frameworks such as Apache Hadoop or
Spark
• Hive – SQL based, Batch Oriented
• Query (not nearly as fast as Athena)
• Basic ETL operations
• Easy
• Spark
• Complex ETL
• Machine Learning
• Graph DB functionality
• SparkR
Jay Rebecca
Marissa Alex
Friend
Employer
Neighbor
Marketing
Promotion
Business Insights -> Discovery & Prediction

Additional Database Options
S3
EMR
ETL
Extraction
Transformation
Loading
Amazon
DynamoDB
ElasticsearchAmazon
Aurora
Amazon
Redshift
Amazon
Athena / Presto

Cataloging The Data Lake
Make Your Metadata Available
• Versions
• Content
• Schema
• Names
• Layout
• Enumeration
• Origins
AWS Elasticsearch
AWS Relational
Database Service

End User Tools
• RDS, Athena & Redshift Connectivity
• JDBC
• ODBC
• Commercially Available Tools – Amazon
Marketplace
• AWS QuickSight
• Easy to Integrate
• Per User Per Month Pricing
• Super Fast, Parallel, In memory Calculation Engine

Expense
• S3 – 100 TB, 1 year, $27K
• Athena ($5/TB scanned)
• Parquet & ORC – provide compression
• And columnar retrieval (only the needed columns)
• Redshift (Storage or Compute Oriented Nodes), 5 TB
• Assuming 20% compression
• Storage Intensive - $15K/year
• Compute Intensive - $50-85K/year
• Spark Cluster (assume 6 hr/day, 6 R3.4xLarge worker nodes)
• $25K/year

Managing Costs
• S3 Storage (Infrequent Access, Reduced Redundancy)
• Storage Formats: Compression, Columnar Storage, Partitioning
• Complementary Use
• Athena & Redshift
• Spark & Redshift
• Kinesis Analytics
• Lambda: Server-less Tasks
• Reserved and Spot Instances
• Automated Processes & Transitory Resources

Amazon Builder’s Template
https://aws.amazon.com/answers/big-data/data-lake-solution/
• Better Access Control Policies
• Searchable Data Catalog
• API Access
• User Console
• Monitoring
Ready to deploy template

Summary
• S3
• Provides Excellent Storage Versatility
• Excellent For Data Lake Storage
• Athena Provides Quick Start
• Easy To Manage - No Server
• Cost Effective
• AWS Ecosystem Supports More Complex Solutions
• Integrated Authentication & Security
• Real Time Catalog Updates

Best Practices for Building a Data Lake on AWS

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Best Practices for Building a Data Lake on AWS

Similaire à Best Practices for Building a Data Lake on AWS (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

Best Practices for Building a Data Lake on AWS