Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting, in its original format and extract value. In this session learn how to architect and implement an Analytics Data Lake. Hear customer examples of best practices and learn from their architectural blueprints.
The Ten Facts About People With Autism Presentation
Best Practices for Building a Data Lake on AWS
1. AWS Big Data:
Presented By:
Jay Duff
jay@reluscloud.com
Big Data Practice Director
Your Fast Track to Data Lakes
2. Big Data Model Maturity Index
Business Monitoring
Business Insights
Business Optimization
Data Monetization
Business Metamorphosis
Schmarzo, Bill. Big Data: Understanding How Data Powers Big
Business (Kindle Locations 292-295). Wiley. Kindle Edition.
Inward Focus External Focus
3. Seizing the Big Data
Opportunity
• Business Monitoring – How DID the business perform
• Business Insights – Discovering patterns,
correlations, influences
• Business Optimization – Embedding algorithms to
automatically adjust operations (or the customer
experience)
• Data Monetization – Leveraging (or enriching) your data
assets (or platform) for new revenue opportunities
• Business Metamorphosis – Ultimate goal of new
products in new markets
Business Monitoring
Business Insights
Business Optimization
Data Monetization
Business Metamorphosis
4. Business Insights - Discovering
• Data Lake –
• Raw but accessible, unfiltered, largely unstructured
• Reduced bias
• More difficult to consume
• Narrow audience
• Data Warehouse
• Structured, Optimized
• Defined Measures & Metrics
• Easy to consume (e.g. Daily Dashboard)
• Large audience
5. Getting Started
S3
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
How to include data
How to make the data valuable
for discovery
Accessible
Secure
Cataloged
Durable
Cost OptimizationPerformance
SecurityReliability Operational Excellence
Ref: AWS Well-Architected Framework, https://aws.amazon.com/whitepapers/
Data is Only as Valuable as
the Decisions it Enables –
Ion Stoica, RISE Lab
6. AWS Athena
• Eliminate ETL
• Eliminate
Database
• Query S3 Directly
• Auto Scale
S3
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
Athena Service
AWS Service, based on Presto, provides ability to
query data in many formats without a client cluster
7. S3 Ingest
• AWS Console
• CLI - $ aws s3 cp …
• SDK – embed into your existing application
• Sqoop (EMR) – RDBMS to S3
• AWS Kinesis Firehose – Streaming data to S3
• AWS Snowball
• AWS Direct Connect
S3Ingest
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
Download Airline Data – CSV Format
8. AWS Console – S3, upload 6 data
files
To Ingest Into S3:
$ aws s3 cp . s3://scalawag.data/fl/ --recursive
1. Download fight data, 2016,
Jan, Feb, Mar, Apr, May, Jun
2. Unzip
3. Unique file names (per year &
month)
4. Copy to S3 (using CLI)
9. Athena: Creating a Table
• Create Database: AWS
• Table: flight
• Data location:
s3://scalawag.data/fl/
10. Building Create Table
Statement
Console will guide you through
the process
• Select CSV format
• Add columns
• Skip Partitions
• Run Query
Additional Formats:
• Apache Web Log
• CSV, TSV, Delimited Text
• JSON
• Parquet & ORC
11. Query
• Created the table in Hive
Metastore
• CSV
• External
• S3
• Athena – Hosted Cluster
• There is no Database
• Autoscaling
• 2.44 sec
• You pay per bytes scanned
13. Simplified Data Lake
• Retrieved External Data
• Uploaded to the Data Lake via Command Line Interface (CLI)
• (One time) Defined the Table
• Queried Data (directly from S3)
• No ETL
• No Cluster
• No Database
• What about more complex use cases?
14. Common Challenges
• Unstructured Data may lead to:
• Ungoverned Chaos
• Unusable Data
• Disparate & Complex Tools (that are quickly changing)
• Enterprise Wide Collaboration
• Security
• Unified, Consistent
• Common Toolset
• Storage & Compute costs
15. Complex Storage Requirements
– Why S3
• Durable & Available
• High Performance & Scalable
• Easy To Use
• AWS SDKs
• Trigger Events – Notifications & Process Steps
• Integrated
• Encryption – Managed SSE, SSE-C, SSE-KMS
• Policies – Lifecycle, Encryption, Access, Backup
• Native connectivity to EMR, Redshift, DynamoDB, Elasticsearch
• Low Cost & Storage Cost is Decoupled From Compute
• Widely Adopted
16. Adding ETL Complexity
• Storage Formats
• Parquet, ORC
• Avro
• Partitioning
• By Date
• By Geography
• Deduplication
• Streaming
S3
EMR
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
Elastic Map Reduce Service
17. AWS Elastic Map Reduce (EMR)
• EMR is a managed cluster running frameworks such as Apache Hadoop or
Spark
• Hive – SQL based, Batch Oriented
• Query (not nearly as fast as Athena)
• Basic ETL operations
• Easy
• Spark
• Complex ETL
• Machine Learning
• Graph DB functionality
• SparkR
Jay Rebecca
Marissa Alex
Friend
Employer
Neighbor
Marketing
Promotion
Business Insights -> Discovery & Prediction
19. Cataloging The Data Lake
Make Your Metadata Available
• Versions
• Content
• Schema
• Names
• Layout
• Enumeration
• Origins
AWS Elasticsearch
AWS Relational
Database Service
20. End User Tools
• RDS, Athena & Redshift Connectivity
• JDBC
• ODBC
• Commercially Available Tools – Amazon
Marketplace
• AWS QuickSight
• Easy to Integrate
• Per User Per Month Pricing
• Super Fast, Parallel, In memory Calculation Engine