Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 26

Workshop: Architecting a Serverless Data Lake

3

Share

In this workshop, learn how to create a serverless data lake architecture. Understand how to ingest data at scale from multiple data sources, how to transform the data, and how to catalog it to make it available for querying using a variety of tools. Also learn how to set up governance and data quality controls.

Speakers:
Rajanikanth Bhargava Chilakapati - Solutions Architect, AWS
Karl Hart - Solutions Architect, AWS
John Pignata - Startup Solutions Architect, AWS

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Workshop: Architecting a Serverless Data Lake

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless Data Lake Workshop Raj Chilakapati Solutions Architect Amazon Web Services A R C 3 0 2
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Development Environment Setup Review Data Lake Architecture Why Serverless? Glue Extract Transform Load (ETL) Data Governance Bonus Content
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scenario You support a successful online ecommerce website with millions of users. The website is tracking your end user activity and their buying habits online. Requirements • Build a cost effective solution to have a unified analytics environment. • Support the ability to query data in ad-hoc queries • Use Business Intelligence tools with a end goal of helping business teams derive efficiencies in their marketing campaigns. Constraints to keep in mind • Should not loose the focus on data quality and governance controls. Data Sources include weblogs, NoSQL databases and other datasources
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Re:Invent Workshop summary • Ingest data from various data sources and join them together • Enrich raw data • Convert data to parquet for efficient querying • Grant access to roles based on the data classification • SQL Access for Data Scientists • Data Visualization with charts and graphs • Data Lineage
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. Your own device for console access 2. An AWS account that you are able to use for testing. (Should not be used for production or other purposes.) 3. Download the Lab Guide at http://bit.do/nov28nyloftworkshop Requirements
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Development Environment Your Cloud Engineering team has deployed a development environment for you Ingestion / Data Generation Kinesis / Log Data Data Generation Lambda Functions Amazon S3 Buckets Amazon DynamoDB AWS Glue Management Console / Development Endpoint Amazon Athena Amazon QuickSight
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. Deploy the Lab CloudFormation template from here http://bit.do/nov28datalaketemplate 2. Examine the environment in CloudFormation Designer 3. Deploy your stack Deploy the lab environment Template Stack
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High Level Architecture
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Firehose • Serverless, easy to use • Seamless integration with AWS data stores • Support for serverless transformation • Near real-time ingestion • Pay only for what you use
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Simple Storage Service (S3) • Object Store • Highly durable • Limitless scalability • Pay for what you use • Comprehensive Security & Compliance capabilities • Support for Query in place
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue • Serverless ETL • Universal Data Catalog • Open source Apache Spark environment • DynamicFrame – Built in functions • Seamless integration with AWS services • Support for on-premises data stores
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Athena • Serverless interactive query service • Integrated with AWS Glue Data Catalog • Open source, built on Presto, query with standard SQL • Pay per Query • Support for standard formats like CSV, JSON, ORC, Avro and Parquet • Fast parallel query execution
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon QuickSight • Serverless, end to end BI solution • Built-in SPICE engine • Smart visualizations • Seamless integration with AWS services • On-premises database support • Pay only for what you use • Multiple device support • Share and collaborate
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Classification and Security • Grant S3 access by role to bucket / prefix • Approaches to segment data • Multiple copies of the data in different buckets • Grant access to roles to buckets • Tokenization, join to tokenized tables, and views to resolve them Bucket with objects Role Permissions
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. UserProfile Duplication ID First Last 1 Sam Smith 2 Jane Jones UserProfileSecure ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Duplication UserProfile ID First Last 1 Sam Smith 2 Jane Jones UserProfileSecure ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tokenization UserProfile ID First Last SSN_Token 1 Sam Smith 8c9d409dcc43 2 Jane Jones 06a38ea94e69 SSN_Tokens Token SSN 8c9d409dcc43 111-11-1111 06a38ea94e69 222-22-2222
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tokenization ProfileView ID First Last 1 Sam Smith 2 Jane Jones ProfileSecureView ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum UserProfileSecure ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bonus Content • AWS Glue Development Endpoints – Apache Zeppelin notebook • Amazon Redshift/Spectrum Integration
  25. 25. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Raj Chilakapati chilakap@amazon.com

×