SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
S U M M I T
Ams t e rd a m
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Modern Data Platform on AWS
Damon Cortesi
Big Data Architect - AWS
@dacort
A N T 0 0 1
David Morel
Takeaway.com
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A brief history of significant Big Data releases
2004
Google publishes
MapReduce paper
2006
Hadoop is created
HBase development starts
2008
Facebook launches
Hive
AWS EMR announced
2009
Facebook launches Presto
Apache Spark released
2012
MXNet
Paper Published
2015
Amazon Athena &
AWS Glue announced
2016
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
There are more
people accessing data
And more
requirements for
making data available
Data Scientists
Analysts
Business Users
Applications
Secure Real time
Flexible Scalable
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS databases and analytics
Broad and deep portfolio, built for builders
AWS Marketplace
Amazon Redshift
Data warehousing
Amazon EMR
Hadoop + Spark
Athena
Interactive analytics
Kinesis Analytics
Real-time
Amazon Elasticsearch service
Operational Analytics
RDS
MySQL, PostgreSQL, MariaDB,
Oracle, SQL Server
Aurora
MySQL, PostgreSQL
Amazon
QuickSight
Amazon
SageMaker
DynamoDB
Key value, Document
ElastiCache
Redis, Memcached
Neptune
Graph
Timestream
Time Series
QLDB
Ledger Database
S3/Amazon Glacier
AWS Glue
ETL & Data Catalog
Lake Formation
Data Lakes
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect
Data Movement
AnalyticsDatabases
Business Intelligence & Machine Learning
Data Lake
Managed
Blockchain
Blockchain
Templates
Blockchain
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions
730+ Database
solutions
600+ Analytics
solutions
25+ Blockchain
solutions
20+ Data lake
solutions
30+ solutions
RDS on VMWare
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data lake with AWS Glue
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
Managed Streaming
For Kafka
Fully managed open-
source platform for
building real-time
streaming data pipelines
and applications.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Kinesis Data Streams
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Kinesis Data Firehose
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Prefix: raw/life/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/
Buffer: Up to 128MB or 15 minutes
Kinesis events to S3
Kinesis Data
Streams
Kinesis Data
Firehose
Save as Parquet
Lambda
Transformation
Aggregated
JSON Data
Clients
Aggregated
Parquet Data
Source backup
New! as of 12th Feb
• Support for custom S3 prefix
Amazon Athena
Crawlers
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data Movement From On-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide
a more consistent
network experience than
Internet-based
connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Database Migration Service
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DMS to S3
AWS Database Migration
Service
Source
database
Crawlers Data catalogSnapshot
Data
AWS Glue
Amazon Athena
Amazon EMR
New! as of 25th March
• Support for Parquet
• Support for S3 encryption with KMS
Amazon Redshift
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DMS to S3 Change Data Capture (CDC)
• Challenging to do easily
• Need to maintain a staging table and reconstitute dataset
newDf = df2.filter("cdc = 'I'")
updDf = df2.filter("cdc = 'U'")
delDf = df2.filter("cdc = 'D'”)
w = Window().partitionBy("id").orderBy(F.col("idx").desc())
latestUpdateDf = updDf.withColumn("rn", F.row_number()
.over(w)).where(F.col("rn") == 1).select("*").drop("rn")
# Create the update table, join to the original table,
# filter everything out of the original where the update is null, then union
tempDf = latestUpdateDf.select("id").withColumnRenamed("id", "id_1")
filteredBaseDf = insertsDf.join(tempDf, insertsDf.id == tempDf.id_1, 'left')
filteredBaseDf = filteredBaseDf.filter("id_1 is null").drop("id_1")
insertAndUpdateDdf = filteredBaseDf.union(latestUpdateDf)
# Ok, now remove any deleted columns!
tempDf = delDf.select("id").withColumnRenamed("id", "id_del")
finalDf = insertAndUpdateDdf.join(tempDf, insertAndUpdateDdf.id == tempDf.id_del, 'left')
finalDf = finalDf.filter("id_del is NOT null").drop("id_del")
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue ETL
New!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Third-party API to S3
3rd Party
API
AWS Glue
Python Shell
Crawlers Data catalogIncremental
Exports
Amazon Athena
Glue ETL
Transformed
Data
Amazon Redshift
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Parquet File Format
Row group meta data
allows Parquet reader
to skip portions of, or
all files.
Columnar format is
optimized for
analytics.
Column meta-data
allows for pre-
aggregation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Parquet
• Previously it was common to deliver in JSON/CSV/text then run
another process to convert to Parquet. It’s becoming more common to
deliver straight to Parquet.
• Kinesis Firehose – Added support May 2018
• Custom prefix support !: Feb 2019
• Requires schema in Glue Data Catalog
• Athena – CREATE TABLE AS SELECT: Oct 2018
• EMR – S3-optimized Parquet committer: Nov 2018
• Database Migration Service – Added Parquet support ": Mar 2019
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue ETL
New!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon EMR
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Redshift
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Athena
Permissions
Data Lake
AWS Cloud
AWS Cloud
Reporting
&
Analytics
Machine
Learning
AWS Cloud
Custom
Applications
AWS Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon EMR Notebooks in the Console
A managed analytics environment based on Jupyter Notebooks
Amazon EMR clusters
AWS Management
Console for EMR
EMR-managed notebook based
on Jupyter notebook
users
Auto saves notebook file to your S3 bucket
Run queries on your remote EMR cluster
EMR VPC
Customer VPC
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon QuickSight
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data lake with AWS Glue
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How it works
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blueprints build on AWS Glue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Easily de-duplicate your data with ML transforms
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Lake Formation Pricing
No additional charges – Only pay for the
underlying services used.
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A tale of AWS at Takeaway.com
Data Engineering in the Business Intelligence team
1. Once upon a time
2. Learning
3. The kingdom
4. Lessons
5. Complexity
6. Flexibility
7. Simplicity
8. Expansion
9. Happily ever after
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I TS U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Contenu connexe

Tendances

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesLars E Martinsson
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected ApproachBusiness Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected ApproachDATAVERSITY
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 

Tendances (20)

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected ApproachBusiness Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected Approach
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 

Similaire à Modern Data Platform on AWS

Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019javier ramirez
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019Amazon Web Services
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Summits
 
Discuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
Discuss data migration with AWS experts - STG304 - Santa Clara AWS SummitDiscuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
Discuss data migration with AWS experts - STG304 - Santa Clara AWS SummitAmazon Web Services
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitAmazon Web Services
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWSAmazon Web Services
 
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Amazon Web Services
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time dataAmazon Web Services
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSAmazon Web Services
 
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitHow to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitAmazon Web Services
 
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS SummitBuild your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS SummitAmazon Web Services
 
Make Your Data Move: Best Practices for Migrating Data to AWS
Make Your Data Move: Best Practices for Migrating Data to AWSMake Your Data Move: Best Practices for Migrating Data to AWS
Make Your Data Move: Best Practices for Migrating Data to AWSAmazon Web Services
 
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdfBuilding data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdfAmazon Web Services
 
Migrating Data to the Cloud: Explore Your Options From AWS
Migrating Data to the Cloud: Explore Your Options From AWSMigrating Data to the Cloud: Explore Your Options From AWS
Migrating Data to the Cloud: Explore Your Options From AWSAmazon Web Services
 
Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)Amazon Web Services
 
Make your data move: Best practices for migrating data to AWS - STG201 - New ...
Make your data move: Best practices for migrating data to AWS - STG201 - New ...Make your data move: Best practices for migrating data to AWS - STG201 - New ...
Make your data move: Best practices for migrating data to AWS - STG201 - New ...Amazon Web Services
 
AWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS Summit
AWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS SummitAWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS Summit
AWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS SummitAmazon Web Services
 
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyBest Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyAmazon Web Services
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitAmazon Web Services
 

Similaire à Modern Data Platform on AWS (20)

Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
Discuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
Discuss data migration with AWS experts - STG304 - Santa Clara AWS SummitDiscuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
Discuss data migration with AWS experts - STG304 - Santa Clara AWS Summit
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWS
 
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time data
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitHow to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
 
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS SummitBuild your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
 
Make Your Data Move: Best Practices for Migrating Data to AWS
Make Your Data Move: Best Practices for Migrating Data to AWSMake Your Data Move: Best Practices for Migrating Data to AWS
Make Your Data Move: Best Practices for Migrating Data to AWS
 
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdfBuilding data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
Building data lakes for analytics on AWS - ADB201 - Santa Clara AWS Summit.pdf
 
Migrating Data to the Cloud: Explore Your Options From AWS
Migrating Data to the Cloud: Explore Your Options From AWSMigrating Data to the Cloud: Explore Your Options From AWS
Migrating Data to the Cloud: Explore Your Options From AWS
 
Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)
 
Make your data move: Best practices for migrating data to AWS - STG201 - New ...
Make your data move: Best practices for migrating data to AWS - STG201 - New ...Make your data move: Best practices for migrating data to AWS - STG201 - New ...
Make your data move: Best practices for migrating data to AWS - STG201 - New ...
 
AWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS Summit
AWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS SummitAWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS Summit
AWS Data Transfer Services: Deep Dive - SRV302 - Chicago AWS Summit
 
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyBest Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Modern Data Platform on AWS

  • 1. S U M M I T Ams t e rd a m
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Modern Data Platform on AWS Damon Cortesi Big Data Architect - AWS @dacort A N T 0 0 1 David Morel Takeaway.com
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A brief history of significant Big Data releases 2004 Google publishes MapReduce paper 2006 Hadoop is created HBase development starts 2008 Facebook launches Hive AWS EMR announced 2009 Facebook launches Presto Apache Spark released 2012 MXNet Paper Published 2015 Amazon Athena & AWS Glue announced 2016
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data every 5 years There is more data than people think 15 years live for Data platforms need to 1,000x scale >10x grows
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T There are more people accessing data And more requirements for making data available Data Scientists Analysts Business Users Applications Secure Real time Flexible Scalable
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS databases and analytics Broad and deep portfolio, built for builders AWS Marketplace Amazon Redshift Data warehousing Amazon EMR Hadoop + Spark Athena Interactive analytics Kinesis Analytics Real-time Amazon Elasticsearch service Operational Analytics RDS MySQL, PostgreSQL, MariaDB, Oracle, SQL Server Aurora MySQL, PostgreSQL Amazon QuickSight Amazon SageMaker DynamoDB Key value, Document ElastiCache Redis, Memcached Neptune Graph Timestream Time Series QLDB Ledger Database S3/Amazon Glacier AWS Glue ETL & Data Catalog Lake Formation Data Lakes Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect Data Movement AnalyticsDatabases Business Intelligence & Machine Learning Data Lake Managed Blockchain Blockchain Templates Blockchain Amazon Comprehend Amazon Rekognition Amazon Lex Amazon Transcribe AWS DeepLens 250+ solutions 730+ Database solutions 600+ Analytics solutions 25+ Blockchain solutions 20+ Data lake solutions 30+ solutions RDS on VMWare
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lake with AWS Glue Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  • 9. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3—Object Storage Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely Managed Streaming For Kafka Fully managed open- source platform for building real-time streaming data pipelines and applications.
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Kinesis Data Streams
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Kinesis Data Firehose
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Prefix: raw/life/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/ Buffer: Up to 128MB or 15 minutes Kinesis events to S3 Kinesis Data Streams Kinesis Data Firehose Save as Parquet Lambda Transformation Aggregated JSON Data Clients Aggregated Parquet Data Source backup New! as of 12th Feb • Support for custom S3 prefix Amazon Athena Crawlers
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Movement From On-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Database Migration Service
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DMS to S3 AWS Database Migration Service Source database Crawlers Data catalogSnapshot Data AWS Glue Amazon Athena Amazon EMR New! as of 25th March • Support for Parquet • Support for S3 encryption with KMS Amazon Redshift
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DMS to S3 Change Data Capture (CDC) • Challenging to do easily • Need to maintain a staging table and reconstitute dataset newDf = df2.filter("cdc = 'I'") updDf = df2.filter("cdc = 'U'") delDf = df2.filter("cdc = 'D'”) w = Window().partitionBy("id").orderBy(F.col("idx").desc()) latestUpdateDf = updDf.withColumn("rn", F.row_number() .over(w)).where(F.col("rn") == 1).select("*").drop("rn") # Create the update table, join to the original table, # filter everything out of the original where the update is null, then union tempDf = latestUpdateDf.select("id").withColumnRenamed("id", "id_1") filteredBaseDf = insertsDf.join(tempDf, insertsDf.id == tempDf.id_1, 'left') filteredBaseDf = filteredBaseDf.filter("id_1 is null").drop("id_1") insertAndUpdateDdf = filteredBaseDf.union(latestUpdateDf) # Ok, now remove any deleted columns! tempDf = delDf.select("id").withColumnRenamed("id", "id_del") finalDf = insertAndUpdateDdf.join(tempDf, insertAndUpdateDdf.id == tempDf.id_del, 'left') finalDf = finalDf.filter("id_del is NOT null").drop("id_del")
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue ETL New!
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Third-party API to S3 3rd Party API AWS Glue Python Shell Crawlers Data catalogIncremental Exports Amazon Athena Glue ETL Transformed Data Amazon Redshift
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Parquet File Format Row group meta data allows Parquet reader to skip portions of, or all files. Columnar format is optimized for analytics. Column meta-data allows for pre- aggregation
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Parquet • Previously it was common to deliver in JSON/CSV/text then run another process to convert to Parquet. It’s becoming more common to deliver straight to Parquet. • Kinesis Firehose – Added support May 2018 • Custom prefix support !: Feb 2019 • Requires schema in Glue Data Catalog • Athena – CREATE TABLE AS SELECT: Oct 2018 • EMR – S3-optimized Parquet committer: Nov 2018 • Database Migration Service – Added Parquet support ": Mar 2019
  • 23. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue ETL New!
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon EMR
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Athena Permissions Data Lake AWS Cloud AWS Cloud Reporting & Analytics Machine Learning AWS Cloud Custom Applications AWS Glue Data Catalog
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon EMR Notebooks in the Console A managed analytics environment based on Jupyter Notebooks Amazon EMR clusters AWS Management Console for EMR EMR-managed notebook based on Jupyter notebook users Auto saves notebook file to your S3 bucket Run queries on your remote EMR cluster EMR VPC Customer VPC
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon QuickSight
  • 30. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lake with AWS Glue Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Enforce security policies across multiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS Lake Formation
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How it works
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Easily load data to your data lake logs DBs Blueprints Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep one-shot incremental
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blueprints build on AWS Glue
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Easily de-duplicate your data with ML transforms
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Secure once, access in multiple ways Data Lake Storage Data Catalog Access Control Lake Formation Admin
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Security permissions in Lake Formation Control data access with simple grant and revoke permissions Specify permissions on tables and columns rather than on buckets and objects Easily view policies granted to a particular user Audit all data access at one place
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Lake Formation Pricing No additional charges – Only pay for the underlying services used.
  • 40. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 41. A tale of AWS at Takeaway.com Data Engineering in the Business Intelligence team
  • 42. 1. Once upon a time
  • 45.
  • 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I TS U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.