AWS Data Analytics Solutions

•Télécharger en tant que PPTX, PDF•

0 j'aime•678 vues

This document discusses building data lakes and analytics pipelines on Amazon Web Services (AWS). It recommends using AWS services like Amazon S3, AWS Glue, AWS Lake Formation, and Amazon Redshift to build scalable and secure data lakes that can ingest and process large amounts of data from various sources. It also highlights how AWS provides the most comprehensive set of analytics services, enables the easiest setup of data lakes, and offers the most scalable and cost-effective infrastructure for analytics workloads.

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
進入 AI 領域的第一步驟
- 資料平台的建置
T r a c k 6 | S e s s i o n 1
Jayson Hsieh
Senior Solutions Architect
Amazon Web Services

Dataisa strategicassetforeveryorganization
The world’s most
valuable resource is
no longer oil, but data.
David Parkins, 2017, The Economist
“
”

Challenge
Amazon needed to analyze a massive amount of data to find
insights, identify opportunities, and evaluate business
Including catalog browsing, order placement, transaction
processing, delivery scheduling, video services, and Prime
registration
• 50 petabytes of data and 75,000 tables
• Processing 600,000 user analytics jobs each day
• Data is published by more than 1,800 teams
• 3,300+ data consumer teams analyze this data
Amazon.comlowerscostsand gains
fasterinsightswithAWSdata analyticofferings
The Oracle data
warehouse did not scale
for PB level data, was
difficult to maintain, and
was costly.

100 PB
Amazonuses anAWSdatalake
S3
Amazon Kinesis
Data lake
web interface
Data lake APIs
Workflows service
Discovery service
Data ingestion Subscription service
Data security and governance
Source systems Big data marketplace Analytics
Data quality
/curationAmazon S3
Amazon DynamoDB
Relational stores
Amazon EMR
Amazon Redshift
Other compute
Nonrelational stores
Amazon Redshift Spectrum
100 PB
Solution
Amazon deployed a data lake with
Amazon S3, and it now runs analytics
with Amazon Redshift, Amazon
Redshift Spectrum, and Amazon
EMR.
Benefits
Amazon doubled the data stored
from 50 PB to 100 PB, lowered
costs, and was able to gain insights
faster.

Customerswant morevalue fromtheirdata
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications

Common analyticsusecases–whichdo youneed?
Data warehouse modernization
Big data and data lakes
Real-time streaming and analytics
Operational and search analytics
Self-service business analytics
Acquisition of third-party data for analysis

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Automatedataingestion
Ingestion of data should be automated using triggers, schedules, and change
detection
• Eliminates error-prone
manual processes
• Allows data to be
processed as it arrives

Designingestionforfailuresand duplicates
Ingestion triggered from requests and events must be idempotent
• Appropriate retries
• Deal with message duplication failure

Preserveoriginalsourcedata
Having raw data in its pristine form allows you to repeat the ETL process in case
of failures
• No transformation of the original
data files
• Allow replay data pipeline

Describedatawithmetadata
It’s essential that any dataset that makes its way into a data store environment
is discoverable and classified
• Capture metadata for application to leverage
the ingested datasets
• Ensure that this activity is well-documented
and automated

UsetherightETLtoolforthejob
Select an ETL tool that closely meets your requirements for streamlining the
workflow between the source and the destination
• Several options
 Custom built to solve specific problems
 Assembled from open source projects
 Commercially licensed ETL platforms
• Support for complex workflows, APIs and
specific languages
• Connectors to varied data stores
• Performance, budget, and enterprise scale.

AutomateETLworkflows
Chaining ETL jobs ensures the seamless execution of your ETL workflow
• Output from one process or job typically serves as an input to another
• Ensure you have the visibility of tracking and debugging any failure

Tierstorageappropriately
Store data in the optimal tier to ensure that you leverage the best features of
the storage services for your analytics applications
• Two basic parameters for choosing the
right data storage
 Data format
 Access Frequency
• Distributing your datasets into
different services
 Metadata tier & Payload tier
 Hot, warm and cold tiers

Secure,protect,and manageyourentireanalytics
pipeline
Both the data assets and the infrastructure for storing, processing, and
analyzing data must be secured
• Implementing fine-grained controls that allow
authorized users to manage particular assets
• Access roles might change at various stages of
an analytics pipeline
• Ensuring that unauthorized users are blocked
from taking any actions that would compromise
data confidentiality and security
AWS Lake Formation
AWS Identity andAccess
Management

Designforscalableand reliableanalyticspipelines
Make analytics execution compute environments reliable and scalable
• Keep up the pace of data volume
and velocity
• Provide high data reliability and
optimized query performance to
support different analytics
applications
 batch and streaming ingest
 fast ad hoc queries to data science

Why chooseAWSfordatalakes and analytics?
Most secure
infrastructure
for analytics
Most scalable and
cost-effective
Easiest to build
data lakes
and analytics
Most comprehensive
and open
1 2 3 4

1.Easiesttobuild datalakes and analytics
• A single storage layer (Amazon S3) for
all analytics and ML
• A service to build secure data lakes in
days
• Deep integration across analytics and
infrastructure
(including federated queries)
Data
warehousin
g
Analytic
s
Machine
learning
(ML)
Data lake
The fastest way to go from zero to insights, covering all data for all users

2.Mostsecureinfrastructureforanalytics
Customers need to have multiple levels of security, identity and access
management, encryption, and compliance to secure their data lakes
Services for security and governance
Compliance
AWS Artifact
Amazon Inspector
AWS CloudHSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
Amazon VPC
Encryption
AWS Certificate Manager
AWS Key Management Service
Encryption at rest
Encryption in transit
Bring your own keys,
HSM support
Identity
IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations

Data movement
Analytics
Data lake infrastructure &
management
Data, visualization, engagement & machine
learning
3.Mostcomprehensiveand open
+ Many more
Amazon
Redshift
Amazon
EMR
(Spark &
Hadoop)
Amazon
Athena
Amazon
Elasticsearc
h Service
Amazon
Kinesis
Data
Analytics
AWS Glue
(Spark &
Python)
Amazon S3/
Amazon S3 Glacier
AWS Glue
AWS Lake
Formation
Amazon
QuickSigh
t
Amazon
SageMaker
Amazon
Comprehen
d
Amazo
n Lex
Amazon
Polly
Amazon
Rekognition
Amazon
Translat
e
AWS Database Migration Service | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data Firehose | Amazon Kinesis Data
Streams | Amazon Managed Streaming for Apache Kafka
Amazon
Pinpoint
AWS Data
Exchange
NEW

4.Mostscalable,cost-effective,high-performance
infrastructureforanalytics
Five highly
available storage
tiers and
intelligent tiering
Industry-leading choice
of 200+ instance types
to meet workload needs
On-Demand,
Reserved, and
Spot Instances
to reduce costs
100 Gbps-
bandwidth network
interfaces for
performance

LearnanalyticswithAWSTrainingandCertification
Resources created by the experts at AWS to help you build and validate data analytics skills
Visit aws.amazon.com/training/paths-specialty/
New free digital course: Data Analytics Fundamentals
Validate expertise with the AWS Certified Big Data—Specialty exam or the
new AWS Certified Data Analytics—Specialty beta exam
Classroom offerings, including Big Data on AWS,
feature AWS expert instructors and hands-on labs

Thank you!
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jayson Hsieh
hsiej@amazon.com

Recommandé

Track 4 Session 3_ 利用 AWS Step Functions 建構穩健的業務處理流程Amazon Web Services

Track 4 Session 6_ IOT01 如何透過 AWS IoT 服務建構物聯網應用Amazon Web Services

Track 2 Session 3_ 日本電視直播技術革命串流平台不容忽視的技術創新.pptxAmazon Web Services

Track 5 Session 2_SEC01 多重帳戶安全策略與方針.pptxAmazon Web Services

Track 6 Session 3_如何藉由 AWS AI 和機器學習平台搭建多功能的 AI 解決方案.pptxAmazon Web Services

Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構Amazon Web Services

Track 1 Session 1_企業善用雲端來加速數位化及創新Amazon Web Services

Track 1 Session 3_建構安全高效的電子設計自動化環境Amazon Web Services

Recommandé

Track 4 Session 3_ 利用 AWS Step Functions 建構穩健的業務處理流程Amazon Web Services

Track 4 Session 6_ IOT01 如何透過 AWS IoT 服務建構物聯網應用Amazon Web Services

Track 2 Session 3_ 日本電視直播技術革命串流平台不容忽視的技術創新.pptxAmazon Web Services

Track 5 Session 2_SEC01 多重帳戶安全策略與方針.pptxAmazon Web Services

Track 6 Session 3_如何藉由 AWS AI 和機器學習平台搭建多功能的 AI 解決方案.pptxAmazon Web Services

Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構Amazon Web Services

Track 1 Session 1_企業善用雲端來加速數位化及創新Amazon Web Services

Track 1 Session 3_建構安全高效的電子設計自動化環境Amazon Web Services

Track 3 Session 3_如何妥善運用雲端優勢遷移上雲.pptxAmazon Web Services

Track 4 Session 5_ 架構即代碼 – AWS CDK 與 CDK8S 聯手打造下一代的 K8S 應用Amazon Web Services

Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptxAmazon Web Services

Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢Amazon Web Services

Track 5 Session 4_ intel 透過AWS Outposts就地佈署 on-premises 雲端環境.pptxAmazon Web Services

Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Amazon Web Services

Track 4 Session 1_MAD01 如何活用事件驅動架構快速擴展應用Amazon Web Services

Introduzione a Amazon Elastic Container ServiceAmazon Web Services

Transform Your Business with VMware Cloud on AWS: Technical Overview Amazon Web Services

Track 6 Session 2_ 搭建現代化的資料數據湖.pptxAmazon Web Services

Track 6 Session 5_ 如何藉由物聯網 (IoT) 與機器學習提高預測性維修與產品良率.pptxAmazon Web Services

Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotAmazon Web Services

2016 summits - future of enterprise itAmazon Web Services

ENT207-The Future of Enterprise IT.pdfAmazon Web Services

Track 3 Session 4_企業工作負載遷移至 AWS 的最佳實踐Amazon Web Services

Track 5 Session 6_ BLC01 透過 Amazon Managed Blockchain 與 Amazon QLDB 打造區塊鍊應用.pptxAmazon Web Services

Enterprise Cloud Computing with AWS - How enterprises are using the AWS Cloud...Amazon Web Services

Track 4 Session 4_ MAD02 MAD 04 如何藉由 CICD 流程管理容器化和無伺服器應用Amazon Web Services

Executing a Large-Scale Migration to AWSAmazon Web Services

Presentation on amazon web servises(aws), by gaurav raturiGaurav577170

Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services

Module 1 - CP Datalake on AWSLam Le

Contenu connexe

Tendances

Track 3 Session 3_如何妥善運用雲端優勢遷移上雲.pptxAmazon Web Services

Track 4 Session 5_ 架構即代碼 – AWS CDK 與 CDK8S 聯手打造下一代的 K8S 應用Amazon Web Services

Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptxAmazon Web Services

Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢Amazon Web Services

Track 5 Session 4_ intel 透過AWS Outposts就地佈署 on-premises 雲端環境.pptxAmazon Web Services

Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Amazon Web Services

Track 4 Session 1_MAD01 如何活用事件驅動架構快速擴展應用Amazon Web Services

Introduzione a Amazon Elastic Container ServiceAmazon Web Services

Transform Your Business with VMware Cloud on AWS: Technical Overview Amazon Web Services

Track 6 Session 2_ 搭建現代化的資料數據湖.pptxAmazon Web Services

Track 6 Session 5_ 如何藉由物聯網 (IoT) 與機器學習提高預測性維修與產品良率.pptxAmazon Web Services

Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotAmazon Web Services

2016 summits - future of enterprise itAmazon Web Services

ENT207-The Future of Enterprise IT.pdfAmazon Web Services

Track 3 Session 4_企業工作負載遷移至 AWS 的最佳實踐Amazon Web Services

Track 5 Session 6_ BLC01 透過 Amazon Managed Blockchain 與 Amazon QLDB 打造區塊鍊應用.pptxAmazon Web Services

Enterprise Cloud Computing with AWS - How enterprises are using the AWS Cloud...Amazon Web Services

Track 4 Session 4_ MAD02 MAD 04 如何藉由 CICD 流程管理容器化和無伺服器應用Amazon Web Services

Executing a Large-Scale Migration to AWSAmazon Web Services

Presentation on amazon web servises(aws), by gaurav raturiGaurav577170

Tendances (20)

Track 3 Session 3_如何妥善運用雲端優勢遷移上雲.pptx

Track 4 Session 5_ 架構即代碼 – AWS CDK 與 CDK8S 聯手打造下一代的 K8S 應用

Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx

Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢

Track 5 Session 4_ intel 透過AWS Outposts就地佈署 on-premises 雲端環境.pptx

Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用

Track 4 Session 1_MAD01 如何活用事件驅動架構快速擴展應用

Introduzione a Amazon Elastic Container Service

Transform Your Business with VMware Cloud on AWS: Technical Overview

Track 6 Session 2_ 搭建現代化的資料數據湖.pptx

Track 6 Session 5_ 如何藉由物聯網 (IoT) 與機器學習提高預測性維修與產品良率.pptx

Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot

2016 summits - future of enterprise it

ENT207-The Future of Enterprise IT.pdf

Track 3 Session 4_企業工作負載遷移至 AWS 的最佳實踐

Track 5 Session 6_ BLC01 透過 Amazon Managed Blockchain 與 Amazon QLDB 打造區塊鍊應用.pptx

Enterprise Cloud Computing with AWS - How enterprises are using the AWS Cloud...

Track 4 Session 4_ MAD02 MAD 04 如何藉由 CICD 流程管理容器化和無伺服器應用

Executing a Large-Scale Migration to AWS

Presentation on amazon web servises(aws), by gaurav raturi

Similaire à AWS Data Analytics Solutions

Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services

Module 1 - CP Datalake on AWSLam Le

AWS Big Data Solution DaysAmazon Web Services

Fast Track to Your Data Lake on AWSAmazon Web Services

Big Data Meets AI - Driving Insights and Adding Intelligence to Your SolutionsAmazon Web Services

ABD202_Best Practices for Building Serverless Big Data ApplicationsAmazon Web Services

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM

Building your Datalake on AWSAmazon Web Services

Delivering business insights and automation utilizing aws data servicesBhuvaneshwaran R

AWS Big Data PlatformAmazon Web Services

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services

Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services

Using Data Lakes: Data Analytics Week SFAmazon Web Services

(BDT317) Building A Data Lake On AWSAmazon Web Services

Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services

Using Data Lakes Amazon Web Services

Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group

Implementing a Data LakeAmazon Web Services

Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services

Building Your Data Lake on AWS - Level 200Amazon Web Services

Similaire à AWS Data Analytics Solutions (20)

Finding Meaning in the Noise: Understanding Big Data with AWS Analytics

Module 1 - CP Datalake on AWS

AWS Big Data Solution Days

Fast Track to Your Data Lake on AWS

Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions

ABD202_Best Practices for Building Serverless Big Data Applications

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Building your Datalake on AWS

Delivering business insights and automation utilizing aws data services

AWS Big Data Platform

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

Scalable Data Analytics - DevDay Austin 2017 Day 2

Using Data Lakes: Data Analytics Week SF

(BDT317) Building A Data Lake On AWS

Building a Server-less Data Lake on AWS - Technical 301

Using Data Lakes

Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...

Implementing a Data Lake

Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...

Building Your Data Lake on AWS - Level 200

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services

Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services

Costruire Applicazioni Moderne con AWSAmazon Web Services

Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services

Open banking as a serviceAmazon Web Services

Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services

OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services

Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services

Computer Vision con AWSAmazon Web Services

Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services

Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services

API moderne real-time per applicazioni mobili e webAmazon Web Services

Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services

Tools for building your MVP on AWSAmazon Web Services

How to Build a Winning Pitch DeckAmazon Web Services

Building a web application without serversAmazon Web Services

Fundraising EssentialsAmazon Web Services

AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services

Come costruire un'architettura Serverless nel Cloud AWSAmazon Web Services

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...

Esegui pod serverless con Amazon EKS e AWS Fargate

Costruire Applicazioni Moderne con AWS

Come spendere fino al 90% in meno con i container e le istanze spot

Open banking as a service

Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...

OpsWorks Configuration Management: automatizza la gestione e i deployment del...

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads

Computer Vision con AWS

Database Oracle e VMware Cloud on AWS i miti da sfatare

Crea la tua prima serverless ledger-based app con QLDB e NodeJS

API moderne real-time per applicazioni mobili e web

Database Oracle e VMware Cloud™ on AWS: i miti da sfatare

Tools for building your MVP on AWS

How to Build a Winning Pitch Deck

Building a web application without servers

Fundraising Essentials

AWS_HK_StartupDay_Building Interactive websites while automating for efficien...

Come costruire un'architettura Serverless nel Cloud AWS

AWS Data Analytics Solutions

2. Dataisa strategicassetforeveryorganization The world’s most valuable resource is no longer oil, but data. David Parkins, 2017, The Economist “ ”

3. Challenge Amazon needed to analyze a massive amount of data to find insights, identify opportunities, and evaluate business Including catalog browsing, order placement, transaction processing, delivery scheduling, video services, and Prime registration • 50 petabytes of data and 75,000 tables • Processing 600,000 user analytics jobs each day • Data is published by more than 1,800 teams • 3,300+ data consumer teams analyze this data Amazon.comlowerscostsand gains fasterinsightswithAWSdata analyticofferings The Oracle data warehouse did not scale for PB level data, was difficult to maintain, and was costly.

4. 100 PB Amazonuses anAWSdatalake S3 Amazon Kinesis Data lake web interface Data lake APIs Workflows service Discovery service Data ingestion Subscription service Data security and governance Source systems Big data marketplace Analytics Data quality /curationAmazon S3 Amazon DynamoDB Relational stores Amazon EMR Amazon Redshift Other compute Nonrelational stores Amazon Redshift Spectrum 100 PB Solution Amazon deployed a data lake with Amazon S3, and it now runs analytics with Amazon Redshift, Amazon Redshift Spectrum, and Amazon EMR. Benefits Amazon doubled the data stored from 50 PB to 100 PB, lowered costs, and was able to gain insights faster.

5. Customerswant morevalue fromtheirdata Growing exponentially From new sources Increasingly diverse Used by many people Analyzed by many applications

6. Common analyticsusecases–whichdo youneed? Data warehouse modernization Big data and data lakes Real-time streaming and analytics Operational and search analytics Self-service business analytics Acquisition of third-party data for analysis

7. DataArchitectures

9. Automatedataingestion Ingestion of data should be automated using triggers, schedules, and change detection • Eliminates error-prone manual processes • Allows data to be processed as it arrives

10. Designingestionforfailuresand duplicates Ingestion triggered from requests and events must be idempotent • Appropriate retries • Deal with message duplication failure

11. Preserveoriginalsourcedata Having raw data in its pristine form allows you to repeat the ETL process in case of failures • No transformation of the original data files • Allow replay data pipeline

12. Describedatawithmetadata It’s essential that any dataset that makes its way into a data store environment is discoverable and classified • Capture metadata for application to leverage the ingested datasets • Ensure that this activity is well-documented and automated

13. UsetherightETLtoolforthejob Select an ETL tool that closely meets your requirements for streamlining the workflow between the source and the destination • Several options  Custom built to solve specific problems  Assembled from open source projects  Commercially licensed ETL platforms • Support for complex workflows, APIs and specific languages • Connectors to varied data stores • Performance, budget, and enterprise scale.

14. AutomateETLworkflows Chaining ETL jobs ensures the seamless execution of your ETL workflow • Output from one process or job typically serves as an input to another • Ensure you have the visibility of tracking and debugging any failure

15. Tierstorageappropriately Store data in the optimal tier to ensure that you leverage the best features of the storage services for your analytics applications • Two basic parameters for choosing the right data storage  Data format  Access Frequency • Distributing your datasets into different services  Metadata tier & Payload tier  Hot, warm and cold tiers

16. Secure,protect,and manageyourentireanalytics pipeline Both the data assets and the infrastructure for storing, processing, and analyzing data must be secured • Implementing fine-grained controls that allow authorized users to manage particular assets • Access roles might change at various stages of an analytics pipeline • Ensuring that unauthorized users are blocked from taking any actions that would compromise data confidentiality and security AWS Lake Formation AWS Identity andAccess Management

17. Designforscalableand reliableanalyticspipelines Make analytics execution compute environments reliable and scalable • Keep up the pace of data volume and velocity • Provide high data reliability and optimized query performance to support different analytics applications  batch and streaming ingest  fast ad hoc queries to data science

19. Why chooseAWSfordatalakes and analytics? Most secure infrastructure for analytics Most scalable and cost-effective Easiest to build data lakes and analytics Most comprehensive and open 1 2 3 4

20. 1.Easiesttobuild datalakes and analytics • A single storage layer (Amazon S3) for all analytics and ML • A service to build secure data lakes in days • Deep integration across analytics and infrastructure (including federated queries) Data warehousin g Analytic s Machine learning (ML) Data lake The fastest way to go from zero to insights, covering all data for all users

21. 2.Mostsecureinfrastructureforanalytics Customers need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lakes Services for security and governance Compliance AWS Artifact Amazon Inspector AWS CloudHSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS WAF Amazon Macie Amazon VPC Encryption AWS Certificate Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations

22. Data movement Analytics Data lake infrastructure & management Data, visualization, engagement & machine learning 3.Mostcomprehensiveand open + Many more Amazon Redshift Amazon EMR (Spark & Hadoop) Amazon Athena Amazon Elasticsearc h Service Amazon Kinesis Data Analytics AWS Glue (Spark & Python) Amazon S3/ Amazon S3 Glacier AWS Glue AWS Lake Formation Amazon QuickSigh t Amazon SageMaker Amazon Comprehen d Amazo n Lex Amazon Polly Amazon Rekognition Amazon Translat e AWS Database Migration Service | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data Firehose | Amazon Kinesis Data Streams | Amazon Managed Streaming for Apache Kafka Amazon Pinpoint AWS Data Exchange NEW

23. 4.Mostscalable,cost-effective,high-performance infrastructureforanalytics Five highly available storage tiers and intelligent tiering Industry-leading choice of 200+ instance types to meet workload needs On-Demand, Reserved, and Spot Instances to reduce costs 100 Gbps- bandwidth network interfaces for performance

24. LearnanalyticswithAWSTrainingandCertification Resources created by the experts at AWS to help you build and validate data analytics skills Visit aws.amazon.com/training/paths-specialty/ New free digital course: Data Analytics Fundamentals Validate expertise with the AWS Certified Big Data—Specialty exam or the new AWS Certified Data Analytics—Specialty beta exam Classroom offerings, including Big Data on AWS, feature AWS expert instructors and hands-on labs

25. Thank you! © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Jayson Hsieh hsiej@amazon.com

Notes de l'éditeur

在大數據名詞出現已經超過十年了，各行各業其實都理解到了資料、數據的重要性，甚至是公司賴以維生的主要資產之一。從銀行業利用資料找出洗錢或詐欺的交易行為，零售業利用用戶的瀏覽跟交易紀錄建立出個人化的推薦系統、製造業透過生產製造的數據跟 sensor 來進一步的優化生產流程等等。從過去五到七年的趨勢來看： 1. 因為連網設備以及 4G 5G 的普及，系統產生出更多的資料 2. 儲存成本的降低，尤其是在 S3 這種 object storage 開始流行後 3. 技術門檻的降低，除了蓬勃的 open source hadoop 生態系外，雲端業者也推出了 4. Cloud pay as you go model 讓成本降低所以你如果可以把更你的業務產生的有效資料都集中儲存，並且有能力用不同維度來分析並且找出 insight，甚至利用這些資料作為人工智慧技術的基礎，將能真正的引領公司做到創新，保持更好的競爭力。 We know data is a strategic asset for every organization, not just new businesses and gaming. Data has gone from being something that was cumbersome and expensive to store, to becoming the life-blood of many a companies business model. Over the past 5-7 years a couple of key trends have got us to this point. Connected devices, apps, and systems now generate more data than every before With the cloud driving down the cost of storage, customer no longer need to decide what data to keep and what to throw away With the cloud providing pay-as-you-go, on-demand compute, organizations can now more easily analyze their data to gain insights in a variety of different ways So If you are able to store every relevant data point about your business (which could grow to massive volumes), and have the ability to analyze all of that data in different ways and distill the data down to insights for your business, it will fuel innovation in your organization which can lead to a competitive advantage. Companies today are using data to drive decisions like when to offer new product offerings, how to introduce new revenue streams, where to automate manual processes, how to earn customer trust, etc. All of these decisions can fuel innovation and drive your business forward. For example, - FINRA is able to catch fraudsters more effectively by processing exabytes of data on AWS. - Expedia is able to process terabytes of data related to the cost and availability of lodgings to make real-time recommendations to their customers. - Bristol Meyers Squibb uses analytics to discover, develop and deliver innovative medicines that help patients prevail. Transition: …but customers are facing new challenges in managing and analyzing their data.
拿我們公司亞馬遜電商做為例子，我們網站上各個不同的功能模組都會產生非常多的資料，包含商品分類、訂單購物車、交易流程、運送排班、甚至是 Prime 的註冊等等。 Amazon builds and operates thousands of micro-services to serve millions of customers. These include catalog browsing, order placement, transaction processing, delivery scheduling, video services, and Prime registration. Each service publishes datasets to Amazon’s massive analytics infrastructure, including over 50 petabytes of data and 75,000 tables, processing 600,000 user analytics jobs each day. Data is published by more than 1,800 teams, while more than 3,300 data consumer teams analyze this data to find drive insights, identify opportunities, prepare reports, and evaluate business performance. Challenge: The on-premises Oracle database infrastructure that supported this system was not designed for processing petabytes of data and resulted in a monolithic solution that was hard to maintain and operate due to lack of separation of concerns from both a functional and financial perspective. From an operational perspective, transformations of tables with over 100 million rows consistently failed. This limited business teams’ ability to generate insights or deploy large-scale machine learning solutions. Many abandoned the monolithic Oracle data warehouse in favor of custom solutions using Amazon Web Services (AWS) technologies. Database administration for the monolithic Oracle Data Warehouse was complicated, expensive, and error-prone, requiring engineers to spend hundreds of hours each month on software upgrades, database backups, OS patching, and performance monitoring. Inefficient hardware provisioning required labor-intensive demand forecasting and capacity planning. It was financially inefficient being statically sized for peak loads and lacking ability to dynamically scale for hardware cost optimization, with ever increasing Oracle licensing costs. Solution: The new data lake based solution utilizes a variety of AWS services to deliver performance and reliability at exabyte scale for data processing, streaming, and analytics. Amazon Simple Storage Service (Amazon S3) was used as a data lake to hold raw data in native format until required for analysis. Using Amazon S3 gave Amazon the flexibility to manage a wide variety of data at scale with reduced costs, improved access control, and strong regulatory compliance. To enable self-service analytics for end users, Amazon developed a service that synchronizes data from the data lake with compute systems including Amazon EMR and Amazon Redshift. Amazon EMR provides a managed Hadoop framework that can run Apache Spark, HBase, Presto, and Flink on Amazon Elastic Compute Cloud (Amazon EC2) instances and interact with data in Amazon S3. Amazon Redshift is the AWS data warehouse service that allows analytics end users to run complex queries and visualize results using tools such as Amazon Quicksight. Further, Amazon integrated the Data Lake with the Redshift Spectrum feature, which allows users to query any dataset in the lake directly from Redshift without needing to synchronize the data to their cluster. Result: The new analytics infrastructure has one data lake with over 100 petabytes of data – almost twice the size of the previous Oracle data warehouse. Teams across Amazon are now using over 2700 Amazon Redshift or Amazon EMR clusters to process data from the data lake. Amazon’s consumer businesses have benefitted from the separation of data storage from data processing in AWS. AWS storage services made it easy to store data in any format, securely, at massive scale, low cost, and to move data quickly and easily. The data lake architecture allows each system to scale independently, lowered overall costs, and broadened the arsenal of technologies available. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results. I think that this needs to be balanced with a cost assessment as otherwise it seems like we have taken a monolith and replaced it with something bigger and more complex.
The data lake architecture allows each system to scale independently, lowered overall costs, and broadened the arsenal of technologies available. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results Amazon builds and operates thousands of micro-services to serve millions of customers. These include catalog browsing, order placement, transaction processing, delivery scheduling, video services, and Prime registration. Each service publishes datasets to Amazon’s massive analytics infrastructure, including over 50 petabytes of data and 75,000 tables, processing 600,000 user analytics jobs each day. Data is published by more than 1,800 teams, while more than 3,300 data consumer teams analyze this data to find drive insights, identify opportunities, prepare reports, and evaluate business performance. Challenge: The on-premises Oracle database infrastructure that supported this system was not designed for processing petabytes of data and resulted in a monolithic solution that was hard to maintain and operate due to lack of separation of concerns from both a functional and financial perspective. From an operational perspective, transformations of tables with over 100 million rows consistently failed. This limited business teams’ ability to generate insights or deploy large-scale machine learning solutions. Many abandoned the monolithic Oracle data warehouse in favor of custom solutions using Amazon Web Services (AWS) technologies. Database administration for the monolithic Oracle Data Warehouse was complicated, expensive, and error-prone, requiring engineers to spend hundreds of hours each month on software upgrades, database backups, OS patching, and performance monitoring. Inefficient hardware provisioning required labor-intensive demand forecasting and capacity planning. It was financially inefficient being statically sized for peak loads and lacking ability to dynamically scale for hardware cost optimization, with ever increasing Oracle licensing costs. Solution: The new data lake based solution utilizes a variety of AWS services to deliver performance and reliability at exabyte scale for data processing, streaming, and analytics. Amazon Simple Storage Service (Amazon S3) was used as a data lake to hold raw data in native format until required for analysis. Using Amazon S3 gave Amazon the flexibility to manage a wide variety of data at scale with reduced costs, improved access control, and strong regulatory compliance. To enable self-service analytics for end users, Amazon developed a service that synchronizes data from the data lake with compute systems including Amazon EMR and Amazon Redshift. Amazon EMR provides a managed Hadoop framework that can run Apache Spark, HBase, Presto, and Flink on Amazon Elastic Compute Cloud (Amazon EC2) instances and interact with data in Amazon S3. Amazon Redshift is the AWS data warehouse service that allows analytics end users to run complex queries and visualize results using tools such as Amazon Quicksight. Further, Amazon integrated the Data Lake with the Redshift Spectrum feature, which allows users to query any dataset in the lake directly from Redshift without needing to synchronize the data to their cluster. Result: The new analytics infrastructure has one data lake with over 100 petabytes of data – almost twice the size of the previous Oracle data warehouse. Teams across Amazon are now using over 2700 Amazon Redshift or Amazon EMR clusters to process data from the data lake. Amazon’s consumer businesses have benefitted from the separation of data storage from data processing in AWS. AWS storage services made it easy to store data in any format, securely, at massive scale, low cost, and to move data quickly and easily. The data lake architecture allows each system to scale independently, lowered overall costs, and broadened the arsenal of technologies available. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results. I think that this needs to be balanced with a cost assessment as otherwise it seems like we have taken a monolith and replaced it with something bigger and more complex.
聊完了我們自家的故事以後，我們來看看 AWS 客戶們在資料上通常面對什麼樣的挑戰 We hear from companies all the time that they are looking to extract more value from their data but struggle to capture, store, and analyze all the data generated by today’s modern and digital businesses. Data is growing exponentially, coming from new sources, is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people. Data is the core behind every business and every decision. Gartner states the # 1 technology area where CIOs are increasing investment is ‘Business intelligence or data analytics solution’ (45% of CIOs are increasing investment and only 1% are decreasing investment). The same report shows 63% of CEOs say they are likely to make a change to the business model over the next two years, and performing analytics prior to, during and after changes of this magnitude is crucial. It is paramount to measure return on investment for digital activities. If you aren't measuring it, "you're not taking it seriously."
不論是哪一種應用場景，這些挑戰都依然存在 Analytics is a broad space, but some common use cases emerge for most businesses. Which of these are you pursuing today?
而常見的資料平台架構通常離不開這些元件
至於建立一個資料平台，設計原則到底是什麼？
Size Call lambda for checking each ETL result
AWS analytics helps organizations quickly get from data to answers by providing mature and integrated analytics services ranging from cloud data warehouses to serverless data lakes. Getting answers quickly means spending less time building plumbing and configuring cloud analytics services to work together. AWS helps you do exactly that by giving you 1/ an easy path to build a data lake and start running diverse analytics workloads, 2/ secure cloud storage, compute, and network infrastructure that meets the specific needs of analytic workloads, 3/ a fully integrated analytics stack with a mature set of analytics tools, covering all common uses cases and leveraging open source and standard languages, engines, and platforms, and 4/ the best performance, the most scalability, and the lowest cost for analytics. Many organizations choose cloud data lakes as the foundation for their data and analytics architectures. Setting up and managing data lakes today involves a lot of manual and time-consuming tasks such as loading data from diverse sources, monitoring these data flows, setting up partitions, turning on encryption and managing keys, re-organizing data into columnar format, and granting and auditing access. AWS is focused on helping customers build and secure data lakes in the cloud, in days, not months. AWS Lake Formation enables secured self-service discovery and access for users (without having to ask IT to find and grant access to data), aware of multiple analytics services, and provides easy on-demand access to specific resources that fit the processor and memory requirements of each analytics workload. The data is curated and cataloged, already prepared for any flavor of analytics, and related records are matched and de-duplicated with machine learning. This automation greatly reduces the time it takes to get to answers when your data lake is built on top of AWS. Amazon S3 provides a fully-featured storage layer alongside both a comprehensive compute layer (Amazon EC2), and a network stack with the speed and scale needed for advanced analytics. Building a data lake on this robust foundation allows customers to store all of their data without worrying about what to throw away and it reduces the cost needed to do analysis on that data. AWS provides a diverse and mature set of analytics services that are deeply integrated with the infrastructure layers, allowing you to easily take advantage of features like intelligent tiering and EC2 spot instances to reduce cost and run analytics faster. When you are ready for more advanced analytic approaches, our broad collection of ML and AI services can be used against that same data in S3 to provide even more insight without the delays and cost from moving or transforming your data.
To eliminate siloes, you need to build a data lake. But that’s not always been easy in the past or with other vendors today. Only AWS makes it this simple to build a comprehensive and secure data lake environment so you can start getting new insights from all your data – fast! AWS Lake Formation, which automates many of the complex steps required to set up a data lake, reducing the time required to build a secure data lake from months to days. Security control at the object level for our object storage (data lake storage) layer. Other cloud vendors only provide bucket level security control. Deep integration across services that are needed to get answers from your data, including storage, compute, networking, and data movement. For example, Amazon EMR makes it easy to use EC2 Spot instances to save up to 90% on analytic workloads. Amazon Redshift allows you to query your S3 objects directly from your data warehouse. A single security model across all analytic services. AWS Lake Formation provides a single way to control access to your data whether you are accessing that data from a data warehouse, a Spark cluster, or a serverless query technology. Mature analytics services. Amazon EMR was first released in 2009 and Amazon Redshift first launched in 2013. Amazon S3 was one of the first AWS products and has been available since 2006. Tens of thousands of customers have data lakes on AWS and X exabytes of data is analyzed every day. A single object storage layer that is compatible with all AWS analytics and machine learning services. Amazon S3 is our only object storage service, we do not have different versions of S3 and we do not have separate “data lake storage.” 5 storage tiers and intelligent tiering in Amazon S3, so are able to store more data at a lower cost and with less manual data lifecycle work than with any other cloud provider. AWS S3 and AWS managed services store customer data in independent data centers across three availability zones within a single AWS Region and automatically replicate data between any regions regardless of storage class, providing a very high degree of fault tolerance and data durability out-of-the-box. AWS analytics services provide best of breed performance. Amazon Redshift is 2x faster than the next most popular competitor and Amazon EMR runs Apache Spark workloads over 10x faster than open source Spark. Speed helps get to answers quickly and also helps keep costs down for complex analytics.
As organizations face a growing number of data breaches, cybercrime, and the need to meet compliance regulations like the General Data Protection Regulation (GDPR), their data lakes needs to be protected with the highest levels of security. AWS provides essential security, compliance, and audit capabilities with all of the services listed above available as needed. AWS data lakes give you control at the object level (rather than the bucket level), allowing you to apply access, log, and audit policies at account and object levels. AWS offers different forms of encryption, including automatic server-side encryption, encryption with keys managed by the AWS Key Management Service (KMS), and encryption with keys that you can manage. AWS encrypts data in transit when replicating across regions, and lets customers use separate accounts for source and destination regions to protect against malicious insider deletions. AWS helps security teams proactively monitor, detect, and alert anomalies with Amazon Macie, an AI-powered security service that helps detect early stages of an attack. S3 is the only object store that gives you inventory reports on all your objects so you can answer questions like: Are all my objects encrypted? S3 is the only object store that allows you to analyze all the access permissions on all your objects and buckets with IAM Access Analyzer.
Deep integration between all the layers of the AWS analytics stack gives builders the tools to quickly analyze data using any approach. Use AWS Lake Formation to store your data once in standards-based formats (such as Parquet or ORC) in S3 and then analyze that data using the right tool for the job, including services for data warehouses, Apache Spark or Hadoop, data catalog, serverless ETL, operational analytics (Elasticsearch), and streaming analytics. Integration with EC2 makes it simple to scale up and down and to use techniques like EC2 Spot instances to reduce the cost for analysis by as much as 90%. The breadth and depth of analytics services on AWS makes it easy to choose the right tool for the right job. From the fastest data warehouse service to a fully managed Apache Spark and Apache Hadoop service, AWS analytics makes it easy for you to spin up the right resources to run whatever analysis is most appropriate for your specific need. There is no compression algorithm for experience, and AWS has worked with customers to provide managed analytics services longer than anyone else. For example, Amazon EMR launched in 2009 and Amazon Redshift launched in 2013. When using these services, there is no need to continually move and transform data, and AWS has native and fully integrated services for core use cases rather than a collection of partially integrated services from other vendors. If you do need something beyond what our native services offer, we have ### partner services to complement our core offerings. AWS offers at least: 10 data movement services 13 analytics services 18 machine learning and AI services 17 security and governance services Maybe more since this slide was created!
Analytics requires robust storage and compute to get performance at the right cost. Amazon EC2’s industry leading collection of instance types, reserved instances, and Spot instances makes choosing the right compute for your analytics workloads simple, and 100Gbps network interfaces provide an order of magnitude more bandwidth between storage and compute, increasing performance and reducing cost. Does your workload require specialized GPU instances or FPGA powered instances? Amazon EC2 provides over 200 instance types to meet the needs of any workload. Amazon S3 provides the world’s highest availability with 11 nines of durability, making S3 uniquely well suited for data lake storage. In addition, S3’s five storage tiers and intelligent tiering make storing vast amount of data less expensive and easier to manage.
Speaker Notes: You came to re:Invent to learn. There’s no need to stop when you go home. Keep re:Inventing with resources from AWS Training and Certification for Big Data - for you and your teams AWS Training and Certification offers training built by AWS experts for learners at a variety of skill levels who want to understand data analytics, data lakes, and associated skills in Machine Learning. Validate your expertise with AWS Certification and earn an industry-recognized credential. For more information, visit aws.amazon.com/training and look for the Big Data Learning Path.