This document discusses building data lakes and analytics pipelines on Amazon Web Services (AWS). It recommends using AWS services like Amazon S3, AWS Glue, AWS Lake Formation, and Amazon Redshift to build scalable and secure data lakes that can ingest and process large amounts of data from various sources. It also highlights how AWS provides the most comprehensive set of analytics services, enables the easiest setup of data lakes, and offers the most scalable and cost-effective infrastructure for analytics workloads.
3. Challenge
Amazon needed to analyze a massive amount of data to find
insights, identify opportunities, and evaluate business
Including catalog browsing, order placement, transaction
processing, delivery scheduling, video services, and Prime
registration
• 50 petabytes of data and 75,000 tables
• Processing 600,000 user analytics jobs each day
• Data is published by more than 1,800 teams
• 3,300+ data consumer teams analyze this data
Amazon.comlowerscostsand gains
fasterinsightswithAWSdata analyticofferings
The Oracle data
warehouse did not scale
for PB level data, was
difficult to maintain, and
was costly.
4. 100 PB
Amazonuses anAWSdatalake
S3
Amazon Kinesis
Data lake
web interface
Data lake APIs
Workflows service
Discovery service
Data ingestion Subscription service
Data security and governance
Source systems Big data marketplace Analytics
Data quality
/curationAmazon S3
Amazon DynamoDB
Relational stores
Amazon EMR
Amazon Redshift
Other compute
Nonrelational stores
Amazon Redshift Spectrum
100 PB
Solution
Amazon deployed a data lake with
Amazon S3, and it now runs analytics
with Amazon Redshift, Amazon
Redshift Spectrum, and Amazon
EMR.
Benefits
Amazon doubled the data stored
from 50 PB to 100 PB, lowered
costs, and was able to gain insights
faster.
6. Common analyticsusecases–whichdo youneed?
Data warehouse modernization
Big data and data lakes
Real-time streaming and analytics
Operational and search analytics
Self-service business analytics
Acquisition of third-party data for analysis
9. Automatedataingestion
Ingestion of data should be automated using triggers, schedules, and change
detection
• Eliminates error-prone
manual processes
• Allows data to be
processed as it arrives
11. Preserveoriginalsourcedata
Having raw data in its pristine form allows you to repeat the ETL process in case
of failures
• No transformation of the original
data files
• Allow replay data pipeline
12. Describedatawithmetadata
It’s essential that any dataset that makes its way into a data store environment
is discoverable and classified
• Capture metadata for application to leverage
the ingested datasets
• Ensure that this activity is well-documented
and automated
13. UsetherightETLtoolforthejob
Select an ETL tool that closely meets your requirements for streamlining the
workflow between the source and the destination
• Several options
Custom built to solve specific problems
Assembled from open source projects
Commercially licensed ETL platforms
• Support for complex workflows, APIs and
specific languages
• Connectors to varied data stores
• Performance, budget, and enterprise scale.
14. AutomateETLworkflows
Chaining ETL jobs ensures the seamless execution of your ETL workflow
• Output from one process or job typically serves as an input to another
• Ensure you have the visibility of tracking and debugging any failure
15. Tierstorageappropriately
Store data in the optimal tier to ensure that you leverage the best features of
the storage services for your analytics applications
• Two basic parameters for choosing the
right data storage
Data format
Access Frequency
• Distributing your datasets into
different services
Metadata tier & Payload tier
Hot, warm and cold tiers
16. Secure,protect,and manageyourentireanalytics
pipeline
Both the data assets and the infrastructure for storing, processing, and
analyzing data must be secured
• Implementing fine-grained controls that allow
authorized users to manage particular assets
• Access roles might change at various stages of
an analytics pipeline
• Ensuring that unauthorized users are blocked
from taking any actions that would compromise
data confidentiality and security
AWS Lake Formation
AWS Identity andAccess
Management
17. Designforscalableand reliableanalyticspipelines
Make analytics execution compute environments reliable and scalable
• Keep up the pace of data volume
and velocity
• Provide high data reliability and
optimized query performance to
support different analytics
applications
batch and streaming ingest
fast ad hoc queries to data science
19. Why chooseAWSfordatalakes and analytics?
Most secure
infrastructure
for analytics
Most scalable and
cost-effective
Easiest to build
data lakes
and analytics
Most comprehensive
and open
1 2 3 4
20. 1.Easiesttobuild datalakes and analytics
• A single storage layer (Amazon S3) for
all analytics and ML
• A service to build secure data lakes in
days
• Deep integration across analytics and
infrastructure
(including federated queries)
Data
warehousin
g
Analytic
s
Machine
learning
(ML)
Data lake
The fastest way to go from zero to insights, covering all data for all users
21. 2.Mostsecureinfrastructureforanalytics
Customers need to have multiple levels of security, identity and access
management, encryption, and compliance to secure their data lakes
Services for security and governance
Compliance
AWS Artifact
Amazon Inspector
AWS CloudHSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
Amazon VPC
Encryption
AWS Certificate Manager
AWS Key Management Service
Encryption at rest
Encryption in transit
Bring your own keys,
HSM support
Identity
IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
22. Data movement
Analytics
Data lake infrastructure &
management
Data, visualization, engagement & machine
learning
3.Mostcomprehensiveand open
+ Many more
Amazon
Redshift
Amazon
EMR
(Spark &
Hadoop)
Amazon
Athena
Amazon
Elasticsearc
h Service
Amazon
Kinesis
Data
Analytics
AWS Glue
(Spark &
Python)
Amazon S3/
Amazon S3 Glacier
AWS Glue
AWS Lake
Formation
Amazon
QuickSigh
t
Amazon
SageMaker
Amazon
Comprehen
d
Amazo
n Lex
Amazon
Polly
Amazon
Rekognition
Amazon
Translat
e
AWS Database Migration Service | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data Firehose | Amazon Kinesis Data
Streams | Amazon Managed Streaming for Apache Kafka
Amazon
Pinpoint
AWS Data
Exchange
NEW
24. LearnanalyticswithAWSTrainingandCertification
Resources created by the experts at AWS to help you build and validate data analytics skills
Visit aws.amazon.com/training/paths-specialty/
New free digital course: Data Analytics Fundamentals
Validate expertise with the AWS Certified Big Data—Specialty exam or the
new AWS Certified Data Analytics—Specialty beta exam
Classroom offerings, including Big Data on AWS,
feature AWS expert instructors and hands-on labs
在大數據名詞出現已經超過十年了,各行各業其實都理解到了資料、數據的重要性,甚至是公司賴以維生的主要資產之一。從銀行業利用資料找出洗錢或詐欺的交易行為,零售業利用用戶的瀏覽跟交易紀錄建立出個人化的推薦系統、製造業透過生產製造的數據跟 sensor 來進一步的優化生產流程等等。
從過去五到七年的趨勢來看:
1. 因為連網設備以及 4G 5G 的普及,系統產生出更多的資料
2. 儲存成本的降低,尤其是在 S3 這種 object storage 開始流行後
3. 技術門檻的降低,除了蓬勃的 open source hadoop 生態系外,雲端業者也推出了
4. Cloud pay as you go model 讓成本降低
所以你如果可以把更你的業務產生的有效資料都集中儲存,並且有能力用不同維度來分析並且找出 insight,甚至利用這些資料作為人工智慧技術的基礎,將能真正的引領公司做到創新,保持更好的競爭力。
We know data is a strategic asset for every organization, not just new businesses and gaming.
Data has gone from being something that was cumbersome and expensive to store, to becoming the life-blood of many a companies business model.
Over the past 5-7 years a couple of key trends have got us to this point.
Connected devices, apps, and systems now generate more data than every before
With the cloud driving down the cost of storage, customer no longer need to decide what data to keep and what to throw away
With the cloud providing pay-as-you-go, on-demand compute, organizations can now more easily analyze their data to gain insights in a variety of different ways
So If you are able to store every relevant data point about your business (which could grow to massive volumes), and have the ability to analyze all of that data in different ways and distill the data down to insights for your business, it will fuel innovation in your organization which can lead to a competitive advantage.
Companies today are using data to drive decisions like when to offer new product offerings, how to introduce new revenue streams, where to automate manual processes, how to earn customer trust, etc. All of these decisions can fuel innovation and drive your business forward.
For example,
- FINRA is able to catch fraudsters more effectively by processing exabytes of data on AWS.
- Expedia is able to process terabytes of data related to the cost and availability of lodgings to make real-time recommendations to their customers.
- Bristol Meyers Squibb uses analytics to discover, develop and deliver innovative medicines that help patients prevail.
Transition: …but customers are facing new challenges in managing and analyzing their data.
拿我們公司 亞馬遜電商 做為例子,我們網站上各個不同的功能模組都會產生非常多的資料,包含商品分類、訂單購物車、交易流程、運送排班、甚至是 Prime 的註冊等等。
Amazon builds and operates thousands of micro-services to serve millions of customers. These include catalog browsing, order placement, transaction processing, delivery scheduling, video services, and Prime registration. Each service publishes datasets to Amazon’s massive analytics infrastructure, including over 50 petabytes of data and 75,000 tables, processing 600,000 user analytics jobs each day. Data is published by more than 1,800 teams, while more than 3,300 data consumer teams analyze this data to find drive insights, identify opportunities, prepare reports, and evaluate business performance.
Challenge:
The on-premises Oracle database infrastructure that supported this system was not designed for processing petabytes of data and resulted in a monolithic solution that was hard to maintain and operate due to lack of separation of concerns from both a functional and financial perspective. From an operational perspective, transformations of tables with over 100 million rows consistently failed. This limited business teams’ ability to generate insights or deploy large-scale machine learning solutions. Many abandoned the monolithic Oracle data warehouse in favor of custom solutions using Amazon Web Services (AWS) technologies.
Database administration for the monolithic Oracle Data Warehouse was complicated, expensive, and error-prone, requiring engineers to spend hundreds of hours each month on software upgrades, database backups, OS patching, and performance monitoring. Inefficient hardware provisioning required labor-intensive demand forecasting and capacity planning. It was financially inefficient being statically sized for peak loads and lacking ability to dynamically scale for hardware cost optimization, with ever increasing Oracle licensing costs.
Solution:
The new data lake based solution utilizes a variety of AWS services to deliver performance and reliability at exabyte scale for data processing, streaming, and analytics. Amazon Simple Storage Service (Amazon S3) was used as a data lake to hold raw data in native format until required for analysis. Using Amazon S3 gave Amazon the flexibility to manage a wide variety of data at scale with reduced costs, improved access control, and strong regulatory compliance. To enable self-service analytics for end users, Amazon developed a service that synchronizes data from the data lake with compute systems including Amazon EMR and Amazon Redshift. Amazon EMR provides a managed Hadoop framework that can run Apache Spark, HBase, Presto, and Flink on Amazon Elastic Compute Cloud (Amazon EC2) instances and interact with data in Amazon S3. Amazon Redshift is the AWS data warehouse service that allows analytics end users to run complex queries and visualize results using tools such as Amazon Quicksight. Further, Amazon integrated the Data Lake with the Redshift Spectrum feature, which allows users to query any dataset in the lake directly from Redshift without needing to synchronize the data to their cluster.
Result:
The new analytics infrastructure has one data lake with over 100 petabytes of data – almost twice the size of the previous Oracle data warehouse. Teams across Amazon are now using over 2700 Amazon Redshift or Amazon EMR clusters to process data from the data lake.
Amazon’s consumer businesses have benefitted from the separation of data storage from data processing in AWS. AWS storage services made it easy to store data in any format, securely, at massive scale, low cost, and to move data quickly and easily. The data lake architecture allows each system to scale independently, lowered overall costs, and broadened the arsenal of technologies available. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results.
I think that this needs to be balanced with a cost assessment as otherwise it seems like we have taken a monolith and replaced it with something bigger and more complex.
The data lake architecture allows each system to scale independently, lowered overall costs, and broadened the arsenal of technologies available. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results
Amazon builds and operates thousands of micro-services to serve millions of customers. These include catalog browsing, order placement, transaction processing, delivery scheduling, video services, and Prime registration. Each service publishes datasets to Amazon’s massive analytics infrastructure, including over 50 petabytes of data and 75,000 tables, processing 600,000 user analytics jobs each day. Data is published by more than 1,800 teams, while more than 3,300 data consumer teams analyze this data to find drive insights, identify opportunities, prepare reports, and evaluate business performance.
Challenge:
The on-premises Oracle database infrastructure that supported this system was not designed for processing petabytes of data and resulted in a monolithic solution that was hard to maintain and operate due to lack of separation of concerns from both a functional and financial perspective. From an operational perspective, transformations of tables with over 100 million rows consistently failed. This limited business teams’ ability to generate insights or deploy large-scale machine learning solutions. Many abandoned the monolithic Oracle data warehouse in favor of custom solutions using Amazon Web Services (AWS) technologies.
Database administration for the monolithic Oracle Data Warehouse was complicated, expensive, and error-prone, requiring engineers to spend hundreds of hours each month on software upgrades, database backups, OS patching, and performance monitoring. Inefficient hardware provisioning required labor-intensive demand forecasting and capacity planning. It was financially inefficient being statically sized for peak loads and lacking ability to dynamically scale for hardware cost optimization, with ever increasing Oracle licensing costs.
Solution:
The new data lake based solution utilizes a variety of AWS services to deliver performance and reliability at exabyte scale for data processing, streaming, and analytics. Amazon Simple Storage Service (Amazon S3) was used as a data lake to hold raw data in native format until required for analysis. Using Amazon S3 gave Amazon the flexibility to manage a wide variety of data at scale with reduced costs, improved access control, and strong regulatory compliance. To enable self-service analytics for end users, Amazon developed a service that synchronizes data from the data lake with compute systems including Amazon EMR and Amazon Redshift. Amazon EMR provides a managed Hadoop framework that can run Apache Spark, HBase, Presto, and Flink on Amazon Elastic Compute Cloud (Amazon EC2) instances and interact with data in Amazon S3. Amazon Redshift is the AWS data warehouse service that allows analytics end users to run complex queries and visualize results using tools such as Amazon Quicksight. Further, Amazon integrated the Data Lake with the Redshift Spectrum feature, which allows users to query any dataset in the lake directly from Redshift without needing to synchronize the data to their cluster.
Result:
The new analytics infrastructure has one data lake with over 100 petabytes of data – almost twice the size of the previous Oracle data warehouse. Teams across Amazon are now using over 2700 Amazon Redshift or Amazon EMR clusters to process data from the data lake.
Amazon’s consumer businesses have benefitted from the separation of data storage from data processing in AWS. AWS storage services made it easy to store data in any format, securely, at massive scale, low cost, and to move data quickly and easily. The data lake architecture allows each system to scale independently, lowered overall costs, and broadened the arsenal of technologies available. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results.
I think that this needs to be balanced with a cost assessment as otherwise it seems like we have taken a monolith and replaced it with something bigger and more complex.
聊完了我們自家的故事以後,我們來看看 AWS 客戶們在資料上通常面對什麼樣的挑戰
We hear from companies all the time that they are looking to extract more value from their data but struggle to capture, store, and analyze all the data generated by today’s modern and digital businesses. Data is growing exponentially, coming from new sources, is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people.
Data is the core behind every business and every decision. Gartner states the # 1 technology area where CIOs are increasing investment is ‘Business intelligence or data analytics solution’ (45% of CIOs are increasing investment and only 1% are decreasing investment). The same report shows 63% of CEOs say they are likely to make a change to the business model over the next two years, and performing analytics prior to, during and after changes of this magnitude is crucial. It is paramount to measure return on investment for digital activities. If you aren't measuring it, "you're not taking it seriously."
不論是哪一種應用場景,這些挑戰都依然存在
Analytics is a broad space, but some common use cases emerge for most businesses. Which of these are you pursuing today?
而常見的資料平台架構通常離不開這些元件
至於建立一個資料平台,設計原則到底是什麼?
Size
Call lambda for checking each ETL result
AWS analytics helps organizations quickly get from data to answers by providing mature and integrated analytics services ranging from cloud data warehouses to serverless data lakes. Getting answers quickly means spending less time building plumbing and configuring cloud analytics services to work together. AWS helps you do exactly that by giving you 1/ an easy path to build a data lake and start running diverse analytics workloads, 2/ secure cloud storage, compute, and network infrastructure that meets the specific needs of analytic workloads, 3/ a fully integrated analytics stack with a mature set of analytics tools, covering all common uses cases and leveraging open source and standard languages, engines, and platforms, and 4/ the best performance, the most scalability, and the lowest cost for analytics.
Many organizations choose cloud data lakes as the foundation for their data and analytics architectures. Setting up and managing data lakes today involves a lot of manual and time-consuming tasks such as loading data from diverse sources, monitoring these data flows, setting up partitions, turning on encryption and managing keys, re-organizing data into columnar format, and granting and auditing access. AWS is focused on helping customers build and secure data lakes in the cloud, in days, not months. AWS Lake Formation enables secured self-service discovery and access for users (without having to ask IT to find and grant access to data), aware of multiple analytics services, and provides easy on-demand access to specific resources that fit the processor and memory requirements of each analytics workload. The data is curated and cataloged, already prepared for any flavor of analytics, and related records are matched and de-duplicated with machine learning. This automation greatly reduces the time it takes to get to answers when your data lake is built on top of AWS.
Amazon S3 provides a fully-featured storage layer alongside both a comprehensive compute layer (Amazon EC2), and a network stack with the speed and scale needed for advanced analytics. Building a data lake on this robust foundation allows customers to store all of their data without worrying about what to throw away and it reduces the cost needed to do analysis on that data. AWS provides a diverse and mature set of analytics services that are deeply integrated with the infrastructure layers, allowing you to easily take advantage of features like intelligent tiering and EC2 spot instances to reduce cost and run analytics faster. When you are ready for more advanced analytic approaches, our broad collection of ML and AI services can be used against that same data in S3 to provide even more insight without the delays and cost from moving or transforming your data.
To eliminate siloes, you need to build a data lake. But that’s not always been easy in the past or with other vendors today. Only AWS makes it this simple to build a comprehensive and secure data lake environment so you can start getting new insights from all your data – fast!
AWS Lake Formation, which automates many of the complex steps required to set up a data lake, reducing the time required to build a secure data lake from months to days.
Security control at the object level for our object storage (data lake storage) layer. Other cloud vendors only provide bucket level security control.
Deep integration across services that are needed to get answers from your data, including storage, compute, networking, and data movement. For example, Amazon EMR makes it easy to use EC2 Spot instances to save up to 90% on analytic workloads. Amazon Redshift allows you to query your S3 objects directly from your data warehouse.
A single security model across all analytic services. AWS Lake Formation provides a single way to control access to your data whether you are accessing that data from a data warehouse, a Spark cluster, or a serverless query technology.
Mature analytics services. Amazon EMR was first released in 2009 and Amazon Redshift first launched in 2013. Amazon S3 was one of the first AWS products and has been available since 2006. Tens of thousands of customers have data lakes on AWS and X exabytes of data is analyzed every day.
A single object storage layer that is compatible with all AWS analytics and machine learning services. Amazon S3 is our only object storage service, we do not have different versions of S3 and we do not have separate “data lake storage.”
5 storage tiers and intelligent tiering in Amazon S3, so are able to store more data at a lower cost and with less manual data lifecycle work than with any other cloud provider.
AWS S3 and AWS managed services store customer data in independent data centers across three availability zones within a single AWS Region and automatically replicate data between any regions regardless of storage class, providing a very high degree of fault tolerance and data durability out-of-the-box.
AWS analytics services provide best of breed performance. Amazon Redshift is 2x faster than the next most popular competitor and Amazon EMR runs Apache Spark workloads over 10x faster than open source Spark. Speed helps get to answers quickly and also helps keep costs down for complex analytics.
As organizations face a growing number of data breaches, cybercrime, and the need to meet compliance regulations like the General Data Protection Regulation (GDPR), their data lakes needs to be protected with the highest levels of security. AWS provides essential security, compliance, and audit capabilities with all of the services listed above available as needed.
AWS data lakes give you control at the object level (rather than the bucket level), allowing you to apply access, log, and audit policies at account and object levels. AWS offers different forms of encryption, including automatic server-side encryption, encryption with keys managed by the AWS Key Management Service (KMS), and encryption with keys that you can manage. AWS encrypts data in transit when replicating across regions, and lets customers use separate accounts for source and destination regions to protect against malicious insider deletions. AWS helps security teams proactively monitor, detect, and alert anomalies with Amazon Macie, an AI-powered security service that helps detect early stages of an attack. S3 is the only object store that gives you inventory reports on all your objects so you can answer questions like: Are all my objects encrypted? S3 is the only object store that allows you to analyze all the access permissions on all your objects and buckets with IAM Access Analyzer.
Deep integration between all the layers of the AWS analytics stack gives builders the tools to quickly analyze data using any approach. Use AWS Lake Formation to store your data once in standards-based formats (such as Parquet or ORC) in S3 and then analyze that data using the right tool for the job, including services for data warehouses, Apache Spark or Hadoop, data catalog, serverless ETL, operational analytics (Elasticsearch), and streaming analytics. Integration with EC2 makes it simple to scale up and down and to use techniques like EC2 Spot instances to reduce the cost for analysis by as much as 90%.
The breadth and depth of analytics services on AWS makes it easy to choose the right tool for the right job. From the fastest data warehouse service to a fully managed Apache Spark and Apache Hadoop service, AWS analytics makes it easy for you to spin up the right resources to run whatever analysis is most appropriate for your specific need. There is no compression algorithm for experience, and AWS has worked with customers to provide managed analytics services longer than anyone else. For example, Amazon EMR launched in 2009 and Amazon Redshift launched in 2013. When using these services, there is no need to continually move and transform data, and AWS has native and fully integrated services for core use cases rather than a collection of partially integrated services from other vendors. If you do need something beyond what our native services offer, we have ### partner services to complement our core offerings.
AWS offers at least:
10 data movement services
13 analytics services
18 machine learning and AI services
17 security and governance services
Maybe more since this slide was created!
Analytics requires robust storage and compute to get performance at the right cost. Amazon EC2’s industry leading collection of instance types, reserved instances, and Spot instances makes choosing the right compute for your analytics workloads simple, and 100Gbps network interfaces provide an order of magnitude more bandwidth between storage and compute, increasing performance and reducing cost. Does your workload require specialized GPU instances or FPGA powered instances? Amazon EC2 provides over 200 instance types to meet the needs of any workload. Amazon S3 provides the world’s highest availability with 11 nines of durability, making S3 uniquely well suited for data lake storage. In addition, S3’s five storage tiers and intelligent tiering make storing vast amount of data less expensive and easier to manage.
Speaker Notes:
You came to re:Invent to learn. There’s no need to stop when you go home.
Keep re:Inventing with resources from AWS Training and Certification for Big Data - for you and your teams
AWS Training and Certification offers training built by AWS experts for learners at a variety of skill levels who want to understand data analytics, data lakes, and associated skills in Machine Learning.
Validate your expertise with AWS Certification and earn an industry-recognized credential.
For more information, visit aws.amazon.com/training and look for the Big Data Learning Path.