SlideShare une entreprise Scribd logo
1  sur  51
P U B L I C S E C T O R
S U M M I T
Washingt on D .C
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
High Performance Data Streaming with
Amazon Kinesis: Best Practices and Common
Pitfalls
Randy Ridgley, Principal Solutions Architect
S e s s i o n 3 1 0 5 7 0
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Agenda
• Streaming data overview
• Introduction to Amazon Kinesis
• Serverless Stream Processing
with Amazon Kinesis Data Streams
• Enhanced Fan-out consumers with AWS Lambda
• Streaming ingest-transform-load (ITL)
with Amazon Kinesis Data Firehose
• Takeaway Demo
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
High volume data produced continuously from a
large of velocity
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
”
”
To create value, companies must derive insights from a variety of data
sources that are producing data at velocity and volume
real-time
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Challenges of data streaming
Difficult to setup
Hard to achieve high availability
Error prone and complex to manage
Tricky to scale
Integration requires development
Expensive to maintain
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Streaming with Amazon Kinesis
Easily collect, process, and analyze video and data streams in real time
Easily collect, process, and analyze video and data streams in real time
Kinesis
Video Streams
Capture and store
video streams for
analytics
Kinesis
Data Streams
Collect and store
data streams for
analytics
Kinesis
Data Firehose
Load data streams
into AWS data stores
Kinesis
Data Analytics
Analyze data
streams with SQL
or Java
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Benefits of Kinesis for Streaming
No infrastructure provisioning,
no management
Automatically scales during re-shard
operations
No stream consumption costs when
no new records to process
Highly available and secure
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Stream Ingestion
Data from tens of thousands of data sources can be written to a single stream
AWS IoT
Amazon CloudWatch
Events and Logs
AWS SDK LOG4J
Flume
Fluentd
AWS Mobile
SDK
Kinesis
Producer
Library
Kinesis Agent
AWS Toolkits/Libraries AWS Service Integrations 3rd Party Offerings
Amazon Database
Migration Service
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Stream Processing
Records are read in the order they are produced enabling real-time analytics or streaming ETL
Amazon EMR
AWS Lambda
Kinesis
Kinesis Client Library
+
Connector Library
AWS Services
Apache Spark
3rd party
SQL
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Amazon Kinesis Data Streams
• Easy administration and low cost
• Real-time, elastic performance
• Secure, durable storage
• Available to multiple real-time analytics applications
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Serverless Stream Processing
Create Kinesis Data Stream and Lambda Consumer
Apache
Web
Server
Apache Log
data
Amazon
Kinesis Data
Stream
Filter 500
errors
Lambda
Consumer
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Create Kinesis Stream
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Kinesis Data Streams: Standard consumers
Shard 1
Shard 2
Shard 3
Shard n
Kinesis Data Stream
Consumer
Application A
GetRecords()
Data
GetRecords():
5 transactions per second, per shard
Data:
2MB per second, per shard
Data
Producer
up to 1 MB
or 1000
records per
second, per
shard
With only one
consumer
application,
records can be
retrieved every
200 ms.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Kinesis Data Streams: Standard consumers
Shard 1
Shard 2
Shard 3
Shard n
Kinesis Data Stream
Consumer
Application
A
Data
Producer
Consumer
Application
B
Consumer
Application
C
Consumer
Application
D
Consumer
Application
E
With more
consumer
applications,
propagation delay
increases.
For example, with
5 consumer
applications, each
can only retrieve
records once per
second, and less
than 400 KBps.
<= 400 KBps
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Beware poison messages
Lambda checkpoints upon the success of each batch
Failed batches are retried indefinitely (until the bad record expires from
the shard)
Data
producer
Kinesis Data
Streams
Lambda
Consumer
Function A
(instance 1)
Batch size = 200
300 records
.
.
Continues until
record expiration
Function A
(instance 1)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Capture and log exceptions
Catch exceptions and log for offline analysis
Data
producer
Kinesis Data
Streams
Lambda
Consumer
Function A
(instance 1)
Batch size = 200
300 records
✔
Function A
(instance 1)
✔
Catch exceptions and log
to CloudWatch Logs
Amazon
CloudWatch
Logs
Return successfully from
Lambda function
Ensure processing moves forward by catching exceptions and returning
successfully
!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Create Lambda Kinesis Data Stream Consumer
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Kinesis Data Streams: Enhanced fan-out consumers
Consumers do not poll. Messages are pushed to the consumer as they arrive.
Shard 1
Kinesis Data Stream
Data
Producer
Consumer
Application A
SubscribeToShard()
Uses HTTP/2
• Up to 5 mins connection
• Data pushed to consumer
persist
One or more records pushed in
an array records
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Kinesis Data Streams: Enhanced fan-out consumers
Each consumer application gets dedicated 2MB per second egress, per shard
Shard 1
Kinesis Data Stream
Data
Producer
Consumer
Application B
Consumer
Application A
RegisterStreamConsumer()
EFO Pipe
RegisterStreamConsumer()
EFO Pipe
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Kinesis Data Streams: Enhanced fan-out Lambda consumers
Shard 1
Kinesis Data Stream
Data
Producer
EFO Pipe
EFO Pipe
LambdaserviceLambdaservice
Invoke function
Records: <= configured
batch size
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Create Kinesis Consumer
aws kinesis register-stream-consumer --stream-arn
arn:aws:kinesis:{{region}}:{{accounted}}:stream/demo-stream
--consumer-name lastUpdateConsumer
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Kinesis Data Streams: Enhanced fan-out
When to use standard consumers:
Total number of consuming applications is low (< 3)
Consumers are not latency-sensitive
Minimize cost
When to use enhanced fan-out consumers:
Multiple consumer applications for the same Kinesis Data Stream
Default limit of 5 registered consuming applications.
More can be supported with a service limit increase request.
Low-latency requirements for data processing
Messages are typically delivered to a consumer in less than 70 ms
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Streaming ingest-transform-load (ITL)
with Kinesis Data Firehose
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Amazon Kinesis Data Firehose
• Zero administration and seamless elasticity
• Direct-to-data store integration
• Serverless continuous data transformations
• Near real-time
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Enrich streaming data
Data
producer
Kinesis Data
Firehose
{
"ip_addr": "1.2.3.4",
..
}
{
"ip_addr":
"1.2.3.4",
"city": "Boston",
"state": "MA",
..
}
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Filter streaming data
Data
producer
Kinesis Data
Firehose
{
"type": "info",
..
}
{
"type": "error",
..
}
{
"type": "error",
..
}
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Convert streaming data
Data
producer
Kinesis Data
Firehose
[Wed Oct 11 14:32:52 2017] [error]
[client 127.0.0.1]
{
"date": "2017/10/11
14:32:52",
"status": "error",
"source": "127.0.0.1"
}
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Kinesis Data Firehose: Record format conversion
• Convert to Parquet or ORC before delivery to
S3
• Compresses file sizes and optimal format for
Hadoop usage
Kinesis Data
Firehose
Amazon S3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Data Lake metadata catalog, ETL, and data prep
with AWS Glue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Record format conversion
Kinesis Data
Firehose
Amazon S3
AWS Glue Data
Catalog
Data
Producer
schema
convert to
columnar format
JSON data
/failed
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Custom Amazon Simple Storage Service (Amazon S3)
Prefixes
Kinesis Data
Firehose
Amazon S3
Data
Producer
JSON data
Hive compatible partition naming
[column_name = column_value]
i.e. s3://datalake-example/logs/year=2019/month=6/
s3://datalake-example/logs/year=2019/month=6/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Partition by year, month and day
• s3://datalake-example/logs/parquet/year=2019/month=6/day=1/
• s3://datalake-example/logs/parquet/year=2019/month=6/day=2/
Partitioning Examples
Can optimize queries like:
SELECT request_method, response_code, access_time FROM logs
WHERE year = 2019 AND month = 6 AND day = 2
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Source Record Amazon S3 Backup
Kinesis Data
Firehose
Amazon S3
Data
Producer
JSON data
Amazon S3
With Source Record Amazon S3 backup you can filter, enrich,
and convert records while maintaining the raw incoming
records.
Processed data
Enrichment
service
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Amazon
Kinesis Data
Firehose
Amazon S3
Amazon S3
Parquet data
Pre-
processing
service
s3://datalake/dataset/processed/year=2019/month=6/
Amazon
Kinesis Data
Stream
Apache Log
data
Serverless Data Lake Ingestion Architecture
Kinesis Data Firehose conversion, aggregation, and persistence to your Data Lake
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Create AWS Glue Table
CREATE EXTERNAL TABLE `p_streaming_logs`(
`host_address` string,
`request_time` string,
`request_method` string,
`request_path` string,
`request_protocol` string,
`response_code` string,
`response_size` string,
`referrer_host` string,
`user_agent` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://{{S3 bucket}}/weblogs/processed/'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='none’)`
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Create Kinesis Data Firehose
Configure Name and Source
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Create Kinesis Data Firehose
Configure Process Records
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Create Kinesis Data Firehose
Configure Settings – Query Results
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Serverless Data Lake Ingestion Architecture - Review
Amazon
Kinesis Data
Firehose
Amazon S3
Apache
Web
Server
Apache Log
data
Amazon S3
Parquet data
Pre-
processing
service
s3://datalake/dataset/processed/year=2019/month=6/
Amazon
Kinesis Data
Stream
Apache Log
data
Filter 500
errors
Lambda
Consumer
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Data streams are foundational to many
of the Amazon core functions
Amazon Go
video analytics
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Randy Ridgley, Principal Solutions Architect
http://bit.ly/2H98RMC
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Data Lake Architecture
Data Lake Operational Workloads
Data Science/Analytics Workloads
RedshiftEMR
Self-Service
Workloads
Athena
EMR SageMaker
v
Data
Catalog
AWS Glue
Amazon
Kineses Data
Firehose
Operational
Reporting and
Dashboarding
Users
Ad-hoc/Self-
service Users
Data Science,
ML/AI UsersAmazon
EC2
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
An Overview of Data Streaming Technology
Stream Ingestion
Deliver large volumes of high velocity data from a variety of sources into a stream
Stream Storage
Store large volumes of high velocity data and make it highly available for processing
Stream Processing
Real-time streaming analytics and machine learning
Analyze data streams in real-time, use ML (e.g. Anomaly detection use cases)
Streaming ETL
Transform and deliver data into data lakes and warehouses for near real-time analysis (or durable
storage)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Serverless Data Lake Ingestion Architecture
Amazon
Kinesis Data
Firehose
Amazon S3
Apache
Web
Server
Apache Log
data
Amazon S3
Parquet data
Pre-
processing
service
s3://datalake/dataset/processed/year=2019/month=6/
Amazon
Kinesis Data
Stream
Apache Log
data
Filter 500
errors
Lambda
Consumer
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Example Custom Prefixes
Prefix:
datalake/dataset/raw/year=!{timestamp:yyyy}/month=!{timestamp:MM}/
day=!{timestamp:dd}/
ErrorOutputPrefix:
datalake/dataset/failed/!{firehose:error-output-
type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/ day=!{timestamp:dd}/
Resulting Hive compliant S3 Prefix:
datalake/dataset/raw/year=2010/month=06/day=01/
Result S3 Hive compliant S3 ErrorOutputPrefix:
datalake/dataset/failed/processing-failed/year=2018/month=06/day=01/

Contenu connexe

Plus de Amazon Web Services

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSAmazon Web Services
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAmazon Web Services
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightAmazon Web Services
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotAmazon Web Services
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Amazon Web Services
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?Amazon Web Services
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksAmazon Web Services
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Amazon Web Services
 

Plus de Amazon Web Services (20)

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei server
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSight
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced Attacks
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
 

High Performance Data Streaming with Amazon Kinesis: Best Practices and Common Pitfalls

  • 1. P U B L I C S E C T O R S U M M I T Washingt on D .C
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T High Performance Data Streaming with Amazon Kinesis: Best Practices and Common Pitfalls Randy Ridgley, Principal Solutions Architect S e s s i o n 3 1 0 5 7 0
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Agenda • Streaming data overview • Introduction to Amazon Kinesis • Serverless Stream Processing with Amazon Kinesis Data Streams • Enhanced Fan-out consumers with AWS Lambda • Streaming ingest-transform-load (ITL) with Amazon Kinesis Data Firehose • Takeaway Demo
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T High volume data produced continuously from a large of velocity
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T ” ” To create value, companies must derive insights from a variety of data sources that are producing data at velocity and volume real-time
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Challenges of data streaming Difficult to setup Hard to achieve high availability Error prone and complex to manage Tricky to scale Integration requires development Expensive to maintain
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Streaming with Amazon Kinesis Easily collect, process, and analyze video and data streams in real time Easily collect, process, and analyze video and data streams in real time Kinesis Video Streams Capture and store video streams for analytics Kinesis Data Streams Collect and store data streams for analytics Kinesis Data Firehose Load data streams into AWS data stores Kinesis Data Analytics Analyze data streams with SQL or Java
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Benefits of Kinesis for Streaming No infrastructure provisioning, no management Automatically scales during re-shard operations No stream consumption costs when no new records to process Highly available and secure
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Stream Ingestion Data from tens of thousands of data sources can be written to a single stream AWS IoT Amazon CloudWatch Events and Logs AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library Kinesis Agent AWS Toolkits/Libraries AWS Service Integrations 3rd Party Offerings Amazon Database Migration Service
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Stream Processing Records are read in the order they are produced enabling real-time analytics or streaming ETL Amazon EMR AWS Lambda Kinesis Kinesis Client Library + Connector Library AWS Services Apache Spark 3rd party SQL
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Amazon Kinesis Data Streams • Easy administration and low cost • Real-time, elastic performance • Secure, durable storage • Available to multiple real-time analytics applications
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Serverless Stream Processing Create Kinesis Data Stream and Lambda Consumer Apache Web Server Apache Log data Amazon Kinesis Data Stream Filter 500 errors Lambda Consumer
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Create Kinesis Stream
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Kinesis Data Streams: Standard consumers Shard 1 Shard 2 Shard 3 Shard n Kinesis Data Stream Consumer Application A GetRecords() Data GetRecords(): 5 transactions per second, per shard Data: 2MB per second, per shard Data Producer up to 1 MB or 1000 records per second, per shard With only one consumer application, records can be retrieved every 200 ms.
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Kinesis Data Streams: Standard consumers Shard 1 Shard 2 Shard 3 Shard n Kinesis Data Stream Consumer Application A Data Producer Consumer Application B Consumer Application C Consumer Application D Consumer Application E With more consumer applications, propagation delay increases. For example, with 5 consumer applications, each can only retrieve records once per second, and less than 400 KBps. <= 400 KBps
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Beware poison messages Lambda checkpoints upon the success of each batch Failed batches are retried indefinitely (until the bad record expires from the shard) Data producer Kinesis Data Streams Lambda Consumer Function A (instance 1) Batch size = 200 300 records . . Continues until record expiration Function A (instance 1)
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Capture and log exceptions Catch exceptions and log for offline analysis Data producer Kinesis Data Streams Lambda Consumer Function A (instance 1) Batch size = 200 300 records ✔ Function A (instance 1) ✔ Catch exceptions and log to CloudWatch Logs Amazon CloudWatch Logs Return successfully from Lambda function Ensure processing moves forward by catching exceptions and returning successfully !
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Create Lambda Kinesis Data Stream Consumer
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Kinesis Data Streams: Enhanced fan-out consumers Consumers do not poll. Messages are pushed to the consumer as they arrive. Shard 1 Kinesis Data Stream Data Producer Consumer Application A SubscribeToShard() Uses HTTP/2 • Up to 5 mins connection • Data pushed to consumer persist One or more records pushed in an array records
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Kinesis Data Streams: Enhanced fan-out consumers Each consumer application gets dedicated 2MB per second egress, per shard Shard 1 Kinesis Data Stream Data Producer Consumer Application B Consumer Application A RegisterStreamConsumer() EFO Pipe RegisterStreamConsumer() EFO Pipe
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Kinesis Data Streams: Enhanced fan-out Lambda consumers Shard 1 Kinesis Data Stream Data Producer EFO Pipe EFO Pipe LambdaserviceLambdaservice Invoke function Records: <= configured batch size
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Create Kinesis Consumer aws kinesis register-stream-consumer --stream-arn arn:aws:kinesis:{{region}}:{{accounted}}:stream/demo-stream --consumer-name lastUpdateConsumer
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Kinesis Data Streams: Enhanced fan-out When to use standard consumers: Total number of consuming applications is low (< 3) Consumers are not latency-sensitive Minimize cost When to use enhanced fan-out consumers: Multiple consumer applications for the same Kinesis Data Stream Default limit of 5 registered consuming applications. More can be supported with a service limit increase request. Low-latency requirements for data processing Messages are typically delivered to a consumer in less than 70 ms
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Streaming ingest-transform-load (ITL) with Kinesis Data Firehose
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Amazon Kinesis Data Firehose • Zero administration and seamless elasticity • Direct-to-data store integration • Serverless continuous data transformations • Near real-time
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Enrich streaming data Data producer Kinesis Data Firehose { "ip_addr": "1.2.3.4", .. } { "ip_addr": "1.2.3.4", "city": "Boston", "state": "MA", .. }
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Filter streaming data Data producer Kinesis Data Firehose { "type": "info", .. } { "type": "error", .. } { "type": "error", .. }
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Convert streaming data Data producer Kinesis Data Firehose [Wed Oct 11 14:32:52 2017] [error] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "127.0.0.1" }
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Kinesis Data Firehose: Record format conversion • Convert to Parquet or ORC before delivery to S3 • Compresses file sizes and optimal format for Hadoop usage Kinesis Data Firehose Amazon S3
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Data Lake metadata catalog, ETL, and data prep with AWS Glue
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Record format conversion Kinesis Data Firehose Amazon S3 AWS Glue Data Catalog Data Producer schema convert to columnar format JSON data /failed
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Custom Amazon Simple Storage Service (Amazon S3) Prefixes Kinesis Data Firehose Amazon S3 Data Producer JSON data Hive compatible partition naming [column_name = column_value] i.e. s3://datalake-example/logs/year=2019/month=6/ s3://datalake-example/logs/year=2019/month=6/
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Partition by year, month and day • s3://datalake-example/logs/parquet/year=2019/month=6/day=1/ • s3://datalake-example/logs/parquet/year=2019/month=6/day=2/ Partitioning Examples Can optimize queries like: SELECT request_method, response_code, access_time FROM logs WHERE year = 2019 AND month = 6 AND day = 2
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Source Record Amazon S3 Backup Kinesis Data Firehose Amazon S3 Data Producer JSON data Amazon S3 With Source Record Amazon S3 backup you can filter, enrich, and convert records while maintaining the raw incoming records. Processed data Enrichment service
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Amazon Kinesis Data Firehose Amazon S3 Amazon S3 Parquet data Pre- processing service s3://datalake/dataset/processed/year=2019/month=6/ Amazon Kinesis Data Stream Apache Log data Serverless Data Lake Ingestion Architecture Kinesis Data Firehose conversion, aggregation, and persistence to your Data Lake
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Create AWS Glue Table CREATE EXTERNAL TABLE `p_streaming_logs`( `host_address` string, `request_time` string, `request_method` string, `request_path` string, `request_protocol` string, `response_code` string, `response_size` string, `referrer_host` string, `user_agent` string) PARTITIONED BY ( `year` int, `month` int, `day` int, `hour` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://{{S3 bucket}}/weblogs/processed/' TBLPROPERTIES ( 'classification'='parquet', 'compressionType'='none’)`
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Create Kinesis Data Firehose Configure Name and Source
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Create Kinesis Data Firehose Configure Process Records
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Create Kinesis Data Firehose Configure Settings – Query Results
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Serverless Data Lake Ingestion Architecture - Review Amazon Kinesis Data Firehose Amazon S3 Apache Web Server Apache Log data Amazon S3 Parquet data Pre- processing service s3://datalake/dataset/processed/year=2019/month=6/ Amazon Kinesis Data Stream Apache Log data Filter 500 errors Lambda Consumer
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Data streams are foundational to many of the Amazon core functions Amazon Go video analytics
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Randy Ridgley, Principal Solutions Architect http://bit.ly/2H98RMC
  • 47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T
  • 48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Data Lake Architecture Data Lake Operational Workloads Data Science/Analytics Workloads RedshiftEMR Self-Service Workloads Athena EMR SageMaker v Data Catalog AWS Glue Amazon Kineses Data Firehose Operational Reporting and Dashboarding Users Ad-hoc/Self- service Users Data Science, ML/AI UsersAmazon EC2
  • 49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T An Overview of Data Streaming Technology Stream Ingestion Deliver large volumes of high velocity data from a variety of sources into a stream Stream Storage Store large volumes of high velocity data and make it highly available for processing Stream Processing Real-time streaming analytics and machine learning Analyze data streams in real-time, use ML (e.g. Anomaly detection use cases) Streaming ETL Transform and deliver data into data lakes and warehouses for near real-time analysis (or durable storage)
  • 50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Serverless Data Lake Ingestion Architecture Amazon Kinesis Data Firehose Amazon S3 Apache Web Server Apache Log data Amazon S3 Parquet data Pre- processing service s3://datalake/dataset/processed/year=2019/month=6/ Amazon Kinesis Data Stream Apache Log data Filter 500 errors Lambda Consumer
  • 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R S U M M I T Example Custom Prefixes Prefix: datalake/dataset/raw/year=!{timestamp:yyyy}/month=!{timestamp:MM}/ day=!{timestamp:dd}/ ErrorOutputPrefix: datalake/dataset/failed/!{firehose:error-output- type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/ day=!{timestamp:dd}/ Resulting Hive compliant S3 Prefix: datalake/dataset/raw/year=2010/month=06/day=01/ Result S3 Hive compliant S3 ErrorOutputPrefix: datalake/dataset/failed/processing-failed/year=2018/month=06/day=01/