PPT de la presentación de Richard T. Freeman en el Meetup de BEEVA. Marzo 2017.
https://www.meetup.com/es-ES/Innovative-technology-BEEVA/events/238027581/
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
1. JustGiving – Serverless Data Pipelines,
API, Messaging and Stream Processing
Richard Freeman, Ph.D. Lead Data Engineer and Architect
23rd March 2017
RAVEN data science platform in AWS
2. What to Expect from the Session
• Recap of some AWS services
• Event-driven data platform at JustGiving
• Serverless computing
• Six serverless patterns
• Serverless recommendations and best practices
4. Managed Services
Amazon S3
• Distributed web
scale object storage
• Highly scalable,
reliable, and secure
• Pay only for what
you use
• Supports encryption
AWS Lambda
• Runs code in
response to triggers
• Amazon API
Gateway requests
• Serverless
• Pay only when code
runs
• Automatically scales
and high availability
Amazon Simple
Notification Service (SNS)
• Set up, operate, and
send notifications
• Publish messages from
an application and
immediately deliver them
to subscribers or other
applications
• Push messages to
mobile devices
Amazon DynamoDB
• Fast, fully-managed
NoSQL Database
Service
• Capable of handling
any amount of data
• Durable and Highly
Available
• SSD storage
• Cost Effective
5. Amazon Kinesis
Amazon Kinesis
Firehose
Easily load massive volumes
of streaming data into S3,
Amazon Redshift, and
Amazon Elasticsearch
Service
Amazon Kinesis
Analytics
Easily analyze
streaming data with
standard SQL
Amazon Kinesis
Streams
Build your own custom
applications that
process or analyze
streaming data.
Scales out using
shards
7. We are
A tech-for-good platform for
events based fundraising,
charities and crowdfunding
“Ensure no good cause
goes unfunded”
• The #1 platform for online social
giving in the world
• Raised $4.2bn in donations
• 28.5m users
• 196 countries
• 27,000 good causes
• Peaks in traffic: Ice bucket,
natural disasters
• GiveGraph
• 91 million nodes
• 0.53 billion relationships
9. Our Requirements
• Limitation in existing SQL Server data warehouse
• Long running and complex queries for data scientists
• New data sources: API, clickstream, unstructured, log, behavioural
data etc.
• Easy to add data sources and pipelines
• Reduce time spent on data preparation and experiments
Machine
learning
Graph
processing
Natural language
processing
Stream processing
Data ingestion
Data
preparation
Automated Pipelines
Insight
Predictions
Metrics
Inspire
Data-driven
Complex Impact
10. Event Driven Data Platform at JustGiving [1 of 2]
• JustGiving developed in-house analytics and data science
platform in AWS called RAVEN.
• Reporting, Analytics, Visualization, Experimental, Networks
• Uses event-driven and serverless pipelines rather than
directed acyclic graphs (DAGs) or Workflows
• Messaging, queues, pub/sub patterns
• Separate storage from compute
• Supports scalable event driven
• ETL / ELT
• Machine learning
• Natural language processing
• Graph processing
• Allows users to consume raw tables, data blocks, metrics,
KPIs, insight, reports, etc.
13. Event Driven ETL and Data Pipeline Patterns [2 of 2]
Pros
• Decouple the L in ETL / ELT
• Adds lots of flexibility to loading
• Supports incremental and multi-cluster loads
• Simple reload on failure, e.g., send an Amazon SNS/Amazon
SQS message
• Basic sequencing can be done within message
Cons:
• Does not support complex loading workflows
• No native FIFO support in SQS (support added Nov 17, 2016)
15. AWS Lambda Functions
EVENT SOURCE FUNCTION SERVICES (ANYTHING)
Data stores & streaming event sources
e.g. Amazon S3, Amazon DynamoDB,
Amazon Kinesis
Requests to endpoints
e.g. Amazon Alexa Skills, Amazon API Gateway
Changes in repositories and logs
e.g. AWS Code Commit, Amazon CloudWatch
Node.js
Python
Java
C#
Event & Message services
e.g. Amazon SNS, Cron events
16. Writing Lambda Functions
• The Basics
• AWS SDK, IAM Roles and Policies
• Python, Node.js, Java8, and C#
• Lambda handles inbound traffic and integration
• Stateless
• Use S3, DynamoDB, or other Internet storage for persistent data
• 128Mb and 1.5GB RAM
• Duration 100ms to 5min
• Runs your code in response to events or requests
• Pay for execution time in 100ms increments
• Familiar
• Use bash, processes, threads, /tmp, sockets, …
• Bring your own libraries, even native ones
17. AWS Lambda, container or EC2 instance?
Amazon EC2 Instance Containers AWS Lambda
Infrastructure You configure, maintain and
needs be built for HA
You configure, maintain
and needs be built for HA
Request-driven and AWS manages
Instance
Flexibility
Choose instance type, OS,
etc.
Choose instance type, OS,
etc.
Prioritises ease of use - one OS,
default hardware, no maintenance or
scheduled downtime
Scaling Provisioning instances and
configure auto-scaling
Provisioning containers and
configure auto-scaling
Implicit, based on requests
Launch Minutes Seconds Milliseconds to few seconds
State State or stateless State or stateless Stateless
AWS Integration Custom Custom Built-in triggers SNS, Kinesis Stream,
S3, API Gateway etc.
Cost EC2 per hour and spot Containers on EC2 clusters
per hour and spot
Per 100ms, invocation and RAM
20. Pattern 1 – Serverless Message Bus [2 of 2]
Pros
• Simple serverless and highly available service bus
• Centralised logic for transformation and routing
• Kinesis supports more than one Lambda per message
• Kinesis streams support replay
Cons:
• SNS - Scalability one lambda per SNS message
• Lambda limits: 500 MB disk, 1.5GB RAM, complete within
5min
21. Pattern 2 – Serverless Steaming Data Capture API [1 of 2]
22. Pattern 2 – Serverless Steaming Data Capture API [2 of 2]
Pros:
• Serverless real-time stream capture pipeline
• Managed API configuration and deployment
• Secure - IAM Roles
• Simple transformation
• Cost effective – pay for what you use
Cons:
• Complex transformation
• API Gateway and Kinesis soft limits need to be aligned
23. Pattern 3 - Serverless Small Data Pipeline [1 of 2]
24. Pattern 3 - Serverless Small Data Pipeline [2 of 2]
Pros:
• Batch and incremental possible
• Useful for small and infrequently added files
• Can preserve file name
• Can be used for projection, enrichment, parsing, etc.
Cons:
• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min
• Complex joins
• One file on input and output
25. Pattern 4 - Serverless Streamify File and Merge into Larger Files [1 of 2]
26. Pattern 4 - Serverless Streamify File and Merge into Larger Files [2 of 2]
Pros:
• Batch and incremental possible
• Useful for small and frequently added files
• Lambda and Amazon Kinesis Firehose - Serverless way to
persist a stream
Cons:
• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min
• Firehose limits: 15 min, 128 MB buffer
• Complex joins
27. Pattern 5 - Serverless Streaming Analytics and Persist Stream [1 of 2]
28. Pattern 5 - Serverless Streaming Analytics and Persist Stream [2 of 2]
Pros:
• Stream processing without a running cluster
• Lambda and Amazon Kinesis Analytics automatic scaling
• A static website can access DynamoDB metrics
• Choice of language: Python, SQL, Node.js, Java8, and C#
• Code can be changed without interruption
Cons:
• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min
• Firehose limits: 15 min, 128 MB buffer
• Complex stream joins
29. Pattern 6 – Serverless Data and Process API [1 of 2]
30. Pattern 6 – Serverless Data and Process API [2 of 2]
Pros:
• Serverless data API
• DynamoDB uses IAM roles
• Choice of language: Python, Node.js, C#, and Java8
• Code can be changed without interruption
Cons:
• RDS needs encrypted user/pass and VPC
• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min
• Complex or compute intensive queries
32. Lambda Functions Are Good For…
• REST API micro-services backend
• Rows & events
• Parsing, projecting, enriching, filtering etc.
• REST and other AWS integration
• Built in triggers API Gateway, S3, SNS,
CloudWatch, Kinesis Streams etc.
• Secure
• IAM Roles, VPC and KMS
• Highly Available and auto-scaling
• Simple packaging and deployment
• For streams, easily modify code
33. Serverless Data Processing Patterns
Pattern Serverless Spark on EMR when …
Pattern 3 - Serverless
small data pipeline
• Small and infrequently added files
• Preserves existing file names
• Big files > 400MB
• Joining data sources
• Merge small files into one
Pattern 4 - Serverless
streamify file and
merge into larger files
• Merge small frequently added files
into a stream or larger ones in S3
• Streaming analytics and time series
• Big files > 400MB
• Joining streams or data
sources
Pattern 5 - Serverless
analytics and persist
stream
• Streaming analytics and time series
• Real-time dashboard
• Persist an Amazon Kinesis stream
• Joining streams or data
sources
34. Understand the Lambda limits
• Memory, local disk, execution time
• Limited Tooling – Improving
• Throttle – review the architecture and exception DLQ
• Concurrent Lambda functions execution
• AWS account
• DynamoDB and Amazon Kinesis shard iterators
• Complex joins harder to implement – better in Spark on EMR
• ORC and Parquet – better in Spark on EMR
35. Lambda Recommendations
• Test with production volumes
• Think deployment and version control
• Frameworks Serverless, Chalice etc.
• New
• AWS Serverless Application Model (SAM)
• Environment variables
• Dead Letter Queue
• Program defensively
Exceptions, e.g. invalid records
• Reduce execution time
• Minimize logging
• Optimize code, e.g. aggregate in memory to reduce writes
36. 1-2-3 of Lambda
1. Create IAM Role
• Access resources, e.g. DynamoDB, Kinesis
Streams etc.
• Lambda execution
2. Create Lambda Function
• Configuration and function code
• Test manually
3. Deploy package
• Enable event Trigger
• Monitor
First 1 million
requests per
month are free !
37. Other Serverless Example Use Cases
Thumbnail Generation
• http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
Mobile Application Backend
• http://docs.aws.amazon.com/lambda/latest/dg/with-on-demand-custom-android-example.html
Authorizing Access Through a Proxy Resource
• https://aws.amazon.com/blogs/compute/authorizing-access-through-a-proxy-resource-to-amazon-api-gateway-and-aws-lambda-
using-amazon-cognito-user-pools/
Serverless IoT Back Ends
• https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-serverless-iot-back-ends-iot401
Creating Voice Experiences with Alexa Skills
• https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-workshop-creating-voice-experiences-with-alexa-skills-from-
idea-to-testing-in-two-hours-alx203
Lambda with AWS CloudTrail
• http://docs.aws.amazon.com/lambda/latest/dg/with-cloudtrail.html
Slack Integration for AWS Lambda
• https://aws.amazon.com/blogs/aws/new-slack-integration-blueprints-for-aws-lambda/
Dynamic GitHub Actions with AWS Lambda
• https://aws.amazon.com/blogs/compute/dynamic-github-actions-with-aws-lambda/
Lambda with Scheduled Events
• http://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html
38. Find Out More
Serverless Cross Account Stream Replication Using AWS Lambda, Amazon DynamoDB, and Amazon Kinesis
Firehose (AWS Compute Blog)
• https://aws.amazon.com/blogs/compute/serverless-cross-account-stream-replication-using-aws-lambda-
amazon-dynamodb-and-amazon-kinesis-firehose
Analyze a Time Series in Real Time with AWS Lambda, Amazon Kinesis and Amazon DynamoDB Streams (AWS Big
Data Blog)
• https://blogs.aws.amazon.com/bigdata/post/Tx148NMGPIJ6F6F/Analyze-a-Time-Series-in-Real-Time-with-AWS-Lambda-
Amazon-Kinesis-and-Amazon-Dyn
Serverless Dynamic Real-Time Dashboard with AWS DynamoDB, S3 and Cognito (Medium.com)
• https://medium.com/@rfreeman/serverless-dynamic-real-time-dashboard-with-aws-dynamodb-a1a7f8d3bc01
JustGiving Public Repository
• https://github.com/JustGiving
Serverless at re:Invent 2016 – Wrap-up (AWS Compute Blog)
• https://aws.amazon.com/blogs/compute/serverless-at-reinvent-2016-wrap-up/
39. Summary
• Challenges of existing big data pipelines
• Event-driven ETL and serverless pipelines at JustGiving
• Serverless computing
• Six patterns for scalable data pipelines
1. Serverless message bus
2. Serverless steaming data capture API
3. Serverless small data pipeline
4. Serverless streamify file and merge into larger files
5. Serverless streaming analytics and persist stream
6. Serverless data and process API
• Recommendations
• Using AWS managed services
• Understanding Lambda
40. Thank You!
“Ensure no good cause goes unfunded”
Contact
https://linkedin.com/in/drfreeman