Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
1. Big Data & Analytics
Use Cases in
Mobile, E-commerce, Media and more
Russell Nash
AWS Solutions Architect
2.
3. Product?
Do we have a product?
Can we ship?
How to develop faster?
Better? Cheaper?
Market?
Can we scale?
What do people do & why?
How do we optimize?
4.
5. • 10 million guests
• 550,000 properties listed
• Massive growth on AWS
• $776.4M from top investors
• $10B valuation – more than Hyatt
6. “At Airbnb, we look into all possible ways to
improve our product and user experience. Often
times this involves lots of analytics behind the
http://nerds.airbnb.com/redshift-performance-cost/
scene.”
Henry Cai 蔡明航
Software Engineer, Growth at Airbnb
8. Agenda
• Big Data Overview
• MapReduce / Hadoop
• Case Study: Yelp
• Data Warehousing
• Case Study: Foursquare
• NoSQL
• Case Study: AdRoll
• Streaming
• Case Study: Supercell
14. Input
File
Functions Output
Hadoop cluster
1. Very Flexible
2. Very Scalable
3. Often Transient
15. Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendation
s
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial Services
Monte Carlo
Simulations
Risk
Analysis
Security
Anti-virus
Fraud
Detection
Image
Recognition
Social
Network/Gaming
User
Demographics
Usage
analysis
In-game
metrics
19. 400 GB of logs per day
~12 Terabytes per month
20.
21. 1) Load log file data for six months
of user search history into Amazon
S3
Amazon S3
Search ID Search Text Final Selection
12423451 westen Westin
14235235 wisten Westin
54332232 westenn Westin
12423451
14235235
54332232
12423451
14235235
54332232
12423451
14235235
54332232
12423451
14235235
54332232
12423451
22. Amazon S3 Amazon EMR
Log Files
2) Spin up a 200 node cluster
Hadoop Cluster
23. 3) 200 nodes simultaneously analyze this
data looking for common misspellings
… this takes a few hours
Hadoop Cluster
Amazon S3 Amazon EMR
24. Amazon S3 Amazon EMR
4) New common misspellings and
suggestions loaded back into S3
Hadoop Cluster
Log Files
25. Amazon S3 Amazon EMR
5) When the job is done, the cluster is
shut down.
Log Files
26. E-Commerce Case Study
• Online Marketplace
• EMR
–Weblog analysis
– Recommendations
• Link logs with production database in EMR
“Enables us to focus on developing our…analysis stack
without worrying about the underlying infrastructure”
34. Mobile Case Study
• Location based social app
• 40 Million users
• 4.5 Billion check-ins
• Multi-terabytes of log data
35. Who is checking in?
0.6
0.5
0.4
0.3
0.2
0.1
0
Gender
Female Male
Age
0 20 40 60 80
36. When do people go to a place?
Gorilla Coffee
Gray's Papaya
Amorino
Thursday Friday Saturday Sunday
37. “Using Amazon Redshift has enabled the
company to perform more agile analytics
while saving costs.”
38. Media Case Study
• Placeshifting and media streaming
• Collect terabytes of event logs
• Viewership, devices etc
• Hadoop for transformation
• Redshift for analysis
“Redshift allows us to turn on a dime”
39. Performance Evaluation on 2B Rows
Traditional
SQL Database
Amazon
Redshift
Aggregate by month 02:08:35 00:35:46 00:00:12
40. Hadoop MPP NoSQL
Structure
Latency
Interfaces
Any Full
Mins-Hours Seconds-Minutes
Programming
SQL-Like
Tools
SQL
BI Tools
42. ID Age State
123 20 CA
345 25 WA
678 40 FL
Relational Table
ID Attributes
123 Age:20, State:CA
345 Age:25, Country: Australia, Gender: F, Smoker: No
678 Age:40
Non-Relational Table
54. Use Cases
• Gaming analytics
• Sensor networks analytics
• Ad network analytics
• Log centralization
• Click stream analysis
• Hardware and software appliance metrics
• …more…
55. Data
Sources
App.4
[Machine
Learning]
AWS Endpoint
App.1
[Aggregate &
De-Duplicate]
Data
Sources
Data
Sources
Data
Sources
App.2
[Metric
Extraction]
S3
DynamoDB
Redshift
App.3
[Sliding
Window
Analysis]
Data
Sources
Availability
Zone
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Amazon Kinesis
EMR
56. “Amazon Kinesis enables our business-critical analytics and dashboard
applications to reliably get the data streams they need, without delays. Amazon
Kinesis also offloads a lot of developer burden in building a real-time, streaming
data ingestion platform, and enables Supercell to focus on delivering games that
delight players worldwide.”
Sami Yliharju, Supercell Services Lead
57. Big Data Tutorials
aws.amazon.com/big-data
Redshift Free Trial
aws.amazon.com/redshift/free-trial
58. Big Data & Analytics
Use Cases in
Mobile, e-commerce, media and more
Russell Nash
AWS Solutions Architect
Notes de l'éditeur
Put something in users hands (doesn’t need to be code), and get feedback asap
depending on your data structure, its size and access patterns you will need to pick the right solution
* S3 is ideal for large unstructured objects such as files, pictures, binary data, etc.
* Dynamo dB or other no SQL alternatives such as Cassandra is ideal for small object that you have to read or write at a high speed. It is great for data powering web or mobile applications
* Amazon RDS (or other relational databases) are great for structured schema and standard SQL access but the size of data is typically limited to a single server. Of course it is possible to shard data across many RDS instances but this requires substantial development and ops work.
* Hbase – ideal for analytics use case
▪ Optimized for append-heavy, light read workloads
And so there is a variety of ways you can store your data on the cloud based on particular needs of your application.
Hadoop and cloud marriage
Shared nothing
Yelp – Autocomplete, spelling suggestions
S&P Capital IQ – Recommendations for investors based on behaviour
Australian company – uses it to calculate which ad space it should buy.
Let’s look at another company – Yelp.
As you can see this company is growing rapidly and with more than 50 million of monthly visitors and 18 million or reviews the company generates about 400GB of data a day. That data needs to be processes and analyzed.
The more searches you collect from your customers, the better recommendations you can provide.
Using Hadoop on Amazon Elastic MapReduce Yelp analyses customer search results to deliver features such as hotel or restaurants recommendations.
Yelp processes all customer reviews with natural language processing technologies to provide customers review highlights.
From this example we can see that companies such as Yelp can use data generated by their customers on their web site to develop more innovative data products.
By looking at typical queries, yelp can list common suggestions for a query even before you finish typing.
Both of these products are possible because Yelp analyses all the web logs from their websites
Map Reduce – Programming model for Hadoop
Flume – Open source Log collection tool
Mahout – Machine learning project
Nutch – web search engine
Cascading – Software abstraction layer
Hbase – Columnar NoSQL database
Cassandra – NoSQL database
Sqoop – Data transfer between Hadoop and relational db’s
Hive – SQL like language for Hadoop
Chukwa – Log collection
Approaching 50/50 male female
You can see that some places are best for lunch during work hours others are dinner joints.
Use Case – IMDB uses it for new applications. i.e, movie rating system
Let’s look at another company – Yelp.
Use Case – IMDB uses it for new applications. i.e, movie rating system
Use Case – IMDB uses it for new applications. i.e, movie rating system
[2 minutes]
KINESIS is a new service that scales elastically for near realtime processing of streaming big data. The service will store large streams of data in durable, consistent storage, reliably, for near realtime processing of data by an elastically scalable fleet of data processing servers. Large streams means millions of records per second, GBs of data per second and near real-time means order of a few seconds
Streaming data processing has two layers: a storage layer and a processing layer. The storage layer needs to support specialized ordering and consistency semantics that enable fast, inexpensive, and replayable reads and writes of large streams of data. Kinesis is the storage layer in Kinesis / Kinesis. The processing layer is responsible for reading data from the storage layer, processing that data, and notifying the storage layer to delete data that is no longer needed. Kinesis supports the processing layer. Customers compile the Kinesis library into their data processing application. Kinesis notifies the application (the Kinesis Worker) when there is new data to process. The Kinesis / Kinesis control plane works with Kinesis Workers to solve scalability and fault tolerance problems in the processing layer.
Supercell is using Amazon Kinesis for real-time delivery of game insight data sent by hundreds of game engine servers.
TALKING POINTS
AWS Training and Certification is an organization dedicated to expanding and deepening knowledge of AWS, as well as driving proliferation in the usage of AWS services.
Our programs are designed for customers, partners and AWS employees.
Over the past several months, we have rolled out several new courses, training labs, and certifications to our customers and partners
Go and visit the training team at the training booth to receive your 30% discount voucher for a certification exam.