2. Agenda
MapReduce
Google
Scaling Out
Key Value Store
Chaining
Fault Tolerance
Functional Example
Business Problem
Design
Processes
Schema
Big Data Guidelines
4. Google MapReduce
+ Paper published in 2004
+ Implemented in 2003
+ Production use at Google
+ Built for Google
+ Not open sourced
5. Google in 2004
+ Clusters of 100s or 1000s of servers
o Linux
o dual-processor x86
o 2-4 GB memory
o 100BaseT or GigE
o inexpensive IDE hard drives
+ Servers fail every day
+ Network maintenance is constant
6. Scaling Out
+ Scaling up (faster computer) doesn’t get far
+ Scaling out is the only next step
+ Hundreds/thousands of modest computers
outperform the biggest single computers
+ Scaling one to a few is hard
+ Scaling a few to many is easy
+ Scaling many to massive is (almost) trivial
8. Intermediate Data
+ Input data is split between the workers
+ Map workers create key/value pairs
+ Reduce workers read in all intermediate
data and sort by key
+ Reduce workers then iterate over the sorted
data producing a result for each key
10. Rinse and Repeat
+ Often the results of one MapReduce are
used as input to another
+ Building on a powerful basic functional
model complex data processing can be
accomplished
12. Fault Tolerance
+ Likelihood of failure rises with number of
servers and processing time
+ Resiliency is a necessity at scale
+ Scheduler/Supervisor (master) reassigns
failed jobs and ensures reduce workers find
the (right) data
16. Example Business Problem
Scenario:
A mobile operator wants to know if an instant
messaging (IM) service would be useful to
current subscribers.
Question:
What percentage of text messages (SMS)
are part of a conversation?
17. Challenge
✓ 10 million subscribers
✓ average of 100 SMS a month per subscriber
✓ ∴ one billion SMS each month
✓ call detail records (CDR) include SMS but also
voice and data events
✓ ∴ 20 billion (20,000,000,000) records/month
18. Requirements
+ Identify SMS conversations
o messages sent or received with one other party
o interval between messages < 10 minutes
o at least three messages exchanged
+ Provide result as
o ratio of conversational to non-conversational SMS
o per subscriber
o per month
20. Filter
+ Read events from CDR files
o records are in chronological order
o read files in chronological order
+ Discard non-SMS events
+ Distribute SMS events to Map processes
o Consistent distribution by subscriber
21. Hashing
+ To analyze interval between
messages one process must
handle all events for a
particular subscriber
+ Simple Hash:
o M = last four digits of subscriber’s
mobile number
o N = number of processes available
o Pid = M rem N
22. Map
+ Read subscriber’s stored data
+ Find other party in set
+ Increment total count of messages
+ Is previous message < 10 minutes?
o Is next previous message < 10m before previous?
Increment conversational messages count
+ Update previous and next previous times
24. Interim Data
+ We are using an in memory key value store
+ The key is the subscriber number
+ The value is a set of OtherParty
+ OtherParty data structure contains counts
+ When the map is complete we transfer the
data to disk for persistence
25. Reduce
+ Collect intermediate data
from disk copies
+ Iterate through all parties for
each subscriber
+ Total all party counts
+ Provide result as percentage
of conversational messages
to total messages
26. Big Data Guidelines
+ Find opportunities for concurrency
+ Choose the right containers for your data
+ Use memory as effectively as possible
+ Minimize copying data
+ Avoid any unnecessary overhead
+ Anything you are going to do hundreds of
billions of times should be efficient!
In order to successfully handle really big data requires massive concurrency and in the real world this requires fault tolerance.
Google didn’t invent map and reduce but they were the first to apply the paradigm in a general way on a massive scale.
… or, more probably, a number of results. By dividing the work we can assign it to many servers. This concurrency is what allows scale.
Here is an example of something which Google do as part of their core business. Google places web sites which are linked to by many other web sites higher in search results (PageRank). To determine this a map reads web pages found by crawlers and creates key/value pairs. These are written in memory and then pushed out in blocks to disk. A reduce reads these disk blocks and sorts all the intermediate data by key. The reduce function then iterates over all the pairs for a key and outputs one result for each key.
The results from one MapReduce can, and often are, provided as input for further MapReduce runs.
Something like RAID, maybe Reduced Array of Inexpensive Servers (RAIS)? The can and do fail individually without the system failing.
The user process forks all of the other processes which will be used including a master process. The master then assigns those processes work to perform, either map or reduce roles.
The master process monitors each worker by sending a ping periodically. When it detects that a server has failed (or is no longer reachable) it will reassign that server’s work to another worker. After this reassignment each of the reduce workers will be notified to ignore the failed server and instead get the interim data from the newly assigned server.
This is a contrived example.
That’s billion with a ‘B’. In Canada that’s 1,000 million.
There is an obvious hole in this pseudo code, the first two messages of the conversation are not included in the conversational totals. I could have accommodated that but I left it out to keep the example as simple possible.