My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.
It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
1. Cloud Infrastructure:
GFS & MapReduce
Andrii Vozniuk
Used slides of: Jeff Dean, Ed Austin
Data Management in the Cloud
EPFL
February 27, 2012
2. Outline
• Motivation
• Problem Statement
• Storage: Google File System (GFS)
• Processing: MapReduce
• Benchmarks
• Conclusions
3. Motivation
• Huge amounts of data to store and process
• Example @2004:
– 20+ billion web pages x 20KB/page = 400+ TB
– Reading from one disc 30-35 MB/s
• Four months just to read the web
• 1000 hard drives just to store the web
• Even more complicated if we want to process data
• Exp. growth. The solution should be scalable.
4. Motivation
• Buy super fast, ultra reliable hardware?
– Ultra expensive
– Controlled by third party
– Internals can be hidden and proprietary
– Hard to predict scalability
– Fails less often, but still fails!
– No suitable solution on the market
5. Motivation
• Use commodity hardware? Benefits:
– Commodity machines offer much better perf/$
– Full control on and understanding of internals
– Can be highly optimized for their workloads
– Really smart people can do really smart things
• Not that easy:
– Fault tolerance: something breaks all the time
– Applications development
– Debugging, Optimization, Locality
– Communication and coordination
– Status reporting, monitoring
• Handle all these issues for every problem you want to solve
6. Problem Statement
• Develop a scalable distributed file system for
large data-intensive applications running on
inexpensive commodity hardware
• Develop a tool for processing large data sets in
parallel on inexpensive commodity hardware
• Develop the both in coordination for optimal
performance
8. Google Cluster Environment
• @2009:
– 200+ clusters
– 1000+ machines in many of them
– 4+ PB File systems
– 40GB/s read/write load
– Frequent HW failures
– 100s to 1000s active jobs (1 to 1000 tasks)
• Cluster is 1000s of machines
– Stuff breaks: for 1000 – 1 per day, for 10000 – 10 per day
– How to store data reliably with high throughput?
– How to make it easy to develop distributed applications?
11. GF S Google File System*
• Inexpensive commodity hardware
• High Throughput > Low Latency
• Large files (multi GB)
• Multiple clients
• Workload
– Large streaming reads
– Small random writes
– Concurrent append to the same file
* The Google File System. S. Ghemawat, H. Gobioff, S. Leung. SOSP, 2003
12. GF S Architecture
• User-level process running on commodity Linux machines
• Consists of Master Server and Chunk Servers
• Files broken into chunks (typically 64 MB), 3x redundancy (clusters, DCs)
• Data transfers happen directly between clients and Chunk Servers
13. GF S Master Node
• Centralization for simplicity
• Namespace and metadata management
• Managing chunks
– Where they are (file<-chunks, replicas)
– Where to put new
– When to re-replicate (failure, load-balancing)
– When and what to delete (garbage collection)
• Fault tolerance
– Shadow masters
– Monitoring infrastructure outside of GFS
– Periodic snapshots
– Mirrored operations log
14. GF S Master Node
• Metadata is in memory – it’s fast!
– A 64 MB chunk needs less than 64B metadata => for 640
TB less than 640MB
• Asks Chunk Servers when
– Master starts
– Chunk Server joins the cluster
• Operation log
– Is used for serialization of concurrent operations
– Replicated
– Respond to client only when log is flushed locally and
remotely
15. GF S Chunk Servers
• 64MB chunks as Linux files
– Reduce size of the master‘s datastructure
– Reduce client-master interaction
– Internal fragmentation => allocate space lazily
– Possible hotspots => re-replicate
• Fault tolerance
– Heart-beat to the master
– Something wrong => master inits replication
16. GF S Mutation Order
Current lease holder?
Write request
3a. data identity of primary
location of replicas
(cached by client)
Operation completed Operation completed
or Error report 3b. data
Primary assigns # to mutations
Applies it
Forwards write request
3c. data
Operation completed
17. GF S Control & Data Flow
• Decouple control flow and data flow
• Control flow
– Master -> Primary -> Secondaries
• Data flow
– Carefully picked chain of Chunk Servers
• Forward to the closest first
• Distance estimated based on IP
– Fully utilize outbound bandwidth
– Pipelining to exploit full-duplex links
18. GF S Other Important Things
• Snapshot operation – make a copy very fast
• Smart chunks creation policy
– Below-average disk utilization, limited # of recent
• Smart re-replication policy
– Under replicated first
– Chunks that are blocking client
– Live files first (rather than deleted)
• Rebalance and GC periodically
How to process data stored in GFS?
19. M M M
R R MapReduce*
• A simple programming model applicable to many
large-scale computing problems
• Divide and conquer strategy
• Hide messy details in MapReduce runtime library:
– Automatic parallelization
– Load balancing
– Network and disk transfer optimizations
– Fault tolerance
– Part of the stack: improvements to core library benefit
all users of library
*MapReduce: Simplified Data Processing on Large Clusters.
J. Dean, S. Ghemawat. OSDI, 2004
20. M M M
R R Typical problem
• Read a lot of data. Break it into the parts
• Map: extract something important from each part
• Shuffle and Sort
• Reduce: aggregate, summarize, filter, transform Map results
• Write the results
• Chain, Cascade
• Implement Map and Reduce to fit the problem
22. M M M
R R Other suitable examples
• Distributed grep
• Distributed sort (Hadoop MapReduce won TeraSort)
• Term-vector per host
• Document clustering
• Machine learning
• Web access log stats
• Web link-graph reversal
• Inverted index construction
• Statistical machine translation
23. M M M
R R Model
• Programmer specifies two primary methods
– Map(k,v) -> <k’,v’>*
– Reduce(k’, <v’>*) -> <k’,v’>*
• All v’ with same k’ are reduced together, in order
• Usually also specify how to partition k’
– Partition(k’, total partitions) -> partition for k’
• Often a simple hash of the key
• Allows reduce operations for different k’ to be parallelized
24. M M M
R R Code
Map
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Reduce
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result))
25. M M M
R R Architecture
• One master, many workers
• Infrastructure manages scheduling and distribution
• User implements Map and Reduce
28. M M M
R R Architecture
Combiner = local Reduce
29. M M M
R R Important things
• Mappers scheduled close to data
• Chunk replication improves locality
• Reducers often run on same machine as mappers
• Fault tolerance
– Map crash – re-launch all task of machine
– Reduce crash – repeat crashed task only
– Master crash – repeat whole job
– Skip bad records
• Fighting ‘stragglers’ by launching backup tasks
• Proven scalability
• September 2009 Google ran 3,467,000 MR Jobs averaging
488 machines per Job
• Extensively used in Yahoo and Facebook with Hadoop
32. Conclusions
“We believe we get tremendous competitive advantage
by essentially building our own infrastructure”
-- Eric Schmidt
• GFS & MapReduce
– Google achieved their goals
– A fundamental part of their stack
• Open source implementations
– GFS Hadoop Distributed FS (HDFS)
– MapReduce Hadoop MapReduce