2. Outline
Large-scale Distributed Systems
Introduction to Cloud Computing
Cloud Computing paradigms and models
Introduction to MapReduce
Alternative architectures
Writing Application using Hadoop
3. Distributed Systems
Set of discrete machines which cooperate to perform
computation
Give the notion of a single “machine”
Keep the distribution transparent
Examples:
Compute clusters
Distributed storage systems, such as Dropbox, Google Drive, etc.
The Web
4. Characteristics
Ordering
Time is used to ensure ordering
In most cases, only need to know that a happened before
b, known as the happens-before relation
Distributed Mutual Exclusion
Concurrent access to shared resources needs to be
synchronized
Central lock server: All lock requests are handled by a
central server
Token passing: Arrange nodes into a ring and a token is
passed around
Totally-ordered multicast: Clients multicast requests to
each other
5. Characteristics (2)
Distributed transactions
Distributed transactions span multiple transaction processing servers
Actions need to be coordinated across multiple parties
Replication
A number of distributed systems involve replication
Data replication: Multiple copies of some object stored at different servers
Computation replication: Multiple servers capable of providing an
operation
Advantages
1. Load balancing: Work spread out across clients
2. Lower latency: Better performance if replica close to the client
3. Fault tolerance: Failure of some replicas can be tolerated
6. CAP
CAP
Consistency: All nodes see the same state
Availability: All requests get a response
Partitioning: System continues to operate even in the
face of node failure
Brewer‟s conjecture states that in a distributed
system only 2 out of 3 possible
In the current setup, partitioning is a given:
Hardware/software fails all the time
Therefore, systems need to choose between
consistency and availability
7. Advantages
Scalability:
The scale of the Internet (think how many queries Google servers handle daily)
Only a matter of adding more machines
Cheaper than super computers
More machines means more parallelism, hence better performance
Sharing:
The same resource is shared between multiple users
Just like the Internet is shared between millions of users
Communication:
Communication between (potentially geographically isolated) machines and
users (via email, Facebook, etc.)
Reliability:
The service can remain active even if multiple machines go down
8. Challenges
Concurrency:
Concurrent execution requires some form of coordination
Fault-tolerance:
Any component can fail at any instant due to a software or a
hardware bug
Security:
One machine can compromise the entire system
Coordination:
No global time so non-trivial to coordinate
Trouble shooting:
Hard to trouble shoot because hard to reason about the
system
9. Introduction to Cloud
Computing
An emerging IT development, deployment, and delivery model that enables
real-time delivery of a broad range of IT products, services and solutions over
the internet
A realization of utility computing in which computation, storage, and services
are offered as a metered service
Grid Computing: form of distributed computing, acting
in concert to perform very large tasks
Utility Computing: metered service similar to a
traditional public utility
Autonomic Computing: capable of self-management
Cloud Computing: deployments as of 2009 depend on
grids, have autonomic characteristics and bill like
utilities
10. Characteristics
On-demand self-service: allows users to obtain,
configure and deploy cloud services themselves using
cloud service catalogues, without requiring the
assistance of IT.
Broad network access: capabilities are available over
the network and accessed through standard
mechanisms that promote use by heterogeneous thin
or thick client platforms
Resource pooling: The provider‟s computing resources
are pooled to serve multiple consumers using a multi-
tenant model, with different physical and virtual
resources dynamically assigned and reassigned
according to consumer demand.
11. Characteristics (2)
Rapid elasticity: Capabilities can be rapidly and
elastically provisioned, in some cases automatically,
to quickly scale out and rapidly released to quickly
scale in. To the consumer, the capabilities available
for provisioning often appear to be unlimited and
can be purchased in any quantity at any time.
Measured service: Cloud systems automatically
control and optimize resource use by leveraging a
metering capability at some level of abstraction
appropriate to the type of service (e.g., storage,
processing, bandwidth, and active user accounts).
12. Cloud Service Models
SaaS – Software as a Service: Network-hosted
application
PaaS– Platform as a Service: Network-hosted software
development platform
IaaS – Infrastructure as a Service: Provider hosts
customer VMs or provides network storage
DaaS – Data as a Service: Customer queries against
provider’s database
IPMaaS – Identity and Policy Management as a
Service: Provider manages identity and/or access
control policy for customer
NaaS – Network as a Service: Provider offers virtualized
networks (e.g. VPNs)
13. Deployment Models
Private Cloud: infrastructure is operated solely for an
organization.
Public Cloud: infrastructure is made available to the general
public as a pay-as-you-go model, e.g. Amazon Web Services,
Google AppEngine, and Microsoft Azure
Community Cloud: infrastructure between several
organizations from a specific community with common
concerns (security, compliance, jurisdiction, etc.), whether
managed internally or by a third-party and hosted internally or
externally.
Hybrid Cloud: infrastructure is a combination of two or more
clouds(private, community, or public) that remain unique
entities but are bound together by standardized or proprietary
technology that enables data and application portability
between environments.
18. Advantages
Advantages to both service providers and end users
Service providers:
Simplified software installation and maintenance
Centralized control over versioning
No need to build, provision, and maintain a datacenter
On the fly scaling
End users:
“Anytime, anywhere” access
Share data and collaborate easily
Safeguard data stored in the infrastructure
19. Obstacles
Bugs in large-scale distributed systems: Hard to
debug large-scale applications in full deployment
Scaling quickly: Automatically scaling while
conserving resources and money is an open ended
problem
Reputation fate sharing: Bad behavior by one
tenant can reflect badly on the rest
Software licensing: Gap between pay-as-you-go
model and software licensing
20. Obstacles (2)
Service availability: Possibility of cloud outage
Data lock-in: Dependence on cloud specific APIs
Security: Requires strong encrypted storage, VLANs,
and network middle-boxes (firewalls, etc.)
Data transfer bottlenecks: Moving large amounts of
data in and out is expensive
Performance unpredictability: Resource sharing
between applications
Scalable storage: No standard model to arbitrarily
scale storage up and down on-demand while
ensuring data durability and high availability
21. Introduction to MapReduce
A simple programming model that applies to many
large-scale computing problems
Hide messy details in MR runtime library:
Automatic parallelization
Load balancing
Network and disk transfer optimization
Handling of machine failures
Robustness
Improvements to core library benefit all users of library
22. Google MapReduce – Idea
The core idea behind MapReduce is mapping your
data set into a collection of <key, value> pairs, and
then reducing over all pairs with the same key.
Map
Apply function to all elements of a list
square x = x * x;
Map square [1, 2, 3, 4, 5];
[1, 4, 9, 16, 25]
Reduce
Combine all elements of a list
Reduce (+)[1, 2, 3, 4, 5];
15
24. MapReduce architecture
Master: In charge of all meta data, work scheduling
and distribution, and job orchestration
Workers: Contain slots to execute map or reduce
functions
Mappers:
A map worker reads the contents of the input split that it has
been assigned
It parses the file and converts it to key/value pairs and invokes
the user-defined map function for each pair
The intermediate key/value pairs after the application of the
map logic are collected (buffered) in memory
Once the buffered key/value pairs exceed a threshold they are
written to local disk and partitioned (using a partitioning
function) into R partitions. The location of each partition is passed
to the master
25. MapReduce architecture (2)
Workers: Contain slots to execute map or reduce
functions
Reducers:
A reduce worker gets locations of its input partitions from the
master and uses HTTP requests to retrieve them
Once it has read all its input, it sorts it by key to group
together all occurrences of the same key
It then invokes the user-defined reduce for each key and
passes it the key and its associated values
The key/value pairs generated after the application of the
reduce logic are then written to a final output file, which is
subsequently written to the distributed filesystem
26. Google File System - GFS
In-house distributed file system at Google
Stores all input an output files
Stores files…
divided into 64 MB blocks
on at least 3 different machines
Machines running GFS also run MapReduce
27. MapReduce job phases
A MapReduce job can be divided into 4 phases:
Input split: The input dataset is sliced into M splits, one
per map task
Map logic: The user-supplied map function is invoked
In tandem a sort phase is also applied that ensures that
map output is locally sorted by key
In addition, the key space is also partitioned amongst the
reducers
Shuffle: Map output is relayed to all reduce tasks
Reduce logic: The user-provided reduce function is
invoked
Before the application of the reduce function, the input
keys are merged to get globally sorted key/value pairs
29. Wordcount map in Java
1. public void map(Object key, Text value , Context
context) {
2. StringTokenizer itr = new
StringTokenizer(value.toString());
3. while (itr.hasMoreTokens()) {
4. word.set(itr.nextToken());
5. context.write(word , one);
6. }
7. }
30. Wordcount reduce in Java
1. public void reduce(Text key, Iterable <IntWritable >
values ,
Context context) {
2. int sum = 0;
3. for (IntWritable val : values) {
4. sum += val.get();
5. }
6. result.set(sum);
7. context.write(key, result);
8. }
31. Hadoop
Open-source implementation of MapReduce,
developed by Doug Cutting originally at Yahoo! in
2004
Now a top-level Apache open-source project
Implemented in Java (Google‟s in-house
implementation is in C++)
Jobs can be written in C++, Java, Python, etc.
Comes with an associated distributed filesystem,
HDFS (clone of GFS)
32. Hadoop Components
Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce Software Framework
There are many other projects based around
core Hadoop
– Often referred to as the „Hadoop
Ecosystem‟
– Pig, Hive, HBase,
Flume, Oozie, Sqoop, etc
33. Hadoop Users
Adobe: Several areas from social services to
unstructured data storage and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster
with 1100 nodes (12PB) and another with 300 nodes
(3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000
nodes; largest cluster has 4500 nodes!
34. Running a Hadoop Application
The first order of the day is to format the Hadoop
DFS
Jump to the Hadoop directory and execute:
bin/hadoop namenode -format
Running Hadoop
To run Hadoop and HDFS:
bin/start-all.sh
To terminate them:
bin/stop-all.sh
35. Running a Hadoop Application
Generating a dataset
Create a temporary directory to hold the data:
mkdir /tmp/gutenberg
Jump to it:
cd /tmp/gutenberg
Download text files:
wget www.gutenberg.org/etext/20417
wget www.gutenberg.org/etext/5000
wget www.gutenberg.org/etext/4300
36. Running a Hadoop Application
Copying the dataset to the HDFS
Jump to the Hadoop directory and execute:
bin/hadoop dfs -copyFromLocal /tmp/gutenberg
/ccw/Gutenberg
Running Wordcount
bin/hadoop jar hadoop-examples-1.0.4.jar wordcount
/ccw/gutenberg /ccw/gutenberg-output
Retrieving results from the HDFS
Copy to the local FS:
bin/hadoop dfs –getmerge /ccw/gutenberg-output
/tmp/gutenberg-output
37. Running a Hadoop Application
Accessing the web interface
JobTracker: http://localhost:50030
TaskTracker: http://localhost:50060
Reference: Running Hadoop on Ubuntu Linux
(Single-Node Cluster):
http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/