Contenu connexe Similaire à MapReduce Design Patterns (20) Plus de Donald Miner (11) MapReduce Design Patterns1. 1© Copyright 2012 EMC Corporation. All rights reserved.
MapReduce
Design Patterns
Donald Miner
Greenplum Hadoop Solutions Architect
@donaldpminer
2. 2© Copyright 2012 EMC Corporation. All rights reserved.
Book was made available December 2012
4. 4© Copyright 2012 EMC Corporation. All rights reserved.
What are design patterns?
(in general)
Reusable solutions to problems
Domain independent
Not a cookbook, but not a guide
Not a finished solution
5. 5© Copyright 2012 EMC Corporation. All rights reserved.
Why design patterns?
(in general)
Makes the intent of code easier to understand
Provides a common language for solutions
Be able to reuse code
Known performance profiles and limitations of
solutions
6. 6© Copyright 2012 EMC Corporation. All rights reserved.
Why MapReduce design patterns?
Recurring patterns in data-related problem solving
Groups are building patterns independently
Lots of new users every day
MapReduce is a new way of thinking
Foundation for higher-level tools (Pig, Hive, …)
Community is reaching the right level of maturity
7. 7© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Template
Intent
Motivation
Applicability
Structure
Consequences
Resemblances
Performance analysis
Examples
8. 8© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Categories
Summarization
Filtering
Data Organization
Joins
Metapatterns
Input and output
9. 9© Copyright 2012 EMC Corporation. All rights reserved.
Filtering patterns
Extract interesting subsets
Filtering
Bloom filtering
Top ten
Distinct
Summarization patterns
top-down summaries
Numerical summarizations
Inverted index
Counting with counters
I only want
some of my data!
I only want
a top-level view
of my data!
10. 10© Copyright 2012 EMC Corporation. All rights reserved.
Data organization patterns
Reorganize, restructure
Structured to hierarchical
Partitioning
Binning
Total order sorting
Shuffling
Join patterns
Bringing data sets together
Reduce-side join
Replicated join
Composite join
Cartesian product
I want to change
the way my data
is organized!
I want to mash
my different data
sources together!
11. 11© Copyright 2012 EMC Corporation. All rights reserved.
Metapatterns
Patterns of patterns
Job chaining
Chain folding
Job merging
Input and output patterns
Custom input and output
Generating data
External source output
External source input
Partition pruning
I want to solve
a complex problem
with multiple patterns!
I want to get data or
put data in an
unusual place!
12. 12© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
(filtering)
Intent
Retrieve a relatively small number of top K records, according
to a ranking scheme in your data set, no matter how large
the data.
Motivation
Finding outliers
Top ten lists are fun
Building dashboards
Sorting/Limit isn’t going to work here
13. 13© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Applicability
Rank-able records
Limited number of output records
Consequences
The top K records are returned.
14. 14© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Structure
class mapper:
setup():
initialize top ten sorted list
map(key, record):
insert record into top ten sorted list
if length of array is greater-than 10:
truncate list to a length of 10
cleanup():
for record in top sorted ten list:
emit null,record
class reducer:
setup():
initialize top ten sorted list
reduce(key, records):
sort records
truncate records to top 10
for record in records:
emit record
15. 15© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Resemblances
SQL:
SELECT * FROM table ORDER BY col4 DESC LIMIT 10;
Pig:
B = ORDER A BY col4 DESC;
C = LIMIT B 10;
16. 16© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Performance analysis
Pretty quick: map-heavy, low network usage
Pay attention to how many records the reducer is getting
[number of input splits] x K
Example
Top ten StackOverflow users by reputation
17. 17© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> {
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void map(Object key, Text value, Context context) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
String userId = parsed.get("Id");
String reputation = parsed.get("Reputation");
repToRecordMap.put(Integer.parseInt(reputation), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
protected void cleanup(Context context) {
for (Text t : repToRecordMap.values()) {
context.write(NullWritable.get(), t);
}
}
}
Top Ten Mapper
18. 18© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text>
{
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context) {
for (Text value : values) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
for (Text t : repToRecordMap.descendingMap().values()) {
context.write(NullWritable.get(), t);
}
}
} Top Ten Reducer
19. 19© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
(filtering)
Intent
Keep records that are a member of some predefined set of
values. It is not a problem if the output is a bit inaccurate.
Motivation
Similar to normal Boolean filtering, but we are filtering on set
membership
Set membership is evaluated with a Bloom filter
20. 20© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Applicability
A feature can be extracted and tested for set membership
Predetermined set is available
Some false positives are acceptable
Consequences
Records that pass the Bloom filter membership test are returned
Known Uses
Keep all records in a watch list (and a few records that aren’t)
Pre-filtering records before an expensive membership test
21. 21© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Structure
class mapper:
setup():
load bloom filter into memory
map(key, record):
if record in bloom filter:
emit (record, null)
Resemblances
UDFs?
22. 22© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Performance analysis
Map-only
Slight overhead in moving Bloom filter into memory
Bloom filter membership tests are constant time
Example
Filter StackOverflow comments that do not contain a keyword
Distributed HBase query using a Bloom filter
23. 23© Copyright 2012 EMC Corporation. All rights reserved.
Candidate new patterns
Link Graph processing patterns (new category)
– Shortest past, diameter, graph stats, connected
components, etc.
– Too domain specific?
– Has its own distinct patterns
Projection (filtering)
– Remove “columns” of data
Transformation (data organization?)
– Take a data set but transform it into something else
24. 24© Copyright 2012 EMC Corporation. All rights reserved.
Future and call to action
Contributing your own patterns
Trends in the nature of data
– Images, audio, video, biomedical, social …
Libraries, abstractions, and tools
Ecosystem patterns: YARN, HBase, ZooKeeper, …
Notes de l'éditeur Quick overview of bookThis talk is not to sell you on the book, its to sell you on why MRDPs are important for the communityIntermediate to advanced MapReduce resourceEarly beginners and experts alike can find some use in itSome knowledge of Hadoop is encouragedTom White’s Hadoop: The Definitive Guide is a good start Story about explaining joinsTeaching hadoop classes and explaining how to solve problems was challengingMost of the stuff in this book is not novel– it’s been collected through different sources Spend time talking about each and what purpose they haveRemember the mention that examples are in Hadoop Just briefly outline