SlideShare une entreprise Scribd logo
1  sur  25
1© Copyright 2012 EMC Corporation. All rights reserved.
MapReduce
Design Patterns
Donald Miner
Greenplum Hadoop Solutions Architect
@donaldpminer
2© Copyright 2012 EMC Corporation. All rights reserved.
Book was made available December 2012
3© Copyright 2012 EMC Corporation. All rights reserved.
Inspiration for my book
4© Copyright 2012 EMC Corporation. All rights reserved.
What are design patterns?
(in general)
Reusable solutions to problems
Domain independent
Not a cookbook, but not a guide
Not a finished solution
5© Copyright 2012 EMC Corporation. All rights reserved.
Why design patterns?
(in general)
Makes the intent of code easier to understand
Provides a common language for solutions
Be able to reuse code
Known performance profiles and limitations of
solutions
6© Copyright 2012 EMC Corporation. All rights reserved.
Why MapReduce design patterns?
Recurring patterns in data-related problem solving
Groups are building patterns independently
Lots of new users every day
MapReduce is a new way of thinking
Foundation for higher-level tools (Pig, Hive, …)
Community is reaching the right level of maturity
7© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Template
Intent
Motivation
Applicability
Structure
Consequences
Resemblances
Performance analysis
Examples
8© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Categories
Summarization
Filtering
Data Organization
Joins
Metapatterns
Input and output
9© Copyright 2012 EMC Corporation. All rights reserved.
Filtering patterns
Extract interesting subsets
Filtering
Bloom filtering
Top ten
Distinct
Summarization patterns
top-down summaries
Numerical summarizations
Inverted index
Counting with counters
I only want
some of my data!
I only want
a top-level view
of my data!
10© Copyright 2012 EMC Corporation. All rights reserved.
Data organization patterns
Reorganize, restructure
Structured to hierarchical
Partitioning
Binning
Total order sorting
Shuffling
Join patterns
Bringing data sets together
Reduce-side join
Replicated join
Composite join
Cartesian product
I want to change
the way my data
is organized!
I want to mash
my different data
sources together!
11© Copyright 2012 EMC Corporation. All rights reserved.
Metapatterns
Patterns of patterns
Job chaining
Chain folding
Job merging
Input and output patterns
Custom input and output
Generating data
External source output
External source input
Partition pruning
I want to solve
a complex problem
with multiple patterns!
I want to get data or
put data in an
unusual place!
12© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
(filtering)
Intent
Retrieve a relatively small number of top K records, according
to a ranking scheme in your data set, no matter how large
the data.
Motivation
Finding outliers
Top ten lists are fun
Building dashboards
Sorting/Limit isn’t going to work here
13© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Applicability
Rank-able records
Limited number of output records
Consequences
The top K records are returned.
14© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Structure
class mapper:
setup():
initialize top ten sorted list
map(key, record):
insert record into top ten sorted list
if length of array is greater-than 10:
truncate list to a length of 10
cleanup():
for record in top sorted ten list:
emit null,record
class reducer:
setup():
initialize top ten sorted list
reduce(key, records):
sort records
truncate records to top 10
for record in records:
emit record
15© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Resemblances
SQL:
SELECT * FROM table ORDER BY col4 DESC LIMIT 10;
Pig:
B = ORDER A BY col4 DESC;
C = LIMIT B 10;
16© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Performance analysis
Pretty quick: map-heavy, low network usage
Pay attention to how many records the reducer is getting
[number of input splits] x K
Example
Top ten StackOverflow users by reputation
17© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> {
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void map(Object key, Text value, Context context) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
String userId = parsed.get("Id");
String reputation = parsed.get("Reputation");
repToRecordMap.put(Integer.parseInt(reputation), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
protected void cleanup(Context context) {
for (Text t : repToRecordMap.values()) {
context.write(NullWritable.get(), t);
}
}
}
Top Ten Mapper
18© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text>
{
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context) {
for (Text value : values) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
for (Text t : repToRecordMap.descendingMap().values()) {
context.write(NullWritable.get(), t);
}
}
} Top Ten Reducer
19© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
(filtering)
Intent
Keep records that are a member of some predefined set of
values. It is not a problem if the output is a bit inaccurate.
Motivation
Similar to normal Boolean filtering, but we are filtering on set
membership
Set membership is evaluated with a Bloom filter
20© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Applicability
A feature can be extracted and tested for set membership
Predetermined set is available
Some false positives are acceptable
Consequences
Records that pass the Bloom filter membership test are returned
Known Uses
Keep all records in a watch list (and a few records that aren’t)
Pre-filtering records before an expensive membership test
21© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Structure
class mapper:
setup():
load bloom filter into memory
map(key, record):
if record in bloom filter:
emit (record, null)
Resemblances
UDFs?
22© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Performance analysis
Map-only
Slight overhead in moving Bloom filter into memory
Bloom filter membership tests are constant time
Example
Filter StackOverflow comments that do not contain a keyword
Distributed HBase query using a Bloom filter
23© Copyright 2012 EMC Corporation. All rights reserved.
Candidate new patterns
Link Graph processing patterns (new category)
– Shortest past, diameter, graph stats, connected
components, etc.
– Too domain specific?
– Has its own distinct patterns
Projection (filtering)
– Remove “columns” of data
Transformation (data organization?)
– Take a data set but transform it into something else
24© Copyright 2012 EMC Corporation. All rights reserved.
Future and call to action
Contributing your own patterns
Trends in the nature of data
– Images, audio, video, biomedical, social …
Libraries, abstractions, and tools
Ecosystem patterns: YARN, HBase, ZooKeeper, …
MapReduce Design Patterns

Contenu connexe

Tendances

Tendances (20)

Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Altis: AWS Snowflake Practice
Altis: AWS Snowflake PracticeAltis: AWS Snowflake Practice
Altis: AWS Snowflake Practice
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
planning & project management for DWH
planning & project management for DWHplanning & project management for DWH
planning & project management for DWH
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Sqoop
SqoopSqoop
Sqoop
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
 

En vedette

En vedette (6)

Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-Reduce
 

Similaire à MapReduce Design Patterns

Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014
Simon Ritter
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9
google
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The Enterprise
Daniel Egan
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Romeo Kienzler
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
The Children's Hospital of Philadelphia
 

Similaire à MapReduce Design Patterns (20)

Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design Patterns
 
Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014
 
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
Lambdas and-streams-s ritter-v3
Lambdas and-streams-s ritter-v3Lambdas and-streams-s ritter-v3
Lambdas and-streams-s ritter-v3
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The Enterprise
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
MATLAB and HDF-EOS
MATLAB and HDF-EOSMATLAB and HDF-EOS
MATLAB and HDF-EOS
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
JDBC Next: A New Asynchronous API for Connecting to a Database
JDBC Next: A New Asynchronous API for Connecting to a Database JDBC Next: A New Asynchronous API for Connecting to a Database
JDBC Next: A New Asynchronous API for Connecting to a Database
 
C# in depth
C# in depthC# in depth
C# in depth
 
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 

Plus de Donald Miner

10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
Donald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
Donald Miner
 

Plus de Donald Miner (11)

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

MapReduce Design Patterns

  • 1. 1© Copyright 2012 EMC Corporation. All rights reserved. MapReduce Design Patterns Donald Miner Greenplum Hadoop Solutions Architect @donaldpminer
  • 2. 2© Copyright 2012 EMC Corporation. All rights reserved. Book was made available December 2012
  • 3. 3© Copyright 2012 EMC Corporation. All rights reserved. Inspiration for my book
  • 4. 4© Copyright 2012 EMC Corporation. All rights reserved. What are design patterns? (in general) Reusable solutions to problems Domain independent Not a cookbook, but not a guide Not a finished solution
  • 5. 5© Copyright 2012 EMC Corporation. All rights reserved. Why design patterns? (in general) Makes the intent of code easier to understand Provides a common language for solutions Be able to reuse code Known performance profiles and limitations of solutions
  • 6. 6© Copyright 2012 EMC Corporation. All rights reserved. Why MapReduce design patterns? Recurring patterns in data-related problem solving Groups are building patterns independently Lots of new users every day MapReduce is a new way of thinking Foundation for higher-level tools (Pig, Hive, …) Community is reaching the right level of maturity
  • 7. 7© Copyright 2012 EMC Corporation. All rights reserved. Pattern Template Intent Motivation Applicability Structure Consequences Resemblances Performance analysis Examples
  • 8. 8© Copyright 2012 EMC Corporation. All rights reserved. Pattern Categories Summarization Filtering Data Organization Joins Metapatterns Input and output
  • 9. 9© Copyright 2012 EMC Corporation. All rights reserved. Filtering patterns Extract interesting subsets Filtering Bloom filtering Top ten Distinct Summarization patterns top-down summaries Numerical summarizations Inverted index Counting with counters I only want some of my data! I only want a top-level view of my data!
  • 10. 10© Copyright 2012 EMC Corporation. All rights reserved. Data organization patterns Reorganize, restructure Structured to hierarchical Partitioning Binning Total order sorting Shuffling Join patterns Bringing data sets together Reduce-side join Replicated join Composite join Cartesian product I want to change the way my data is organized! I want to mash my different data sources together!
  • 11. 11© Copyright 2012 EMC Corporation. All rights reserved. Metapatterns Patterns of patterns Job chaining Chain folding Job merging Input and output patterns Custom input and output Generating data External source output External source input Partition pruning I want to solve a complex problem with multiple patterns! I want to get data or put data in an unusual place!
  • 12. 12© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” (filtering) Intent Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation Finding outliers Top ten lists are fun Building dashboards Sorting/Limit isn’t going to work here
  • 13. 13© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Applicability Rank-able records Limited number of output records Consequences The top K records are returned.
  • 14. 14© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Structure class mapper: setup(): initialize top ten sorted list map(key, record): insert record into top ten sorted list if length of array is greater-than 10: truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record class reducer: setup(): initialize top ten sorted list reduce(key, records): sort records truncate records to top 10 for record in records: emit record
  • 15. 15© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Resemblances SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10; Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;
  • 16. 16© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Performance analysis Pretty quick: map-heavy, low network usage Pay attention to how many records the reducer is getting [number of input splits] x K Example Top ten StackOverflow users by reputation
  • 17. 17© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void map(Object key, Text value, Context context) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); String userId = parsed.get("Id"); String reputation = parsed.get("Reputation"); repToRecordMap.put(Integer.parseInt(reputation), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } protected void cleanup(Context context) { for (Text t : repToRecordMap.values()) { context.write(NullWritable.get(), t); } } } Top Ten Mapper
  • 18. 18© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void reduce(NullWritable key, Iterable<Text> values, Context context) { for (Text value : values) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } for (Text t : repToRecordMap.descendingMap().values()) { context.write(NullWritable.get(), t); } } } Top Ten Reducer
  • 19. 19© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” (filtering) Intent Keep records that are a member of some predefined set of values. It is not a problem if the output is a bit inaccurate. Motivation Similar to normal Boolean filtering, but we are filtering on set membership Set membership is evaluated with a Bloom filter
  • 20. 20© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Applicability A feature can be extracted and tested for set membership Predetermined set is available Some false positives are acceptable Consequences Records that pass the Bloom filter membership test are returned Known Uses Keep all records in a watch list (and a few records that aren’t) Pre-filtering records before an expensive membership test
  • 21. 21© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Structure class mapper: setup(): load bloom filter into memory map(key, record): if record in bloom filter: emit (record, null) Resemblances UDFs?
  • 22. 22© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Performance analysis Map-only Slight overhead in moving Bloom filter into memory Bloom filter membership tests are constant time Example Filter StackOverflow comments that do not contain a keyword Distributed HBase query using a Bloom filter
  • 23. 23© Copyright 2012 EMC Corporation. All rights reserved. Candidate new patterns Link Graph processing patterns (new category) – Shortest past, diameter, graph stats, connected components, etc. – Too domain specific? – Has its own distinct patterns Projection (filtering) – Remove “columns” of data Transformation (data organization?) – Take a data set but transform it into something else
  • 24. 24© Copyright 2012 EMC Corporation. All rights reserved. Future and call to action Contributing your own patterns Trends in the nature of data – Images, audio, video, biomedical, social … Libraries, abstractions, and tools Ecosystem patterns: YARN, HBase, ZooKeeper, …

Notes de l'éditeur

  1. Quick overview of bookThis talk is not to sell you on the book, its to sell you on why MRDPs are important for the communityIntermediate to advanced MapReduce resourceEarly beginners and experts alike can find some use in itSome knowledge of Hadoop is encouragedTom White’s Hadoop: The Definitive Guide is a good start
  2. Story about explaining joinsTeaching hadoop classes and explaining how to solve problems was challengingMost of the stuff in this book is not novel– it’s been collected through different sources
  3. Spend time talking about each and what purpose they haveRemember the mention that examples are in Hadoop
  4. Just briefly outline