SlideShare a Scribd company logo
1 of 27
MapReduce paradigm explained
with Hadoop examples

by Dmytro Sandu
How things began
• 1998 – Google founded:
– Need to index entire Web – terabytes of data
– No other option than distributed processing
– Decided to use clusters of low-cost commodity
PC’s instead of expensive servers
– Began development of specialized distributed file
system, later called GFS
– Allowed to handle terabytes of data and scale
smoothly
Few years later
• Key problem emerge:
– Simple algorithms: search, sort, compute indexes
etc.
– And complex environment:
•
•
•
•

Parallel computations (1000x of PCs)
Distributed data
Load balancing
Fault tolerance (both hardware and software)

• Result - large and complex code for simple
tasks
Solution
• Some abstraction needed:
– To express simple programs…
– and hide messy details of distributed computing

• Inspired by LISP and other functional
languages
MapReduce algorithm
• Most programs can be expressed as:
– Split input data into pieces
– Apply Map function to each piece
• Map function emits some number of (key, value) pairs

– Gather all pairs with the same key
– Pass each (key, list(values)) to Reduce function
• Reduce function computes single final value out of
list(values)

– List of all (key, final value) pairs is the result
For example
• Process election protocols:
– Split protocols into bulletins
– Map(bulletin_number, bulletin_data) {
emit(bulletin_data.selected_candidate,1); }
– Reduce(candidate, iterator:votes) {
int sum = 0;
for each vote in votes
sum += vote;
Emit(sum);
}
And run in parallel
What you have to do
• Set up a cluster of many machines
– Usually one master and many slaves

• Pull data into cluster’s file system
– distributed and replicated automatically

• Select data formatter (text, csv, xml, your own)
– Splits data into meaningful pieces for Map() stage

• Write Map() and Reduce() functions
• Run it!
What framework do
• Manages distributed file system(GFS or HDFS)
• Schedules and distributes Mappers and Reducers
across cluster
• Attempts to run Mappers as close to data
location as possible
• Automatically stores and routes intermediate
data from Mappers to Reducers
• Partitions and sorts output keys
• Restarts failed jobs, monitors failed machines
How this looks like
Distributed reduce
• There are multiple reducers to speed up work
• Each reducer provides separate output file
• Intermediate keys from Map phase are
partitioned across Reducers
– Balanced partitioning function is used, based on
key hash
– Same keys go into single reducer!
– User-defined partitioning function can be used
What to do with multiple outputs?
• Can be processed outside the cluster
– Amount of output data is usually much smaller

• User-defined partitioner can sort data across
outputs
– Need to think about partitioning balance
– May require separate smaller MapReduce step to
estimate key distribution

• Or just pass as-is to next MapReduce step
Now let’s sort
• MapReduce steps can be chained together
• Built-in sort by key is actively exploited
• First example output was sorted by candidate
name, voice count is the value
• Let’s re-sort by voice count and see the leader
– Map(candidate, count)
{Emit(concat(count,candidate), null)}
– Partition(key)
{return get_count(key) div reducers_count;}
– Reduce(key,values[]) { Emit(null) }
What happened next
• 2004 - Google tells world about their work:
– GFS file system, MapReduce C++ library

• 2005 - Doug Cutting and Mike Cafarella create
their open-source implementation in Java:
– Apache HDFS and Apache Hadoop

• Big Data wave hits first Facebook, Yahoo and
other internet giants, then others
• Tons of tools and cloud solutions emerge around
• 2013, Oct 15 – Hadoop 2.2.0 released
Hadoop 2.2.0 vs 1.2.1
• Moves to more general cluster management

• Better Windows support (still little docs)
How to get in
• Download from http://hadoop.apache.org/
– Explore API doc, example code
– Pull examples to Eclipse, resolve dependencies by
linking JAR’s, try to write your MR code
– Export your code as JAR

• Here problems begin:
– Hard and long to set up, especially on Windows
– 2.2.0 is more complex than 1.x, less info available
Possible solutions
• Windows + Cygwin + Hadoop – fail
• Ubuntu + Hadoop – too much time
• Hortonworks Sandbox – win!
–
–
–
–
–

Bundled VM images
Single-node Hadoop ready to use
All major Hadoop-based tools also installed
Apache Hue – web-based management UI
Educational – only license

• http://hortonworks.com/products/hortonworkssandbox/
UI look
Let’s pull in some files
And set up standard word count
• Job Designer-> New Action->Java
– Jar path /user/hue/oozie/workspaces/lib/hadoopexamples.jar
– Main class
org.apache.hadoop.examples.WordCount
– Args
/user/hue/oozie/workspaces/data/Voroshilovghra
d_SierghiiViktorovichZhadan.txt
/user/hue/oozie/workspaces/data/wc.txt
TokenizerMapper
IntSumReducer
WordCount
Now let’s sort the result
WordSortCount
Sources
• http://research.google.com/archive/mapredu
ce.html
• http://hadoop.apache.org
• http://hortonworks.com/products/hortonwor
ks-sandbox/
• http://stackoverflow.com/questions/tagged/h
adoop
Thanks!

More Related Content

What's hot

Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 

What's hot (19)

Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Similar to Map reduce paradigm explained

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce ParadigmTarjMehta1
 

Similar to Map reduce paradigm explained (20)

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop
HadoopHadoop
Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce Paradigm
 
hadoop
hadoophadoop
hadoop
 

Recently uploaded

Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 

Recently uploaded (20)

Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 

Map reduce paradigm explained

  • 1. MapReduce paradigm explained with Hadoop examples by Dmytro Sandu
  • 2. How things began • 1998 – Google founded: – Need to index entire Web – terabytes of data – No other option than distributed processing – Decided to use clusters of low-cost commodity PC’s instead of expensive servers – Began development of specialized distributed file system, later called GFS – Allowed to handle terabytes of data and scale smoothly
  • 3. Few years later • Key problem emerge: – Simple algorithms: search, sort, compute indexes etc. – And complex environment: • • • • Parallel computations (1000x of PCs) Distributed data Load balancing Fault tolerance (both hardware and software) • Result - large and complex code for simple tasks
  • 4. Solution • Some abstraction needed: – To express simple programs… – and hide messy details of distributed computing • Inspired by LISP and other functional languages
  • 5. MapReduce algorithm • Most programs can be expressed as: – Split input data into pieces – Apply Map function to each piece • Map function emits some number of (key, value) pairs – Gather all pairs with the same key – Pass each (key, list(values)) to Reduce function • Reduce function computes single final value out of list(values) – List of all (key, final value) pairs is the result
  • 6. For example • Process election protocols: – Split protocols into bulletins – Map(bulletin_number, bulletin_data) { emit(bulletin_data.selected_candidate,1); } – Reduce(candidate, iterator:votes) { int sum = 0; for each vote in votes sum += vote; Emit(sum); }
  • 7. And run in parallel
  • 8. What you have to do • Set up a cluster of many machines – Usually one master and many slaves • Pull data into cluster’s file system – distributed and replicated automatically • Select data formatter (text, csv, xml, your own) – Splits data into meaningful pieces for Map() stage • Write Map() and Reduce() functions • Run it!
  • 9. What framework do • Manages distributed file system(GFS or HDFS) • Schedules and distributes Mappers and Reducers across cluster • Attempts to run Mappers as close to data location as possible • Automatically stores and routes intermediate data from Mappers to Reducers • Partitions and sorts output keys • Restarts failed jobs, monitors failed machines
  • 11. Distributed reduce • There are multiple reducers to speed up work • Each reducer provides separate output file • Intermediate keys from Map phase are partitioned across Reducers – Balanced partitioning function is used, based on key hash – Same keys go into single reducer! – User-defined partitioning function can be used
  • 12. What to do with multiple outputs? • Can be processed outside the cluster – Amount of output data is usually much smaller • User-defined partitioner can sort data across outputs – Need to think about partitioning balance – May require separate smaller MapReduce step to estimate key distribution • Or just pass as-is to next MapReduce step
  • 13. Now let’s sort • MapReduce steps can be chained together • Built-in sort by key is actively exploited • First example output was sorted by candidate name, voice count is the value • Let’s re-sort by voice count and see the leader – Map(candidate, count) {Emit(concat(count,candidate), null)} – Partition(key) {return get_count(key) div reducers_count;} – Reduce(key,values[]) { Emit(null) }
  • 14. What happened next • 2004 - Google tells world about their work: – GFS file system, MapReduce C++ library • 2005 - Doug Cutting and Mike Cafarella create their open-source implementation in Java: – Apache HDFS and Apache Hadoop • Big Data wave hits first Facebook, Yahoo and other internet giants, then others • Tons of tools and cloud solutions emerge around • 2013, Oct 15 – Hadoop 2.2.0 released
  • 15. Hadoop 2.2.0 vs 1.2.1 • Moves to more general cluster management • Better Windows support (still little docs)
  • 16. How to get in • Download from http://hadoop.apache.org/ – Explore API doc, example code – Pull examples to Eclipse, resolve dependencies by linking JAR’s, try to write your MR code – Export your code as JAR • Here problems begin: – Hard and long to set up, especially on Windows – 2.2.0 is more complex than 1.x, less info available
  • 17. Possible solutions • Windows + Cygwin + Hadoop – fail • Ubuntu + Hadoop – too much time • Hortonworks Sandbox – win! – – – – – Bundled VM images Single-node Hadoop ready to use All major Hadoop-based tools also installed Apache Hue – web-based management UI Educational – only license • http://hortonworks.com/products/hortonworkssandbox/
  • 19. Let’s pull in some files
  • 20. And set up standard word count • Job Designer-> New Action->Java – Jar path /user/hue/oozie/workspaces/lib/hadoopexamples.jar – Main class org.apache.hadoop.examples.WordCount – Args /user/hue/oozie/workspaces/data/Voroshilovghra d_SierghiiViktorovichZhadan.txt /user/hue/oozie/workspaces/data/wc.txt
  • 24. Now let’s sort the result
  • 26. Sources • http://research.google.com/archive/mapredu ce.html • http://hadoop.apache.org • http://hortonworks.com/products/hortonwor ks-sandbox/ • http://stackoverflow.com/questions/tagged/h adoop