SlideShare a Scribd company logo
1 of 33
June 14, 2012

Optimizing MapReduce Job
Performance
Todd Lipcon [@tlipcon]
Introductions

    •  Software Engineer at Cloudera since 2009
    •  Committer and PMC member on HDFS,
       MapReduce, and HBase
    •  Spend lots of time looking at full stack
       performance

    •  This talk is to help you develop faster jobs
      –  If you want to hear about how we made Hadoop
         faster, see my Hadoop World 2011 talk on
         cloudera.com

2
                       ©2011 Cloudera, Inc. All Rights Reserved.
Aspects of Performance

    •  Algorithmic performance
      –  big-O, join strategies, data structures,
         asymptotes
    •  Physical performance
      –  Hardware (disks, CPUs, etc)
    •  Implementation performance
      –  Efficiency of code, avoiding extra work
      –  Make good use of available physical perf


3
                        ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals

    •  You can’t tune what you don’t
       understand
      –  MR’s strength as a framework is its black-box
         nature
      –  To get optimal performance, you have to
         understand the internals

    •  This presentation: understanding the
       black box

4
                      ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals (2)

    •  You can’t improve what you can’t
       measure
      –  Ganglia/Cacti/Cloudera Manager/etc a must
      –  Top 4 metrics: CPU, Memory, Disk, Network
      –  MR job metrics: slot-seconds, CPU-seconds,
         task wall-clocks, and I/O


    •  Before you start: run jobs, gather data


5
                      ©2011 Cloudera, Inc. All Rights Reserved.
Graphing bottlenecks
                                                                             This job might
                                                                             be CPU-bound
                                                                             in map phase
                                                        Most jobs not
                                                        CPU-bound
     Plenty of free
     RAM, perhaps
     can make better
     use of it?




                                                                        Fairly flat-topped
                                                                        network –
                                                                        bottleneck?




6
                       ©2011 Cloudera, Inc. All Rights Reserved.
Performance tuning cycle


                   Identify                                      Address
       Run job
                   bottleneck                                    bottleneck
                  - Graphs                                       -  Tune configs
                  - Job counters                                 -  Improve code
                  - Job logs                                     -  Rethink algos
                  - Profiler results


                 In order to understand these metrics and make
                 changes, you need to understand MR internals.




7
                     ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




8
                          ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




9
                          ©2011 Cloudera, Inc. All Rights Reserved.
Map-side sort/spill overview
  •  Goal: when complete, map task outputs one sorted file
  •  What happens when you call OutputCollector.collect
     ()?
       Map
       Task                       2. Output Buffer fills up.
                                  Contents sorted, partitioned
           .collect(K,V)          and spilled to disk

MapOutputBuffer                                  IFile
1. In-memory buffer
holds serialized,                                                      Map-side
                                                 IFile                                       IFile
unsorted key-values                                                     Merge
                                                                       3. Map task finishes. All
                                                 IFile                 IFiles merged to single
                                                                       IFile per task


10
                           ©2011 Cloudera, Inc. All Rights Reserved.
Zooming further: MapOutputBuffer
   (Hadoop 1.0)




                                                  12 bytes/rec
                  kvoffsets
              (Partition, KOff, VOff)
                    per record
                                                                          io.sort.record.percent
                                                                          * io.sort.mb

                      kvindices




                                                  4 bytes/rec
                  1 indirect-sort index
io.sort.mb             per record


                                                  R bytes/rec
                      kvbuffer
                    Raw, serialized                                     (1-io.sort.record.percent)
                    (Key, Val) pairs                                    * io.sort.mb




  11
                                  ©2011 Cloudera, Inc. All Rights Reserved.
MapOutputBuffer spill behavior

 •  Memory is limited: must spill
     –  If either of the kvbuffer or the metadata
        buffers fill up, “spill” to disk
     –  In fact, we spill before it’s full (in another
        thread): configure io.sort.spill.percent
 •  Performance impact
     –  If we spill more than one time, we must re-
        read and re-write all data: 3x the IO!
     –  #1 goal for map task optimization: spill once!

12
                       ©2011 Cloudera, Inc. All Rights Reserved.
Spill counters on map tasks

 •  ratio of Spilled Records vs Map Output
    Records
     –  if unequal, then you are doing more than one
        spill
 •  FILE: Number of bytes read/written
     –  get a sense of I/O amplification due to spilling




13
                      ©2011 Cloudera, Inc. All Rights Reserved.
Spill logs on map tasks
                             indicates that the metadata buffers
 2012-06-04 11:52:21,445 INFO before the data buffer
                             filled up MapTask: Spilling map output:
   record full = true
 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend
   = 60030900; bufvoid = 228117712
 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend =
   600309; length = 750387
 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0
 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output:
   record full = true
 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900;
   bufend = 120061700; bufvoid = 228117712
  2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309;
   kvend = 450230; length = 750387
 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of
   map output
 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1
 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2
          3 11:52:29,105 we can do
          better?


14
                          ©2011 Cloudera, Inc. All Rights Reserved.
Tuning to reduce spills

 •  Parameters:
     –  io.sort.mb: total buffer space
     –  io.sort.record.percent: proportion between
        metadata buffers and key/value data
     –  io.sort.spill.percent: threshold at which
        spill is triggered
     –  Total map output generated: can you use
        more compact serialization?
 •  Optimal settings depend on your data and
    available RAM!

15
                      ©2011 Cloudera, Inc. All Rights Reserved.
Setting io.sort.record.percent

 •  Common mistake: metadata buffers fill up
    way before kvdata buffer
 •  Optimal setting:
     –  io.sort.record.percent = 16/(16 + R)
     –  R = average record size: divide “Map Output
        Bytes” counter by “Map Output Records” counter
 •  Default (0.05) is usually too low (optimal for
    ~300byte records)
 •  Hadoop 2.0: this is no longer necessary!
     –  see MAPREDUCE-64 for gory details

16
                      ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Example (terasort)

 •  Map input size = output size
     –  128MB block = 1,342,177 records, each 100
        bytes
     –  metadata: 16 * 1342177 = 20.9MB
 •  io.sort.mb
     –  128MB data + 20.9MB meta = 148.9MB
 •  io.sort.record.percent
     –  16/(16+100)=0.138
 •  io.sort.spill.percent = 1.0

17
                    ©2011 Cloudera, Inc. All Rights Reserved.
More tips on spill tuning
 •  Biggest win is going from 2 spills to 1 spill
     –  3 spills is approximately the same speed as 2 spills
        (same IO amplificatoin)
 •  Calculate if it’s even possible, given your heap
    size
     –  io.sort.mb has to fit within your Java heap (plus
        whatever RAM your Mapper needs, plus ~30% for
        overhead)
 •  Only bother if this is the bottleneck!
     –  Look at map task logs: if the merge step at the end is
        taking a fraction of a second, not worth it!
     –  Typically most impact on jobs with big shuffle (sort/
        dedup)


18
                         ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




19
                          ©2011 Cloudera, Inc. All Rights Reserved.
Reducer fetch tuning

 •  Reducers fetch map output via HTTP
 •  Tuning parameters:
     –  Server side: tasktracker.http.threads
     –  Client side:
      mapred.reduce.parallel.copies
 •  Turns out this is not so interesting
     –  follow the best practices from Hadoop:
        Definitive Guide


20
                     ©2011 Cloudera, Inc. All Rights Reserved.
Improving fetch bottlenecks

 •  Reduce intermediate data
     –  Implement a Combiner: less data transfers faster
     –  Enable intermediate compression: Snappy is
        easy to enable; trades off some CPU for less IO/
        network
 •  Double-check for network issues
     –  Frame errors, NICs auto-negotiated to 100mbit,
        etc: one or two slow hosts can bottleneck a job
     –  Tell-tale sign: all maps are done, and reducers sit
        in fetch stage for many minutes (look at logs)


21
                       ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




22
                          ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge (Hadoop 1.0)

                                  Yes:
                                                    RAMManager
                                  fetch to                                       RAM-to-disk
                                  RAM                                            merges
 Remote Map           Fits in
    Outputs           RAM?
  (via HTTP)                                                                      1. Data accumulated
                                                                                  in RAM is merged to
                                No: fetch                                         disk files
                                to disk
                                                      Local Disk

                                                         IFile
2. If too many disk
                         disk-to-disk                                                    Merged
files accumulate,                                        IFile
                         merges                                                          iterator
they are re-merged

                                                         IFile                           Reduce
                                                                                          Task
                                                                   3. Segments from
                                                                   RAM and disk are
23                                                                 merged into the
                                                                   reducer code
                                     ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge triggers
 •  RAMManager
     –  Total buffer size:
       mapred.job.shuffle.input.buffer.percent
       (default 0.70, percentage of reducer heapsize)
 •  Mem-to-disk merge triggers:
     –  RAMManager is
        mapred.job.shuffle.merge.percent % full
        (default 0.66)
     –  Or mapred.inmem.merge.threshold segments
        accumulated (default 1000)
 •  Disk-to-disk merge
     –  io.sort.factor on-disk segments pile up (fairly rare)



24
                             ©2011 Cloudera, Inc. All Rights Reserved.
Final merge phase

 •  MR assumes that reducer code needs the
    full heap worth of RAM
     –  Spills all in-RAM segments before running
        user code to free memory
 •  This isn’t true if your reducer is simple
     –  eg sort, simple aggregation, etc with no state
 •  Configure
     mapred.job.reduce.input.buffer.percent to
     0.70 to keep reducer input data in RAM


25
                      ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge counters

 •  FILE: number of bytes read/written
     –  Ideally close to 0 if you can fit in RAM
 •  Spilled records:
     –  Ideally close to 0. If significantly more than
        reduce input records, job is hitting a multi-
        pass merge which is quite expensive




26
                       ©2011 Cloudera, Inc. All Rights Reserved.
Tuning reducer merge

 •  Configure
     mapred.job.reduce.input.buffer.percent
    to 0.70 to keep data in RAM if you don’t
    have any state in reducer
 •  Experiment with setting
    mapred.inmem.merge.threshold to 0 to
    avoid spills
 •  Hadoop 2.0: experiment with
     mapreduce.reduce.merge.memtomem.enabled


27
                   ©2011 Cloudera, Inc. All Rights Reserved.
Rules of thumb for # maps/reduces

 •  Aim for map tasks running 1-3 minutes each
     –  Too small: wasted startup overhead, less efficient
        shuffle
     –  Too big: not enough parallelism, harder to share
        cluster
 •  Reduce task count:
     –  Large reduce phase: base on cluster slot count (a
        few GB per reducer)
     –  Small reduce phase: fewer reducers will result in
        more efficient shuffle phase


28
                       ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




29
                          ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Java code for MR
 •  Follow general Java best practices
     –  String parsing and formatting is slow
     –  Guard debug statements with isDebugEnabled()
     –  StringBuffer.append vs repeated string concatenation
 •  For CPU-intensive jobs, make a test harness/
    benchmark outside MR
     –  Then use your favorite profiler
 •  Check for GC overhead: -XX:+PrintGCDetails –
    verbose:gc
 •  Easiest profiler: add –Xprof to
    mapred.child.java.opts – then look at
    stdout task log

30
                         ©2011 Cloudera, Inc. All Rights Reserved.
Other tips for fast MR code

 •  Use the most compact and efficient data
    formats
     –  LongWritable is way faster than parsing text
     –  BytesWritable instead of Text for SHA1
        hashes/dedup
     –  Avro/Thrift/Protobuf for complex data, not JSON!
 •  Write a Combiner and RawComparator
 •  Enable intermediate compression (Snappy/
    LZO)

31
                       ©2011 Cloudera, Inc. All Rights Reserved.
Summary

 •  Understanding MR internals helps understand
    configurations and tuning
 •  Focus your tuning effort on things that are
    bottlenecks, following a scientific approach
 •  Don’t forget that you can always just add nodes!
     –  Spending 1 month of engineer time to make your job
        20% faster is not worth it if you have a 10 node
        cluster!
 •  We’re working on simplifying this where we can,
    but deep understanding will always allow more
    efficient jobs


32
                       ©2011 Cloudera, Inc. All Rights Reserved.
Questions?

    @tlipcon
todd@cloudera.com

More Related Content

What's hot

Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesNeo4j
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
 
System Design Interviews.pdf
System Design Interviews.pdfSystem Design Interviews.pdf
System Design Interviews.pdfRaviTandon11
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 

What's hot (20)

Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
System Design Interviews.pdf
System Design Interviews.pdfSystem Design Interviews.pdf
System Design Interviews.pdf
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

Viewers also liked

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017NVIDIA
 

Viewers also liked (6)

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017
 

Similar to Optimizing MapReduce Job performance

Multilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceMultilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceTsuyoshi OZAWA
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Benoit Hudzia
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HiveYukinori Suda
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQLMasao Fujii
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsSatya Narayan
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 

Similar to Optimizing MapReduce Job performance (20)

Multilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceMultilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduce
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQL
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Optimizing MapReduce Job performance

  • 1. June 14, 2012 Optimizing MapReduce Job Performance Todd Lipcon [@tlipcon]
  • 2. Introductions •  Software Engineer at Cloudera since 2009 •  Committer and PMC member on HDFS, MapReduce, and HBase •  Spend lots of time looking at full stack performance •  This talk is to help you develop faster jobs –  If you want to hear about how we made Hadoop faster, see my Hadoop World 2011 talk on cloudera.com 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. Aspects of Performance •  Algorithmic performance –  big-O, join strategies, data structures, asymptotes •  Physical performance –  Hardware (disks, CPUs, etc) •  Implementation performance –  Efficiency of code, avoiding extra work –  Make good use of available physical perf 3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. Performance fundamentals •  You can’t tune what you don’t understand –  MR’s strength as a framework is its black-box nature –  To get optimal performance, you have to understand the internals •  This presentation: understanding the black box 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. Performance fundamentals (2) •  You can’t improve what you can’t measure –  Ganglia/Cacti/Cloudera Manager/etc a must –  Top 4 metrics: CPU, Memory, Disk, Network –  MR job metrics: slot-seconds, CPU-seconds, task wall-clocks, and I/O •  Before you start: run jobs, gather data 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. Graphing bottlenecks This job might be CPU-bound in map phase Most jobs not CPU-bound Plenty of free RAM, perhaps can make better use of it? Fairly flat-topped network – bottleneck? 6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. Performance tuning cycle Identify Address Run job bottleneck bottleneck - Graphs -  Tune configs - Job counters -  Improve code - Job logs -  Rethink algos - Profiler results In order to understand these metrics and make changes, you need to understand MR internals. 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. Map-side sort/spill overview •  Goal: when complete, map task outputs one sorted file •  What happens when you call OutputCollector.collect ()? Map Task 2. Output Buffer fills up. Contents sorted, partitioned .collect(K,V) and spilled to disk MapOutputBuffer IFile 1. In-memory buffer holds serialized, Map-side IFile IFile unsorted key-values Merge 3. Map task finishes. All IFile IFiles merged to single IFile per task 10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. Zooming further: MapOutputBuffer (Hadoop 1.0) 12 bytes/rec kvoffsets (Partition, KOff, VOff) per record io.sort.record.percent * io.sort.mb kvindices 4 bytes/rec 1 indirect-sort index io.sort.mb per record R bytes/rec kvbuffer Raw, serialized (1-io.sort.record.percent) (Key, Val) pairs * io.sort.mb 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. MapOutputBuffer spill behavior •  Memory is limited: must spill –  If either of the kvbuffer or the metadata buffers fill up, “spill” to disk –  In fact, we spill before it’s full (in another thread): configure io.sort.spill.percent •  Performance impact –  If we spill more than one time, we must re- read and re-write all data: 3x the IO! –  #1 goal for map task optimization: spill once! 12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Spill counters on map tasks •  ratio of Spilled Records vs Map Output Records –  if unequal, then you are doing more than one spill •  FILE: Number of bytes read/written –  get a sense of I/O amplification due to spilling 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Spill logs on map tasks indicates that the metadata buffers 2012-06-04 11:52:21,445 INFO before the data buffer filled up MapTask: Spilling map output: record full = true 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend = 60030900; bufvoid = 228117712 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend = 600309; length = 750387 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output: record full = true 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900; bufend = 120061700; bufvoid = 228117712 2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309; kvend = 450230; length = 750387 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of map output 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2 3 11:52:29,105 we can do better? 14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Tuning to reduce spills •  Parameters: –  io.sort.mb: total buffer space –  io.sort.record.percent: proportion between metadata buffers and key/value data –  io.sort.spill.percent: threshold at which spill is triggered –  Total map output generated: can you use more compact serialization? •  Optimal settings depend on your data and available RAM! 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. Setting io.sort.record.percent •  Common mistake: metadata buffers fill up way before kvdata buffer •  Optimal setting: –  io.sort.record.percent = 16/(16 + R) –  R = average record size: divide “Map Output Bytes” counter by “Map Output Records” counter •  Default (0.05) is usually too low (optimal for ~300byte records) •  Hadoop 2.0: this is no longer necessary! –  see MAPREDUCE-64 for gory details 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. Tuning Example (terasort) •  Map input size = output size –  128MB block = 1,342,177 records, each 100 bytes –  metadata: 16 * 1342177 = 20.9MB •  io.sort.mb –  128MB data + 20.9MB meta = 148.9MB •  io.sort.record.percent –  16/(16+100)=0.138 •  io.sort.spill.percent = 1.0 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. More tips on spill tuning •  Biggest win is going from 2 spills to 1 spill –  3 spills is approximately the same speed as 2 spills (same IO amplificatoin) •  Calculate if it’s even possible, given your heap size –  io.sort.mb has to fit within your Java heap (plus whatever RAM your Mapper needs, plus ~30% for overhead) •  Only bother if this is the bottleneck! –  Look at map task logs: if the merge step at the end is taking a fraction of a second, not worth it! –  Typically most impact on jobs with big shuffle (sort/ dedup) 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. Reducer fetch tuning •  Reducers fetch map output via HTTP •  Tuning parameters: –  Server side: tasktracker.http.threads –  Client side: mapred.reduce.parallel.copies •  Turns out this is not so interesting –  follow the best practices from Hadoop: Definitive Guide 20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. Improving fetch bottlenecks •  Reduce intermediate data –  Implement a Combiner: less data transfers faster –  Enable intermediate compression: Snappy is easy to enable; trades off some CPU for less IO/ network •  Double-check for network issues –  Frame errors, NICs auto-negotiated to 100mbit, etc: one or two slow hosts can bottleneck a job –  Tell-tale sign: all maps are done, and reducers sit in fetch stage for many minutes (look at logs) 21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23. Reducer merge (Hadoop 1.0) Yes: RAMManager fetch to RAM-to-disk RAM merges Remote Map Fits in Outputs RAM? (via HTTP) 1. Data accumulated in RAM is merged to No: fetch disk files to disk Local Disk IFile 2. If too many disk disk-to-disk Merged files accumulate, IFile merges iterator they are re-merged IFile Reduce Task 3. Segments from RAM and disk are 23 merged into the reducer code ©2011 Cloudera, Inc. All Rights Reserved.
  • 24. Reducer merge triggers •  RAMManager –  Total buffer size: mapred.job.shuffle.input.buffer.percent (default 0.70, percentage of reducer heapsize) •  Mem-to-disk merge triggers: –  RAMManager is mapred.job.shuffle.merge.percent % full (default 0.66) –  Or mapred.inmem.merge.threshold segments accumulated (default 1000) •  Disk-to-disk merge –  io.sort.factor on-disk segments pile up (fairly rare) 24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25. Final merge phase •  MR assumes that reducer code needs the full heap worth of RAM –  Spills all in-RAM segments before running user code to free memory •  This isn’t true if your reducer is simple –  eg sort, simple aggregation, etc with no state •  Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep reducer input data in RAM 25 ©2011 Cloudera, Inc. All Rights Reserved.
  • 26. Reducer merge counters •  FILE: number of bytes read/written –  Ideally close to 0 if you can fit in RAM •  Spilled records: –  Ideally close to 0. If significantly more than reduce input records, job is hitting a multi- pass merge which is quite expensive 26 ©2011 Cloudera, Inc. All Rights Reserved.
  • 27. Tuning reducer merge •  Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep data in RAM if you don’t have any state in reducer •  Experiment with setting mapred.inmem.merge.threshold to 0 to avoid spills •  Hadoop 2.0: experiment with mapreduce.reduce.merge.memtomem.enabled 27 ©2011 Cloudera, Inc. All Rights Reserved.
  • 28. Rules of thumb for # maps/reduces •  Aim for map tasks running 1-3 minutes each –  Too small: wasted startup overhead, less efficient shuffle –  Too big: not enough parallelism, harder to share cluster •  Reduce task count: –  Large reduce phase: base on cluster slot count (a few GB per reducer) –  Small reduce phase: fewer reducers will result in more efficient shuffle phase 28 ©2011 Cloudera, Inc. All Rights Reserved.
  • 29. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 29 ©2011 Cloudera, Inc. All Rights Reserved.
  • 30. Tuning Java code for MR •  Follow general Java best practices –  String parsing and formatting is slow –  Guard debug statements with isDebugEnabled() –  StringBuffer.append vs repeated string concatenation •  For CPU-intensive jobs, make a test harness/ benchmark outside MR –  Then use your favorite profiler •  Check for GC overhead: -XX:+PrintGCDetails – verbose:gc •  Easiest profiler: add –Xprof to mapred.child.java.opts – then look at stdout task log 30 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31. Other tips for fast MR code •  Use the most compact and efficient data formats –  LongWritable is way faster than parsing text –  BytesWritable instead of Text for SHA1 hashes/dedup –  Avro/Thrift/Protobuf for complex data, not JSON! •  Write a Combiner and RawComparator •  Enable intermediate compression (Snappy/ LZO) 31 ©2011 Cloudera, Inc. All Rights Reserved.
  • 32. Summary •  Understanding MR internals helps understand configurations and tuning •  Focus your tuning effort on things that are bottlenecks, following a scientific approach •  Don’t forget that you can always just add nodes! –  Spending 1 month of engineer time to make your job 20% faster is not worth it if you have a 10 node cluster! •  We’re working on simplifying this where we can, but deep understanding will always allow more efficient jobs 32 ©2011 Cloudera, Inc. All Rights Reserved.
  • 33. Questions? @tlipcon todd@cloudera.com