SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Building Scale-Free
                   Applications with
                 Hadoop and Cascading


                        Chris K Wensel
                        Concurrent, Inc.



Sunday, June 21, 2009
Introduction
                                        Chris K Wensel
                                           chris@wensel.net


                    • Cascading, Lead developer
                        •   http://cascading.org/
                    • Concurrent, Inc., Founder
                        •   Hadoop/Cascading support and tools
                        •   http://concurrentinc.com
                    • Scale Unlimited, Principal Instructor
                        •   Hadoop related Training and Consulting
                        •   http://scaleunlimited.com
                                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Topics

                                    Cascading


                                    MapReduce


                                     Hadoop




                 A rapid introduction to Hadoop, MapReduce
                 patterns, and best practices with Cascading.


                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
What is Hadoop?




                                      Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Conceptually
                           Cluster




                                          Execution


                                          Storage




                    • Single FileSystem Namespace (large files)
                    • Single Execution-Space (high latency/batch)

                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
CPU, Disk, Rack, and
                               Node
                            Cluster




                                 Rack               Rack                Rack

                                 Node        Node   Node         Node   ...


                                      cpu             Execution


                                      disk             Storage




                    • Virtualization of storage and execution resources
                    • Normalizes disks and CPU
                    • Rack and node aware
                    • Data locality and fault tolerance
                                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Storage
                         Rack                Rack             Rack

                         Node      Node      Node      Node   ...

                          block1    block1                          block1


                          block2              block2                block2




                         One file, two blocks large, 3 nodes replicated



                    • Files chunked into blocks
                    • Blocks independently replicated
                                                                             Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Execution
                         Rack                Rack            Rack

                         Node      Node      Node    Node    ...

                           map       map       map     map         map



                          reduce    reduce                     reduce




                    • Map and Reduce tasks move to data
                    • Or available processing slots

                                                                         Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Clients and Jobs
                        Users       Cluster

                           Client




                           Client
                                                          Execution
                                              job   job               job
                                                          Storage
                           Client




              • Clients launch jobs into the cluster
              • One at a time if there are dependencies
              • Clients are not managed by Hadoop
                                                                            Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Physical Architecture
                                                           NameNode                DataNode
                                                                                    DataNode
                                                                                      DataNode
                                                                                        DataNode                             data block



                                    ns                                                                          read/write
                                 operations                Secondary                            ns
                                                                                             operations
                                              read/write                           ns
                                                                                operations         read/write
                                                                                                                       mapper
                                                                                                                           mapper
                                                                                                                       child jvm
                                                                                                                              mapper
                                                                                                                         child jvm
                                              jobs                      tasks                                                child jvm
                        Client                             JobTracker

                                                                                        TaskTracker
                                                                                                                       reducer
                                                                                                                           reducer
                                                                                                                       child jvm
                                                                                                                              reducer
                                                                                                                         child jvm
                                                                                                                             child jvm

                    • Solid boxes are unique applications
                    • Dashed boxes are child JVM instances on same node as parent
                    • Dotted boxes are blocks of managed files on same node as parent
                                                                                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Deployment
                        Node                    Node                        Node

                                         jobs                       tasks
                               Client                  JobTracker              TaskTracker




                                                                                   DataNode
                                                Node


                                                   NameNode



                                                                            Not uncommon to
                                                Node                         be same node

                                                       Secondary




                    • TaskTracker and DataNode collocated

                                                                                              Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Why Hadoop?




                                      Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Why It’s Adopted

                         Big Data             PaaS




                         Big Data      Platform as a Service
                        Few Apps            Many Apps




                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
How It’s Used


                         Ad hoc
                         Queries         Querying / Insights

                        Processing
                        Integration     Integration / Processing




                                                            Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Cascading and ...?
                                               Big Data      PaaS



                                  Ad hoc
                                               Pig/Hive       ?
                                  Queries


                                 Processing
                                               Cascading   Cascading
                                 Integration




                    • Designed for Processing / Integration
                    • OK for ad-hoc queries (API, not Syntax)
                    • Should you use Hadoop for small queries?

                                                                       Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
ShareThis.com
                             Processing
                                             Big Data
                             Integration


                    • Primarily integration and processing
                    • Running on AWS

                              Ad hoc
                                              PaaS
                              Queries



                    • Ad-hoc queries (mining) performed on
                      Aster Data nCluster, not Hadoop
                                                        Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
ShareThis in AWS
                                         bixo web            content
                  logprocessor                                                ...
                                          crawler            indexer


                        Amazon Web Services

                                                           SQS


                                                                             AsterData
                               loggers
                              loggers
                             loggers                      Hadoop
                            loggers                       Hadoop
                                                           Hadoop
                                                         Cascading
                                                         Cascading
                                                          Cascading


                                                                                 Katta
                                                                  Hyper
                                                    S3
                                                                  Table


                                                                          Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Staged Pipeline
                                                every Y hrs         on bixo completion
                           every X hrs




                           logprocessor            bixo
                                                                         indexer
                                                 crawler



                                          ...                 ...
              • Each cluster
                    • is “on demand”
                    • is “right sized” to load
                    • is tuned to boot at optimum frequency
                                                                         Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
ShareThis:
                           Best Practices

                 • Authoritative/Original data kept in S3 in its
                   native format

                 • Derived data pipelined into relevant systems,
                   when possible

                 • Clusters tuned to load for best utilization

                 • Loose coupling keeps change cases isolated

                 • GC’ing Hadoop not a bad idea

                                                          Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Thinking In MapReduce




                                         Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
It’s really
                           Map-Group-Reduce
            •       Map and Reduce are user defined functions

            •       Map translates input Keys and Values to new Keys and Values

                                          [K1,V1]      Map       [K2,V2]



            •       System sorts the keys, and groups each unique Key with all its Values

                                          [K2,V2]      Group    [K2,{V2,V2,....}]



            •       Reduce translates the Values of each unique Key to new Keys and Values


                                   [K2,{V2,V2,....}]   Reduce    [K3,V3]


                                                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Word Count
                    • Read a line of text, output every word

     [0, "when in the course of
           human events"]                 Map      ["when",1]     ["in",1]      ["the",1]                    [...,1]



                    • Group all the values with each unique word
                                  ["when",1]
                                   ["when",1]
                                    ["when",1]
                                     ["when",1]    Group        ["when",{1,1,1,1,1}]
                                      ["when",1]


                    • Add up the values associated with each unique word

                           ["when",{1,1,1,1,1}]    Reduce       ["when",5]


                                                                                 Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Know That...
                    •   1 Job = 1 Map impl + 1 Reduce impl
                    •   Map:
                        •   is required in a Hadoop Job
                        •   may emit 0 or more Key Value pairs [K2,V2]
                    •   Reduce:
                        •   is optional in a Hadoop Job
                        •   sees Keys in sort order
                        •   but Keys will be randomly distributed across Reducers
                        •   the collection of Values per Key are unordered [K2,{V2..}]
                        •   may emit 0 or more Key Value pairs [K3,V3]
                                                                          Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Runtime Distribution
                                          [K1,V1]            Map      [K2,V2]       Group    [K2,{V2,...}]   Reduce    [K3,V3]


                                                            Mapper
                                                             Task


                                                            Mapper                                           Reducer
                                                                                    Shuffle
                                                             Task                                             Task


                                                            Mapper                                           Reducer
                                                                                    Shuffle
                                                             Task                                             Task


                                                            Mapper                                           Reducer
                                                                                    Shuffle                    Task
                                                             Task


                                                            Mapper
                                                             Task
                                                                                 Mappers must
                                                                                complete before
                                                                                 Reducers can
                                                                                    begin
                        split1   split2   split3   split4      ...                                   part-00000    part-00001      part-000N

                                             file                                                                  directory
                          block1            block2           blockN
                                                                                                                        Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Common Patterns


            • Filtering or “Grepping”
                                        • Distributed Tasks
            • Parsing, Conversion
                                        • Simple Total Sorting
            • Counting, Summing
                                        • Chained Jobs
            • Binning, Collating




                                                      Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Advanced Patterns


            • Group By
                                 • CoGrouping / Joining
            • Distinct
                                 • Distributed Total Sort
            • Secondary Sort




                                               Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
For Example:
                               Secondary Sort
                              Mapper                          Reducer
                                         [K1,V1]                      [K2,{V2,V2,....}]



                                [K1,V1] -> <A1,B1,C1,D1>      [K2,V2] -> <A2,B2,{<C2,D2>,...}>



                                          Map                             Reduce



                              <A2,B2><C2> -> K2, <D2> -> V2   <A3,B3> -> K3, <C3,D3> -> V3



                                         [K2,V2]                          [K3,V3]



                    • The K2 key becomes a composite Key
                       • Key: [grouping, secondary], Value: [remaining values]
                    • Shuffle phase
                       • Custom Partitioner, must only partition on grouping Fields
                       • Standard ‘output key’ Comparator, must compare on all Fields
                       • Custom ‘value grouping’ Comparator, must compare grouping Fields
                                                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Very Advanced

                        • Classification   • Dimension
                                            Reduction
                        • Clustering
                                          • Evolutionary
                        • Regression        Algorithms




                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Troubles w/ MapReduce
                                      [ k, [v] ]                                    [ k, [v] ]
                              Map                   Reduce              Map                        Reduce


                        [ k, v ]                   [ k, v ]              [ k, v ]                           [ k, v ]


                               File                            File                                    File



                                                              [ k, v ] = key and value pair
                                                              [ k, [v] ] = key and associated values collection



                    • Keys and Values are brittle
                    • Map and Reduce is coarse grained


                                                                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Real World Apps
                                                                                                                                                                                                           [37/75] map+reduce




                                                                                                                                                                                                           [54/75] map+reduce




       [41/75] map+reduce      [43/75] map+reduce       [42/75] map+reduce      [45/75] map+reduce       [44/75] map+reduce      [39/75] map+reduce    [36/75] map+reduce        [46/75] map+reduce        [40/75] map+reduce        [50/75] map+reduce     [38/75] map+reduce     [49/75] map+reduce     [51/75] map+reduce      [47/75] map+reduce     [52/75] map+reduce        [53/75] map+reduce    [48/75] map+reduce




       [23/75] map+reduce      [25/75] map+reduce       [24/75] map+reduce      [27/75] map+reduce       [26/75] map+reduce      [21/75] map+reduce    [19/75] map+reduce        [28/75] map+reduce        [22/75] map+reduce        [32/75] map+reduce     [20/75] map+reduce     [31/75] map+reduce     [33/75] map+reduce      [29/75] map+reduce     [34/75] map+reduce        [35/75] map+reduce    [30/75] map+reduce




           [7/75] map+reduce        [2/75] map+reduce       [8/75] map+reduce       [10/75] map+reduce       [9/75] map+reduce     [5/75] map+reduce    [3/75] map+reduce        [11/75] map+reduce         [6/75] map+reduce        [13/75] map+reduce     [4/75] map+reduce    [16/75] map+reduce     [14/75] map+reduce      [15/75] map+reduce     [17/75] map+reduce        [18/75] map+reduce     [12/75] map+reduce




              [60/75] map              [62/75] map             [61/75] map                                                            [58/75] map          [55/75] map                                                     [56/75] map+reduce                  [57/75] map                                                                                [71/75] map               [72/75] map
                                                                                                                                                                                                      [59/75] map




                                                                                                         [64/75] map+reduce                                 [63/75] map+reduce                        [65/75] map+reduce          [68/75] map+reduce      [67/75] map+reduce     [70/75] map+reduce     [69/75] map+reduce      [73/75] map+reduce     [66/75] map+reduce        [74/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                            [75/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                             [1/75] map+reduce




      1 app, 75 jobs

      green                                                      =                map + reduce
      purple                                                     =                map
      blue                                                       =                join/merge
      orange                                                     =                map split
                                                                                                                                                                                                                                                                                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Not Thinking in
                                          MapReduce
        Map                               Reduce                                  Map                              Reduce
                        [ f1,f2,.. ]            [ f1,f2,.. ]             [ f1,f2,.. ]          [ f1,f2,.. ]               [ f1,f2,.. ]
          Pipe                         Pipe                     Pipe                    Pipe                  Pipe                              Pipe


                  [ f1,f2,.. ]                                                                                                   [ f1,f2,.. ]



         Source                               [ f1, f2,... ] = tuples with field names                                                           Sink




                    • It’s easier with Fields/Tuples...
                    • Streaming through Pipes/Operations
                                                                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']                                    Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']


                                                                                        ['', '']                                                                                             ['', '']
                                                                                                                                                                                                                                                                         Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
                                                                                        ['', '']                                                                                             ['', '']

                                                                                                                                                                                                                                                                                                     ['', '']
                                                                Each('')[RegexSplitter[decl:'', ''][args:1]]                                          Each('')[RegexSplitter[decl:'', '', '', ''][args:1]]
                                                                                                                                                                                                                                                                                                     ['', '']

                                                                                        ['', '']                                                                                        ['', '']
                                                                                                                                                                                                                                                                            Each('')[RegexSplitter[decl:'', '', ''][args:1]]
                                                                                        ['', '']                                                                                        ['', '']




                                                                                                                            ['', '']                                           ['', '']
                                                                                                                            ['', '']                                           ['', '']
                                                                                                                                                                                                                                                                                          ['', '']
                                                                                                                                                         CoGroup('')[by:a:['']b:['']]                                                                                                     ['', '']

                                                                                                                                                                        a[''],b['']
                                                              ['', '']                                                                                                  ['', '', '', '']
                                                              ['', '']
                                                                                                                                                     Each('')[Identity[decl:'', '', '']]

                                                                                        ['', '']                                                                         ['', '', '']
                                                                                        ['', '']                                                                         ['', '', '']




                                                                                                                                                                                                                  ['', '', '']                             ['', '']
                                                                                                                                                                                                                  ['', '', '']                             ['', '']

                                                                                                                                                                                                                                      CoGroup('')[by:a:['']b:['']]
                                                                                                                                                                ['', '', '']
                                                                                                                                                                ['', '', '']                                                                                                                  ['', '']
                                                                                                                                                                                                   ['', '', '']                                a[''],b['']                                    ['', '']
                                                                                                                                                                                                   ['', '', '']                                ['', '', '', '', '']

                                                                                                                                                                                                                                 Each('')[Identity[decl:'', '', '', '']]

                                                                                                                                                                                                                                                 ['', '', '', '']
                                                                                                                                                                                                                                                 ['', '', '', '']



                                                                                                                                                                                           ['', '', '', '']
                                                                                                                                                                                           ['', '', '', '']                                      ['', '', '', '']
                                                                                                                                                                                                                                                 ['', '', '', '']

                                                                                                                                            CoGroup('')[by:a:['']b:['']]                                                              CoGroup('')[by:a:['']b:['']]

                                                                                                                                       a[''],b['']                                                                                            a[''],b['']
                                                                                                                                       ['', '', '', '', '', '', '']                                                                           ['', '', '', '', '', '']

                                                                                                                                                                                                                                 Each('')[Identity[decl:'', '', '', '']]

                                                                                        ['', '', '', '', '', '', '']                                                                                                                             ['', '', '', '']
                                                                                        ['', '', '', '', '', '', '']                                                                                                                             ['', '', '', '']

                                        CoGroup('')[by:a:['']b:['']]

                                          a[''],b['']                                                                                                                                                                      ['', '', '', '']
                                          ['', '', '', '', '', '', '', '', '']                                                                                                                                             ['', '', '', '']

                                     Each('')[Identity[decl:'', '']]                                                                                                            CoGroup('')[by:a:['']b:['']]

                                                   ['', '']                                                                                                                         a[''],b['']
                                                   ['', '']                                                                                                                         ['', '', '', '', '', '', '']

                                                                                                                                                                      Each('')[Identity[decl:'', '', '']]

                                                ['', '']                                                                                                                  ['', '', '']
                                                ['', '']                                                                                                                  ['', '', '']

                                      GroupBy('')[by:['']]

                                                a['']                                                                        ['', '', '']
                                                ['', '']                                                                     ['', '', '']

                                 Every('')[Agregator[decl:'']]                            CoGroup('')[by:a:['']b:['']]




                                                                                                                                                                                                                                 1 app, 12 jobs
                                                                                                     a[''],b['']
                                                                                                     ['', '', '', '', '']
                                                 ['', '']
                                                 ['', '']
                                                                                      Each('')[Identity[decl:'', '', '']]


                                                                                                        ['', '', '']
                        Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
                                                                                                        ['', '', '']




                                                                                                        ['', '', '']
                                                                                                        ['', '', '']

                                                                                              GroupBy('')[by:['']]
                                                                                                                                                                                                                                 green = source or sink
                                                                                                        a['']
                                                                                                        ['', '', '']

                                                                                     Every('')[Agregator[decl:'', '']]
                                                                                                                                                                                                                                 blue = pipe+operation
                                                                                                        ['', '', '']
                                                                                                        ['', '', '']

                                                                            Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

                                                                                                                                                                                                                                                                                                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Cascading




                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
In A Nut Shell


                    • An alternative API to MapReduce
                        • Fields and Tuples
                        • Functions, Filters, and Aggregators
                        • Built-in ‘relational’ operations
                    • Can be used by any JVM based language



                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Developers


           • Re-use or write new Operations (Func,Filter,Agg)


           • Chain the Ops into data processing work-flows


           • Instantiate the work-flow with data end-points




                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Some Sample Code


        http://github.com/cwensel/cascading.samples


        • hadoop
        • logparser
        • loganalysis



                         bit.ly links on next pages
                                                      Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Raw MapReduce:
                        Parsing a Log File

    > read one key/value at a time from inputPath
    > using some regular expression...
    > parse value into regular expression groups
    > save results as a TAB delimited row to the outputPath



                          http://bit.ly/RegexMap
                          http://bit.ly/RegexMain
                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
http://bit.ly/RegexMap
File - /Users/cwensel/sandbox/calku/cascading.samples/hadoop/src/java/hadoop/RegexParserMap.java
 public class RegexParserMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>
   {
   private Pattern pattern;
   private Matcher matcher;

    @Override
    public void configure( JobConf job )
      {
      pattern = Pattern.compile( job.get( "logparser.regex" ) );
      matcher = pattern.matcher( "" ); // lets re-use the matcher
      }

    @Override
    public void map( LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter )
      {
      matcher.reset( value.toString() );

        if( !matcher.find() )
           throw new RuntimeException( "could not match pattern: [" + pattern + "] with value: [" + value + "]" );

        StringBuffer buffer = new StringBuffer();

        for( int i = 0; i < matcher.groupCount(); i++ )
          {
          if( i != 0 )
             buffer.append( "t" );

          buffer.append( matcher.group( i + 1 ) ); // skip group 0
          }

        // pass null so a TAB is not prepended, not all OutputFormats accept null
        output.collect( null, new Text( buffer.toString() ) );
        }
    }

                                                                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
{
    public static void main( String[] args ) throws IOException
      {
      // create Hadoop path instances
      Path inputPath = new Path( args[ 0 ] );
                                                                                      http://bit.ly/RegexMain
      Path outputPath = new Path( args[ 1 ] );

        // get the FileSystem instances for the input path
        FileSystem outputFS = outputPath.getFileSystem( new JobConf() );

        // if output path exists, delete recursively
        if( outputFS.exists( outputPath ) )
           outputFS.delete( outputPath, true );

        // initialize Hadoop job configuration
        JobConf jobConf = new JobConf();
        jobConf.setJobName( "logparser" );

        // set the current job jar
        jobConf.setJarByClass( Main.class );

        // set the input path and input format
        TextInputFormat.setInputPaths( jobConf, inputPath );
        jobConf.setInputFormat( TextInputFormat.class );

        // set the output path and output format
        TextOutputFormat.setOutputPath( jobConf, outputPath );
        jobConf.setOutputFormat( TextOutputFormat.class );
        jobConf.setOutputKeyClass( Text.class );
        jobConf.setOutputValueClass( Text.class );

        // must set to zero since we have no redcue function
        jobConf.setNumReduceTasks( 0 );

        // configure our parsing map classs
        jobConf.setMapperClass( RegexParserMap.class );
        String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +  [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]*  " ([^ ]*) ([^ ]*).*$" ;
        jobConf.set( "logparser.regex" , apacheRegex );

        // create Hadoop client, must pass in this JobConf for some reason
        JobClient jobClient = new JobClient( jobConf );

        // submit job
        RunningJob runningJob = jobClient.submitJob( jobConf );

        // block until job completes
        runningJob.waitForCompletion();
        }                                                                                                          Copyright Concurrent, Inc. 2009. All rights reserved
    }
Sunday, June 21, 2009
Cascading:
                        Parsing a Log File

    > read one “line” at a time from “source”
    > using some regular expression...
    > parse “line” into “ip, time, method, event, status, size”
    > save as a TAB delimited row to the “sink”




                           http://bit.ly/LPMain
                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
public static void main( String[] args )

                                                                                                 http://bit.ly/LPMain
     {
     String inputPath = args[ 0 ];
     String outputPath = args[ 1 ];

      // define what the input file looks like, "offset" is bytes from beginning
      TextLine scheme = new TextLine( new Fields( "offset", "line" ) );

      // create SOURCE tap to read a resource from the local file system, if input is not an URL
      Tap logTap = inputPath.matches( "^[^:]+://.*" ) ? new Hfs( scheme, inputPath ) : new Lfs( scheme, inputPath );

      // create an assembly to parse an Apache log file and store on an HDFS cluster

      // declare the field names we will parse out of the log file
      Fields apacheFields = new Fields( "ip" , "time" , "method", "event" , "status" , "size" );

      // define the regular expression to parse the log file with
      String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +  [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]*  " ([^ ]*) ([^ ]*).*$" ;

      // declare the groups from the above regex we want to keep. each regex group will be given
      // a field name from 'apacheFields', above, respectively
      int[] allGroups = {1, 2, 3, 4, 5, 6};

      // create the parser
      RegexParser parser = new RegexParser( apacheFields, apacheRegex, allGroups );

      // create the import pipe element, with the name 'import', and with the input argument named "line"
      // replace the incoming tuple with the parser results
      // "line" -> parser -> "ts"
      Pipe importPipe = new Each( "import" , new Fields( "line" ), parser, Fields.RESULTS );

      // create a SINK tap to write to the default filesystem
      // by default, TextLine writes all fields out
      Tap remoteLogTap = new Hfs( new TextLine(), outputPath, SinkMode.REPLACE );

      // set the current job jar
      Properties properties = new Properties();
      FlowConnector.setApplicationJarClass( properties, Main.class );

      // connect the assembly to the SOURCE and SINK taps
      Flow parsedLogFlow = new FlowConnector( properties ).connect( logTap, remoteLogTap, importPipe );

      // optionally print out the parsedLogFlow to a DOT file for import into a graphics package
      // parsedLogFlow.writeDOT( "logparser.dot" );

      // start execution of the flow (either locally or on the cluster
      parsedLogFlow.start();

      // block until the flow completes
      parsedLogFlow.complete();                                                                                        Copyright Concurrent, Inc. 2009. All rights reserved
      }
Sunday, June 21, 2009
   }
What Changes When...

                 • We want to sort by “status” field?
                 • We want to filter out some “urls”?
                 • We want to sum the “size”?
                 • Do all the above?


                 • We are thinking in ‘fields’ not Key/Values

                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Cascading:
                        Arrival rate

    > ..parse data as before..
    > read parsed logs
    > calculate “urls” per second
    > calculate “urls” per minute
    > save “sec interval” & “count” as a TAB’d row to “sink1”
    > save “min interval” & “count” as a TAB’d row to “sink2”




                        http://bit.ly/LAMain
                                                 Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
http://bit.ly/LAMain

                                 Arrival Rate Flow
                                Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']']


                                                     ['ip', 'time', 'method', 'event', 'status', 'size']
                                                     ['ip', 'time', 'method', 'event', 'status', 'size']


                                                  Each('arrival rate')[DateParser[decl:'ts'][args:1]]

                                                                      ['ts']                        ['ts']
                                                                      ['ts']                        ['ts']

                                           GroupBy('tsCount')[by:['ts']]                      Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]]


                                                                                                                             ['tm']
                                                    tsCount['ts']
                                                                                                                             ['tm']
                                                    ['ts']

                                                                                                                GroupBy('tmCount')[by:['tm']]
                                      Every('tsCount')[Count[decl:'count']]

                                                                                                                        tmCount['tm']
                                                    ['ts', 'count']                                                     ['tm']
                                                    ['ts']

                                                                                                             Every('tmCount')[Count[decl:'count']]

                        Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']']
                                                                                                                         ['tm', 'count']
                                                                                                                         ['tm']


                                                                                            Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']']

                                                                                                                                              Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
// set the current job jar
                   Properties properties = new Properties ();


                                                                                                                              http://bit.ly/LAMain
                   FlowConnector .setApplicationJarClass ( properties , Main .class );

                   FlowConnector   flowConnector  = new FlowConnector ( properties );
                   CascadeConnector   cascadeConnector  = new CascadeConnector ();

                   String   inputPath = args [ 0 ];
                   String   logsPath = args [ 1 ] + "/logs/";
                   String   arrivalRatePath   = args [ 1 ] + "/arrivalrate/" ;
                   String   arrivalRateSecPath   = arrivalRatePath     + "sec";
                   String   arrivalRateMinPath   = arrivalRatePath     + "min" ;

                   // create an assembly to import an Apache log file and store on DFS
                   // declares: "time", "method", "event", "status", "size"
                   Fields apacheFields = new Fields ( "ip" , "time" , "method" , "event" , "status", "size" );
                   String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +[([^]]*)] +"([^ ]*) ([^ ]*) [^ ]*
                                                                                                        " ([^ ]*) ([^ ]*).*$";
                   int[] apacheGroups = {1, 2, 3, 4, 5, 6};
                   RegexParser parser = new RegexParser ( apacheFields , apacheRegex , apacheGroups );
                   Pipe importPipe = new Each ( "import" , new Fields ( "line" ), parser );

                   // create tap to read a resource from the local file system, if not an url for an external resource
                   // Lfs allows for relative paths
                   Tap logTap =
                     inputPath .matches ( "^[^:]+://.*" ) ? new Hfs ( new TextLine (), inputPath ) : new Lfs ( new TextLine (), inputPath                            );
                   // create a tap to read/write from the default filesystem
                   Tap parsedLogTap = new Hfs ( apacheFields , logsPath );

                   // connect the assembly to source and sink taps
                   Flow importLogFlow   = flowConnector .connect ( logTap , parsedLogTap , importPipe                         );

                   // create an assembly to parse out the time field into a timestamp
                   // then count the number of requests per second and per minute

                   // apply a text parser to create a timestamp with 'second' granularity
                   // declares field "ts"
                   DateParser dateParser = new DateParser ( new Fields ( "ts" ), "dd/MMM/yyyy:HH:mm:ss Z");
                   Pipe tsPipe = new Each ( "arrival rate" , new Fields ( "time" ), dateParser , Fields .RESULTS                        );

                   // name the per second assembly and split on tsPipe
                   Pipe tsCountPipe = new Pipe ( "tsCount", tsPipe );
                   tsCountPipe = new GroupBy ( tsCountPipe , new Fields ( "ts" ) );
                   tsCountPipe = new Every ( tsCountPipe , Fields .GROUP , new Count () );

                   // apply expression to create a timestamp with 'minute' granularity
                   // declares field "tm"
                   Pipe tmPipe = new Each ( tsPipe , new ExpressionFunction ( new Fields ( "tm" ), "ts - (ts % (60 * 1000))", long.class ) );

                   // name the per minute assembly and split on tmPipe
                   Pipe tmCountPipe = new Pipe ( "tmCount", tmPipe );
                   tmCountPipe = new GroupBy ( tmCountPipe , new Fields ( "tm" ) );
                   tmCountPipe = new Every ( tmCountPipe , Fields .GROUP , new Count () );

                   // create taps to write the results the default filesystem, using the given fields
                   Tap tsSinkTap = new Hfs ( new TextLine (), arrivalRateSecPath      );
                   Tap tmSinkTap = new Hfs ( new TextLine (), arrivalRateMinPath      );

                   // a convenience method for binding taps and pipes, order is significant
                   Map <String , Tap > sinks = Cascades .tapsMap ( Pipe .pipes ( tsCountPipe , tmCountPipe                         ), Tap .taps ( tsSinkTap , tmSinkTap    ) );

                   // connect the assembly to the source and sink taps
                   Flow arrivalRateFlow   = flowConnector .connect ( parsedLogTap , sinks , tsCountPipe , tmCountPipe                           );

                   // optionally print out the arrivalRateFlow to a graph file for import into a graphics package
                   //arrivalRateFlow.writeDOT( "arrivalrate.dot" );

                   // connect the flows by their dependencies, order is not significant
                   Cascade cascade = cascadeConnector .connect ( importLogFlow , arrivalRateFlow                         );

                   // execute the cascade, which in turn executes each flow in dependency order
                   cascade .complete ();
                                                                                                                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Cascading Model
                                                               Pipe                                                                                Operation




                                                                                           Sub
                                Each          GroupBy        CoGroup          Every                                  Function      Filter          Aggregator       Buffer       Assertion
                                                                                         Assembly




                                                              Joiner                           ??
                                                                                                                       ??           ??                ??                ??           ??




                        InnerJoin      OuterJoin    LeftJoin          RightJoin    MixedJoin        ??
                                                                                                                                                      Tap




                                                                                                                            Hdfs            JDBC            HBase               ??



                                          Each                                           Every


                                                                                                                                                  TupleEntry




                                                         Value                                            Group
                         Function         Filter        Assertion
                                                                          Aggregator     Buffer
                                                                                                         Assertion                       Fields                 Tuple


                                                                                                                                                                 Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Assembling Applications
                          Tap      func                                Tap
                                   Each    Group   Every
                                                   aggr
                        [source]   filter                              [sink]




                                                              Tap
                                                             [trap]




         • Pipes can be arranged into arbitrary assemblies
                                                    buffer




                          Tap      func                                Tap
                                           Group   aggr
                        [source]   filter                              [sink]




                                                              Tap
                                                             [trap]




         • Some limitations imposed by Operations
                                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Assembly -> Flow
                                 Pipe Assembly

                                 func     Group
                                         CoGroup   aggr    Group
                                                          GroupBy   aggr   func




                                 func     func




                        Flow

                        Source    func    Group
                                         CoGroup   aggr    Group
                                                          GroupBy   aggr   func   Sink




                        Source    func    func




                    • Assemblies are planned into Flows
                                                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Flow Planner
                  Map                       Reduce         Map       Reduce

                    Source   func      CoGroup   aggr   temp     GroupBy   aggr       func               Sink




                    Source   func   func




                    • The Planner handles how operations
                      are partitioned into MapReduce jobs

                                                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Assembly -> Cluster

                 Client              Cluster
                   Assembly   Flow    Job          Job




                                               Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Best Practices:
                               Developing
              • Organize Flows as med/small work-loads
                    • Improves reusability and work-sharing

              • Small/Medium-grained Operations
                    • Improves reusability

              • Coarse-grained Operations
                    • Better performance vs flexibility

              • SubAssemblies encapsulate responsibilities
                    • Separate “rules” from “integration”

              • Solve problem first, Optimize second
                    • Can replace Flows with raw MapReduce

                                                              Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Best Practices:
                                     Testing

               • Unit tests
                        • Operations tested independent of Hadoop or Map/Reduce
               • Regression / Integration Testing
                        • Use built-in testing cluster for small data
                        • Use EC2 for large regression data sets
               • Inline “strict” Assertions into Assemblies
                        • Planner can remove them at run-time
                        • SubAssemblies tested independently of other logic



                                                                        Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Best Practices:
                            Data Processing
             • Account for good/bad data
                   • Split good/bad data by rules
                   • Capture rule reasons with “bad” data

             • Inline “validating” Assertions
                   • For staging or untrusted data-sources
                   • Again, can be “planned” out for performance at run-time

             • Capture really bad data through “traps”
                   • Depending on Fidelity of app

             • Use s3n:// before s3://
                   • S3:// stores data in proprietary format, never as default FS
                   • Store compressed, in chunks (1 file per Mapper)
                                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Cascading


                 • Free Software/      • User contributed
                   GPLv3                 modules

                 • 1.0 since January   • Elastic MapReduce

                 • Growing community



                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
1.1

                    • Flow level counters
                    • Fields.SWAP and Fields.REPLACE
                    • “Safe” Operations
                    • Hadoop “globbing” support
                    • Detailed job statistics


                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009
Resources
                    • Cascading
                     • http://cascading.org


                    • Enterprise tools and support
                     • http://concurrentinc.com


                    • Professional Services and Training
                     • http://scaleunlimited.com
                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Sunday, June 21, 2009

Contenu connexe

Tendances

3 Networking CloudStack Developer Day
3  Networking CloudStack Developer Day 3  Networking CloudStack Developer Day
3 Networking CloudStack Developer Day Kimihiko Kitase
 
1006 Z2 Intro Complete
1006 Z2 Intro Complete1006 Z2 Intro Complete
1006 Z2 Intro CompleteHenning Blohm
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
Ramakrishnan Keynote Ladis2009
Ramakrishnan Keynote Ladis2009Ramakrishnan Keynote Ladis2009
Ramakrishnan Keynote Ladis2009yarapavan
 
Windows Server 2012 Hyper-V
Windows Server 2012 Hyper-VWindows Server 2012 Hyper-V
Windows Server 2012 Hyper-VMicrosoftid
 
Dean Keynote Ladis2009 Google
Dean Keynote Ladis2009 GoogleDean Keynote Ladis2009 Google
Dean Keynote Ladis2009 GoogleDuoc
 
Cache-partitioning
Cache-partitioningCache-partitioning
Cache-partitioningdavidkftam
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterRyousei Takano
 
Cloud Computing Intro
Cloud Computing IntroCloud Computing Intro
Cloud Computing Intropeterdenyer
 
Telecom universal datastatesharingfabric
Telecom universal datastatesharingfabricTelecom universal datastatesharingfabric
Telecom universal datastatesharingfabricShay Hassidim
 
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x The Linux Foundation
 
IBM SmartCloud Enterprise - A Secure Infrastructure for Test and Development
IBM SmartCloud Enterprise - A Secure Infrastructure for Test and DevelopmentIBM SmartCloud Enterprise - A Secure Infrastructure for Test and Development
IBM SmartCloud Enterprise - A Secure Infrastructure for Test and DevelopmentPiotr Pietrzak
 

Tendances (14)

3 Networking CloudStack Developer Day
3  Networking CloudStack Developer Day 3  Networking CloudStack Developer Day
3 Networking CloudStack Developer Day
 
1006 Z2 Intro Complete
1006 Z2 Intro Complete1006 Z2 Intro Complete
1006 Z2 Intro Complete
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Ramakrishnan Keynote Ladis2009
Ramakrishnan Keynote Ladis2009Ramakrishnan Keynote Ladis2009
Ramakrishnan Keynote Ladis2009
 
Windows Server 2012 Hyper-V
Windows Server 2012 Hyper-VWindows Server 2012 Hyper-V
Windows Server 2012 Hyper-V
 
Dean Keynote Ladis2009 Google
Dean Keynote Ladis2009 GoogleDean Keynote Ladis2009 Google
Dean Keynote Ladis2009 Google
 
Xen in the Cloud at SCALE 10x
Xen in the Cloud at SCALE 10xXen in the Cloud at SCALE 10x
Xen in the Cloud at SCALE 10x
 
Cache-partitioning
Cache-partitioningCache-partitioning
Cache-partitioning
 
XS Boston 2008 ARM
XS Boston 2008 ARMXS Boston 2008 ARM
XS Boston 2008 ARM
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
 
Cloud Computing Intro
Cloud Computing IntroCloud Computing Intro
Cloud Computing Intro
 
Telecom universal datastatesharingfabric
Telecom universal datastatesharingfabricTelecom universal datastatesharingfabric
Telecom universal datastatesharingfabric
 
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
 
IBM SmartCloud Enterprise - A Secure Infrastructure for Test and Development
IBM SmartCloud Enterprise - A Secure Infrastructure for Test and DevelopmentIBM SmartCloud Enterprise - A Secure Infrastructure for Test and Development
IBM SmartCloud Enterprise - A Secure Infrastructure for Test and Development
 

Similaire à Building Scale Free Applications with Hadoop and Cascading

Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
CouchDB to the Edge ApacheCon EU
CouchDB to the  Edge ApacheCon EUCouchDB to the  Edge ApacheCon EU
CouchDB to the Edge ApacheCon EUChris Anderson
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflowrjmurphyslideshare
 
HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and FutureDataWorks Summit
 
Sly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingSly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingStefan Marr
 
JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)
JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)
JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)Darshan Karandikar
 
Cloud Foundry Bootcamp
Cloud Foundry BootcampCloud Foundry Bootcamp
Cloud Foundry BootcampAlvaro Videla
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureBob Rhubart
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Eduserv
 
Deduplication and single instance storage
Deduplication and single instance storageDeduplication and single instance storage
Deduplication and single instance storageInterop
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Javalucenerevolution
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Deploying OpenStack using Crowbar
Deploying OpenStack using CrowbarDeploying OpenStack using Crowbar
Deploying OpenStack using Crowbaropenstackindia
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloudaidanshribman
 
Fusion-io SSD and SQL Server 2008
Fusion-io SSD and SQL Server 2008Fusion-io SSD and SQL Server 2008
Fusion-io SSD and SQL Server 2008Mark Ginnebaugh
 
Fusion Iossdandsqlserver2008 091022013943 Phpapp02
Fusion Iossdandsqlserver2008 091022013943 Phpapp02Fusion Iossdandsqlserver2008 091022013943 Phpapp02
Fusion Iossdandsqlserver2008 091022013943 Phpapp02eddiesauvao
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverJohn Paulett
 

Similaire à Building Scale Free Applications with Hadoop and Cascading (20)

Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
CouchDB to the Edge ApacheCon EU
CouchDB to the  Edge ApacheCon EUCouchDB to the  Edge ApacheCon EU
CouchDB to the Edge ApacheCon EU
 
Developing Distributed Semantic Systems
Developing Distributed Semantic SystemsDeveloping Distributed Semantic Systems
Developing Distributed Semantic Systems
 
Open GeoSocial API
Open GeoSocial APIOpen GeoSocial API
Open GeoSocial API
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and Future
 
Sly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingSly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of Programming
 
Performance in a virtualized environment
Performance in a virtualized environmentPerformance in a virtualized environment
Performance in a virtualized environment
 
JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)
JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)
JRuby on Rails - RoR's Simplicity Meets Java's Class (a case in point)
 
Cloud Foundry Bootcamp
Cloud Foundry BootcampCloud Foundry Bootcamp
Cloud Foundry Bootcamp
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the Future
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
 
Deduplication and single instance storage
Deduplication and single instance storageDeduplication and single instance storage
Deduplication and single instance storage
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Java
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Deploying OpenStack using Crowbar
Deploying OpenStack using CrowbarDeploying OpenStack using Crowbar
Deploying OpenStack using Crowbar
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
Fusion-io SSD and SQL Server 2008
Fusion-io SSD and SQL Server 2008Fusion-io SSD and SQL Server 2008
Fusion-io SSD and SQL Server 2008
 
Fusion Iossdandsqlserver2008 091022013943 Phpapp02
Fusion Iossdandsqlserver2008 091022013943 Phpapp02Fusion Iossdandsqlserver2008 091022013943 Phpapp02
Fusion Iossdandsqlserver2008 091022013943 Phpapp02
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And Failover
 

Plus de cwensel

Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014cwensel
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014cwensel
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011cwensel
 
Cascading and BigData Problems
Cascading and BigData ProblemsCascading and BigData Problems
Cascading and BigData Problemscwensel
 
Buzz words
Buzz wordsBuzz words
Buzz wordscwensel
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 

Plus de cwensel (6)

Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011
 
Cascading and BigData Problems
Cascading and BigData ProblemsCascading and BigData Problems
Cascading and BigData Problems
 
Buzz words
Buzz wordsBuzz words
Buzz words
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 

Dernier

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Building Scale Free Applications with Hadoop and Cascading

  • 1. Building Scale-Free Applications with Hadoop and Cascading Chris K Wensel Concurrent, Inc. Sunday, June 21, 2009
  • 2. Introduction Chris K Wensel chris@wensel.net • Cascading, Lead developer • http://cascading.org/ • Concurrent, Inc., Founder • Hadoop/Cascading support and tools • http://concurrentinc.com • Scale Unlimited, Principal Instructor • Hadoop related Training and Consulting • http://scaleunlimited.com Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 3. Topics Cascading MapReduce Hadoop A rapid introduction to Hadoop, MapReduce patterns, and best practices with Cascading. Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 4. What is Hadoop? Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 5. Conceptually Cluster Execution Storage • Single FileSystem Namespace (large files) • Single Execution-Space (high latency/batch) Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 6. CPU, Disk, Rack, and Node Cluster Rack Rack Rack Node Node Node Node ... cpu Execution disk Storage • Virtualization of storage and execution resources • Normalizes disks and CPU • Rack and node aware • Data locality and fault tolerance Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 7. Storage Rack Rack Rack Node Node Node Node ... block1 block1 block1 block2 block2 block2 One file, two blocks large, 3 nodes replicated • Files chunked into blocks • Blocks independently replicated Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 8. Execution Rack Rack Rack Node Node Node Node ... map map map map map reduce reduce reduce • Map and Reduce tasks move to data • Or available processing slots Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 9. Clients and Jobs Users Cluster Client Client Execution job job job Storage Client • Clients launch jobs into the cluster • One at a time if there are dependencies • Clients are not managed by Hadoop Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 10. Physical Architecture NameNode DataNode DataNode DataNode DataNode data block ns read/write operations Secondary ns operations read/write ns operations read/write mapper mapper child jvm mapper child jvm jobs tasks child jvm Client JobTracker TaskTracker reducer reducer child jvm reducer child jvm child jvm • Solid boxes are unique applications • Dashed boxes are child JVM instances on same node as parent • Dotted boxes are blocks of managed files on same node as parent Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 11. Deployment Node Node Node jobs tasks Client JobTracker TaskTracker DataNode Node NameNode Not uncommon to Node be same node Secondary • TaskTracker and DataNode collocated Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 12. Why Hadoop? Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 13. Why It’s Adopted Big Data PaaS Big Data Platform as a Service Few Apps Many Apps Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 14. How It’s Used Ad hoc Queries Querying / Insights Processing Integration Integration / Processing Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 15. Cascading and ...? Big Data PaaS Ad hoc Pig/Hive ? Queries Processing Cascading Cascading Integration • Designed for Processing / Integration • OK for ad-hoc queries (API, not Syntax) • Should you use Hadoop for small queries? Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 16. ShareThis.com Processing Big Data Integration • Primarily integration and processing • Running on AWS Ad hoc PaaS Queries • Ad-hoc queries (mining) performed on Aster Data nCluster, not Hadoop Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 17. ShareThis in AWS bixo web content logprocessor ... crawler indexer Amazon Web Services SQS AsterData loggers loggers loggers Hadoop loggers Hadoop Hadoop Cascading Cascading Cascading Katta Hyper S3 Table Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 18. Staged Pipeline every Y hrs on bixo completion every X hrs logprocessor bixo indexer crawler ... ... • Each cluster • is “on demand” • is “right sized” to load • is tuned to boot at optimum frequency Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 19. ShareThis: Best Practices • Authoritative/Original data kept in S3 in its native format • Derived data pipelined into relevant systems, when possible • Clusters tuned to load for best utilization • Loose coupling keeps change cases isolated • GC’ing Hadoop not a bad idea Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 20. Thinking In MapReduce Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 21. It’s really Map-Group-Reduce • Map and Reduce are user defined functions • Map translates input Keys and Values to new Keys and Values [K1,V1] Map [K2,V2] • System sorts the keys, and groups each unique Key with all its Values [K2,V2] Group [K2,{V2,V2,....}] • Reduce translates the Values of each unique Key to new Keys and Values [K2,{V2,V2,....}] Reduce [K3,V3] Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 22. Word Count • Read a line of text, output every word [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] • Group all the values with each unique word ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{1,1,1,1,1}] ["when",1] • Add up the values associated with each unique word ["when",{1,1,1,1,1}] Reduce ["when",5] Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 23. Know That... • 1 Job = 1 Map impl + 1 Reduce impl • Map: • is required in a Hadoop Job • may emit 0 or more Key Value pairs [K2,V2] • Reduce: • is optional in a Hadoop Job • sees Keys in sort order • but Keys will be randomly distributed across Reducers • the collection of Values per Key are unordered [K2,{V2..}] • may emit 0 or more Key Value pairs [K3,V3] Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 24. Runtime Distribution [K1,V1] Map [K2,V2] Group [K2,{V2,...}] Reduce [K3,V3] Mapper Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Task Mappers must complete before Reducers can begin split1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory block1 block2 blockN Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 25. Common Patterns • Filtering or “Grepping” • Distributed Tasks • Parsing, Conversion • Simple Total Sorting • Counting, Summing • Chained Jobs • Binning, Collating Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 26. Advanced Patterns • Group By • CoGrouping / Joining • Distinct • Distributed Total Sort • Secondary Sort Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 27. For Example: Secondary Sort Mapper Reducer [K1,V1] [K2,{V2,V2,....}] [K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}> Map Reduce <A2,B2><C2> -> K2, <D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3 [K2,V2] [K3,V3] • The K2 key becomes a composite Key • Key: [grouping, secondary], Value: [remaining values] • Shuffle phase • Custom Partitioner, must only partition on grouping Fields • Standard ‘output key’ Comparator, must compare on all Fields • Custom ‘value grouping’ Comparator, must compare grouping Fields Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 28. Very Advanced • Classification • Dimension Reduction • Clustering • Evolutionary • Regression Algorithms Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 29. Troubles w/ MapReduce [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection • Keys and Values are brittle • Map and Reduce is coarse grained Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 30. Real World Apps [37/75] map+reduce [54/75] map+reduce [41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce [23/75] map+reduce [25/75] map+reduce [24/75] map+reduce [27/75] map+reduce [26/75] map+reduce [21/75] map+reduce [19/75] map+reduce [28/75] map+reduce [22/75] map+reduce [32/75] map+reduce [20/75] map+reduce [31/75] map+reduce [33/75] map+reduce [29/75] map+reduce [34/75] map+reduce [35/75] map+reduce [30/75] map+reduce [7/75] map+reduce [2/75] map+reduce [8/75] map+reduce [10/75] map+reduce [9/75] map+reduce [5/75] map+reduce [3/75] map+reduce [11/75] map+reduce [6/75] map+reduce [13/75] map+reduce [4/75] map+reduce [16/75] map+reduce [14/75] map+reduce [15/75] map+reduce [17/75] map+reduce [18/75] map+reduce [12/75] map+reduce [60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map [59/75] map [64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce [75/75] map+reduce [1/75] map+reduce 1 app, 75 jobs green = map + reduce purple = map blue = join/merge orange = map split Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 31. Not Thinking in MapReduce Map Reduce Map Reduce [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] Pipe Pipe Pipe Pipe Pipe Pipe [ f1,f2,.. ] [ f1,f2,.. ] Source [ f1, f2,... ] = tuples with field names Sink • It’s easier with Fields/Tuples... • Streaming through Pipes/Operations Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 32. Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', ''] ['', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', ''][args:1]] Each('')[RegexSplitter[decl:'', '', '', ''][args:1]] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', '', ''][args:1]] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] CoGroup('')[by:a:['']b:['']] ['', ''] a[''],b[''] ['', ''] ['', '', '', ''] ['', ''] Each('')[Identity[decl:'', '', '']] ['', ''] ['', '', ''] ['', ''] ['', '', ''] ['', '', ''] ['', ''] ['', '', ''] ['', ''] CoGroup('')[by:a:['']b:['']] ['', '', ''] ['', '', ''] ['', ''] ['', '', ''] a[''],b[''] ['', ''] ['', '', ''] ['', '', '', '', ''] Each('')[Identity[decl:'', '', '', '']] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] CoGroup('')[by:a:['']b:['']] a[''],b[''] a[''],b[''] ['', '', '', '', '', '', ''] ['', '', '', '', '', ''] Each('')[Identity[decl:'', '', '', '']] ['', '', '', '', '', '', ''] ['', '', '', ''] ['', '', '', '', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', ''] ['', '', '', '', '', '', '', '', ''] ['', '', '', ''] Each('')[Identity[decl:'', '']] CoGroup('')[by:a:['']b:['']] ['', ''] a[''],b[''] ['', ''] ['', '', '', '', '', '', ''] Each('')[Identity[decl:'', '', '']] ['', ''] ['', '', ''] ['', ''] ['', '', ''] GroupBy('')[by:['']] a[''] ['', '', ''] ['', ''] ['', '', ''] Every('')[Agregator[decl:'']] CoGroup('')[by:a:['']b:['']] 1 app, 12 jobs a[''],b[''] ['', '', '', '', ''] ['', ''] ['', ''] Each('')[Identity[decl:'', '', '']] ['', '', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', '', ''] ['', '', ''] ['', '', ''] GroupBy('')[by:['']] green = source or sink a[''] ['', '', ''] Every('')[Agregator[decl:'', '']] blue = pipe+operation ['', '', ''] ['', '', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 33. Cascading Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 34. In A Nut Shell • An alternative API to MapReduce • Fields and Tuples • Functions, Filters, and Aggregators • Built-in ‘relational’ operations • Can be used by any JVM based language Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 35. Developers • Re-use or write new Operations (Func,Filter,Agg) • Chain the Ops into data processing work-flows • Instantiate the work-flow with data end-points Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 36. Some Sample Code http://github.com/cwensel/cascading.samples • hadoop • logparser • loganalysis bit.ly links on next pages Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 37. Raw MapReduce: Parsing a Log File > read one key/value at a time from inputPath > using some regular expression... > parse value into regular expression groups > save results as a TAB delimited row to the outputPath http://bit.ly/RegexMap http://bit.ly/RegexMain Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 38. http://bit.ly/RegexMap File - /Users/cwensel/sandbox/calku/cascading.samples/hadoop/src/java/hadoop/RegexParserMap.java public class RegexParserMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private Pattern pattern; private Matcher matcher; @Override public void configure( JobConf job ) { pattern = Pattern.compile( job.get( "logparser.regex" ) ); matcher = pattern.matcher( "" ); // lets re-use the matcher } @Override public void map( LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter ) { matcher.reset( value.toString() ); if( !matcher.find() ) throw new RuntimeException( "could not match pattern: [" + pattern + "] with value: [" + value + "]" ); StringBuffer buffer = new StringBuffer(); for( int i = 0; i < matcher.groupCount(); i++ ) { if( i != 0 ) buffer.append( "t" ); buffer.append( matcher.group( i + 1 ) ); // skip group 0 } // pass null so a TAB is not prepended, not all OutputFormats accept null output.collect( null, new Text( buffer.toString() ) ); } } Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 39. { public static void main( String[] args ) throws IOException { // create Hadoop path instances Path inputPath = new Path( args[ 0 ] ); http://bit.ly/RegexMain Path outputPath = new Path( args[ 1 ] ); // get the FileSystem instances for the input path FileSystem outputFS = outputPath.getFileSystem( new JobConf() ); // if output path exists, delete recursively if( outputFS.exists( outputPath ) ) outputFS.delete( outputPath, true ); // initialize Hadoop job configuration JobConf jobConf = new JobConf(); jobConf.setJobName( "logparser" ); // set the current job jar jobConf.setJarByClass( Main.class ); // set the input path and input format TextInputFormat.setInputPaths( jobConf, inputPath ); jobConf.setInputFormat( TextInputFormat.class ); // set the output path and output format TextOutputFormat.setOutputPath( jobConf, outputPath ); jobConf.setOutputFormat( TextOutputFormat.class ); jobConf.setOutputKeyClass( Text.class ); jobConf.setOutputValueClass( Text.class ); // must set to zero since we have no redcue function jobConf.setNumReduceTasks( 0 ); // configure our parsing map classs jobConf.setMapperClass( RegexParserMap.class ); String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* + [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]* " ([^ ]*) ([^ ]*).*$" ; jobConf.set( "logparser.regex" , apacheRegex ); // create Hadoop client, must pass in this JobConf for some reason JobClient jobClient = new JobClient( jobConf ); // submit job RunningJob runningJob = jobClient.submitJob( jobConf ); // block until job completes runningJob.waitForCompletion(); } Copyright Concurrent, Inc. 2009. All rights reserved } Sunday, June 21, 2009
  • 40. Cascading: Parsing a Log File > read one “line” at a time from “source” > using some regular expression... > parse “line” into “ip, time, method, event, status, size” > save as a TAB delimited row to the “sink” http://bit.ly/LPMain Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 41. public static void main( String[] args ) http://bit.ly/LPMain { String inputPath = args[ 0 ]; String outputPath = args[ 1 ]; // define what the input file looks like, "offset" is bytes from beginning TextLine scheme = new TextLine( new Fields( "offset", "line" ) ); // create SOURCE tap to read a resource from the local file system, if input is not an URL Tap logTap = inputPath.matches( "^[^:]+://.*" ) ? new Hfs( scheme, inputPath ) : new Lfs( scheme, inputPath ); // create an assembly to parse an Apache log file and store on an HDFS cluster // declare the field names we will parse out of the log file Fields apacheFields = new Fields( "ip" , "time" , "method", "event" , "status" , "size" ); // define the regular expression to parse the log file with String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* + [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]* " ([^ ]*) ([^ ]*).*$" ; // declare the groups from the above regex we want to keep. each regex group will be given // a field name from 'apacheFields', above, respectively int[] allGroups = {1, 2, 3, 4, 5, 6}; // create the parser RegexParser parser = new RegexParser( apacheFields, apacheRegex, allGroups ); // create the import pipe element, with the name 'import', and with the input argument named "line" // replace the incoming tuple with the parser results // "line" -> parser -> "ts" Pipe importPipe = new Each( "import" , new Fields( "line" ), parser, Fields.RESULTS ); // create a SINK tap to write to the default filesystem // by default, TextLine writes all fields out Tap remoteLogTap = new Hfs( new TextLine(), outputPath, SinkMode.REPLACE ); // set the current job jar Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); // connect the assembly to the SOURCE and SINK taps Flow parsedLogFlow = new FlowConnector( properties ).connect( logTap, remoteLogTap, importPipe ); // optionally print out the parsedLogFlow to a DOT file for import into a graphics package // parsedLogFlow.writeDOT( "logparser.dot" ); // start execution of the flow (either locally or on the cluster parsedLogFlow.start(); // block until the flow completes parsedLogFlow.complete(); Copyright Concurrent, Inc. 2009. All rights reserved } Sunday, June 21, 2009 }
  • 42. What Changes When... • We want to sort by “status” field? • We want to filter out some “urls”? • We want to sum the “size”? • Do all the above? • We are thinking in ‘fields’ not Key/Values Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 43. Cascading: Arrival rate > ..parse data as before.. > read parsed logs > calculate “urls” per second > calculate “urls” per minute > save “sec interval” & “count” as a TAB’d row to “sink1” > save “min interval” & “count” as a TAB’d row to “sink2” http://bit.ly/LAMain Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 44. http://bit.ly/LAMain Arrival Rate Flow Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']'] ['ip', 'time', 'method', 'event', 'status', 'size'] ['ip', 'time', 'method', 'event', 'status', 'size'] Each('arrival rate')[DateParser[decl:'ts'][args:1]] ['ts'] ['ts'] ['ts'] ['ts'] GroupBy('tsCount')[by:['ts']] Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]] ['tm'] tsCount['ts'] ['tm'] ['ts'] GroupBy('tmCount')[by:['tm']] Every('tsCount')[Count[decl:'count']] tmCount['tm'] ['ts', 'count'] ['tm'] ['ts'] Every('tmCount')[Count[decl:'count']] Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']'] ['tm', 'count'] ['tm'] Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']'] Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 45. // set the current job jar Properties properties = new Properties (); http://bit.ly/LAMain FlowConnector .setApplicationJarClass ( properties , Main .class ); FlowConnector flowConnector = new FlowConnector ( properties ); CascadeConnector cascadeConnector = new CascadeConnector (); String inputPath = args [ 0 ]; String logsPath = args [ 1 ] + "/logs/"; String arrivalRatePath = args [ 1 ] + "/arrivalrate/" ; String arrivalRateSecPath = arrivalRatePath + "sec"; String arrivalRateMinPath = arrivalRatePath + "min" ; // create an assembly to import an Apache log file and store on DFS // declares: "time", "method", "event", "status", "size" Fields apacheFields = new Fields ( "ip" , "time" , "method" , "event" , "status", "size" ); String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +[([^]]*)] +"([^ ]*) ([^ ]*) [^ ]* " ([^ ]*) ([^ ]*).*$"; int[] apacheGroups = {1, 2, 3, 4, 5, 6}; RegexParser parser = new RegexParser ( apacheFields , apacheRegex , apacheGroups ); Pipe importPipe = new Each ( "import" , new Fields ( "line" ), parser ); // create tap to read a resource from the local file system, if not an url for an external resource // Lfs allows for relative paths Tap logTap = inputPath .matches ( "^[^:]+://.*" ) ? new Hfs ( new TextLine (), inputPath ) : new Lfs ( new TextLine (), inputPath ); // create a tap to read/write from the default filesystem Tap parsedLogTap = new Hfs ( apacheFields , logsPath ); // connect the assembly to source and sink taps Flow importLogFlow = flowConnector .connect ( logTap , parsedLogTap , importPipe ); // create an assembly to parse out the time field into a timestamp // then count the number of requests per second and per minute // apply a text parser to create a timestamp with 'second' granularity // declares field "ts" DateParser dateParser = new DateParser ( new Fields ( "ts" ), "dd/MMM/yyyy:HH:mm:ss Z"); Pipe tsPipe = new Each ( "arrival rate" , new Fields ( "time" ), dateParser , Fields .RESULTS ); // name the per second assembly and split on tsPipe Pipe tsCountPipe = new Pipe ( "tsCount", tsPipe ); tsCountPipe = new GroupBy ( tsCountPipe , new Fields ( "ts" ) ); tsCountPipe = new Every ( tsCountPipe , Fields .GROUP , new Count () ); // apply expression to create a timestamp with 'minute' granularity // declares field "tm" Pipe tmPipe = new Each ( tsPipe , new ExpressionFunction ( new Fields ( "tm" ), "ts - (ts % (60 * 1000))", long.class ) ); // name the per minute assembly and split on tmPipe Pipe tmCountPipe = new Pipe ( "tmCount", tmPipe ); tmCountPipe = new GroupBy ( tmCountPipe , new Fields ( "tm" ) ); tmCountPipe = new Every ( tmCountPipe , Fields .GROUP , new Count () ); // create taps to write the results the default filesystem, using the given fields Tap tsSinkTap = new Hfs ( new TextLine (), arrivalRateSecPath ); Tap tmSinkTap = new Hfs ( new TextLine (), arrivalRateMinPath ); // a convenience method for binding taps and pipes, order is significant Map <String , Tap > sinks = Cascades .tapsMap ( Pipe .pipes ( tsCountPipe , tmCountPipe ), Tap .taps ( tsSinkTap , tmSinkTap ) ); // connect the assembly to the source and sink taps Flow arrivalRateFlow = flowConnector .connect ( parsedLogTap , sinks , tsCountPipe , tmCountPipe ); // optionally print out the arrivalRateFlow to a graph file for import into a graphics package //arrivalRateFlow.writeDOT( "arrivalrate.dot" ); // connect the flows by their dependencies, order is not significant Cascade cascade = cascadeConnector .connect ( importLogFlow , arrivalRateFlow ); // execute the cascade, which in turn executes each flow in dependency order cascade .complete (); Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 46. Cascading Model Pipe Operation Sub Each GroupBy CoGroup Every Function Filter Aggregator Buffer Assertion Assembly Joiner ?? ?? ?? ?? ?? ?? InnerJoin OuterJoin LeftJoin RightJoin MixedJoin ?? Tap Hdfs JDBC HBase ?? Each Every TupleEntry Value Group Function Filter Assertion Aggregator Buffer Assertion Fields Tuple Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 47. Assembling Applications Tap func Tap Each Group Every aggr [source] filter [sink] Tap [trap] • Pipes can be arranged into arbitrary assemblies buffer Tap func Tap Group aggr [source] filter [sink] Tap [trap] • Some limitations imposed by Operations Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 48. Assembly -> Flow Pipe Assembly func Group CoGroup aggr Group GroupBy aggr func func func Flow Source func Group CoGroup aggr Group GroupBy aggr func Sink Source func func • Assemblies are planned into Flows Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 49. Flow Planner Map Reduce Map Reduce Source func CoGroup aggr temp GroupBy aggr func Sink Source func func • The Planner handles how operations are partitioned into MapReduce jobs Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 50. Assembly -> Cluster Client Cluster Assembly Flow Job Job Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 51. Best Practices: Developing • Organize Flows as med/small work-loads • Improves reusability and work-sharing • Small/Medium-grained Operations • Improves reusability • Coarse-grained Operations • Better performance vs flexibility • SubAssemblies encapsulate responsibilities • Separate “rules” from “integration” • Solve problem first, Optimize second • Can replace Flows with raw MapReduce Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 52. Best Practices: Testing • Unit tests • Operations tested independent of Hadoop or Map/Reduce • Regression / Integration Testing • Use built-in testing cluster for small data • Use EC2 for large regression data sets • Inline “strict” Assertions into Assemblies • Planner can remove them at run-time • SubAssemblies tested independently of other logic Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 53. Best Practices: Data Processing • Account for good/bad data • Split good/bad data by rules • Capture rule reasons with “bad” data • Inline “validating” Assertions • For staging or untrusted data-sources • Again, can be “planned” out for performance at run-time • Capture really bad data through “traps” • Depending on Fidelity of app • Use s3n:// before s3:// • S3:// stores data in proprietary format, never as default FS • Store compressed, in chunks (1 file per Mapper) Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 54. Cascading • Free Software/ • User contributed GPLv3 modules • 1.0 since January • Elastic MapReduce • Growing community Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 55. 1.1 • Flow level counters • Fields.SWAP and Fields.REPLACE • “Safe” Operations • Hadoop “globbing” support • Detailed job statistics Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009
  • 56. Resources • Cascading • http://cascading.org • Enterprise tools and support • http://concurrentinc.com • Professional Services and Training • http://scaleunlimited.com Copyright Concurrent, Inc. 2009. All rights reserved Sunday, June 21, 2009