SlideShare une entreprise Scribd logo
1  sur  40
Low-Latency “OLAP” with Hadoop and HBase
      Andrei Dragomir | Software Engineer




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Synopsis


  §  What                        are we trying to solve
  §  Description                                              of our system
  §  How                     it works
  §  Minimizing                                            Latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   2
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   3
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   4
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   5
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   6
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   7
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   8
Vocabulary

  Date                              Country                            City          OS       Browser      Sales
  2012-05-12                        USA                                NY            Win      FF           $ 0.0
  2012-05-12                        USA                                NY            Win      FF           $ 10.0
  2012-05-13                        USA                                SF            OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario       Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago       OSX      Safari       $ 15.0
  ...                               ...                                ...           ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                            3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                OSX: 2   Chrome: 2




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      9
Vocabulary

  Date                              Country                            City           OS       Browser      Sales
  2012-05-12                        USA                                NY             Win      FF           $ 0.0
  2012-05-12                        USA                                NY             Win      FF           $ 10.0
  2012-05-13                        USA                                SF             OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario        Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago        OSX      Safari       $ 15.0
  ...                               ...                                ...            ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                             3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                 Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                 OSX: 2   Chrome: 2
  §    We want to get (mostly) numeric data: metrics
  §    These metrics have a set of labels (dimensions)
  §    We want to view the metrics by any combination of
        dimensions
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      10
Vocabulary

  Date                              Country                            City           OS       Browser      Sales
  2012-05-12                        USA                                NY             Win      FF           $ 0.0
  2012-05-12                        USA                                NY             Win      FF           $ 10.0
  2012-05-13                        USA                                SF             OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario        Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago        OSX      Safari       $ 15.0
  ...                               ...                                ...            ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                             3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                 Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                 OSX: 2   Chrome: 2
  §    We want to get (mostly) numeric data: metrics
  §    These metrics have a set of labels (dimensions)
  §    We want to view the metrics by any combination of
        dimensions
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      11
Vocabulary

  Date                              Country                            City           OS       Browser      Sales
  2012-05-12                        USA                                NY             Win      FF           $ 0.0
  2012-05-12                        USA                                NY             Win      FF           $ 10.0
  2012-05-13                        USA                                SF             OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario        Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago        OSX      Safari       $ 15.0
  ...                               ...                                ...            ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                             3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                 Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                 OSX: 2   Chrome: 2
  §    We want to get (mostly) numeric data: metrics
  §    These metrics have a set of labels (dimensions)
  §    We want to view the metrics by any combination of
        dimensions
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      12
OLAP Queries

  §    Rolling up to country level                                                Country    visits   sales
  SELECT	
  COUNT(visits),	
  SUM(sales)	
                                         USA        4        $50
  GROUP	
  BY	
  country	
  
                                                                                   Canada     1        0




  §    “Slicing” by browser                                                       Country   visits sales

  SELECT	
  COUNT(visits),	
  SUM(sales)	
                                         USA       2         $10

  GROUP	
  BY	
  country	
                                                         Canada    0         0
  HAVING	
  browser	
  =	
  “FF”	
  

                                                                                   Browser   sales     visits
  §    Top browsers by sales
                                                                                   Chrome    $25       2
  SELECT	
  SUM(sales),	
  COUNT(visits)	
  	
  
  GROUP	
  BY	
  browser	
  	
                                                     Safari    $15       1

  ORDER	
  BY	
  sales	
                                                           FF        $10       2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   13
Looking inside – physical diagram




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Looking inside – logical diagram




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Simplifying assumptions: pre-aggregation


  §  In          most cases...
       §  Data  needs to be summarized – hard to
             draw 1B data points
       §  You    don’t need to look at all dimensions at
             the same time – hard to correlate
       §  Not   all queries are used with the same
             frequency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   16
A timeless CS problem: Optimize...


                                       Time                                                     Space
       §  Pre-aggregation                                                         §  Runtime

       §  Fast
                                                                                     aggregation
                                                                                   §  Flexible
       §  Efficient                               reads –
             O(1)
       §  Inflexible                                                              §  I/O,   CPU intensive
       §  Processing                                           latency            §  Slow– always need
       §  Combinatorial
                                                                                     to look at all the
             Explosion                                                               data
                                                                                   §  Low    throughput
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   17
Solution ?


  §  Just do both !
  §  Can tune: pre-aggregate more, or rely on
      runtime aggregation
  §  Ingestion + process speed vs Query speed

  §  Works just like normal queries +
      materialized views




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   18
Solution ?


  §  Process:   pre-aggregate all the report
       definitions, create an indexed HBase table.
  §  Query:   use the indexes to get the data
       fast. Perform extra aggregation, filtering if
       needed at runtime.
  §  Platform                                   strengths
       §  Parallelism                                         in M/R
       §  Fast  access and natural key ordering in
             HBase
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   19
Minimal HBase details

                                                                                   Row	
     Columns...	
  
  §    Data is stored in tables                                                   Key	
  
                                                                                   u1	
      v1	
      v2	
      v3	
  
  §    Each row has a key,
                                                                                   u2	
      v	
       X	
       ...	
  
        and any number of
        columns (long & wide)                                                      u3	
      v	
       x	
       ...	
  
                                                                                   u4	
      x	
       v2	
      ...	
  
  §    Ordered by row keys:                                                       u5	
      ...	
     v3	
      ...	
  
        clustered indexes
                                                                                   u6	
      ...	
     v5	
      ...	
  
        built-in
                                                                                   u7	
      ...	
     ...	
     ...	
  
  §    Sparse tables. NULLs                                                       u8	
      ...	
     ...	
     ...	
  
        are free.


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   20
Minimal HBase details

                                                                                           Row	
     Column
  §    Operations use row                                                                 key	
     ...	
  
        key: get(), put()	
                                                                aaa	
     v1	
  
                                                                                           aab	
     v2	
  
  §    Can scan a range of
                                                                                   ←	
  
        rows:[start,	
  end)	
                                                             aac	
     v3	
  
                                                                                   ←	
     aad	
     v4	
  
  §  We   can use the row                                                         ←	
     aae	
     v5	
  
        key as a built-in                                                          ←	
     aaf	
     v6	
  
        indexing                                                                           aba	
     ...	
  
        mechanism                                                                          abb	
     ...	
  




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   21
SaasBase vs. SQL Views Comparison




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   22
Reports configuration


  §    List of Dimensions (with custom classes,
        arguments, etc)
  §    List of Metrics (with custom classes, arguments,
        etc)
  §    List of Reports, each containing
        §    Dimensions (subset)
        §    Metrics (subset)
        §    Sorting, etc
  §  The    reports configuration is used in the
        entire system: import, process, query
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   23
Solution ?

  Date                         Countr                  Cit           Sale
                               y                       y             s
  2012-05-1 USA                                        NY            3
  2
  2012-05-1 USA                                        NY            10
  2
  2012-05-1 USA                                        SF            25
  3
  2012-05-1 CAN                                        ON            0
  3
  2012-05-1 USA                                        CH            15
  4




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   24
Solution ?

  Date                         Countr                  Cit           Sale
                               y                       y             s
  2012-05-1 USA                                        NY            3
  2
  2012-05-1 USA                                        NY            10
  2
  2012-05-1 USA                                        SF            25
  3
  2012-05-1 CAN                                        ON            0
  3
  2012-05-1 USA
  visits_by_city:	
  {	
             CH 15
  	
  	
  dimensions:	
  [country,	
  city],	
  	
  
  4
  	
  	
  metrics:	
  [visits]	
  
  },	
  	
  
  daily_sales:	
  {	
  
  	
  	
  dimensions:	
  [year,	
  month,	
  day,	
  
  country],	
  	
  
  	
  	
  metrics:	
  [sales]	
  
  }	
  

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   25
Solution ?

  Date                         Countr                  Cit           Sale
                               y                       y             s
  2012-05-1 USA                                        NY            3
  2
  2012-05-1 USA                                        NY            10
  2                                                                           	
  	
  	
  Statistics	
  HBASE	
  Output	
  Table	
  
                                                                                             	
  	
  	
  	
  	
  ROWKEY	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  VALUE	
  
  2012-05-1 USA                                        SF            25
  3                                                                           daily_sales/2012+05+12+USA	
  	
  	
  	
  $13	
  	
  
                                                                              daily_sales/2012+05+13+CAN	
  	
  	
  	
  $0	
  
  2012-05-1 CAN                                        ON            0
                                                                              daily_sales/2012+05+13+USA	
  	
  	
  	
  $25	
  
  3
                                                                              daily_sales/2012+05+14+USA	
  	
  	
  	
  $15	
  
  2012-05-1 USA
  visits_by_city:	
  {	
             CH 15                                    visits_by_city/CAN+ON	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
  	
  	
  dimensions:	
  [country,	
  city],	
  	
  
  4
  	
  	
  metrics:	
  [visits]	
                                              visits_by_city/USA+CH	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
  },	
  	
  
  daily_sales:	
  {	
                                                         visits_by_city/USA+NY	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  
  	
  	
  dimensions:	
  [year,	
  month,	
  day,	
                           visits_by_city/USA+SF	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
  country],	
  	
  
  	
  	
  metrics:	
  [sales]	
  
  }	
  

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.     26
HBase natural order: hierarchical filtering




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   27
Sorting


  §  Add  the metrics that you want to sort by to the
       row key...
  §  In          a way that preserves the ordering




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   28
Sorting


  §  Add   the metrics that you want to sort by to the
        row key...
  §  In          a way that preserves the ordering
  §    ORDER	
  BY	
  metric	
  DESC	
  ==	
  Long.MAX_VALUE	
  –	
  metric	
  


  2012+05+USA+0000000000+	
  
  2012+05+USA+4294961296+SF 	
  =	
  1000	
  visits	
  
  2012+05+USA+4294961396+NY 	
  =	
  900	
  visits	
  
  .	
  .	
  .	
  	
  	
  
  2012+05+USA+9999999999+	
  


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   29
Minimizing Latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Minimizing Import Latency


  §    Only import the minimal set of changes
  §    Map/Reduce input filters:
        §    c.a.s.a.i.FileCache – checks if file already
              processed
        §    c.a.s.a.i.FileDateFilter – checks if a date in
              the file path is against a specified interval
        §    process files from 3 days ago up until now,
              once
        §    HBase scan (from import table) start and stop row
  §    Minimize map-task overhead – stitch input splits
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   31
Minimizing Import Latency


  §    Minimize map-task overhead – stitch input splits
  §    for 400000 files -> 400000 Map Tasks, slow reduce-copy
        phase
  §    o.a.h.m.i.CombineFileInputFormat – make 2GB
        splits
  §    c.a.s.a.m.i.FixedMappersTableInputFormat –
        stitches multiple HBase regions in the same
        map task



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   32
Minimizing Import Latency


  §    If warehousing in HBase, use
        o.a.h.h.m.HFileOutputFormat	
  
  §    ~ 100 times faster than using the API
  §    No shuffle step! you must use a global order partitioner
  §    Problem: data grows over time
  §    Solution: estimate output partitions based on input data
        size, and make partitions (regions) using this heuristic
  §    c.a.s.a.m.FileSizeDatePartitioner – inject input files
        size and dates and rebalance regions based on these,
        and a fixed size (2GB)


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   33
Minimizing Processing Latency


  §    Processing involves reading the input (files, tables,
        events), pre-aggregating it (reducing cardinality) and
        generating tables that can be queried in real-time
  §    Processing does GROUP BY, COUNT/SUM/AVG, ORDER
        BY
  §    Minimize each M/R step: read, map, partition, combine,
        copy, sort, reduce, write
  §    Read
        §    Filter input data (incremental processing) – differentiate
              between OPEN and CLOSED data
        §    HBase Scan options: caching, batching, etc
        §    Ensure HBase table regions are distributed in the cluster
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   34
Minimizing Processing Latency


  §    c.a.s.a.m.j.SuperProcessor	
  
        §    One shot M/R job: for all data, for all reports, emit the
              pre-aggregated values in 1 map() call
        §    no allocations
        §    Simple and tight
        §    no system calls (avoid context switches)
        §    no String <> byte[] transformations
        §    minimize Map > Combine > Reduce I/O
        §    NO ALLOCATIONS



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   35
Minimizing Query Latency


  §    c.a.s.a.m.t.ReportHandler	
  
        §    Simple Thrift server
  §    Data is already processed and pre-aggregated
  §    Query time does HAVING/WHERE (filters), extra
        GROUP BY (roll-ups)
  §    Calculate an optimal set of HBase scan()s	
  
        §    single / multiple scans
        §    start / stop rows (prefixes, index positions)
  §    Perform extra roll-ups / sorting
  §    Assorted sundries: paging, display-time ser/des, etc

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   36
Flexible


  §    Report configuration – the core of the system
  §    c.a.s.a.e.Dimension, c.a.s.a.e.Metric	
  
        §    Can override ser/des, aggregate functions (for metrics)
        §    Can override behavior (only add 1 if X...)
        §    Emergent patterns are rolled-up in the reporting core
  §    The entire processing loop can be written outside of
        M/R for realtime
        §    Storm ?
  §    Applied in 4 use-cases right now, easy to extend
  §    Some programming required
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   37
Thank you


                                 adragomi@adobe.com / @adragomir
                                          http://hstack.org


       Our team: Adrian Muraru, Andrei Dulvac, Bogdan Dragu,
     Bogdan Drutu, Cosmin Lehene, Raluca Podiuc, Tudor Scurtu

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Break!
Break takes place in the Community Showcase (Hall 2)
Sessions will resume at 3:35pm




                                                       Page 40

Contenu connexe

Tendances

Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersDataWorks Summit
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 

Tendances (20)

Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size Matters
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
The PostgreSQL Query Planner
The PostgreSQL Query PlannerThe PostgreSQL Query Planner
The PostgreSQL Query Planner
 

En vedette

Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Cosmin Lehene
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Cosmin Lehene
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, AdobeHBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, AdobeCloudera, Inc.
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupLuke Han
 
(Ebook pdf) olap
(Ebook   pdf) olap(Ebook   pdf) olap
(Ebook pdf) olapTalita Lima
 
Sybase BAM Overview
Sybase BAM OverviewSybase BAM Overview
Sybase BAM OverviewXu Jiang
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecYang Li
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering PrinciplesXu Jiang
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/Xu Jiang
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin IntroductionLuke Han
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Polyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQPolyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQChristian Posta
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 

En vedette (20)

Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, AdobeHBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
(Ebook pdf) olap
(Ebook   pdf) olap(Ebook   pdf) olap
(Ebook pdf) olap
 
Sybase BAM Overview
Sybase BAM OverviewSybase BAM Overview
Sybase BAM Overview
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Polyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQPolyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQ
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 

Similaire à Low Latency OLAP with Hadoop and HBase

Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setupDong Lin
 
AD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application EditionAD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application EditionRay Bilyk
 
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify LearningsObvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify LearningsDavid Poblador i Garcia
 
Ajax for-coldfusion-developers
Ajax for-coldfusion-developersAjax for-coldfusion-developers
Ajax for-coldfusion-developersSudhakar Ganta
 
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...Kai Koenig
 
AD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application EditionAD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application EditionRay Bilyk
 
Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019David Voyles
 
Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019David Voyles
 
So go installation guide
So go installation guideSo go installation guide
So go installation guideJavier Urbaneja
 
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...Robert MacLean
 
Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015Sarah Sexton
 
Tom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe TechnologiesTom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe TechnologiesDevGAMM Conference
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krchamochimedia
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRailswebuploader
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrencyConcurrency, Inc.
 
Macs OSX & Libraries
Macs OSX & LibrariesMacs OSX & Libraries
Macs OSX & LibrariesScott Kehoe
 
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'MahonyBusiness Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahonycomahony
 
Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...North Bend Public Library
 
Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010Jongwon Kim
 

Similaire à Low Latency OLAP with Hadoop and HBase (20)

Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setup
 
AD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application EditionAD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application Edition
 
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify LearningsObvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
 
Ajax for-coldfusion-developers
Ajax for-coldfusion-developersAjax for-coldfusion-developers
Ajax for-coldfusion-developers
 
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
 
AD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application EditionAD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
 
Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019
 
Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019
 
So go installation guide
So go installation guideSo go installation guide
So go installation guide
 
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
 
Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015
 
01 lab1
01 lab101 lab1
01 lab1
 
Tom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe TechnologiesTom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe Technologies
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krcha
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrency
 
Macs OSX & Libraries
Macs OSX & LibrariesMacs OSX & Libraries
Macs OSX & Libraries
 
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'MahonyBusiness Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahony
 
Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...
 
Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Low Latency OLAP with Hadoop and HBase

  • 1. Low-Latency “OLAP” with Hadoop and HBase Andrei Dragomir | Software Engineer © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 2. Synopsis §  What are we trying to solve §  Description of our system §  How it works §  Minimizing Latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
  • 3. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
  • 4. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
  • 5. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
  • 6. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
  • 7. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
  • 8. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
  • 9. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
  • 10. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 §  We want to get (mostly) numeric data: metrics §  These metrics have a set of labels (dimensions) §  We want to view the metrics by any combination of dimensions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
  • 11. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 §  We want to get (mostly) numeric data: metrics §  These metrics have a set of labels (dimensions) §  We want to view the metrics by any combination of dimensions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
  • 12. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 §  We want to get (mostly) numeric data: metrics §  These metrics have a set of labels (dimensions) §  We want to view the metrics by any combination of dimensions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
  • 13. OLAP Queries §  Rolling up to country level Country visits sales SELECT  COUNT(visits),  SUM(sales)   USA 4 $50 GROUP  BY  country   Canada 1 0 §  “Slicing” by browser Country visits sales SELECT  COUNT(visits),  SUM(sales)   USA 2 $10 GROUP  BY  country   Canada 0 0 HAVING  browser  =  “FF”   Browser sales visits §  Top browsers by sales Chrome $25 2 SELECT  SUM(sales),  COUNT(visits)     GROUP  BY  browser     Safari $15 1 ORDER  BY  sales   FF $10 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
  • 14. Looking inside – physical diagram © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 15. Looking inside – logical diagram © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 16. Simplifying assumptions: pre-aggregation §  In most cases... §  Data needs to be summarized – hard to draw 1B data points §  You don’t need to look at all dimensions at the same time – hard to correlate §  Not all queries are used with the same frequency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
  • 17. A timeless CS problem: Optimize... Time Space §  Pre-aggregation §  Runtime §  Fast aggregation §  Flexible §  Efficient reads – O(1) §  Inflexible §  I/O, CPU intensive §  Processing latency §  Slow– always need §  Combinatorial to look at all the Explosion data §  Low throughput © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
  • 18. Solution ? §  Just do both ! §  Can tune: pre-aggregate more, or rely on runtime aggregation §  Ingestion + process speed vs Query speed §  Works just like normal queries + materialized views © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
  • 19. Solution ? §  Process: pre-aggregate all the report definitions, create an indexed HBase table. §  Query: use the indexes to get the data fast. Perform extra aggregation, filtering if needed at runtime. §  Platform strengths §  Parallelism in M/R §  Fast access and natural key ordering in HBase © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
  • 20. Minimal HBase details Row   Columns...   §  Data is stored in tables Key   u1   v1   v2   v3   §  Each row has a key, u2   v   X   ...   and any number of columns (long & wide) u3   v   x   ...   u4   x   v2   ...   §  Ordered by row keys: u5   ...   v3   ...   clustered indexes u6   ...   v5   ...   built-in u7   ...   ...   ...   §  Sparse tables. NULLs u8   ...   ...   ...   are free. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
  • 21. Minimal HBase details Row   Column §  Operations use row key   ...   key: get(), put()   aaa   v1   aab   v2   §  Can scan a range of ←   rows:[start,  end)   aac   v3   ←   aad   v4   §  We can use the row ←   aae   v5   key as a built-in ←   aaf   v6   indexing aba   ...   mechanism abb   ...   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
  • 22. SaasBase vs. SQL Views Comparison © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
  • 23. Reports configuration §  List of Dimensions (with custom classes, arguments, etc) §  List of Metrics (with custom classes, arguments, etc) §  List of Reports, each containing §  Dimensions (subset) §  Metrics (subset) §  Sorting, etc §  The reports configuration is used in the entire system: import, process, query © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
  • 24. Solution ? Date Countr Cit Sale y y s 2012-05-1 USA NY 3 2 2012-05-1 USA NY 10 2 2012-05-1 USA SF 25 3 2012-05-1 CAN ON 0 3 2012-05-1 USA CH 15 4 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
  • 25. Solution ? Date Countr Cit Sale y y s 2012-05-1 USA NY 3 2 2012-05-1 USA NY 10 2 2012-05-1 USA SF 25 3 2012-05-1 CAN ON 0 3 2012-05-1 USA visits_by_city:  {   CH 15    dimensions:  [country,  city],     4    metrics:  [visits]   },     daily_sales:  {      dimensions:  [year,  month,  day,   country],        metrics:  [sales]   }   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
  • 26. Solution ? Date Countr Cit Sale y y s 2012-05-1 USA NY 3 2 2012-05-1 USA NY 10 2      Statistics  HBASE  Output  Table            ROWKEY                        VALUE   2012-05-1 USA SF 25 3 daily_sales/2012+05+12+USA        $13     daily_sales/2012+05+13+CAN        $0   2012-05-1 CAN ON 0 daily_sales/2012+05+13+USA        $25   3 daily_sales/2012+05+14+USA        $15   2012-05-1 USA visits_by_city:  {   CH 15 visits_by_city/CAN+ON                  1      dimensions:  [country,  city],     4    metrics:  [visits]   visits_by_city/USA+CH                  1   },     daily_sales:  {   visits_by_city/USA+NY                  2      dimensions:  [year,  month,  day,   visits_by_city/USA+SF                  1   country],        metrics:  [sales]   }   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
  • 27. HBase natural order: hierarchical filtering © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
  • 28. Sorting §  Add the metrics that you want to sort by to the row key... §  In a way that preserves the ordering © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
  • 29. Sorting §  Add the metrics that you want to sort by to the row key... §  In a way that preserves the ordering §  ORDER  BY  metric  DESC  ==  Long.MAX_VALUE  –  metric   2012+05+USA+0000000000+   2012+05+USA+4294961296+SF  =  1000  visits   2012+05+USA+4294961396+NY  =  900  visits   .  .  .       2012+05+USA+9999999999+   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
  • 30. Minimizing Latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 31. Minimizing Import Latency §  Only import the minimal set of changes §  Map/Reduce input filters: §  c.a.s.a.i.FileCache – checks if file already processed §  c.a.s.a.i.FileDateFilter – checks if a date in the file path is against a specified interval §  process files from 3 days ago up until now, once §  HBase scan (from import table) start and stop row §  Minimize map-task overhead – stitch input splits © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
  • 32. Minimizing Import Latency §  Minimize map-task overhead – stitch input splits §  for 400000 files -> 400000 Map Tasks, slow reduce-copy phase §  o.a.h.m.i.CombineFileInputFormat – make 2GB splits §  c.a.s.a.m.i.FixedMappersTableInputFormat – stitches multiple HBase regions in the same map task © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 32
  • 33. Minimizing Import Latency §  If warehousing in HBase, use o.a.h.h.m.HFileOutputFormat   §  ~ 100 times faster than using the API §  No shuffle step! you must use a global order partitioner §  Problem: data grows over time §  Solution: estimate output partitions based on input data size, and make partitions (regions) using this heuristic §  c.a.s.a.m.FileSizeDatePartitioner – inject input files size and dates and rebalance regions based on these, and a fixed size (2GB) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
  • 34. Minimizing Processing Latency §  Processing involves reading the input (files, tables, events), pre-aggregating it (reducing cardinality) and generating tables that can be queried in real-time §  Processing does GROUP BY, COUNT/SUM/AVG, ORDER BY §  Minimize each M/R step: read, map, partition, combine, copy, sort, reduce, write §  Read §  Filter input data (incremental processing) – differentiate between OPEN and CLOSED data §  HBase Scan options: caching, batching, etc §  Ensure HBase table regions are distributed in the cluster © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
  • 35. Minimizing Processing Latency §  c.a.s.a.m.j.SuperProcessor   §  One shot M/R job: for all data, for all reports, emit the pre-aggregated values in 1 map() call §  no allocations §  Simple and tight §  no system calls (avoid context switches) §  no String <> byte[] transformations §  minimize Map > Combine > Reduce I/O §  NO ALLOCATIONS © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35
  • 36. Minimizing Query Latency §  c.a.s.a.m.t.ReportHandler   §  Simple Thrift server §  Data is already processed and pre-aggregated §  Query time does HAVING/WHERE (filters), extra GROUP BY (roll-ups) §  Calculate an optimal set of HBase scan()s   §  single / multiple scans §  start / stop rows (prefixes, index positions) §  Perform extra roll-ups / sorting §  Assorted sundries: paging, display-time ser/des, etc © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 36
  • 37. Flexible §  Report configuration – the core of the system §  c.a.s.a.e.Dimension, c.a.s.a.e.Metric   §  Can override ser/des, aggregate functions (for metrics) §  Can override behavior (only add 1 if X...) §  Emergent patterns are rolled-up in the reporting core §  The entire processing loop can be written outside of M/R for realtime §  Storm ? §  Applied in 4 use-cases right now, easy to extend §  Some programming required © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 37
  • 38. Thank you adragomi@adobe.com / @adragomir http://hstack.org Our team: Adrian Muraru, Andrei Dulvac, Bogdan Dragu, Bogdan Drutu, Cosmin Lehene, Raluca Podiuc, Tudor Scurtu © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 39. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 40. Break! Break takes place in the Community Showcase (Hall 2) Sessions will resume at 3:35pm Page 40