SlideShare une entreprise Scribd logo
1  sur  22
Framework for Big Data Discovery and Analytics

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Hadoop MapReduce
• We can look across all our data to
answer questions!

Problem Statement:
Developers can write MapReduce code to analyze data, but don’t know what to look
for; the analysts know what to look for, but don’t know how to write code.

Technology is not the problem. It’s enabling the analyst to effectively leverage
technology and reuse it.

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Typical Analyst Workflow:
• I have an entity I want to learn more about
• Everything is indexed by entities

• We can ask questions of Big Data, but they aren’t Big
Questions – we always start with an entity

We should be able to:
• Have a pattern and see entities that match that pattern
• We can ask complex questions of Big Data

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Naïve Way:
Custom MapReduce job for each question

Amino Way:
Pre-compute features (micro-analytics), the building blocks of questions, and let
analysts mix those on the fly to ask complex questions
The Amino index executes Analysts’ complex questions as a real time scan, less
competition for resources, more scalable.
Scales to billions of entities and features

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Live Demo
What could go wrong?…

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Amino Framework
Feature Creation API
• Abstracts the complexities of MapReduce
• Focus on logic of the feature/micro-analytic
• Write-once DataLoader for each data source
• Simple and powerful data joins

Amino Index
• AminoOutputFormat
• Bulk Ingest into Accumulo
Query API
• Iterators

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Workflow

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Benefits
• Data Agnostic
• Not a black box
• Fully scalable
• Crowd source micro-analytics
• Inherent cross-datasource linked indexes
• Encourages sharing of knowledge, discovery
• Index built to support machine learning
• Security considered up front – index is in Accumulo
• Built on open source, for open source

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Feature Creation

-Can join multiple datasets
-Keys are established in the DataLoader

Any external job can output this format
and it will be indexed properly during
indexing jobs

Notice there’s no key – that’s on
purpose!
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Index Goals

Now all our features are indexed, let’s let the analysts start building!

• Fast scans
• Highly dimensional scans
• Data compression
• Simple query structure

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Accumulo Index 1: More Dimensions than Entities
Row

CF

Shard Number: Data Source : Bucket Name Bucket Value

CQ

Value

Hash Salt Compressed Bitmap

Example:
Row

CF

CQ

Value

2:Twitter:handle

stevetouw

0

010011010010011

JavaEWAH is a word-aligned compressed variant of the
Java bitset class. It does not achieve the best
compression, but rather improves query processing time

Indexes in the bit vector represent the features that entity falls in –
a feature vector
© 2013 42six Solutions, All Rights Reserved, www.42six.com
At Query Time…
Bloom Filter based on Lexicographical first and last of
each dimension of the query
Number of followers: 10 - 200

First: aachimba

Last: zzrka

Number of tweets per day: 0 - 6

First: aaabbb

Last: zyrbb

Handle starts with letter: S

First: saarba

Last: szaban

Smallest range
Dimensions map to a query bit vector
000001001111000101000011100101010011100101
Note there is an index for every possible value between the
ratio features
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Accumulo Iterator Time!!

Row

CF

CQ

Value

2:Twitter:handle

saarba

0

00101011001110

2:Twitter:handle

saarra

0

00101111010100

2:Twitter:handle

stevetouw

0

01111100001100

2:Twitter:handle

szaban

0

00110011001111

Push our query bit vector through the range found in
the previous step

If the result of the bitwise operation contains an index at each
dimension, we have a match!

© 2013 42six Solutions, All Rights Reserved, www.42six.com
What is the Salt For?
Row

CF

CQ

Value

Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap
Row

CF

CQ

Value

2:Twitter:handle

stevetouw

0

0100110100100101

Collisions are possible (using
32 bit vector). Salt is used to
hash the feature indexes, so
you need as many matches in
the previous step as you have
salts.
We have used 3 salts with 15
billion records and have had
no collisions

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Benefits of this Index

• Tables are small, bit vector compression is good, only one row per
entity
• Works great if you have more dimensions than you have entities or
the range in your dimensions are good bloom filters (like “handle
starts with letter …”)
• No matter how many dimensions, the query will always be as fast
as the smallest range
• All processing/boolean logic occurs on the nodes (thanks iterators),
fully scalable
• Represents a feature vector for your entities – great for machine
learning

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Accumulo Index 2: More Entities than Dimensions

Row

CF

CQ

shard:salt Data Source#Bucket Name#FeatureId

Value

Feature Value

Compressed Bitmap

Example:
Row

CF

CQ

Value

2:0

Twitter#handle#123456

s

0100110100101001

123456 could map to feature “Handle starts with letter”
Indexes in the bit vector represent the entities that fall in that
feature
So handle stevetouw could map to index 73 (for salt 0)

© 2013 42six Solutions, All Rights Reserved, www.42six.com
That Same Query Again…
Number of followers: 10 – 200 (feature id: 444411)
Number of tweets per day: 0 – 6 (feature id: 555522)
Handle starts with letter: S (feature id: 123456)
Row

CQ

Value

2:0
OR

CF
Twitter#handle#444411

10

0010111011100

2:0

Twitter#handle#444411

11

0101010101101

……
2:0

OR

200

0000001011000

2:0

AND

Twitter#handle#444411
Twitter#handle#555522

0

1111110001101

2:0

Twitter#handle#555522

1

1010100000100

……
2:0

Twitter#handle#555522

6

1111001010000

2:0

Twitter#handle#123456

S

1111110001101

Magic iterator that handles all the boolean logic
© 2013 42six Solutions, All Rights Reserved, www.42six.com
More Details
Row

CQ

Value

2:0
OR

CF
Twitter#handle#444411

10

0010111011100

2:0

Twitter#handle#444411

11

0101010101101

……
2:0

OR

200

0000001011000

2:0

AND

Twitter#handle#444411

Twitter#handle#555522

0

1111110001101

2:0

Twitter#handle#555522

1

1010100000100

……
2:0

Twitter#handle#555522

6

1111001010000

2:0

Twitter#handle#123456

S

1111110001101

The same entity is guaranteed to always land in the same shard:salt no
matter the feature

We are left with a set of indexes for each salt, now what?
© 2013 42six Solutions, All Rights Reserved, www.42six.com
Convert Indexes to Entities
Row

CF

CQ

shard

Index Position#Data Source#Bucket Name#Salt

Value

Bucket Value

Example:
Row

CF

CQ

2

73#Twitter#handle#0

Value

stevetouw

The iterator scans the rows using a CF filter with the indexes desired
The iterator ensures it gets the same CQ “# of salts” times before it sends
the resulting CQ results back
Again, use the power of iterators and pushing code to the data rather than
doing the salt set operation in the web tier

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Benefits of this Index

• Tables are small, bit vector compression is good
• Works great if you have more entities than you have dimensions
(most likely scenario)

• Affords the ability to do full boolean logic in-iterator, rather than just
ANDs as in the previous index
• All processing/boolean logic occurs on the nodes (thanks iterators),
fully scalable

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Conclusion

• Amino helps non-technical folk leverage MapReduce cleanly and without
hogging cluster resources
• Accumulo iterators are the reason for the index performance

• Amino is all about sharing and reuse, crowd source the building blocks,
save analysts hypotheses, the more people touching Amino, the smarter
it becomes
• Open source (documentation needs help): https://github.com/aminocloud/amino

© 2013 42six Solutions, All Rights Reserved, www.42six.com
Questions?
Steve Touw, steve@42six.com
Barrett Stabile, bstabile@42six.com
Joe Bruner, jbruner@42six.com
Sapan Shah, sshah@42six.com

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Contenu connexe

Tendances

Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopRussell Jurney
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandraDuyhai Doan
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplDuyhai Doan
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 

Tendances (18)

Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandra
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxpl
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 

En vedette

Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupCloudera, Inc.
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to AccumuloDonald Miner
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis PatternsMikio L. Braun
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 

En vedette (9)

Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 

Similaire à The Amino Analytical Framework - Leveraging Accumulo to the Fullest

PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...DataStax
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster
 
Project Flogo: Serverless Integration, Powered by Flogo and Lambda
Project Flogo: Serverless Integration, Powered by Flogo and LambdaProject Flogo: Serverless Integration, Powered by Flogo and Lambda
Project Flogo: Serverless Integration, Powered by Flogo and LambdaLeon Stigter
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldRandy Shoup
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Caserta
 
Sharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratchSharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratchAdelina Simion
 
MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...
MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...
MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...MongoDB
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation SystemsJared Winick
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptQingsong Yao
 

Similaire à The Amino Analytical Framework - Leveraging Accumulo to the Fullest (20)

PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
 
Project Flogo: Serverless Integration, Powered by Flogo and Lambda
Project Flogo: Serverless Integration, Powered by Flogo and LambdaProject Flogo: Serverless Integration, Powered by Flogo and Lambda
Project Flogo: Serverless Integration, Powered by Flogo and Lambda
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
 
CDC to the Max!
CDC to the Max!CDC to the Max!
CDC to the Max!
 
Sharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratchSharing our best secrets: Design a distributed system from scratch
Sharing our best secrets: Design a distributed system from scratch
 
MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...
MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...
MongoDB World 2018: Petro.ai: Big Data, Data Science, and Chat Unite in Oil &...
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
 
OMP GSE
OMP GSEOMP GSE
OMP GSE
 

Dernier

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Dernier (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

The Amino Analytical Framework - Leveraging Accumulo to the Fullest

  • 1. Framework for Big Data Discovery and Analytics © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 2. Hadoop MapReduce • We can look across all our data to answer questions! Problem Statement: Developers can write MapReduce code to analyze data, but don’t know what to look for; the analysts know what to look for, but don’t know how to write code. Technology is not the problem. It’s enabling the analyst to effectively leverage technology and reuse it. © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 3. Typical Analyst Workflow: • I have an entity I want to learn more about • Everything is indexed by entities • We can ask questions of Big Data, but they aren’t Big Questions – we always start with an entity We should be able to: • Have a pattern and see entities that match that pattern • We can ask complex questions of Big Data © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 4. Naïve Way: Custom MapReduce job for each question Amino Way: Pre-compute features (micro-analytics), the building blocks of questions, and let analysts mix those on the fly to ask complex questions The Amino index executes Analysts’ complex questions as a real time scan, less competition for resources, more scalable. Scales to billions of entities and features © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 5. Live Demo What could go wrong?… © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 6. Amino Framework Feature Creation API • Abstracts the complexities of MapReduce • Focus on logic of the feature/micro-analytic • Write-once DataLoader for each data source • Simple and powerful data joins Amino Index • AminoOutputFormat • Bulk Ingest into Accumulo Query API • Iterators © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 7. Workflow © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 8. Benefits • Data Agnostic • Not a black box • Fully scalable • Crowd source micro-analytics • Inherent cross-datasource linked indexes • Encourages sharing of knowledge, discovery • Index built to support machine learning • Security considered up front – index is in Accumulo • Built on open source, for open source © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 9. Feature Creation -Can join multiple datasets -Keys are established in the DataLoader Any external job can output this format and it will be indexed properly during indexing jobs Notice there’s no key – that’s on purpose! © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 10. Index Goals Now all our features are indexed, let’s let the analysts start building! • Fast scans • Highly dimensional scans • Data compression • Simple query structure © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 11. Accumulo Index 1: More Dimensions than Entities Row CF Shard Number: Data Source : Bucket Name Bucket Value CQ Value Hash Salt Compressed Bitmap Example: Row CF CQ Value 2:Twitter:handle stevetouw 0 010011010010011 JavaEWAH is a word-aligned compressed variant of the Java bitset class. It does not achieve the best compression, but rather improves query processing time Indexes in the bit vector represent the features that entity falls in – a feature vector © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 12. At Query Time… Bloom Filter based on Lexicographical first and last of each dimension of the query Number of followers: 10 - 200 First: aachimba Last: zzrka Number of tweets per day: 0 - 6 First: aaabbb Last: zyrbb Handle starts with letter: S First: saarba Last: szaban Smallest range Dimensions map to a query bit vector 000001001111000101000011100101010011100101 Note there is an index for every possible value between the ratio features © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 13. Accumulo Iterator Time!! Row CF CQ Value 2:Twitter:handle saarba 0 00101011001110 2:Twitter:handle saarra 0 00101111010100 2:Twitter:handle stevetouw 0 01111100001100 2:Twitter:handle szaban 0 00110011001111 Push our query bit vector through the range found in the previous step If the result of the bitwise operation contains an index at each dimension, we have a match! © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 14. What is the Salt For? Row CF CQ Value Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap Row CF CQ Value 2:Twitter:handle stevetouw 0 0100110100100101 Collisions are possible (using 32 bit vector). Salt is used to hash the feature indexes, so you need as many matches in the previous step as you have salts. We have used 3 salts with 15 billion records and have had no collisions © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 15. Benefits of this Index • Tables are small, bit vector compression is good, only one row per entity • Works great if you have more dimensions than you have entities or the range in your dimensions are good bloom filters (like “handle starts with letter …”) • No matter how many dimensions, the query will always be as fast as the smallest range • All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable • Represents a feature vector for your entities – great for machine learning © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 16. Accumulo Index 2: More Entities than Dimensions Row CF CQ shard:salt Data Source#Bucket Name#FeatureId Value Feature Value Compressed Bitmap Example: Row CF CQ Value 2:0 Twitter#handle#123456 s 0100110100101001 123456 could map to feature “Handle starts with letter” Indexes in the bit vector represent the entities that fall in that feature So handle stevetouw could map to index 73 (for salt 0) © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 17. That Same Query Again… Number of followers: 10 – 200 (feature id: 444411) Number of tweets per day: 0 – 6 (feature id: 555522) Handle starts with letter: S (feature id: 123456) Row CQ Value 2:0 OR CF Twitter#handle#444411 10 0010111011100 2:0 Twitter#handle#444411 11 0101010101101 …… 2:0 OR 200 0000001011000 2:0 AND Twitter#handle#444411 Twitter#handle#555522 0 1111110001101 2:0 Twitter#handle#555522 1 1010100000100 …… 2:0 Twitter#handle#555522 6 1111001010000 2:0 Twitter#handle#123456 S 1111110001101 Magic iterator that handles all the boolean logic © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 19. Convert Indexes to Entities Row CF CQ shard Index Position#Data Source#Bucket Name#Salt Value Bucket Value Example: Row CF CQ 2 73#Twitter#handle#0 Value stevetouw The iterator scans the rows using a CF filter with the indexes desired The iterator ensures it gets the same CQ “# of salts” times before it sends the resulting CQ results back Again, use the power of iterators and pushing code to the data rather than doing the salt set operation in the web tier © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 20. Benefits of this Index • Tables are small, bit vector compression is good • Works great if you have more entities than you have dimensions (most likely scenario) • Affords the ability to do full boolean logic in-iterator, rather than just ANDs as in the previous index • All processing/boolean logic occurs on the nodes (thanks iterators), fully scalable © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 21. Conclusion • Amino helps non-technical folk leverage MapReduce cleanly and without hogging cluster resources • Accumulo iterators are the reason for the index performance • Amino is all about sharing and reuse, crowd source the building blocks, save analysts hypotheses, the more people touching Amino, the smarter it becomes • Open source (documentation needs help): https://github.com/aminocloud/amino © 2013 42six Solutions, All Rights Reserved, www.42six.com
  • 22. Questions? Steve Touw, steve@42six.com Barrett Stabile, bstabile@42six.com Joe Bruner, jbruner@42six.com Sapan Shah, sshah@42six.com © 2013 42six Solutions, All Rights Reserved, www.42six.com