The Amino Analytical Framework - Leveraging Accumulo to the Fullest

Framework for Big Data Discovery and Analytics

© 2013 42six Solutions, All Rights Reserved, www.42six.com

Hadoop MapReduce
• We can look across all our data to
answer questions!

Problem Statement:
Developers can write MapReduce code to analyze data, but don’t know what to look
for; the analysts know what to look for, but don’t know how to write code.

Technology is not the problem. It’s enabling the analyst to effectively leverage
technology and reuse it.


Typical Analyst Workflow:
• I have an entity I want to learn more about
• Everything is indexed by entities

• We can ask questions of Big Data, but they aren’t Big
Questions – we always start with an entity

We should be able to:
• Have a pattern and see entities that match that pattern
• We can ask complex questions of Big Data


Naïve Way:
Custom MapReduce job for each question

Amino Way:
Pre-compute features (micro-analytics), the building blocks of questions, and let
analysts mix those on the fly to ask complex questions
The Amino index executes Analysts’ complex questions as a real time scan, less
competition for resources, more scalable.
Scales to billions of entities and features


Live Demo
What could go wrong?…


Amino Framework
Feature Creation API
• Abstracts the complexities of MapReduce
• Focus on logic of the feature/micro-analytic
• Write-once DataLoader for each data source
• Simple and powerful data joins

Amino Index
• AminoOutputFormat
• Bulk Ingest into Accumulo
Query API
• Iterators


Workflow


Benefits
• Data Agnostic
• Not a black box
• Fully scalable
• Crowd source micro-analytics
• Inherent cross-datasource linked indexes
• Encourages sharing of knowledge, discovery
• Index built to support machine learning
• Security considered up front – index is in Accumulo
• Built on open source, for open source


Feature Creation

-Can join multiple datasets
-Keys are established in the DataLoader

Any external job can output this format
and it will be indexed properly during
indexing jobs

Notice there’s no key – that’s on
purpose!

Index Goals

Now all our features are indexed, let’s let the analysts start building!

• Fast scans
• Highly dimensional scans
• Data compression
• Simple query structure


Accumulo Index 1: More Dimensions than Entities
Row

CF

Shard Number: Data Source : Bucket Name Bucket Value

CQ

Value

Hash Salt Compressed Bitmap

Example:
Row

CF

CQ

Value

2:Twitter:handle

stevetouw

0

010011010010011

JavaEWAH is a word-aligned compressed variant of the
Java bitset class. It does not achieve the best
compression, but rather improves query processing time

Indexes in the bit vector represent the features that entity falls in –
a feature vector

At Query Time…
Bloom Filter based on Lexicographical first and last of
each dimension of the query
Number of followers: 10 - 200

First: aachimba

Last: zzrka

Number of tweets per day: 0 - 6

First: aaabbb

Last: zyrbb

Handle starts with letter: S

First: saarba

Last: szaban

Smallest range
Dimensions map to a query bit vector
000001001111000101000011100101010011100101
Note there is an index for every possible value between the
ratio features

Accumulo Iterator Time!!

Row

CF

CQ

Value

2:Twitter:handle

saarba

0

00101011001110

2:Twitter:handle

saarra

0

00101111010100

2:Twitter:handle

stevetouw

0

01111100001100

2:Twitter:handle

szaban

0

00110011001111

Push our query bit vector through the range found in
the previous step

If the result of the bitwise operation contains an index at each
dimension, we have a match!


What is the Salt For?
Row

CF

CQ

Value

Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap
Row

CF

CQ

Value

2:Twitter:handle

stevetouw

0

0100110100100101

Collisions are possible (using
32 bit vector). Salt is used to
hash the feature indexes, so
you need as many matches in
the previous step as you have
salts.
We have used 3 salts with 15
billion records and have had
no collisions


Benefits of this Index

• Tables are small, bit vector compression is good, only one row per
entity
• Works great if you have more dimensions than you have entities or
the range in your dimensions are good bloom filters (like “handle
starts with letter …”)
• No matter how many dimensions, the query will always be as fast
as the smallest range
• All processing/boolean logic occurs on the nodes (thanks iterators),
fully scalable
• Represents a feature vector for your entities – great for machine
learning


Accumulo Index 2: More Entities than Dimensions

Row

CF

CQ

shard:salt Data Source#Bucket Name#FeatureId

Value

Feature Value

Compressed Bitmap

Example:
Row

CF

CQ

Value

2:0

Twitter#handle#123456

s

0100110100101001

123456 could map to feature “Handle starts with letter”
Indexes in the bit vector represent the entities that fall in that
feature
So handle stevetouw could map to index 73 (for salt 0)


That Same Query Again…
Number of followers: 10 – 200 (feature id: 444411)
Number of tweets per day: 0 – 6 (feature id: 555522)
Handle starts with letter: S (feature id: 123456)
Row

CQ

Value

2:0
OR

CF

10

0010111011100

2:0


11

0101010101101

……
2:0

OR

200

0000001011000

2:0

AND


0

1111110001101

2:0


1

1010100000100

……
2:0


6

1111001010000

2:0


S

1111110001101

Magic iterator that handles all the boolean logic

More Details
Row

CQ

Value

2:0
OR

CF

10

0010111011100

2:0


11

0101010101101

……
2:0

OR

200

0000001011000

2:0

AND



0

1111110001101

2:0


1

1010100000100

……
2:0


6

1111001010000

2:0


S

1111110001101

The same entity is guaranteed to always land in the same shard:salt no
matter the feature

We are left with a set of indexes for each salt, now what?

Convert Indexes to Entities
Row

CF

CQ

shard

Index Position#Data Source#Bucket Name#Salt

Value

Bucket Value

Example:
Row

CF

CQ

2

73#Twitter#handle#0

Value

stevetouw

The iterator scans the rows using a CF filter with the indexes desired
The iterator ensures it gets the same CQ “# of salts” times before it sends
the resulting CQ results back
Again, use the power of iterators and pushing code to the data rather than
doing the salt set operation in the web tier


Benefits of this Index

• Tables are small, bit vector compression is good
• Works great if you have more entities than you have dimensions
(most likely scenario)

• Affords the ability to do full boolean logic in-iterator, rather than just
ANDs as in the previous index
• All processing/boolean logic occurs on the nodes (thanks iterators),
fully scalable


Conclusion

• Amino helps non-technical folk leverage MapReduce cleanly and without
hogging cluster resources
• Accumulo iterators are the reason for the index performance

• Amino is all about sharing and reuse, crowd source the building blocks,
save analysts hypotheses, the more people touching Amino, the smarter
it becomes
• Open source (documentation needs help): https://github.com/aminocloud/amino


Questions?
Steve Touw, steve@42six.com
Barrett Stabile, bstabile@42six.com
Joe Bruner, jbruner@42six.com
Sapan Shah, sshah@42six.com


The Amino Analytical Framework - Leveraging Accumulo to the Fullest

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

En vedette

En vedette (9)

Similaire à The Amino Analytical Framework - Leveraging Accumulo to the Fullest

Similaire à The Amino Analytical Framework - Leveraging Accumulo to the Fullest (20)

Dernier

Dernier (20)

The Amino Analytical Framework - Leveraging Accumulo to the Fullest