SlideShare une entreprise Scribd logo
1  sur  49
SURVEY OF ACCUMULO
TECHNIQUES FOR INDEXING DATA
Donald Miner
@donaldpminer
January 18th, 2015
INTRODUCTION TO
ACCUMULO
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Adelaide Bartkowski
Alyssa Files
Beatriz Palmore
Cecilia Ours
Craig Avalos
Dianna Lapointe
Erma Davis
Fermina Smead
Garrett Harsh
Gaylene Sherry
Gilberto Pardue
Hui Nodal
Janell Tomita
Jannette Betters
Jeana Delk
Madlyn Radke
Peggie Allis
Rhona Zygmont
Tran Degarmo
Wilhelmina Papp
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Janell Tomita
Jannette Betters
Jeana Delk
Madlyn Radke
Peggie Allis
Rhona Zygmont
Tran Degarmo
Wilhelmina Papp
Adelaide Bartkowski
Alyssa Files
Beatriz Palmore
Cecilia Ours
Craig Avalos
Dianna Lapointe
Erma Davis
Fermina Smead
Garrett Harsh
Gaylene Sherry
Gilberto Pardue
Hui Nodal
-inf to D E to H J to +inf
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Accumulo Master
TabletServer TabletServer TabletServer
ZooKeeper
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
KEY VALUE
Adelaide Bartkowski 91294124
Alyssa Files 491294
Beatriz Palmore 4124124124
Cecilia Ours 419120
Craig Avalos 940124
Dianna Lapointe 4921
Erma Davis 050194
Fermina Smead 10024599949
Garrett Harsh 140095931
Gaylene Sherry 914815
Gilberto Pardue 412414124124
Hui Nodal 962195192
Janell Tomita 12121
Jannette Betters 9192012
Jeana Delk 9120150
Madlyn Radke 4921
Peggie Allis 944944
Rhona Zygmont 123103
Tran Degarmo 9499494
Wilhelmina Papp 11221
Lookup “Garret Harsh”
FAST
Lookup “4921”
SLOW
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
MIT Lincoln Lab study:
100 Million inserts per second using Accumulo
http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdf
http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf
Booz Allen Hamilton study:
942 tablet servers, 7.56 trillion entries, 408TB, 26 hours
94MB/Sec, 15TB/hr, 80million inserts per second
11 tablet servers went down with no interruption
Showed linear scalability for write throughput
22,000 queries per second
HBase vs. Accumulo
• Subtle yet important differences in visibility implementation
• Coprocessors vs. Iterators
• Accumulo has faster write throughput*
• HBase’s reads are faster*
• HBase has more ecosystem integration
• Accumulo can shift around column families and locality groups
after the fact
• Accumulo has shown to work with no problems at 1,000 nodes
(BAH paper). Facebook and others run a “cell” design for
HBase. Largest clusters in the hundreds*.
* We believeDisclaimer: I am biased
Column Visibility Syntax
Label Description
A & B Both ‘A’ and ‘B’ are required
A | B Either ‘A’ or ‘B’ is required
A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required
A | (B & C) ‘A’ or ‘B’ and ‘C’ is required
(A | B) & (C & D) ?
A & (B & (C | D)) ?
Patient has schizophrenia: insurer | MD & psych
Patient has stomach ulcers: insurer | doctor
Patient has cavity: insurer | dentist
Patient has consent for general anesthesia: surgeon
More cool features
• Iterator framework: customizable server-side processing
• Constraints: user-defined Java functions that allow or
prevent new writes based on a condition
• Large rows: no limit on data stored in a row
• MapReduce InputFormats
• Thrift proxy: access Accumulo through Ruby, Python, …
• Monitor page: shows performance, status, errors, more
• Locality groups: group column families together on disk
for performance tuning (changeable later)
• On-HDFS at rest encryption (work in progress)
• Table import and export
Scalability & Performance
• Multiple HDFS volumes: Accumulo can use multiple
NameNodes to store its data
• Master stores metadata in an Accumulo table
• Native in-memory map: data is first written into a buffer
written in C++, outside of Java
• Relative encoding: consecutive keys with the same values
are flagged instead of rewritten
• Scan pipelines: stages of the read path are parallelized
into separate threads
• Caching: data recently scanned is cached
HOW IT WORKS
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Lookup key
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Collection of data that is kept together
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
What the data is
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Who can see the data
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
When the data was created
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
UNIQUENESS
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
SORTED
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Some piece of information
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
Text rowID = new Text(”don");
Text colFam = new Text(”info");
Text colQual = new Text(”picture");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();
Value value = new Value(MyPictureObj.getBytes());
Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);
BatchWriterConfig config = new BatchWriterConfig();
BatchWriter writer = conn.createBatchWriter(”usertable", config)
writer.add(mutation);
writer.close();
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Family Visibilities
don-don info public
Reading data
Range Family Visibilities
don-don info public
Reading data
Authorizations auths = new Authorizations("public”);
Scanner scan = conn.createScanner(”usertable", auths);
scan.setRange(new Range(”don",”don"));
scan.fetchFamily(”info");
for(Entry<Key,Value> entry : scan) {
String row = entry.getKey().getRow();
Value value = entry.getValue();
}
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Family Visibilities
don-don info public, user, tech
Reading data
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Visibilities
don-don public, user, tech
Reading data
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Visibilities
d-e public, user, tech
Reading data Scan
INDEXING & TABLE DESIGN
Basic Structured Data
Row ID
Column
Family
Column
Qualifier
Column
Visibility
Timestam
p
Value
bob attribute height public Jun 2012 5’11”
bob attribute surname public Jul 2013 doe
bob insurance dental private Sep 2009 MetLife
jane attribute bloodType public Jul 2011 ab-
jane attribute surname public Aug 2013 doe
jane contact cellPhone public Dec 2010 (808) 345-
9876
jane insurance vision private Jan 2008 VSP
john allergy major private Feb 1988 amoxicillin
john attribute weight public Sep 2013 180
john contact homeAddr public Mar 2003 34 Baker LN
Basic Structured Data
Row ID
Column
Family
Column
Qualifier
Column
Visibility
Timestam
p
Value
bob attribute height public Jun 2012 5’11”
bob attribute surname public Jul 2013 doe
bob insurance dental private Sep 2009 MetLife
jane attribute bloodType public Jul 2011 ab-
jane attribute surname public Aug 2013 doe
jane contact cellPhone public Dec 2010 (808) 345-
9876
jane insurance vision private Jan 2008 VSP
john allergy major private Feb 1988 amoxicillin
john attribute weight public Sep 2013 180
john contact homeAddr public Mar 2003 34 Baker LN
Indexing Everything
Row
ID
Column Fam Column Qual Visibility Time value
index Column Fam Column Qual:Row ID Visibility Time -
to Column Fam Column Qual:Row ID Visibility Time -
values Column Fam Column Qual:Row ID Visibility Time -
Event Table
Index Table
Index Table
Row ID
Column
Family
Column
Qualifier
Column
Visibility
Timestam
p
Value
(808) 345-
9876
contact cellPhone:jane public Dec 2010 -
180 attribute weight:john public Sep 2013 -
34 Baker LN contact homeAddr:john public Mar 2003 -
5’11” attribute height:bob public Jun 2012 -
MetLife insuranc
e
dental:bob private Sep 2009 -
VSP insuranc
e
vision:jane private Jan 2008 -
ab- attribute bloodType:jane public Jul 2011 -
amoxicillin allergy major:john private Feb 1988 -
doe attribute surname:bob public Jul 2013 -
doe attribute surname:jane public Aug 2013 -
Data Lake
PATIENTS MEDICINES DOCTORS
INDEX
Data Lake
PATIENTS MEDICINES DOCTORS
INDEX
Tell me
everything
you know
of
amoxicillin
amoxicillin
Data Lake
PATIENTS DISEASES DOCTORS
INDEX
amoxicillin
bob:allergy:amoxicillin
larry:takes:amoxicillin
Stomach ulcer:
treatment:amoxicillin
smith:
prescribed:amoxicillinInfection:
treatment:amoxicillin
Diarrhea:
side effect:amoxicillin
Visibility labels help converge
data sources but still protect
who can see them.
Graphs
a
bc
d
e
a b c d e
a - 1
b 1 -
c - 1
d 1 1 - 1
e -
Start Nodes
EndNodes
Row ID Column Family Column Qualifier Value
a edge b 1
a edge d 1
c edge a 1
c edge d 1
d edge c 1
e edge d 1
• Random walk
• Neighborhoods
• Traversals
Each edge can have
a visibility label!
Term-Partitioned Index
Tablet Server 1
Row ID
Column
Family
Value
baseball document docid_3
baseball document docid_2
bat document docid_2
Tablet Server 2
Row ID
Column
Family
Value
football document docid_1
football document docid_3
glove document docid_1
Tablet Server 3
Row ID
Column
Family
Value
nba document docid_1
shoes document docid_1
soccer document docid_3
RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3]
Tablet Server knows about
the terms “baseball”
Tablet Server knows about
the terms “football”
Tablet Server knows about
the terms “soccer”
Query: “baseball” AND “football” AND “soccer”
Client
Client-side Set
Intersection
[docid_2, docid_3]
[docid_1, docid_3]
[docid_3]
Visibility labels allow protected search Iterators can maintain stats about docs
Geospacial Indexing: Grid Squares
Geospacial Indexing: Z-Order Curve
33.333W, 55.555N = 3535.353535
3535.353535 is the rowkey
Temporal Indexing
Row ID
Column
Family
Column Qualifier Value
Router37 2014-12 1418624102 cold
Router37 2015-01 1421633979 cold
Router37 2015-01 1421634319 hot
Router37 2015-01 1421635001 cold
Server92 2014-12 1418555102 cold
Server92 2014-12 1418556999 hot
Server92 2014-12 1418651002 cold
Server92 2014-12 1418756987 hot
Server92 2014-12 1418853304 cold
Server98 2014-12 1418555104 cold
Server98 2015-01 1421633319 cold
Note:
Dynamically
adding column
families
Resources
Apache Accumulo website
accumulo.apache.org
Accumulo Summit 2014
accumulosummit.com
slideshare.net/AccumuloSummit
Accumulo Summit 2015
End of April!
accumulosummit.com

Contenu connexe

Tendances

Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleSean Cribbs
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
 
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleCassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleDataStax Academy
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousRussell Spitzer
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionPatrick McFadin
 
Introduction to Riak and Ripple (KC.rb)
Introduction to Riak and Ripple (KC.rb)Introduction to Riak and Ripple (KC.rb)
Introduction to Riak and Ripple (KC.rb)Sean Cribbs
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
Riak with node.js
Riak with node.jsRiak with node.js
Riak with node.jsSean Cribbs
 

Tendances (14)

Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleCassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series Example
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Introduction to Riak and Ripple (KC.rb)
Introduction to Riak and Ripple (KC.rb)Introduction to Riak and Ripple (KC.rb)
Introduction to Riak and Ripple (KC.rb)
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Riak with node.js
Riak with node.jsRiak with node.js
Riak with node.js
 

Similaire à Survey of Accumulo Techniques for Indexing Data

PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptxnatesanp1234
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingJustin Cunningham
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availabilityRuben Verborgh
 
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...Amazon Web Services LATAM
 
How Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses CassandraHow Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses Cassandragdusbabek
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
All course slides.pdf
All course slides.pdfAll course slides.pdf
All course slides.pdfssuser98bffa1
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambdadarach
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
LA Cassandra Day 2015 - Cassandra for developers
LA Cassandra Day 2015  - Cassandra for developersLA Cassandra Day 2015  - Cassandra for developers
LA Cassandra Day 2015 - Cassandra for developersChristopher Batey
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLconfluent
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_uploadRajini Ramesh
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 
Understanding the Data Lookup Pattern
Understanding the Data Lookup PatternUnderstanding the Data Lookup Pattern
Understanding the Data Lookup PatternCraig Dunn
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 
Portfolio Oversight With eazyBI
Portfolio Oversight With eazyBIPortfolio Oversight With eazyBI
Portfolio Oversight With eazyBIeazyBI
 

Similaire à Survey of Accumulo Techniques for Indexing Data (20)

PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptx
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
 
How Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses CassandraHow Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses Cassandra
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
All course slides.pdf
All course slides.pdfAll course slides.pdf
All course slides.pdf
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambda
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
LA Cassandra Day 2015 - Cassandra for developers
LA Cassandra Day 2015  - Cassandra for developersLA Cassandra Day 2015  - Cassandra for developers
LA Cassandra Day 2015 - Cassandra for developers
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETL
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Understanding the Data Lookup Pattern
Understanding the Data Lookup PatternUnderstanding the Data Lookup Pattern
Understanding the Data Lookup Pattern
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
Portfolio Oversight With eazyBI
Portfolio Oversight With eazyBIPortfolio Oversight With eazyBI
Portfolio Oversight With eazyBI
 

Plus de Donald Miner

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital SignsDonald Miner
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to AccumuloDonald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 

Plus de Donald Miner (11)

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Dernier

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Dernier (20)

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Survey of Accumulo Techniques for Indexing Data

  • 1. SURVEY OF ACCUMULO TECHNIQUES FOR INDEXING DATA Donald Miner @donaldpminer January 18th, 2015
  • 3. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 4. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Adelaide Bartkowski Alyssa Files Beatriz Palmore Cecilia Ours Craig Avalos Dianna Lapointe Erma Davis Fermina Smead Garrett Harsh Gaylene Sherry Gilberto Pardue Hui Nodal Janell Tomita Jannette Betters Jeana Delk Madlyn Radke Peggie Allis Rhona Zygmont Tran Degarmo Wilhelmina Papp
  • 5. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Janell Tomita Jannette Betters Jeana Delk Madlyn Radke Peggie Allis Rhona Zygmont Tran Degarmo Wilhelmina Papp Adelaide Bartkowski Alyssa Files Beatriz Palmore Cecilia Ours Craig Avalos Dianna Lapointe Erma Davis Fermina Smead Garrett Harsh Gaylene Sherry Gilberto Pardue Hui Nodal -inf to D E to H J to +inf
  • 6. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Accumulo Master TabletServer TabletServer TabletServer ZooKeeper
  • 7. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. KEY VALUE Adelaide Bartkowski 91294124 Alyssa Files 491294 Beatriz Palmore 4124124124 Cecilia Ours 419120 Craig Avalos 940124 Dianna Lapointe 4921 Erma Davis 050194 Fermina Smead 10024599949 Garrett Harsh 140095931 Gaylene Sherry 914815 Gilberto Pardue 412414124124 Hui Nodal 962195192 Janell Tomita 12121 Jannette Betters 9192012 Jeana Delk 9120150 Madlyn Radke 4921 Peggie Allis 944944 Rhona Zygmont 123103 Tran Degarmo 9499494 Wilhelmina Papp 11221 Lookup “Garret Harsh” FAST Lookup “4921” SLOW
  • 8. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 9. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 10. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 11. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 12. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 13. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. MIT Lincoln Lab study: 100 Million inserts per second using Accumulo http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdf http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf Booz Allen Hamilton study: 942 tablet servers, 7.56 trillion entries, 408TB, 26 hours 94MB/Sec, 15TB/hr, 80million inserts per second 11 tablet servers went down with no interruption Showed linear scalability for write throughput 22,000 queries per second
  • 14. HBase vs. Accumulo • Subtle yet important differences in visibility implementation • Coprocessors vs. Iterators • Accumulo has faster write throughput* • HBase’s reads are faster* • HBase has more ecosystem integration • Accumulo can shift around column families and locality groups after the fact • Accumulo has shown to work with no problems at 1,000 nodes (BAH paper). Facebook and others run a “cell” design for HBase. Largest clusters in the hundreds*. * We believeDisclaimer: I am biased
  • 15. Column Visibility Syntax Label Description A & B Both ‘A’ and ‘B’ are required A | B Either ‘A’ or ‘B’ is required A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required A | (B & C) ‘A’ or ‘B’ and ‘C’ is required (A | B) & (C & D) ? A & (B & (C | D)) ? Patient has schizophrenia: insurer | MD & psych Patient has stomach ulcers: insurer | doctor Patient has cavity: insurer | dentist Patient has consent for general anesthesia: surgeon
  • 16. More cool features • Iterator framework: customizable server-side processing • Constraints: user-defined Java functions that allow or prevent new writes based on a condition • Large rows: no limit on data stored in a row • MapReduce InputFormats • Thrift proxy: access Accumulo through Ruby, Python, … • Monitor page: shows performance, status, errors, more • Locality groups: group column families together on disk for performance tuning (changeable later) • On-HDFS at rest encryption (work in progress) • Table import and export
  • 17. Scalability & Performance • Multiple HDFS volumes: Accumulo can use multiple NameNodes to store its data • Master stores metadata in an Accumulo table • Native in-memory map: data is first written into a buffer written in C++, outside of Java • Relative encoding: consecutive keys with the same values are flagged instead of rewritten • Scan pipelines: stages of the read path are parallelized into separate threads • Caching: data recently scanned is cached
  • 19. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P
  • 20. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Lookup key
  • 21. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Collection of data that is kept together
  • 22. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P What the data is
  • 23. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Who can see the data
  • 24. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P When the data was created
  • 25. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P UNIQUENESS
  • 26. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P SORTED
  • 27. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Some piece of information
  • 28. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … Row ID Family Qualifier Visibility Timestamp Value don info picture public 13119103 dd3ae1d3b951a33f… Writing data into Accumulo
  • 29. Row ID Family Qualifier Visibility Timestamp Value don info picture public 13119103 dd3ae1d3b951a33f… Writing data into Accumulo Text rowID = new Text(”don"); Text colFam = new Text(”info"); Text colQual = new Text(”picture"); ColumnVisibility colVis = new ColumnVisibility("public"); long timestamp = System.currentTimeMillis(); Value value = new Value(MyPictureObj.getBytes()); Mutation mutation = new Mutation(rowID); mutation.put(colFam, colQual, colVis, timestamp, value); BatchWriterConfig config = new BatchWriterConfig(); BatchWriter writer = conn.createBatchWriter(”usertable", config) writer.add(mutation); writer.close();
  • 30. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Row ID Family Qualifier Visibility Timestamp Value don info picture public 13119103 dd3ae1d3b951a33f… Writing data into Accumulo
  • 31. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Family Visibilities don-don info public Reading data
  • 32. Range Family Visibilities don-don info public Reading data Authorizations auths = new Authorizations("public”); Scanner scan = conn.createScanner(”usertable", auths); scan.setRange(new Range(”don",”don")); scan.fetchFamily(”info"); for(Entry<Key,Value> entry : scan) { String row = entry.getKey().getRow(); Value value = entry.getValue(); }
  • 33. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Family Visibilities don-don info public, user, tech Reading data
  • 34. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Visibilities don-don public, user, tech Reading data
  • 35. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Visibilities d-e public, user, tech Reading data Scan
  • 37. Basic Structured Data Row ID Column Family Column Qualifier Column Visibility Timestam p Value bob attribute height public Jun 2012 5’11” bob attribute surname public Jul 2013 doe bob insurance dental private Sep 2009 MetLife jane attribute bloodType public Jul 2011 ab- jane attribute surname public Aug 2013 doe jane contact cellPhone public Dec 2010 (808) 345- 9876 jane insurance vision private Jan 2008 VSP john allergy major private Feb 1988 amoxicillin john attribute weight public Sep 2013 180 john contact homeAddr public Mar 2003 34 Baker LN
  • 38. Basic Structured Data Row ID Column Family Column Qualifier Column Visibility Timestam p Value bob attribute height public Jun 2012 5’11” bob attribute surname public Jul 2013 doe bob insurance dental private Sep 2009 MetLife jane attribute bloodType public Jul 2011 ab- jane attribute surname public Aug 2013 doe jane contact cellPhone public Dec 2010 (808) 345- 9876 jane insurance vision private Jan 2008 VSP john allergy major private Feb 1988 amoxicillin john attribute weight public Sep 2013 180 john contact homeAddr public Mar 2003 34 Baker LN
  • 39. Indexing Everything Row ID Column Fam Column Qual Visibility Time value index Column Fam Column Qual:Row ID Visibility Time - to Column Fam Column Qual:Row ID Visibility Time - values Column Fam Column Qual:Row ID Visibility Time - Event Table Index Table
  • 40. Index Table Row ID Column Family Column Qualifier Column Visibility Timestam p Value (808) 345- 9876 contact cellPhone:jane public Dec 2010 - 180 attribute weight:john public Sep 2013 - 34 Baker LN contact homeAddr:john public Mar 2003 - 5’11” attribute height:bob public Jun 2012 - MetLife insuranc e dental:bob private Sep 2009 - VSP insuranc e vision:jane private Jan 2008 - ab- attribute bloodType:jane public Jul 2011 - amoxicillin allergy major:john private Feb 1988 - doe attribute surname:bob public Jul 2013 - doe attribute surname:jane public Aug 2013 -
  • 42. Data Lake PATIENTS MEDICINES DOCTORS INDEX Tell me everything you know of amoxicillin amoxicillin
  • 43. Data Lake PATIENTS DISEASES DOCTORS INDEX amoxicillin bob:allergy:amoxicillin larry:takes:amoxicillin Stomach ulcer: treatment:amoxicillin smith: prescribed:amoxicillinInfection: treatment:amoxicillin Diarrhea: side effect:amoxicillin Visibility labels help converge data sources but still protect who can see them.
  • 44. Graphs a bc d e a b c d e a - 1 b 1 - c - 1 d 1 1 - 1 e - Start Nodes EndNodes Row ID Column Family Column Qualifier Value a edge b 1 a edge d 1 c edge a 1 c edge d 1 d edge c 1 e edge d 1 • Random walk • Neighborhoods • Traversals Each edge can have a visibility label!
  • 45. Term-Partitioned Index Tablet Server 1 Row ID Column Family Value baseball document docid_3 baseball document docid_2 bat document docid_2 Tablet Server 2 Row ID Column Family Value football document docid_1 football document docid_3 glove document docid_1 Tablet Server 3 Row ID Column Family Value nba document docid_1 shoes document docid_1 soccer document docid_3 RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3] Tablet Server knows about the terms “baseball” Tablet Server knows about the terms “football” Tablet Server knows about the terms “soccer” Query: “baseball” AND “football” AND “soccer” Client Client-side Set Intersection [docid_2, docid_3] [docid_1, docid_3] [docid_3] Visibility labels allow protected search Iterators can maintain stats about docs
  • 47. Geospacial Indexing: Z-Order Curve 33.333W, 55.555N = 3535.353535 3535.353535 is the rowkey
  • 48. Temporal Indexing Row ID Column Family Column Qualifier Value Router37 2014-12 1418624102 cold Router37 2015-01 1421633979 cold Router37 2015-01 1421634319 hot Router37 2015-01 1421635001 cold Server92 2014-12 1418555102 cold Server92 2014-12 1418556999 hot Server92 2014-12 1418651002 cold Server92 2014-12 1418756987 hot Server92 2014-12 1418853304 cold Server98 2014-12 1418555104 cold Server98 2015-01 1421633319 cold Note: Dynamically adding column families
  • 49. Resources Apache Accumulo website accumulo.apache.org Accumulo Summit 2014 accumulosummit.com slideshare.net/AccumuloSummit Accumulo Summit 2015 End of April! accumulosummit.com

Notes de l'éditeur

  1. This talk will go over table design and row key design approaches for indexing large amounts of data in Apache Accumulo. We'll do an overview of how to store geographical data, entity relationship graphs, natural language text, numbers, and more in Accumulo. This will serve as a starting point to learning how to effectively store different types of data in Accumulo as well as showcase the capabilities of Accumulo for handling varying situations.
  2. Two basic operators AND operator represented by & OR operator represented by | In the examples A,B, C, and D are security tokens Security Tokens are strings of alphanumeric characters Tokens are user defined Parenthesis are required to use nested logic
  3. This data is our original health care data Rows in red are rows that are only viewable to users with the ”private” authorization
  4. This data is our original health care data Rows in red are rows that are only viewable to users with the ”private” authorization
  5. It is easy to create a text index by splitting large values into constituent words This is similar to the previous example except we are indexing unstructured data (text) instead of structured data (single value)
  6. After indexing our data Row IDs are storing the data of all of our different types It may be necessary to transform these values to get the desired sort order It may also be handy to prepend type information to the Row ID like INT, or CHAR
  7. Matrix representation is efficient for densely connected graphs Matrix allows weights of nodes to be stored Not great for sparse graphs since most cells will be empty or null
  8. Process Tablets are partitioned on row boundaries, in this example the row boundaries are the terms Tablets are assigned to one Tablet Server each, distributing document information across many servers To perform a query that searches for multiple terms, all of the Tablet Servers need to be searched for each term Each tablet server will return a list of document IDs that contain the terms The client needs to perform a set intersection to determine which document was returned from all of the tablet servers Problems This method requires a lot of network traffic to return all the documents found, from each Tablet Server to the client The client will filter out many documents, probably a majority of what is returned If a lot of documents are contained in the table, the client could run out of memory before completing the intersection
  9. After indexing our data Row IDs are storing the data of all of our different types It may be necessary to transform these values to get the desired sort order It may also be handy to prepend type information to the Row ID like INT, or CHAR