SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Apache Accumulo and the Convergence of
Machine Learning, Big Data, and Supercomputing
Jeremy Kepner1,2,3, Lauren Milechin4, Vijay Gadepally1,2, Dylan Hutchison5, Timothy Weale6
MIT Lincoln Laboratory Supercomputing Center1, MIT Computer Science & AI Laboratory2,
MIT Mathematics Department3, MIT Earth, Atmospheric and Planetary Sciences4,
University of Washington5, US Department of Defense6
Oct 2017
This material is based upon work supported by the National Science Foundation under Grant No.
DMS-1312831. Any opinions, findings, and conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Slide - 2
•  Introduction & Accumulo
•  Mathematical Foundations
•  Scalable Performance
•  High Level Interface
•  Summary
Outline
Slide - 3
Why Accumulo? World Record Database Performance
Achieving 100,000,000 database inserts per second using Accumulo and D4M, Kepner et al, HPEC 2014
BAH = Booz Allen Hamilton
10
0
10
1
10
2
10
3
10
5
10
6
10
7
10
8
10
9
total hardware nodes
databaseinserts/second
Accumulo
Cassandra
Oracle
4M/s
(MIT LL 2012)
115M/s
(MIT LL 2014)
1M/s
(Google 2014)
108M/s
(BAH 2013)
140K/s (Oracle 2013)
Number of processors
DataabaseInserts/sec World Record
Slide - 4
Big Data
•  Challenge: Scale of data beyond what current approaches can handle
Social media, cyber networks, bioinformatics, ...
Supercomputing
•  Challenge: Analytics beyond what current approaches can handle
Engineering simulation, drug discovery, autonomy, ...
Machine Learning
•  Challenge: Diversity beyond what current approaches can handle
Computer vision, language processing, decision making, ...
Large Scale Computing: Challenges
independently and merged to produce the sorted result. The
sorting of k shorter sequences can also be divided into k even
shorter sequences and sorted recursively by using the same
merge sort algorithm. This process is recursively repeated until
the divided sequence length reaches 1. The sorting process
takes order nlogkn memory cycles. The k-way merge sort is
log2k times faster than the 2-way merge sort process when k is
greater than 2. For example, when k = 32, the k-way merge
sorter has five times greater sorter throughput than the 2-way
merge sorter. The main difficulty with implementing a k-way
merge sorter in a conventional processor is that it takes many
clock cycles to figure out the smallest (or largest) value among
k entries during each step of the merge sorting process. Ideally,
the smallest value of k should be computed within one
processor clock cycle for the maximum sorter throughput. The
100% efficient systolic merge sorter [9] can achieve this
performance requirement using k linear systolic array cells and
it is particularly well suited for FPGA and integrated circuit
(IC) implementation since it consists of repeated systolic cells
with nearest-neighbor-only communications.
C. 6D Toroidal Communication Network and
Randomized Message Routing
The new graph processor architecture is a parallel processor
interconnected in a 6D toroidal configuration using high
bandwidth optical links. The 6D toroid provides much higher
communication performance than lower-dimensional toroids
because of the higher bisection bandwidth.
The communication network is designed as a packet-
routing network optimized to support small packet sizes that
are as small as a single sparse matrix element. The network
scheduling and protocol are designed such that successive
communication packets from a node would have randomized
destinations in order to minimize network congestion [15].
This design is a great contrast to typical conventional
multiprocessor message-routing schemes that are based on
much larger message sizes and globally arbitrated routing that
are used in order to minimize the message-routing overhead.
However, large message-based communications are often
difficult to route and can have a relatively high message
contention rate caused by the long time periods during which
the involved communication links are tied up. The small
message sizes, along with randomized destination routing,
minimize message contentions and improve the overall
network communication throughput. Figure 6 shows the 512-
node (8 × 8 × 8) 3D toroidal network (drawn as 3 × 3 × 3
network for illustration purposes) simulation based on
randomized destination communication versus unique
III. FPGA PROTOTYPE DEVELOPMENT AND PERFORMANCE
MEASUREMENT
Lincoln Laboratory has developed an FPGA prototype o
the graph processor using commercial FPGA boards as show
in Figure 7. Each board has one large FPGA and two 4-GByt
DDR3 memory banks. Two graph processor nodes ar
implemented in each board. A small 4-board chassi
implements an 8-node graph processor tied together with 1D
toroidal network. Since the commercial board offered limite
scalability due to limited number of communication ports fo
network connection, the larger prototypes will be developed i
the future using custom FPGA boards that can support 6D
toroidal network and up to 1 million nodes.
Fig. 7. FPGA prototype of the graph processor.
Hyperscale Data Centers
Accelerators
Specialized
Processors
Slide - 5
Big Data
•  Challenge: Scale of data beyond what current approaches can handle
•  Accumulo: High performance ingest and scanning of vast amounts of data
Supercomputing
•  Challenge: Analytics beyond what current approaches can handle
•  Accumulo: Real-time filesystem, scheduler, and system monitoring
Machine Learning
•  Challenge: Diversity beyond what current approaches can handle
•  Hardware: Rapid storage and retrieval of training and model data
Large Scale Computing: Accumulo
Requires mathematically rigorous approaches to enable scalable solutions
independently and merged to produce the sorted result. The
sorting of k shorter sequences can also be divided into k even
shorter sequences and sorted recursively by using the same
merge sort algorithm. This process is recursively repeated until
the divided sequence length reaches 1. The sorting process
takes order nlogkn memory cycles. The k-way merge sort is
log2k times faster than the 2-way merge sort process when k is
greater than 2. For example, when k = 32, the k-way merge
sorter has five times greater sorter throughput than the 2-way
merge sorter. The main difficulty with implementing a k-way
merge sorter in a conventional processor is that it takes many
clock cycles to figure out the smallest (or largest) value among
k entries during each step of the merge sorting process. Ideally,
the smallest value of k should be computed within one
processor clock cycle for the maximum sorter throughput. The
100% efficient systolic merge sorter [9] can achieve this
performance requirement using k linear systolic array cells and
it is particularly well suited for FPGA and integrated circuit
(IC) implementation since it consists of repeated systolic cells
with nearest-neighbor-only communications.
C. 6D Toroidal Communication Network and
Randomized Message Routing
The new graph processor architecture is a parallel processor
interconnected in a 6D toroidal configuration using high
bandwidth optical links. The 6D toroid provides much higher
communication performance than lower-dimensional toroids
because of the higher bisection bandwidth.
The communication network is designed as a packet-
routing network optimized to support small packet sizes that
are as small as a single sparse matrix element. The network
scheduling and protocol are designed such that successive
communication packets from a node would have randomized
destinations in order to minimize network congestion [15].
This design is a great contrast to typical conventional
multiprocessor message-routing schemes that are based on
much larger message sizes and globally arbitrated routing that
are used in order to minimize the message-routing overhead.
However, large message-based communications are often
difficult to route and can have a relatively high message
contention rate caused by the long time periods during which
the involved communication links are tied up. The small
message sizes, along with randomized destination routing,
minimize message contentions and improve the overall
network communication throughput. Figure 6 shows the 512-
node (8 × 8 × 8) 3D toroidal network (drawn as 3 × 3 × 3
network for illustration purposes) simulation based on
randomized destination communication versus unique
III. FPGA PROTOTYPE DEVELOPMENT AND PERFORMANCE
MEASUREMENT
Lincoln Laboratory has developed an FPGA prototype o
the graph processor using commercial FPGA boards as show
in Figure 7. Each board has one large FPGA and two 4-GByt
DDR3 memory banks. Two graph processor nodes ar
implemented in each board. A small 4-board chassi
implements an 8-node graph processor tied together with 1D
toroidal network. Since the commercial board offered limite
scalability due to limited number of communication ports fo
network connection, the larger prototypes will be developed i
the future using custom FPGA boards that can support 6D
toroidal network and up to 1 million nodes.
Fig. 7. FPGA prototype of the graph processor.
Hyperscale Data Centers
Accelerators
Specialized
Processors
Slide - 6
•  Introduction & Accumulo
•  Mathematical Foundations
•  Scalable Performance
•  High Level Interface
•  Summary
Outline
Slide - 7
Modern Database Paradigm Shifts
NoSQL
Relational Databases (SQL)2006
NewSQL
1970
Information Retrieval P. BAXENDALE, Editor
A Relational Model of Data for
Large Shared Data Banks
E. F. CODD
IBM Research Laboratory, San Jose, California
Future users of large data banks must be protected from
having to know how the data is organized in the machine (the
internal representation). A prompting service which supplies
such information is not a satisfactory solution. Activities of users
at terminals and most application programs should remain
unaffected when the internal representation of data is changed
and even when some aspects of the external representation
are changed. Changes in data representation will often be
needed as a result of changes in query, update, and report
traffic and natural growth in the types of stored information.
Existing noninferential, formatted data systems provide users
with tree-structured files or slightly more general network
models of the data. In Section 1, inadequacies of these models
are discussed. A model based on n-ary relations, a normal
form for data base relations, and the concept of a universal
data sublanguage are introduced. In Section 2, certain opera-
tions on relations (other than logical inference) are discussed
and applied to the problems of redundancy and consistency
in the user’s model.
KEY WORDS AND PHRASES: data bank, data base, data structure, data
organization, hierarchies of data, networks of data, relations, derivability,
redundancy, consistency, composition, join, retrieval language, predicate
calculus, security, data integrity
CR CATEGORIES: 3.70, 3.73, 3.75, 4.20, 4.22, 4.29
1. Relational Model and Normal Form
1.I. INTR~xJ~TI~N
This paper is concerned with the application of ele-
mentary relation theory to systems which provide shared
accessto large banks of formatted data. Except for a paper
by Childs [l], the principal application of relations to data
systemshasbeento deductive question-answering systems.
Levein and Maron [2] provide numerous referencesto work
in this area.
In contrast, the problems treated here are those of data
independence-the independence of application programs
and terminal activities from growth in data types and
changesin data representation-and certain kinds of data
inconsistency which are expected to become troublesome
even in nondeductive systems.
Volume 13 / Number 6 / June, 1970
The relational view (or model) of data described in
Section 1 appears to be superior in several respectsto the
graph or network model [3,4] presently in vogue for non-
inferential systems.It provides a meansof describing data
with its natural structure only-that is, without superim-
posingany additional structure for machine representation
purposes. Accordingly, it provides a basis for a high level
data language which will yield maximal independence be-
tween programs on the one hand and machine representa-
tion and organization of data on the other.
A further advantage of the relational view is that it
forms a sound basis for treating derivability, redundancy,
and consistency of relations-these are discussedin Section
2. The network model, on the other hand, has spawneda
number of confusions, not the least of which is mistaking
the derivation of connections for the derivation of rela-
tions (seeremarks in Section 2 on the “connection trap”).
Finally, the relational view permits a clearer evaluation
of the scope and logical limitations of present formatted
data systems, and also the relative merits (from a logical
standpoint) of competing representations of data within a
single system. Examples of this clearer perspective are
cited in various parts of this paper. Implementations of
systems to support the relational model are not discussed.
1.2. DATA DEPENDENCIESIN PRESENTSYSTEMS
The provision of data description tables in recently de-
veloped information systems represents a major advance
toward the goal of data independence [5,6,7]. Such tables
facilitate changing certain characteristics of the data repre-
sentation stored in a data bank. However, the variety of
data representation characteristics which can be changed
without logically impairing some application programs is
still quite limited. Further, the model of data with which
usersinteract is still cluttered with representational prop-
erties, particularly in regard to the representation of col-
lections of data (as opposedto individual items). Three of
the principal kinds of data dependencieswhich still need
to be removed are: ordering dependence,indexing depend-
ence, and accesspath dependence.In some systemsthese
dependenciesare not clearly separablefrom one another.
1.2.1. Ordering Dependence. Elements of data in a
data bank may be stored in a variety of ways, someinvolv-
ing no concern for ordering, somepermitting eachelement
to participate in one ordering only, others permitting each
element to participate in several orderings. Let us consider
those existing systemswhich either require or permit data
elementsto be stored in at least onetotal ordering which is
closely associatedwith the hardware-determined ordering
of addresses.For example, the records of a file concerning
parts might be stored in ascending order by part serial
number. Such systems normally permit application pro-
grams to assumethat the order of presentation of records
from such a file is identical to (or is a subordering of) the
Communications of the ACM 377
Relational
Model
E.F. Codd
(1970)
1980 1990 2010
Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,fikes,gruber}@google.com
Google, Inc.
Abstract
Bigtable is a distributed storage system for managing
structured data that is designed to scale to a very large
size: petabytes of data across thousands of commodity
servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Fi-
nance. These applications place very different demands
on Bigtable, both in terms of data size (from URLs to
web pages to satellite imagery) and latency requirements
(from backend bulk processing to real-time data serving).
Despite these varied demands, Bigtable has successfully
provided a flexible, high-performance solution for all of
these Google products. In this paper we describe the sim-
ple data model provided by Bigtable, which gives clients
dynamic control over data layout and format, and we de-
scribe the design and implementation of Bigtable.
1 Introduction
Over the last two and a half years we have designed,
implemented, and deployed a distributed storage system
for managing structured data at Google called Bigtable.
Bigtable is designed to reliably scale to petabytes of
data and thousands of machines. Bigtable has achieved
several goals: wide applicability, scalability, high per-
formance, and high availability. Bigtable is used by
more than sixty Google products and projects, includ-
ing Google Analytics, Google Finance, Orkut, Person-
alized Search, Writely, and Google Earth. These prod-
ucts use Bigtable for a variety of demanding workloads,
which range from throughput-oriented batch-processing
jobs to latency-sensitive serving of data to end users.
The Bigtable clusters used by these products span a wide
range of configurations, from a handful to thousands of
servers, and store up to several hundred terabytes of data.
In many ways, Bigtable resembles a database: it shares
many implementation strategies with databases. Paral-
lel databases [14] and main-memory databases [13] have
achieved scalability and high performance, but Bigtable
provides a different interface than such systems. Bigtable
does not support a full relational data model; instead, it
provides clients with a simple data model that supports
dynamic control over data layout and format, and al-
lows clients to reason about the locality properties of the
data represented in the underlying storage. Data is in-
dexed using row and column names that can be arbitrary
strings. Bigtable also treats data as uninterpreted strings,
although clients often serialize various forms of struc-
tured and semi-structured data into these strings. Clients
can control the locality of their data through careful
choices in their schemas. Finally, Bigtable schema pa-
rameters let clients dynamically control whether to serve
data out of memory or from disk.
Section 2 describes the data model in more detail, and
Section 3 provides an overview of the client API. Sec-
tion 4 briefly describes the underlying Google infrastruc-
ture on which Bigtable depends. Section 5 describes the
fundamentals of the Bigtable implementation, and Sec-
tion 6 describes some of the refinements that we made
to improve Bigtable’s performance. Section 7 provides
measurements of Bigtable’s performance. We describe
several examples of how Bigtable is used at Google
in Section 8, and discuss some lessons we learned in
designing and supporting Bigtable in Section 9. Fi-
nally, Section 10 describes related work, and Section 11
presents our conclusions.
2 Data Model
A Bigtable is a sparse, distributed, persistent multi-
dimensional sorted map. The map is indexed by a row
key, column key, and a timestamp; each value in the map
is an uninterpreted array of bytes.
(row:string, column:string, time:int64) → string
To appear in OSDI 2006 1
G R
 A
 P
 H
 U
 L
 O
Google
BigTable
Chang et al
(2006)
Scalable SQL and NoSQL Data Stores
Rick Cattell
Originally published in 2010, last
revised December 2011
ABSTRACT
In this paper, we examine a number of SQL and so-
called “NoSQL” data stores designed to scale simple
OLTP-style application loads over many servers.
Originally motivated by Web 2.0 applications, these
systems are designed to scale to thousands or millions
of users doing updates as well as reads, in contrast to
traditional DBMSs and data warehouses. We contrast
the new systems on their data model, consistency
mechanisms, storage mechanisms, durability
guarantees, availability, query support, and other
dimensions. These systems typically sacrifice some of
these dimensions, e.g. database-wide transaction
consistency, in order to achieve others, e.g. higher
availability and scalability.
Note: Bibliographic references for systems are not
listed, but URLs for more information can be found in
the System References table at the end of this paper.
Caveat: Statements in this paper are based on sources
and documentation that may not be reliable, and the
systems described are “moving targets,” so some
statements may be incorrect. Verify through other
sources before depending on information here.
Nevertheless, we hope this comprehensive survey is
useful! Check for future corrections on the author’s
web site cattell.net/datastores.
Disclosure: The author is on the technical advisory
board of Schooner Technologies and has a consulting
business advising on scalable databases.
1. OVERVIEW
In recent years a number of new systems have been
designed to provide good horizontal scalability for
simple read/write database operations distributed over
many servers. In contrast, traditional database
products have comparatively little or no ability to scale
horizontally on these applications. This paper
examines and compares the various new systems.
Many of the new systems are referred to as “NoSQL”
data stores. The definition of NoSQL, which stands
for “Not Only SQL” or “Not Relational”, is not
entirely agreed upon. For the purposes of this paper,
NoSQL systems generally have six key features:
1. the ability to horizontally scale “simple
operation” throughput over many servers,
2. the ability to replicate and to distribute (partition)
data over many servers,
3. a simple call level interface or protocol (in
contrast to a SQL binding),
4. a weaker concurrency model than the ACID
transactions of most relational (SQL) database
systems,
5. efficient use of distributed indexes and RAM for
data storage, and
6. the ability to dynamically add new attributes to
data records.
The systems differ in other ways, and in this paper we
contrast those differences. They range in functionality
from the simplest distributed hashing, as supported by
the popular memcached open source cache, to highly
scalable partitioned tables, as supported by Google’s
BigTable [1]. In fact, BigTable, memcached, and
Amazon’s Dynamo [2] provided a “proof of concept”
that inspired many of the data stores we describe here:
• Memcached demonstrated that in-memory indexes
can be highly scalable, distributing and replicating
objects over multiple nodes.
• Dynamo pioneered the idea of eventual
consistency as a way to achieve higher availability
and scalability: data fetched are not guaranteed to
be up-to-date, but updates are guaranteed to be
propagated to all nodes eventually.
• BigTable demonstrated that persistent record
storage could be scaled to thousands of nodes, a
feat that most of the other systems aspire to.
A key feature of NoSQL systems is “shared nothing”
horizontal scaling – replicating and partitioning data
over many servers. This allows them to support a large
number of simple read/write operations per second.
This simple operation load is traditionally called OLTP
(online transaction processing), but it is also common
in modern web applications
The NoSQL systems described here generally do not
provide ACID transactional properties: updates are
eventually propagated, but there are limited guarantees
on the consistency of reads. Some authors suggest a
“BASE” acronym in contrast to the “ACID” acronym:
• BASE = Basically Available, Soft state,
Eventually consistent
• ACID = Atomicity, Consistency, Isolation, and
Durability
The idea is that by giving up ACID constraints, one
can achieve much higher performance and scalability.
NewSQL
Cattell (2010)
SQL Era NoSQL Era NewSQL Era Future
Polystore, high
performance
ingest and
analytics
Fast analytics inside databasesCommon interface Rapid ingest for internet search
SQL = Structured Query Language
NoSQL = Not only SQL
Apache Prof. Stonebraker
(MIT)
NSF & MITProf. Stonebraker
(U.C. Berkeley)
Larry Ellison
Slide - 8
Declarative, Mathematically Rigorous Interfaces
v ATvAT
à
alice
bob
alice
carl
bob
carl
cited
cited
SQL
Set Operations
NoSQL
Graph Operations
NewSQL
Linear Algebra
Associative Array Algebra Provides a Unified Mathematics for SQL, NoSQL, NewSQL
Operations in all representations are equivalent and are linear systems
A = NxM(k1,k2,v,⊕) (k1,k2,v) = A C = AT C = A ⊕ B C = A ⊗ C C = A B = A ⊕.⊗ B
from link to
001 alice cited bob
002 bob cited alice
003 alice cited carl
Associative Array model of SQL, NoSQL, and NewSQL Databases, Kepner et al, HPEC 2016
Mathematics of Big Data, Kepner & Jananthan, MIT Press 2017
Operation: finding Alice’s nearest neighbors
SELECT 'to' FROM T
WHERE 'from=alice'
Slide - 9
Fig. 1—Typical network elements i and j showing
connection weights w.
network such as the one described is suggestive of
works of the nerve cells, or neurons, of physiology,
since the details of neuron interaction are as yet un-
ain, it cannot even be said that the networks are
ntical without some simplifications which are present.
n the work mentioned, the network was activated
an output obtained in the following way. The net
e with a preponderance at the left end and those
a preponderance at the right end. The notion of
le preponderance or elemental discrimination is
ly one of the most primitive sources of patterns.
iiij L , . llhi.l
JL LLL
I lilll , L l l i
Fig. 1
Modern Computers
DGEt Neural
Networks
Language
move is crucial. Only if the decision is made not to ex-
plore and expand further, is the best alternative picked
from the limited set and punched into the card as the
machine's actual move.
The term "best alternative," is used in a very casual
way. The evaluations consist of many numbers, at least
one for each goal. It is clear that if a single alternative
dominates all others, it should be chosen. It is also fairly
clear that an alternative which achieves a very im-
portant subgoal is to be preferred over one which only
increases the likelihood of a few very subordinate ones.
But basically this is a multiple value situation, and in
general no such simple rules can be expected to indicate
a single best action. The problem for the machine is not
to somehow obtain a magic formula to solve the unsolv-
able but to make a reasonable choice with least effort
and proceed with more productive work. There are
other ways to deal with the problem; for instance, in-
clude conflict as a fundamental consideration in the
decision to explore further.
Thus, at each move the machine can be expected to
iterate several times until it achieves an alternative that
it likes, or until it runs out of time and thus loses the
game by not being smart enough or lucky enough.
PERFORMANCE SCHEMA
The pieces now exist to give an over-all schema for
the performance system of the chess learning machine.
This is a set of mechanisms which is sufficient to enable
the machine to play chess. There is no learning in this
system; it will play no better next time because it played
this time. If the content of all the expressions required is
input
OPPONENTS
MOVE
classify OPPONENTS
GOALS
t eliminate
1
useless branches ikelihooab
CURRENT
TACTICS
tactic
se^rc h
.r select action
from ea,ch
AVAILABLE
ALTERNATIVES
get new step by step
position i computation
EVALUATION
FOR. EACH ALT
transformation
rules
v*>c i choose"best'
Y " " edtern&tive
MACHINES
MOVE
t out put
Strategy
96 1955 WESTERN JOINT COMPUTER CONFERENCE
Fig. 10—A4, after averagi
Vision
Slide - 10
•  Increased abstraction at deeper layers
yi+1 = h(Wi yi + bi)
requires a non-linear function, such as
h(y) = max(y,0)
•  Naturally implemented in Graphulo
Remark: can rewrite in GraphBLAS as
yi+1 = Wi yi ⊗ bi ⊕ 0
where ⊕ = max() and ⊗ = +
DNN oscillates over two linear semirings
S1 = ( , + ,x, 0,1)
S2 = ({-∞ ∪ },max,+,-∞,0)
Example: Deep Neural Networks (DNNs)
Input
Features
Output
Classification
Edges
Object Parts
Objects
y0
W0
b0
W1
b1
W2
b2
W3
b3
y2 y3
y4
y1
Mathematical Foundations of the GraphBLAS , Kepner et al, IEEE HPEC 2016
Mathematics of Big Data, Kepner & Jananthan, MIT PRESS, 2017
Slide - 11
GraphBLAS.org Standard for Sparse Matrix Math
•  5 key operations
A = NxM(i,j,v) (i,j,v) = A C = A ⊕ B C = A ⊗ C C = A B = A ⊕.⊗ B
•  That are composable
A ⊕ B = B ⊕ A (A ⊕ B) ⊕ C = A ⊕ (B ⊕ C) A ⊗ (B ⊕ C) = (A ⊗ B) ⊕ (A ⊗ C)
A ⊗ B = B ⊗ A (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) A (B ⊕ C) = (A B) ⊕ (A C)
•  Can be used to build standard GraphBLAS functions
buildMatrix, extractTuples, Transpose, mXm, mXv, vXm, extract, assign, eWiseAdd, ...
•  Can be used to build a variety of graph utility functions
Tril(), Triu(), Degreed Filtered BFS, …
•  Can be used to build a variety of graph algorithms
Triangle Counting, K-Truss, Jaccard Coefficient, Non-Negative Matrix Factorization, …
•  That work on a wide range of graphs
Hyper, multi-directed, multi-weighted, multi-partite, multi-edge CMU SEI
PNNL
The GraphBLAS C API Specification, Buluc, Mattson, McMillan, Moreira, Yang, GraphBLAS.org 2017
GraphBLAS Mathematics, Kepner, GraphBLAS.org 2017
Slide - 12
12
GraphBLAS in Accumulo = Graphulo
Save Communication Builtin to existing
database
Data Locality
Infrastructure
Reuse
Distributed Execution
Fast Access*
Free Database
Features
Slide - 13
Accumulo Graph Schema Variants
•  Adjacency Matrix (directed/undirected/weighted graphs)
–  row = start vertex; column = vertex; value = edge weight
•  Incidence Matrix (multi-hyper-graphs)
–  row = edge; column = vertices associated with edge; value = weight
•  D4M Schema
–  Standard: main table, transpose table, column degree table, row degree table, raw data table
–  Multi-Family: use one table with multiple column families
–  Many-Table: use different tables for different classes of data
•  Single-Table
–  use concatenated v1|v2 as a row key, and isolated v1 or v2 row key implies a degree
Graphulo works with many
Accumulo graph schemas
Key
Value
Row ID
Column
Timestamp
Family Qualifier Visibility
Slide - 14
Key Operation: Multisource BFS (Matrix Multiply)
via Accumulo Distributed Iterators
AT B
C
(c1, r1, v1)
(r2, c2, v2)
Scalable
by
Design
Slide - 15
Key Operation: Multisource BFS (Matrix Multiply)
via Accumulo Distributed Iterators
AT B
C
(c1, r1, v1)
(r2, c2, v2)
(r1, c2, v1 ⊗ v2)
Scalable
by
Design
Slide - 16
Key Operation: Multisource BFS (Matrix Multiply)
via Accumulo Distributed Iterators
AT B
C
(c1, r1, v1)
(r2, c2, v2)
(r1, c2, v1 ⊗ v2)
⊕
Scalable
by
Design
Slide - 17
•  Introduction & Accumulo
•  Mathematical Foundations
•  Scalable Performance
•  High Level Interfaces
•  Summary
Outline
Slide - 18
Published Graphulo Performance
Graphulo scales well in terms of data size and number of nodes
And continues to perform when data is too large for local operations
Slide - 19 From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph
Algorithms inside a BigTable Database, Hutchison, Kepner, Gadepally, Howe ,
IEEE HPEC 2016
Low Latency Analytics
0
10
20
30
40
50
60
Map Reduce Graphulo
Worst
Best
~50x
faster
Multi-SourceBFS
Time(Seconds)
Best Student Paper Award
IEEE HPEC 2016
Slide - 20
Application Example: Jaccard Distance
Jaccard index computed as ratio of number of common
neighbors to all neighbors between two vertices in a graph
A
B
C
D
E
J(D,E) = High Jaccard Distance
J(A,B) = Low
Jaccard Distance
Slide - 21
90%
100%
110%
120%
130%
Standard Client Graphulo
RelativePerformance
Graphulo for Jaccard
-Larger Problems with Better Performance-
From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph
Algorithms inside a BigTable Database, Hutchison, Kepner, Gadepally, Howe ,
IEEE HPEC 2016
8096
16192
32384
64768
129536
Standar Client Graphulo
MaximumProblemSize
(entries)
213
214
215
216
217
Larger Problems Better Performance
Best
Worst
8x Larger Problem
Best
Worst
Best Student Paper Award
IEEE HPEC 2016
Standard Client
(Math Toolkit Java)
Graphulo
(MIT)
Standard Client
(Math Toolkit Java)
Graphulo
(MIT)
Slide - 22
•  Introduction & Accumulo
•  Mathematical Foundations
•  Scalable Performance
•  High Level Interface
•  Summary
Outline
Slide - 23
Graphulo Java Call
int numSteps = 3;
String Atable = "ex10A"; // Input table
String Rtable = "ex10Astep3"; // Output table
String RTtable = null;
String ADegtable = "ex10ADeg"; // Input degree table
String degColumn = "out"; // Degree table column qual.
boolean degInColQ = false; // Degrees in Value
int minDegree = 20; // High-pass filter
int maxDegree = Integer.MAX_VALUE;
int AScanIteratorPriority = -1; // Default scan iter. priority
String v0 = "1,25,:,27,"; // Starting nodes/range
Map<Key,Value> clientResultMap = null; // Not outputting to client
Authorizations Aauth = Authorizations.EMPTY;
Authorizations ADegauth = Authorizations.EMPTY;
boolean outputUnion = false; // Return nodes EXACTLY k steps away
String vReached = graphulo.AdjBFS(Atable, v0, numSteps, Rtable,
RTtable, clientResultMap, AScanIteratorPriority,
ADegtable, degColumn, degInColQ, minDegree, maxDegree,
plusOp, Aauth, ADegauth);
Graphulo Design: Methods take many parameters. Pass null or -1 to use "defaults"
Allows different degree table
schemas, such as putting degree in
Column Qualifier instead of Value
Be careful with priority when stacking iterators!
Degree filtering on the fly with SmallLargeRowFilter if no
degree table given
Slide - 24
•  Platform for analyzing large datasets
•  Part of Hadoop ecosystem
•  Runs by compiling programs into Map/Reduce jobs,
which are then executed using Hadoop
•  Has SQL-like language Pig Latin that allows you to
specify operations in a familiar, declarative way
•  Supports user-defined functions
Apache Pig
Slide - 25
•  Implements Graphulo calls as user-defined functions
•  Preserves much of the Graphulo functionality, without all
the Java coding overhead
•  Call Graphulo Java functions to
–  Connect to Accumulo tables
–  Delete tables, ingest data, query data
–  Run graph algorithms in Accumulo
•  BFS, Jaccard, kTruss
•  Graphulo relations data model well described by
associative array algebra
Pig-ulo
Slide - 26
A = LOAD 'dbSetup.txt' USING PigStorage() AS!
(configFile:chararray, mainTable:chararray, transposeTable:chararray);!
!
!
Tadj = FOREACH A GENERATE!
DbTableBinder(configFile, mainTable, transposeTable);!
!
!
TadjDeg = FOREACH A GENERATE!
DbTableBinder(configFile, ’TadjDeg');!
Setting Up
•  DbTableBinder puts configuration information and table names into one package
•  Supports table/transpose pairs; three tables in three lines
Slide - 27
fnames = LOAD 'ingestFiles.txt' USING PigStorage() AS (fname:chararray);!
!
!
ingest = FOREACH fnames GENERATE PutTriple(Tadj.dbTable ,'adjacency’,fname);!
ingest = FOREACH fnames GENERATE PutTriple(Tedge.dbTable ,'incidence’,fname);!
ingest = FOREACH fnames GENERATE PutTriple(Tsingle.dbTable,'single’ ,fname);!
ingest = FOREACH fnames GENERATE PutTriple(Traw.dbTable ,'other’ ,fname);!
Ingesting Data
•  One tsv or csv for each ingest; same function for each schema
•  Pig will automatically parallelize over multiple files using Map/Reduce framework
Slide - 28
bfs = FOREACH Tadj GENERATE BFS(dbTable,'adjacency’,'3,’,2);!
bfs = FOREACH Tedge GENERATE BFS(dbTable,’incidence’,'3,’,2);!
bfs = FOREACH Tsingle GENERATE BFS(dbTable,’single’ ,'3,’,2);!
Calling Graph Algorithms
•  Parameters stripped down to the most often used
•  Optional configuration file name can be passed in for more advanced options
•  Jaccard and kTruss work similarly
Slide - 29
TadjBFS = FOREACH A GENERATE DbTableBinder(configFile,'Tadj_BFS');!
!
Q = FOREACH TadjBFS GENERATE FLATTEN(D4mQuery(dbTable,':’ ,':’ ));!
Q = FOREACH Tadj GENERATE FLATTEN(D4mQuery(dbTable,'1,2,4,',':' ));!
Q = FOREACH Tadj GENERATE FLATTEN(D4mQuery(dbTable,':’ ,'1,2,4,'));!
Querying Data and Retrieving Results
•  Query using the same strings as in D4M
•  Automatically searches transpose table for column key queries
•  Returned as a bag of tuples
Slide - 30
Jupyter interactive portal interface similar to Mathematica notebooks
Other interfaces: Pig/Matlab/Octave/Julia in Jupyter
Slide - 31
Computing Past
Serial
Local
Homogeneous
Deterministic
OS Managed
Processes
Memory
Files
Communications
Security
Future Challenge: New OS for a New Era of Computers
Computing Present
Massively Parallel
Distributed
Heterogeneous
Non-Deterministic
User Managed
Processes
Memory
Files
Communications
Security
independently and merged to produce the sorted result. The
sorting of k shorter sequences can also be divided into k even
shorter sequences and sorted recursively by using the same
merge sort algorithm. This process is recursively repeated until
the divided sequence length reaches 1. The sorting process
takes order nlogkn memory cycles. The k-way merge sort is
log2k times faster than the 2-way merge sort process when k is
greater than 2. For example, when k = 32, the k-way merge
sorter has five times greater sorter throughput than the 2-way
merge sorter. The main difficulty with implementing a k-way
merge sorter in a conventional processor is that it takes many
clock cycles to figure out the smallest (or largest) value among
k entries during each step of the merge sorting process. Ideally,
the smallest value of k should be computed within one
processor clock cycle for the maximum sorter throughput. The
100% efficient systolic merge sorter [9] can achieve this
performance requirement using k linear systolic array cells and
it is particularly well suited for FPGA and integrated circuit
(IC) implementation since it consists of repeated systolic cells
with nearest-neighbor-only communications.
C. 6D Toroidal Communication Network and
Randomized Message Routing
The new graph processor architecture is a parallel processor
interconnected in a 6D toroidal configuration using high
bandwidth optical links. The 6D toroid provides much higher
communication performance than lower-dimensional toroids
because of the higher bisection bandwidth.
The communication network is designed as a packet-
routing network optimized to support small packet sizes that
are as small as a single sparse matrix element. The network
scheduling and protocol are designed such that successive
communication packets from a node would have randomized
destinations in order to minimize network congestion [15].
This design is a great contrast to typical conventional
multiprocessor message-routing schemes that are based on
much larger message sizes and globally arbitrated routing that
are used in order to minimize the message-routing overhead.
However, large message-based communications are often
difficult to route and can have a relatively high message
contention rate caused by the long time periods during which
the involved communication links are tied up. The small
message sizes, along with randomized destination routing,
minimize message contentions and improve the overall
network communication throughput. Figure 6 shows the 512-
node (8 × 8 × 8) 3D toroidal network (drawn as 3 × 3 × 3
network for illustration purposes) simulation based on
randomized destination communication versus unique
III. FPGA PROTOTYPE DEVELOPMENT AND PERFORMANCE
MEASUREMENT
Lincoln Laboratory has developed an FPGA prototype of
the graph processor using commercial FPGA boards as shown
in Figure 7. Each board has one large FPGA and two 4-GByte
DDR3 memory banks. Two graph processor nodes are
implemented in each board. A small 4-board chassis
implements an 8-node graph processor tied together with 1D
toroidal network. Since the commercial board offered limited
scalability due to limited number of communication ports for
network connection, the larger prototypes will be developed in
the future using custom FPGA boards that can support 6D
toroidal network and up to 1 million nodes.
Fig. 7. FPGA prototype of the graph processor.
TX-0
PDP-11
Intel x86
Massive Data Centers
Manycore
Processors
Specialized Processors
Slide - 32
Potential Organizing Principles
•  Faster, simpler, easier-to-use, provides the data to know what happened when
•  Database, data analysis, machine learning are first order operations
•  OS designed to analyze itself
•  Graduate Unix's "everything is a file" philosophy to "everything is a table”
•  Rigorously defined mathematical interfaces and properties
TabularOSA: Tabular Operating System Architecture
Complexity of modern systems and applications requires mathematically rigorous organizing principles
Slide - 33
Big Data
•  Challenge: Scale of data beyond what current approaches can handle
•  Accumulo: High performance ingest and scanning of vast amounts of data
Supercomputing
•  Challenge: Analytics beyond what current approaches can handle
•  Accumulo: Real-time filesystem, scheduler, and system monitoring
Machine Learning
•  Challenge: Diversity beyond what current approaches can handle
•  Hardware: Rapid storage and retrieval of training and model data
Summary
Requires mathematically rigorous approaches to enable scalable solutions

Contenu connexe

Tendances

Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTechgeetachauhan
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...Edge AI and Vision Alliance
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...LEGATO project
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...Edge AI and Vision Alliance
 
High performance computing
High performance computingHigh performance computing
High performance computingGuy Tel-Zur
 
Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...
Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...
Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...Santiago Gómez Sáez
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015Karel Ha
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...Edge AI and Vision Alliance
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Financegeetachauhan
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
AN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATA
AN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATAAN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATA
AN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATAijcisjournal
 

Tendances (20)

Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...
Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...
Evaluating Caching Strategies for Cloud Data Access using an Enterprise Serv...
 
Hpc with qpu
Hpc with qpuHpc with qpu
Hpc with qpu
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platform
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
AN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATA
AN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATAAN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATA
AN EFFICIENT THRESHOLD CRYPTOGRAPHY SCHEME FOR CLOUD ERP DATA
 

Similaire à Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing

40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Design of an Efficient Communication Protocol for 3d Interconnection Network
Design of an Efficient Communication Protocol for 3d Interconnection NetworkDesign of an Efficient Communication Protocol for 3d Interconnection Network
Design of an Efficient Communication Protocol for 3d Interconnection NetworkIJMTST Journal
 
FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications: FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications: Ilham Amezzane
 
Convergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and SupercomputingConvergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and SupercomputingDESMOND YUEN
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDLarry Smarr
 
Ccna PPT
Ccna PPTCcna PPT
Ccna PPTAIRTEL
 
IMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOC
IMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOCIMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOC
IMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOCVLSICS Design
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Onyebuchi nosiri
 
5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedyOnyebuchi nosiri
 
Ccna presentation{complete]
Ccna presentation{complete]Ccna presentation{complete]
Ccna presentation{complete]Avijit Nath
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
 

Similaire à Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing (20)

40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Design of an Efficient Communication Protocol for 3d Interconnection Network
Design of an Efficient Communication Protocol for 3d Interconnection NetworkDesign of an Efficient Communication Protocol for 3d Interconnection Network
Design of an Efficient Communication Protocol for 3d Interconnection Network
 
FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications: FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications:
 
Convergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and SupercomputingConvergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and Supercomputing
 
networking1.ppt
networking1.pptnetworking1.ppt
networking1.ppt
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
 
Ccna PPT
Ccna PPTCcna PPT
Ccna PPT
 
IMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOC
IMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOCIMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOC
IMPROVED EXTENDED XY ON-CHIP ROUTING IN DIAMETRICAL 2D MESH NOC
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Ccna day1-130802165909-phpapp01
Ccna day1-130802165909-phpapp01Ccna day1-130802165909-phpapp01
Ccna day1-130802165909-phpapp01
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Ccna day 1
Ccna day 1Ccna day 1
Ccna day 1
 
Grid computing & its applications
Grid computing & its applicationsGrid computing & its applications
Grid computing & its applications
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
 
5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy
 
Ccna presentation{complete]
Ccna presentation{complete]Ccna presentation{complete]
Ccna presentation{complete]
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
 

Dernier

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Dernier (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing

  • 1. Apache Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing Jeremy Kepner1,2,3, Lauren Milechin4, Vijay Gadepally1,2, Dylan Hutchison5, Timothy Weale6 MIT Lincoln Laboratory Supercomputing Center1, MIT Computer Science & AI Laboratory2, MIT Mathematics Department3, MIT Earth, Atmospheric and Planetary Sciences4, University of Washington5, US Department of Defense6 Oct 2017 This material is based upon work supported by the National Science Foundation under Grant No. DMS-1312831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
  • 2. Slide - 2 •  Introduction & Accumulo •  Mathematical Foundations •  Scalable Performance •  High Level Interface •  Summary Outline
  • 3. Slide - 3 Why Accumulo? World Record Database Performance Achieving 100,000,000 database inserts per second using Accumulo and D4M, Kepner et al, HPEC 2014 BAH = Booz Allen Hamilton 10 0 10 1 10 2 10 3 10 5 10 6 10 7 10 8 10 9 total hardware nodes databaseinserts/second Accumulo Cassandra Oracle 4M/s (MIT LL 2012) 115M/s (MIT LL 2014) 1M/s (Google 2014) 108M/s (BAH 2013) 140K/s (Oracle 2013) Number of processors DataabaseInserts/sec World Record
  • 4. Slide - 4 Big Data •  Challenge: Scale of data beyond what current approaches can handle Social media, cyber networks, bioinformatics, ... Supercomputing •  Challenge: Analytics beyond what current approaches can handle Engineering simulation, drug discovery, autonomy, ... Machine Learning •  Challenge: Diversity beyond what current approaches can handle Computer vision, language processing, decision making, ... Large Scale Computing: Challenges independently and merged to produce the sorted result. The sorting of k shorter sequences can also be divided into k even shorter sequences and sorted recursively by using the same merge sort algorithm. This process is recursively repeated until the divided sequence length reaches 1. The sorting process takes order nlogkn memory cycles. The k-way merge sort is log2k times faster than the 2-way merge sort process when k is greater than 2. For example, when k = 32, the k-way merge sorter has five times greater sorter throughput than the 2-way merge sorter. The main difficulty with implementing a k-way merge sorter in a conventional processor is that it takes many clock cycles to figure out the smallest (or largest) value among k entries during each step of the merge sorting process. Ideally, the smallest value of k should be computed within one processor clock cycle for the maximum sorter throughput. The 100% efficient systolic merge sorter [9] can achieve this performance requirement using k linear systolic array cells and it is particularly well suited for FPGA and integrated circuit (IC) implementation since it consists of repeated systolic cells with nearest-neighbor-only communications. C. 6D Toroidal Communication Network and Randomized Message Routing The new graph processor architecture is a parallel processor interconnected in a 6D toroidal configuration using high bandwidth optical links. The 6D toroid provides much higher communication performance than lower-dimensional toroids because of the higher bisection bandwidth. The communication network is designed as a packet- routing network optimized to support small packet sizes that are as small as a single sparse matrix element. The network scheduling and protocol are designed such that successive communication packets from a node would have randomized destinations in order to minimize network congestion [15]. This design is a great contrast to typical conventional multiprocessor message-routing schemes that are based on much larger message sizes and globally arbitrated routing that are used in order to minimize the message-routing overhead. However, large message-based communications are often difficult to route and can have a relatively high message contention rate caused by the long time periods during which the involved communication links are tied up. The small message sizes, along with randomized destination routing, minimize message contentions and improve the overall network communication throughput. Figure 6 shows the 512- node (8 × 8 × 8) 3D toroidal network (drawn as 3 × 3 × 3 network for illustration purposes) simulation based on randomized destination communication versus unique III. FPGA PROTOTYPE DEVELOPMENT AND PERFORMANCE MEASUREMENT Lincoln Laboratory has developed an FPGA prototype o the graph processor using commercial FPGA boards as show in Figure 7. Each board has one large FPGA and two 4-GByt DDR3 memory banks. Two graph processor nodes ar implemented in each board. A small 4-board chassi implements an 8-node graph processor tied together with 1D toroidal network. Since the commercial board offered limite scalability due to limited number of communication ports fo network connection, the larger prototypes will be developed i the future using custom FPGA boards that can support 6D toroidal network and up to 1 million nodes. Fig. 7. FPGA prototype of the graph processor. Hyperscale Data Centers Accelerators Specialized Processors
  • 5. Slide - 5 Big Data •  Challenge: Scale of data beyond what current approaches can handle •  Accumulo: High performance ingest and scanning of vast amounts of data Supercomputing •  Challenge: Analytics beyond what current approaches can handle •  Accumulo: Real-time filesystem, scheduler, and system monitoring Machine Learning •  Challenge: Diversity beyond what current approaches can handle •  Hardware: Rapid storage and retrieval of training and model data Large Scale Computing: Accumulo Requires mathematically rigorous approaches to enable scalable solutions independently and merged to produce the sorted result. The sorting of k shorter sequences can also be divided into k even shorter sequences and sorted recursively by using the same merge sort algorithm. This process is recursively repeated until the divided sequence length reaches 1. The sorting process takes order nlogkn memory cycles. The k-way merge sort is log2k times faster than the 2-way merge sort process when k is greater than 2. For example, when k = 32, the k-way merge sorter has five times greater sorter throughput than the 2-way merge sorter. The main difficulty with implementing a k-way merge sorter in a conventional processor is that it takes many clock cycles to figure out the smallest (or largest) value among k entries during each step of the merge sorting process. Ideally, the smallest value of k should be computed within one processor clock cycle for the maximum sorter throughput. The 100% efficient systolic merge sorter [9] can achieve this performance requirement using k linear systolic array cells and it is particularly well suited for FPGA and integrated circuit (IC) implementation since it consists of repeated systolic cells with nearest-neighbor-only communications. C. 6D Toroidal Communication Network and Randomized Message Routing The new graph processor architecture is a parallel processor interconnected in a 6D toroidal configuration using high bandwidth optical links. The 6D toroid provides much higher communication performance than lower-dimensional toroids because of the higher bisection bandwidth. The communication network is designed as a packet- routing network optimized to support small packet sizes that are as small as a single sparse matrix element. The network scheduling and protocol are designed such that successive communication packets from a node would have randomized destinations in order to minimize network congestion [15]. This design is a great contrast to typical conventional multiprocessor message-routing schemes that are based on much larger message sizes and globally arbitrated routing that are used in order to minimize the message-routing overhead. However, large message-based communications are often difficult to route and can have a relatively high message contention rate caused by the long time periods during which the involved communication links are tied up. The small message sizes, along with randomized destination routing, minimize message contentions and improve the overall network communication throughput. Figure 6 shows the 512- node (8 × 8 × 8) 3D toroidal network (drawn as 3 × 3 × 3 network for illustration purposes) simulation based on randomized destination communication versus unique III. FPGA PROTOTYPE DEVELOPMENT AND PERFORMANCE MEASUREMENT Lincoln Laboratory has developed an FPGA prototype o the graph processor using commercial FPGA boards as show in Figure 7. Each board has one large FPGA and two 4-GByt DDR3 memory banks. Two graph processor nodes ar implemented in each board. A small 4-board chassi implements an 8-node graph processor tied together with 1D toroidal network. Since the commercial board offered limite scalability due to limited number of communication ports fo network connection, the larger prototypes will be developed i the future using custom FPGA boards that can support 6D toroidal network and up to 1 million nodes. Fig. 7. FPGA prototype of the graph processor. Hyperscale Data Centers Accelerators Specialized Processors
  • 6. Slide - 6 •  Introduction & Accumulo •  Mathematical Foundations •  Scalable Performance •  High Level Interface •  Summary Outline
  • 7. Slide - 7 Modern Database Paradigm Shifts NoSQL Relational Databases (SQL)2006 NewSQL 1970 Information Retrieval P. BAXENDALE, Editor A Relational Model of Data for Large Shared Data Banks E. F. CODD IBM Research Laboratory, San Jose, California Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information. Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain opera- tions on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user’s model. KEY WORDS AND PHRASES: data bank, data base, data structure, data organization, hierarchies of data, networks of data, relations, derivability, redundancy, consistency, composition, join, retrieval language, predicate calculus, security, data integrity CR CATEGORIES: 3.70, 3.73, 3.75, 4.20, 4.22, 4.29 1. Relational Model and Normal Form 1.I. INTR~xJ~TI~N This paper is concerned with the application of ele- mentary relation theory to systems which provide shared accessto large banks of formatted data. Except for a paper by Childs [l], the principal application of relations to data systemshasbeento deductive question-answering systems. Levein and Maron [2] provide numerous referencesto work in this area. In contrast, the problems treated here are those of data independence-the independence of application programs and terminal activities from growth in data types and changesin data representation-and certain kinds of data inconsistency which are expected to become troublesome even in nondeductive systems. Volume 13 / Number 6 / June, 1970 The relational view (or model) of data described in Section 1 appears to be superior in several respectsto the graph or network model [3,4] presently in vogue for non- inferential systems.It provides a meansof describing data with its natural structure only-that is, without superim- posingany additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence be- tween programs on the one hand and machine representa- tion and organization of data on the other. A further advantage of the relational view is that it forms a sound basis for treating derivability, redundancy, and consistency of relations-these are discussedin Section 2. The network model, on the other hand, has spawneda number of confusions, not the least of which is mistaking the derivation of connections for the derivation of rela- tions (seeremarks in Section 2 on the “connection trap”). Finally, the relational view permits a clearer evaluation of the scope and logical limitations of present formatted data systems, and also the relative merits (from a logical standpoint) of competing representations of data within a single system. Examples of this clearer perspective are cited in various parts of this paper. Implementations of systems to support the relational model are not discussed. 1.2. DATA DEPENDENCIESIN PRESENTSYSTEMS The provision of data description tables in recently de- veloped information systems represents a major advance toward the goal of data independence [5,6,7]. Such tables facilitate changing certain characteristics of the data repre- sentation stored in a data bank. However, the variety of data representation characteristics which can be changed without logically impairing some application programs is still quite limited. Further, the model of data with which usersinteract is still cluttered with representational prop- erties, particularly in regard to the representation of col- lections of data (as opposedto individual items). Three of the principal kinds of data dependencieswhich still need to be removed are: ordering dependence,indexing depend- ence, and accesspath dependence.In some systemsthese dependenciesare not clearly separablefrom one another. 1.2.1. Ordering Dependence. Elements of data in a data bank may be stored in a variety of ways, someinvolv- ing no concern for ordering, somepermitting eachelement to participate in one ordering only, others permitting each element to participate in several orderings. Let us consider those existing systemswhich either require or permit data elementsto be stored in at least onetotal ordering which is closely associatedwith the hardware-determined ordering of addresses.For example, the records of a file concerning parts might be stored in ascending order by part serial number. Such systems normally permit application pro- grams to assumethat the order of presentation of records from such a file is identical to (or is a subordering of) the Communications of the ACM 377 Relational Model E.F. Codd (1970) 1980 1990 2010 Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber {fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,fikes,gruber}@google.com Google, Inc. Abstract Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Fi- nance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the sim- ple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we de- scribe the design and implementation of Bigtable. 1 Introduction Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high per- formance, and high availability. Bigtable is used by more than sixty Google products and projects, includ- ing Google Analytics, Google Finance, Orkut, Person- alized Search, Writely, and Google Earth. These prod- ucts use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data. In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Paral- lel databases [14] and main-memory databases [13] have achieved scalability and high performance, but Bigtable provides a different interface than such systems. Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and al- lows clients to reason about the locality properties of the data represented in the underlying storage. Data is in- dexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of struc- tured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema pa- rameters let clients dynamically control whether to serve data out of memory or from disk. Section 2 describes the data model in more detail, and Section 3 provides an overview of the client API. Sec- tion 4 briefly describes the underlying Google infrastruc- ture on which Bigtable depends. Section 5 describes the fundamentals of the Bigtable implementation, and Sec- tion 6 describes some of the refinements that we made to improve Bigtable’s performance. Section 7 provides measurements of Bigtable’s performance. We describe several examples of how Bigtable is used at Google in Section 8, and discuss some lessons we learned in designing and supporting Bigtable in Section 9. Fi- nally, Section 10 describes related work, and Section 11 presents our conclusions. 2 Data Model A Bigtable is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, time:int64) → string To appear in OSDI 2006 1 G R A P H U L O Google BigTable Chang et al (2006) Scalable SQL and NoSQL Data Stores Rick Cattell Originally published in 2010, last revised December 2011 ABSTRACT In this paper, we examine a number of SQL and so- called “NoSQL” data stores designed to scale simple OLTP-style application loads over many servers. Originally motivated by Web 2.0 applications, these systems are designed to scale to thousands or millions of users doing updates as well as reads, in contrast to traditional DBMSs and data warehouses. We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability. Note: Bibliographic references for systems are not listed, but URLs for more information can be found in the System References table at the end of this paper. Caveat: Statements in this paper are based on sources and documentation that may not be reliable, and the systems described are “moving targets,” so some statements may be incorrect. Verify through other sources before depending on information here. Nevertheless, we hope this comprehensive survey is useful! Check for future corrections on the author’s web site cattell.net/datastores. Disclosure: The author is on the technical advisory board of Schooner Technologies and has a consulting business advising on scalable databases. 1. OVERVIEW In recent years a number of new systems have been designed to provide good horizontal scalability for simple read/write database operations distributed over many servers. In contrast, traditional database products have comparatively little or no ability to scale horizontally on these applications. This paper examines and compares the various new systems. Many of the new systems are referred to as “NoSQL” data stores. The definition of NoSQL, which stands for “Not Only SQL” or “Not Relational”, is not entirely agreed upon. For the purposes of this paper, NoSQL systems generally have six key features: 1. the ability to horizontally scale “simple operation” throughput over many servers, 2. the ability to replicate and to distribute (partition) data over many servers, 3. a simple call level interface or protocol (in contrast to a SQL binding), 4. a weaker concurrency model than the ACID transactions of most relational (SQL) database systems, 5. efficient use of distributed indexes and RAM for data storage, and 6. the ability to dynamically add new attributes to data records. The systems differ in other ways, and in this paper we contrast those differences. They range in functionality from the simplest distributed hashing, as supported by the popular memcached open source cache, to highly scalable partitioned tables, as supported by Google’s BigTable [1]. In fact, BigTable, memcached, and Amazon’s Dynamo [2] provided a “proof of concept” that inspired many of the data stores we describe here: • Memcached demonstrated that in-memory indexes can be highly scalable, distributing and replicating objects over multiple nodes. • Dynamo pioneered the idea of eventual consistency as a way to achieve higher availability and scalability: data fetched are not guaranteed to be up-to-date, but updates are guaranteed to be propagated to all nodes eventually. • BigTable demonstrated that persistent record storage could be scaled to thousands of nodes, a feat that most of the other systems aspire to. A key feature of NoSQL systems is “shared nothing” horizontal scaling – replicating and partitioning data over many servers. This allows them to support a large number of simple read/write operations per second. This simple operation load is traditionally called OLTP (online transaction processing), but it is also common in modern web applications The NoSQL systems described here generally do not provide ACID transactional properties: updates are eventually propagated, but there are limited guarantees on the consistency of reads. Some authors suggest a “BASE” acronym in contrast to the “ACID” acronym: • BASE = Basically Available, Soft state, Eventually consistent • ACID = Atomicity, Consistency, Isolation, and Durability The idea is that by giving up ACID constraints, one can achieve much higher performance and scalability. NewSQL Cattell (2010) SQL Era NoSQL Era NewSQL Era Future Polystore, high performance ingest and analytics Fast analytics inside databasesCommon interface Rapid ingest for internet search SQL = Structured Query Language NoSQL = Not only SQL Apache Prof. Stonebraker (MIT) NSF & MITProf. Stonebraker (U.C. Berkeley) Larry Ellison
  • 8. Slide - 8 Declarative, Mathematically Rigorous Interfaces v ATvAT à alice bob alice carl bob carl cited cited SQL Set Operations NoSQL Graph Operations NewSQL Linear Algebra Associative Array Algebra Provides a Unified Mathematics for SQL, NoSQL, NewSQL Operations in all representations are equivalent and are linear systems A = NxM(k1,k2,v,⊕) (k1,k2,v) = A C = AT C = A ⊕ B C = A ⊗ C C = A B = A ⊕.⊗ B from link to 001 alice cited bob 002 bob cited alice 003 alice cited carl Associative Array model of SQL, NoSQL, and NewSQL Databases, Kepner et al, HPEC 2016 Mathematics of Big Data, Kepner & Jananthan, MIT Press 2017 Operation: finding Alice’s nearest neighbors SELECT 'to' FROM T WHERE 'from=alice'
  • 9. Slide - 9 Fig. 1—Typical network elements i and j showing connection weights w. network such as the one described is suggestive of works of the nerve cells, or neurons, of physiology, since the details of neuron interaction are as yet un- ain, it cannot even be said that the networks are ntical without some simplifications which are present. n the work mentioned, the network was activated an output obtained in the following way. The net e with a preponderance at the left end and those a preponderance at the right end. The notion of le preponderance or elemental discrimination is ly one of the most primitive sources of patterns. iiij L , . llhi.l JL LLL I lilll , L l l i Fig. 1 Modern Computers DGEt Neural Networks Language move is crucial. Only if the decision is made not to ex- plore and expand further, is the best alternative picked from the limited set and punched into the card as the machine's actual move. The term "best alternative," is used in a very casual way. The evaluations consist of many numbers, at least one for each goal. It is clear that if a single alternative dominates all others, it should be chosen. It is also fairly clear that an alternative which achieves a very im- portant subgoal is to be preferred over one which only increases the likelihood of a few very subordinate ones. But basically this is a multiple value situation, and in general no such simple rules can be expected to indicate a single best action. The problem for the machine is not to somehow obtain a magic formula to solve the unsolv- able but to make a reasonable choice with least effort and proceed with more productive work. There are other ways to deal with the problem; for instance, in- clude conflict as a fundamental consideration in the decision to explore further. Thus, at each move the machine can be expected to iterate several times until it achieves an alternative that it likes, or until it runs out of time and thus loses the game by not being smart enough or lucky enough. PERFORMANCE SCHEMA The pieces now exist to give an over-all schema for the performance system of the chess learning machine. This is a set of mechanisms which is sufficient to enable the machine to play chess. There is no learning in this system; it will play no better next time because it played this time. If the content of all the expressions required is input OPPONENTS MOVE classify OPPONENTS GOALS t eliminate 1 useless branches ikelihooab CURRENT TACTICS tactic se^rc h .r select action from ea,ch AVAILABLE ALTERNATIVES get new step by step position i computation EVALUATION FOR. EACH ALT transformation rules v*>c i choose"best' Y " " edtern&tive MACHINES MOVE t out put Strategy 96 1955 WESTERN JOINT COMPUTER CONFERENCE Fig. 10—A4, after averagi Vision
  • 10. Slide - 10 •  Increased abstraction at deeper layers yi+1 = h(Wi yi + bi) requires a non-linear function, such as h(y) = max(y,0) •  Naturally implemented in Graphulo Remark: can rewrite in GraphBLAS as yi+1 = Wi yi ⊗ bi ⊕ 0 where ⊕ = max() and ⊗ = + DNN oscillates over two linear semirings S1 = ( , + ,x, 0,1) S2 = ({-∞ ∪ },max,+,-∞,0) Example: Deep Neural Networks (DNNs) Input Features Output Classification Edges Object Parts Objects y0 W0 b0 W1 b1 W2 b2 W3 b3 y2 y3 y4 y1 Mathematical Foundations of the GraphBLAS , Kepner et al, IEEE HPEC 2016 Mathematics of Big Data, Kepner & Jananthan, MIT PRESS, 2017
  • 11. Slide - 11 GraphBLAS.org Standard for Sparse Matrix Math •  5 key operations A = NxM(i,j,v) (i,j,v) = A C = A ⊕ B C = A ⊗ C C = A B = A ⊕.⊗ B •  That are composable A ⊕ B = B ⊕ A (A ⊕ B) ⊕ C = A ⊕ (B ⊕ C) A ⊗ (B ⊕ C) = (A ⊗ B) ⊕ (A ⊗ C) A ⊗ B = B ⊗ A (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) A (B ⊕ C) = (A B) ⊕ (A C) •  Can be used to build standard GraphBLAS functions buildMatrix, extractTuples, Transpose, mXm, mXv, vXm, extract, assign, eWiseAdd, ... •  Can be used to build a variety of graph utility functions Tril(), Triu(), Degreed Filtered BFS, … •  Can be used to build a variety of graph algorithms Triangle Counting, K-Truss, Jaccard Coefficient, Non-Negative Matrix Factorization, … •  That work on a wide range of graphs Hyper, multi-directed, multi-weighted, multi-partite, multi-edge CMU SEI PNNL The GraphBLAS C API Specification, Buluc, Mattson, McMillan, Moreira, Yang, GraphBLAS.org 2017 GraphBLAS Mathematics, Kepner, GraphBLAS.org 2017
  • 12. Slide - 12 12 GraphBLAS in Accumulo = Graphulo Save Communication Builtin to existing database Data Locality Infrastructure Reuse Distributed Execution Fast Access* Free Database Features
  • 13. Slide - 13 Accumulo Graph Schema Variants •  Adjacency Matrix (directed/undirected/weighted graphs) –  row = start vertex; column = vertex; value = edge weight •  Incidence Matrix (multi-hyper-graphs) –  row = edge; column = vertices associated with edge; value = weight •  D4M Schema –  Standard: main table, transpose table, column degree table, row degree table, raw data table –  Multi-Family: use one table with multiple column families –  Many-Table: use different tables for different classes of data •  Single-Table –  use concatenated v1|v2 as a row key, and isolated v1 or v2 row key implies a degree Graphulo works with many Accumulo graph schemas Key Value Row ID Column Timestamp Family Qualifier Visibility
  • 14. Slide - 14 Key Operation: Multisource BFS (Matrix Multiply) via Accumulo Distributed Iterators AT B C (c1, r1, v1) (r2, c2, v2) Scalable by Design
  • 15. Slide - 15 Key Operation: Multisource BFS (Matrix Multiply) via Accumulo Distributed Iterators AT B C (c1, r1, v1) (r2, c2, v2) (r1, c2, v1 ⊗ v2) Scalable by Design
  • 16. Slide - 16 Key Operation: Multisource BFS (Matrix Multiply) via Accumulo Distributed Iterators AT B C (c1, r1, v1) (r2, c2, v2) (r1, c2, v1 ⊗ v2) ⊕ Scalable by Design
  • 17. Slide - 17 •  Introduction & Accumulo •  Mathematical Foundations •  Scalable Performance •  High Level Interfaces •  Summary Outline
  • 18. Slide - 18 Published Graphulo Performance Graphulo scales well in terms of data size and number of nodes And continues to perform when data is too large for local operations
  • 19. Slide - 19 From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph Algorithms inside a BigTable Database, Hutchison, Kepner, Gadepally, Howe , IEEE HPEC 2016 Low Latency Analytics 0 10 20 30 40 50 60 Map Reduce Graphulo Worst Best ~50x faster Multi-SourceBFS Time(Seconds) Best Student Paper Award IEEE HPEC 2016
  • 20. Slide - 20 Application Example: Jaccard Distance Jaccard index computed as ratio of number of common neighbors to all neighbors between two vertices in a graph A B C D E J(D,E) = High Jaccard Distance J(A,B) = Low Jaccard Distance
  • 21. Slide - 21 90% 100% 110% 120% 130% Standard Client Graphulo RelativePerformance Graphulo for Jaccard -Larger Problems with Better Performance- From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph Algorithms inside a BigTable Database, Hutchison, Kepner, Gadepally, Howe , IEEE HPEC 2016 8096 16192 32384 64768 129536 Standar Client Graphulo MaximumProblemSize (entries) 213 214 215 216 217 Larger Problems Better Performance Best Worst 8x Larger Problem Best Worst Best Student Paper Award IEEE HPEC 2016 Standard Client (Math Toolkit Java) Graphulo (MIT) Standard Client (Math Toolkit Java) Graphulo (MIT)
  • 22. Slide - 22 •  Introduction & Accumulo •  Mathematical Foundations •  Scalable Performance •  High Level Interface •  Summary Outline
  • 23. Slide - 23 Graphulo Java Call int numSteps = 3; String Atable = "ex10A"; // Input table String Rtable = "ex10Astep3"; // Output table String RTtable = null; String ADegtable = "ex10ADeg"; // Input degree table String degColumn = "out"; // Degree table column qual. boolean degInColQ = false; // Degrees in Value int minDegree = 20; // High-pass filter int maxDegree = Integer.MAX_VALUE; int AScanIteratorPriority = -1; // Default scan iter. priority String v0 = "1,25,:,27,"; // Starting nodes/range Map<Key,Value> clientResultMap = null; // Not outputting to client Authorizations Aauth = Authorizations.EMPTY; Authorizations ADegauth = Authorizations.EMPTY; boolean outputUnion = false; // Return nodes EXACTLY k steps away String vReached = graphulo.AdjBFS(Atable, v0, numSteps, Rtable, RTtable, clientResultMap, AScanIteratorPriority, ADegtable, degColumn, degInColQ, minDegree, maxDegree, plusOp, Aauth, ADegauth); Graphulo Design: Methods take many parameters. Pass null or -1 to use "defaults" Allows different degree table schemas, such as putting degree in Column Qualifier instead of Value Be careful with priority when stacking iterators! Degree filtering on the fly with SmallLargeRowFilter if no degree table given
  • 24. Slide - 24 •  Platform for analyzing large datasets •  Part of Hadoop ecosystem •  Runs by compiling programs into Map/Reduce jobs, which are then executed using Hadoop •  Has SQL-like language Pig Latin that allows you to specify operations in a familiar, declarative way •  Supports user-defined functions Apache Pig
  • 25. Slide - 25 •  Implements Graphulo calls as user-defined functions •  Preserves much of the Graphulo functionality, without all the Java coding overhead •  Call Graphulo Java functions to –  Connect to Accumulo tables –  Delete tables, ingest data, query data –  Run graph algorithms in Accumulo •  BFS, Jaccard, kTruss •  Graphulo relations data model well described by associative array algebra Pig-ulo
  • 26. Slide - 26 A = LOAD 'dbSetup.txt' USING PigStorage() AS! (configFile:chararray, mainTable:chararray, transposeTable:chararray);! ! ! Tadj = FOREACH A GENERATE! DbTableBinder(configFile, mainTable, transposeTable);! ! ! TadjDeg = FOREACH A GENERATE! DbTableBinder(configFile, ’TadjDeg');! Setting Up •  DbTableBinder puts configuration information and table names into one package •  Supports table/transpose pairs; three tables in three lines
  • 27. Slide - 27 fnames = LOAD 'ingestFiles.txt' USING PigStorage() AS (fname:chararray);! ! ! ingest = FOREACH fnames GENERATE PutTriple(Tadj.dbTable ,'adjacency’,fname);! ingest = FOREACH fnames GENERATE PutTriple(Tedge.dbTable ,'incidence’,fname);! ingest = FOREACH fnames GENERATE PutTriple(Tsingle.dbTable,'single’ ,fname);! ingest = FOREACH fnames GENERATE PutTriple(Traw.dbTable ,'other’ ,fname);! Ingesting Data •  One tsv or csv for each ingest; same function for each schema •  Pig will automatically parallelize over multiple files using Map/Reduce framework
  • 28. Slide - 28 bfs = FOREACH Tadj GENERATE BFS(dbTable,'adjacency’,'3,’,2);! bfs = FOREACH Tedge GENERATE BFS(dbTable,’incidence’,'3,’,2);! bfs = FOREACH Tsingle GENERATE BFS(dbTable,’single’ ,'3,’,2);! Calling Graph Algorithms •  Parameters stripped down to the most often used •  Optional configuration file name can be passed in for more advanced options •  Jaccard and kTruss work similarly
  • 29. Slide - 29 TadjBFS = FOREACH A GENERATE DbTableBinder(configFile,'Tadj_BFS');! ! Q = FOREACH TadjBFS GENERATE FLATTEN(D4mQuery(dbTable,':’ ,':’ ));! Q = FOREACH Tadj GENERATE FLATTEN(D4mQuery(dbTable,'1,2,4,',':' ));! Q = FOREACH Tadj GENERATE FLATTEN(D4mQuery(dbTable,':’ ,'1,2,4,'));! Querying Data and Retrieving Results •  Query using the same strings as in D4M •  Automatically searches transpose table for column key queries •  Returned as a bag of tuples
  • 30. Slide - 30 Jupyter interactive portal interface similar to Mathematica notebooks Other interfaces: Pig/Matlab/Octave/Julia in Jupyter
  • 31. Slide - 31 Computing Past Serial Local Homogeneous Deterministic OS Managed Processes Memory Files Communications Security Future Challenge: New OS for a New Era of Computers Computing Present Massively Parallel Distributed Heterogeneous Non-Deterministic User Managed Processes Memory Files Communications Security independently and merged to produce the sorted result. The sorting of k shorter sequences can also be divided into k even shorter sequences and sorted recursively by using the same merge sort algorithm. This process is recursively repeated until the divided sequence length reaches 1. The sorting process takes order nlogkn memory cycles. The k-way merge sort is log2k times faster than the 2-way merge sort process when k is greater than 2. For example, when k = 32, the k-way merge sorter has five times greater sorter throughput than the 2-way merge sorter. The main difficulty with implementing a k-way merge sorter in a conventional processor is that it takes many clock cycles to figure out the smallest (or largest) value among k entries during each step of the merge sorting process. Ideally, the smallest value of k should be computed within one processor clock cycle for the maximum sorter throughput. The 100% efficient systolic merge sorter [9] can achieve this performance requirement using k linear systolic array cells and it is particularly well suited for FPGA and integrated circuit (IC) implementation since it consists of repeated systolic cells with nearest-neighbor-only communications. C. 6D Toroidal Communication Network and Randomized Message Routing The new graph processor architecture is a parallel processor interconnected in a 6D toroidal configuration using high bandwidth optical links. The 6D toroid provides much higher communication performance than lower-dimensional toroids because of the higher bisection bandwidth. The communication network is designed as a packet- routing network optimized to support small packet sizes that are as small as a single sparse matrix element. The network scheduling and protocol are designed such that successive communication packets from a node would have randomized destinations in order to minimize network congestion [15]. This design is a great contrast to typical conventional multiprocessor message-routing schemes that are based on much larger message sizes and globally arbitrated routing that are used in order to minimize the message-routing overhead. However, large message-based communications are often difficult to route and can have a relatively high message contention rate caused by the long time periods during which the involved communication links are tied up. The small message sizes, along with randomized destination routing, minimize message contentions and improve the overall network communication throughput. Figure 6 shows the 512- node (8 × 8 × 8) 3D toroidal network (drawn as 3 × 3 × 3 network for illustration purposes) simulation based on randomized destination communication versus unique III. FPGA PROTOTYPE DEVELOPMENT AND PERFORMANCE MEASUREMENT Lincoln Laboratory has developed an FPGA prototype of the graph processor using commercial FPGA boards as shown in Figure 7. Each board has one large FPGA and two 4-GByte DDR3 memory banks. Two graph processor nodes are implemented in each board. A small 4-board chassis implements an 8-node graph processor tied together with 1D toroidal network. Since the commercial board offered limited scalability due to limited number of communication ports for network connection, the larger prototypes will be developed in the future using custom FPGA boards that can support 6D toroidal network and up to 1 million nodes. Fig. 7. FPGA prototype of the graph processor. TX-0 PDP-11 Intel x86 Massive Data Centers Manycore Processors Specialized Processors
  • 32. Slide - 32 Potential Organizing Principles •  Faster, simpler, easier-to-use, provides the data to know what happened when •  Database, data analysis, machine learning are first order operations •  OS designed to analyze itself •  Graduate Unix's "everything is a file" philosophy to "everything is a table” •  Rigorously defined mathematical interfaces and properties TabularOSA: Tabular Operating System Architecture Complexity of modern systems and applications requires mathematically rigorous organizing principles
  • 33. Slide - 33 Big Data •  Challenge: Scale of data beyond what current approaches can handle •  Accumulo: High performance ingest and scanning of vast amounts of data Supercomputing •  Challenge: Analytics beyond what current approaches can handle •  Accumulo: Real-time filesystem, scheduler, and system monitoring Machine Learning •  Challenge: Diversity beyond what current approaches can handle •  Hardware: Rapid storage and retrieval of training and model data Summary Requires mathematically rigorous approaches to enable scalable solutions