SciDB

Topics
• The Big Complex Analytics Space
• SciDB Overview
• How we are different and why that matters
• Architecture
Note: We call our company P4 for short

Rich Data + Complex Analyticsdrive insights and
innovative product offerings
● Tap new types and sources of data
– Location, genomic, behavioral, speech, sensors, images, …
● Integrate mixed data sources for novel insights
– Genomic/wearable sensors/ EHRs /clinical/payer/provider
– Satellite images/smart grid data
– Location/weather/traffic/driving behavior
● Generate micro-segmented pricing & products
– Personalized insurance
– Precision medicine
– Precision warranties
– Behavioral targeting
– Location-based services
● Look at whole populations, big time windows, big regions

Where some of the ‘complex analytics’ problems are
Pharma, Biotech,
AgroBusiness,
Healthcare
Informatics
• Next-gen sequencing analysis & GWAS
• Population studies
• Evidence-based outcome studies
• Pharmaco-economics
Insurance Analytics • Personalized auto or workman’s comp insurance
• Catastrophe modeling and policy pricing
• Risk modeling for insurance exchanges
Industrial Analytics • Precision warranty pricing & maintenance schedules
Call Centers • Speech analytics
Energy • Data from smart sensor grids
Digital Marketing • Geo-targeting & other personalization strategies
• Recommendation engines
Financial Services • Financial modeling, back testing, sensitivity testing
• Algorithmic trading
• Portfolio management & risk management
Scientific Research • Astronomy, Climatology, High Energy Physics, et al

‘Big Analytics’ covers two categories
P4’s space
• Big Volume + Simple analytics
– Traditional Data Warehouses, RDBMSs
– Business analysts
– Count statistics, roll-ups, aggregates
• Big Volume + Complex Analytics
– Emerging markets; new tools
– Data scientists / healthcare analysts / quants / operations researchers
– Multivariate statistics, clustering, SVD, machine-learning, et al

Why would industrial & commercial analytics
applications benefit from yet another
software platform?
• Sensor data, geospatial data, temporal data, genomic
data & images are far more efficiently managed as multi-dimensional
arrays than as relational tables
• Complex analytics should execute in place where the data
resides and scale easily with additional nodes and cores

P4’s new ‘Complex Analytics’
databasescientific data management & analytics for the
commercial world
Rich Data
Massively Scalable Math
Smart Data Management

P4 is well-matched for M2M data
Machine-generated data have inherent ordering & structure
© Paradigm4 Inc.
• location data from cars and cell phones
• telematics data from sensors
• energy usage data from smart sensors grids
• genetic sequencing data
• patient telemetry data
• time series and longitudinal event data
• satellite images of the earth’s surface
2012-01-31 22:32:36.968000

A new ‘Complex Analytics’ database
scientific data management & analytics for the commercial & industrial worlds
• All-in-one
next generation database with data life cycle management
native, seamlessly integrated, scalable complex math operations
• Array data model
optimal for temporal, geospatial, and machine-generated data
n-dimensional
• Open Source
• Commodity HW grid or cloud

SciDB Features
Distributed data storage
With redundancy/fault tolerance and high-availability
Scalable Parallel operations
Parallel linear algebra, aggregates, summaries, data loading
ACID Transactions
Stuctured N-dimensional Sparse Array Data Model
Defined by schema
Expressive SQL-like Query Syntax
Supports joins by array dimensions
No-Overwrite Data Versioning
Extensible
User-defined types, functions, operators

Paradigm4 enables data-intensive research
Capture Ingest, store, and manage
data throughout its lifecycle
Curation Save raw, corrected, pre-processed
and derived
analytic data, with meta data
and provenance
Curiosityh Explore, drill down, filter, select
Compute Complex math and modeling
Collaboration Shared resource
No data silos with long,
metadata filenames
Compliance No overwrite, versioned data
storage supports
reproducibility and validation of
results

First class support for scientific data & scientific
research
• Ingest, store, access, and manage data throughout its life
cycle
• No overwrite database; historical versioning support
• Metadata – store curation and calibration information
• Extensibility (user defined types and operations)
• Save raw, corrected, pre-processed, and derived data
• Support for provenance
• Support reproducibility of results
• Share data across work groups and with outside
organizations
Why SciDB for scientists?

P4’s native Array DB beats Relational DBs* on
storage efficiency & complex computations
16 cells
● Math functions run directly on native storage format
● Dramatic storage efficiencies as # of dimensions &
attributes grows
– Architecture supports n dimensions
● Facilitates drill-down & clustering by like groups
● High performance for both sparse and dense data
– 10-100x faster than RDBMSs on array operations
48 cells
* Applies to both row stores & column stores

Data exploration & analytics work better when
the natural ordering of data is preserved
Clusters, temporal regions are stored together
Resample time or re-grid geospatial data at any resolution
Slice & drill-down in any n-dimensional region
Fast data selection for ad hoc queries
Efficient analytics over sub-regions & moving windows

Complex math underpins many use cases
Industrial
Analytics
• Precision warranty pricing
• Proactive preventive
maintenance
• Modeling & optimization
• Event monitoring in
refineries and factories
• covariance
• PCA , SVD
• cross validation
• bootstrapping
• cluster analysis
• linear/logistic regression
Pharma
Biotech
Healthcare
• Next-gen Sequencing
• Population studies
• Outcome studies
• Precision medicine

Complex math underpins many use cases
Computational
Finance
• Back testing
• Sophisticating modeling
• Portfolio optimization
• Risk management
• covariance
• PCA , SVD
• cross validation
• bootstrapping
• cluster analysis
• monte carlo methods
• linear/logistic regression

P4’s native math library supports
distributed processing
• Task parallelism
‘Embarrassingly parallel’ tasks
Process subpopulations in parallel
Run simulations in parallel
• Massively scalable complex math
‘Non-embarrassingly’ parallel tasks like large scale linear algebra
Math operations that pass intermediate data between nodes
Challenging O(n3) computations
Math operations on data too large to fit on one node
• Large scale analytics without sampling
Look at whole populations, big time windows, big regions
Sample when you want to; not to fit analytics package constraints
Use all the data: sometimes you really want the long tail or the black swan

Query language seamlessly
integrates data manipulation & math
Array Query Language -- AQL
Declarative SQL-like language extended for working with array data
Large-scale math operations embedded in queries
Extensible
Add user-defined types and functions
R, python, and other client interfaces
Compute the log odds ratio for a failure model using logistic regression
SELECT *
FROM LOGISTREGR (model_matrix, success_count, failure_count,
'coefficients')

Linear Algebra as Building Block
Mathematical and Data Manipulation Operations
multiply ( transpose ( Simple_Array ), Simple_Array );
regrid( Simple_Array, 10, 10, avg (v2) );
cumsum (filter ( Simple_Array, v1 = ‘Odd’ ),I, v1 );

Flexible Schema
• Ad hoc queries
• Don’t have to know a priori
what questions you will
want to ask of your data
• Change schema dynamically
• Values <=> dimensions
• Supports transparent data
exploration and mining

Well-suited for storing, accessing, & analyzing images
Satellite images Healthcare images
GIS data
Store metadata with the data
• Instrument id & calibration data
• Experimental conditions and variables
• Data set identifiers & comments

SciDB support for images
• Regrid operator
• Change resolution and coordinate systems
• Overlap
• Supports feature detection when features fall between nodes
• Support for multi-dimensional window operations
• Spatial averaging
• Non-integer dimensions
• Access image through spatio-temporal coordinate systems
• Astronomy (right ascension, declination)
• Remote sensing (lat, long, time)

SciDB array model: create array
CREATE ARRAY RGB
< red : int16,
green : int16,
blue : int16>
[ longitude(double) = *, 10000, 0,
lattitude(double) = *, 10000, 0 ];
Attributes
red, green,
blue
Dimensions
longitude,
lattitude
Dimension size
* indicates unbounded
Chunk
size
Chunk
overlap

[SciDB] Scalable data management
1 2
instance 1 (coordinator) instance 2 (worker)
3 4
instance 3 (worker) instance 4 (worker)

© Paradigm4 Inc.
Soft scalability test
on automotive telematics and location data
• This graph shows how performance scales when both the data volume
and the number of instances are increased together
• Query computes a score for each driver based on how many other
vehicles were driving at the same time, in the same areas as the driver
• If data is perfectly distributed and if all operations in a query are
perfectly parallelizable, the graph should be a 0 slope line
execution time
relative to 1X
scale factor

New Data Window operator
• Computes aggregates over rolling one-dimensional windows
• skipping over empty array cells
• particularly useful for analysis of time series events that happen at
varying frequencies
• Data window accepts an input array, a dimension name,
number of preceding values, number of following values, and
a list of aggregate calls
data_window (input_array, dim_name,
num_preceding, num_following,
aggregate1(attribute1), aggregate2(attribute2)...)

Analyzing event data
• Event hot spots
• Look at which specific sets of locations (at the lat-long level) have
the most hard acceleration and hard braking events (count or
volume normalized metric)
• Profile hot spots by day of week, time of day
• Event windows
• Look at a 30 second window before and after each hard braking
and hard acceleration event
• Look for patterns to predict adverse events or profile drivers

Manage data throughout its life cycle
• Data is never overwritten
• Preserve raw data, corrected data, and updates in the database
• Facilitates reproducibility, audits, compliance
• Supports model development and testing: what-if modeling,
scenario testing, back-testing, sensitivity testing
• Updates are versioned

Client Interfaces
• i-query interactive command line query interface
• Python, C++, R clients
• GUI (forms) interface coming
• Open source client api – roll your own!

© Paradigm4 Inc.
What about hadoop?
• Hadoop alone is not a DBMS
• No indexes, updates, data consistency, metadata
• Modules (hadoop, Pig, Hive, Hbase, HDFS) are loosely integrated and
require a lot of glue code
• Requires skilled development staff to write custom code and maintain clusters
• Slower than a real parallel distributed database so needs more HW
• Linear algebra operators are hard to implement as a map and a reduce
• See Stonebraker, Kepner CACM blog post: Possible Hadoop Trajectories
http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories

What about NoSQL like MongoDB?
Great for some uses cases: match the tool to your requirements
• NoSQL and XML-based systems bake ‘schema’ into the application
code or the records themselves
• NoSQL is most easily defined by what it excludes
• No schemas
• No query language
• Lacks easily automatic data integrity of ACID databases
• No support for joins which are useful when working with multiple data
sources
• Requires coding to walk the data structures to manage data and extract
information
• Harder to collaborate and share data across groups
• More custom code than a DB means potential longer term maintenance
and data archiving issues
• Paradigm4 offers the flexibility of object-oriented data schemas
without sacrificing ACID database integrity or ad hoc query support
© Paradigm4 Inc.

SciDB and Paradigm4
• SciDB is a global, open source community
• Scientists from many fields & computer-scientists
• www.scidb.org
• Paradigm4, a commercial company, sponsors & manages SciDB
• Doing all the initial development for SciDB
• Sells and supports a commercial-quality release of SciDB
• Along with enterprise management tools (e.g. provisioning,
security, recovery)
• And industry-specific add-ons
• www.paradigm4.com

Get more from your analytical database
• Power, Productivity & Performance
– Less coding
– Less data movement
– Transparent scale-up & speed-up
– Prototypes scale to production without rewriting
– Lower cost deployment
• Highly pedigreed technical team
CTO is Mike Stonebraker
renowned database researcher & entrepreneur
• Ready to work with early adopters

Big Complex Analytics
combines data sources for novel insights & products
Automotive Telematics Healthcare Informatics
© Paradigm4 Inc.

© Paradigm4 Inc.
Big Complex Analytics
powers population studies
> 70K tissue samples
> 65K gene probes per sample
covariance, clustering, SVD
> 10 million cars
GPS & driving data every sec
insurance by the trip & how you drive
linear regressions, risk & pricing modeling

© Paradigm4 Inc.
Architecture
• ‘Shared Nothing” cluster of commodity hardware nodes
• Interconnected with standard ethernet and TCP/IP

SciDB Array Schema
CREATE ARRAY Simple_Array
< v1 : double,
v2 : int64,
v3 : string >
[ I = 0:*, 5, 0, J = 0:9, 5, 0 ];
Attributes
v1, v2, v3
Dimensions
I, J Dimension size
* is unbounded
Chunk
size
Chunk
overlap

SciDB array model: data types
• Whole numbers: int8, int16, int32, int64
• Unsigned whole numbers: uint8, ..., uint64
• Date and Time: datetime
• Date and Time with timezone: datetimez
• Floating point: float, double
• Boolean: bool
• Character: char
• Variable-length strings: string
© Paradigm4 Inc.

© Paradigm4 Inc.
SciDB array model: Storage
• SciDB store every attribute separatelly
• Good compression:
– RLE
– zlib
• Parallel processing

© Paradigm4 Inc.
SciDB array model: 1D-array
• Chunk: unit of data processing
• Chunk should fit in memory entirely
• User chooses chunk size

© Paradigm4 Inc.
SciDB array model: bitmap
• SciDB describes EMPTY values using bitmap
• bitmap is compressed efficiently with RLE

© Paradigm4 Inc.
SciDB array model: clustering
• Several available chunk distributions:
– Round-Robin (default)
– Replication
• Optimizer splits queries into stages
• Every stage processed parallel
• Scatter/Gather intermediate results after
every stage according to requirements
• Overlap helps descrease SG size (!)
• NO single point of failure

SciDB array model: redundancy
• --redundancy=X
• Every chunk is replicated X times
• Single copy on every node
• Redundand chunks used only when a node
becomes unavilable
• We protect networks and disk failures
• Use RAID for protect disk failures
© Paradigm4 Inc.

SciDB array model: release 12.7
• Time series
• Optimizations
• Binary loader (based on PostgreSQL binary
loader)
• data_window operator
© Paradigm4 Inc.

SciDB array model: next release
• Repart failed nodes by redundand data
• Elastic cluster:
© Paradigm4 Inc.
– Increase/decrease node count

Contact
– Marilyn Matz
– CEO & co-founder
– 781 718 3999
– mmatz@paradigm4.com
– www.paradigm4.com
– www.scidb.org

SciDB

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à SciDB

Similaire à SciDB (20)

Dernier

Dernier (9)

SciDB