SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
© Paradigm4 Inc. confidential
© Paradigm4 Inc. confidential 
Topics 
• The Big Complex Analytics Space 
• SciDB Overview 
• How we are different and why that matters 
• Architecture 
Note: We call our company P4 for short
Rich Data + Complex Analyticsdrive insights and 
innovative product offerings 
© Paradigm4 Inc. confidential 
● Tap new types and sources of data 
– Location, genomic, behavioral, speech, sensors, images, … 
● Integrate mixed data sources for novel insights 
– Genomic/wearable sensors/ EHRs /clinical/payer/provider 
– Satellite images/smart grid data 
– Location/weather/traffic/driving behavior 
● Generate micro-segmented pricing & products 
– Personalized insurance 
– Precision medicine 
– Precision warranties 
– Behavioral targeting 
– Location-based services 
● Look at whole populations, big time windows, big regions
Where some of the ‘complex analytics’ problems are 
© Paradigm4 Inc. confidential 
Pharma, Biotech, 
AgroBusiness, 
Healthcare 
Informatics 
• Next-gen sequencing analysis & GWAS 
• Population studies 
• Evidence-based outcome studies 
• Pharmaco-economics 
Insurance Analytics • Personalized auto or workman’s comp insurance 
• Catastrophe modeling and policy pricing 
• Risk modeling for insurance exchanges 
Industrial Analytics • Precision warranty pricing & maintenance schedules 
Call Centers • Speech analytics 
Energy • Data from smart sensor grids 
Digital Marketing • Geo-targeting & other personalization strategies 
• Recommendation engines 
Financial Services • Financial modeling, back testing, sensitivity testing 
• Algorithmic trading 
• Portfolio management & risk management 
Scientific Research • Astronomy, Climatology, High Energy Physics, et al
‘Big Analytics’ covers two categories 
P4’s space 
© Paradigm4 Inc. confidential 
• Big Volume + Simple analytics 
– Traditional Data Warehouses, RDBMSs 
– Business analysts 
– Count statistics, roll-ups, aggregates 
• Big Volume + Complex Analytics 
– Emerging markets; new tools 
– Data scientists / healthcare analysts / quants / operations researchers 
– Multivariate statistics, clustering, SVD, machine-learning, et al
Why would industrial & commercial analytics 
applications benefit from yet another 
software platform? 
• Sensor data, geospatial data, temporal data, genomic 
data & images are far more efficiently managed as multi-dimensional 
arrays than as relational tables 
• Complex analytics should execute in place where the data 
resides and scale easily with additional nodes and cores 
© Paradigm4 Inc. confidential
P4’s new ‘Complex Analytics’ 
databasescientific data management & analytics for the 
commercial world 
© Paradigm4 Inc. confidential 
Rich Data 
Massively Scalable Math 
Smart Data Management
P4 is well-matched for M2M data 
Machine-generated data have inherent ordering & structure 
© Paradigm4 Inc. 
• location data from cars and cell phones 
• telematics data from sensors 
• energy usage data from smart sensors grids 
• genetic sequencing data 
• patient telemetry data 
• time series and longitudinal event data 
• satellite images of the earth’s surface 
2012-01-31 22:32:36.968000
A new ‘Complex Analytics’ database 
scientific data management & analytics for the commercial & industrial worlds 
• All-in-one 
next generation database with data life cycle management 
native, seamlessly integrated, scalable complex math operations 
• Array data model 
optimal for temporal, geospatial, and machine-generated data 
n-dimensional 
• Open Source 
• Commodity HW grid or cloud 
© Paradigm4 Inc. confidential
© Paradigm4 Inc. confidential 
SciDB Features 
Distributed data storage 
With redundancy/fault tolerance and high-availability 
Scalable Parallel operations 
Parallel linear algebra, aggregates, summaries, data loading 
ACID Transactions 
Stuctured N-dimensional Sparse Array Data Model 
Defined by schema 
Expressive SQL-like Query Syntax 
Supports joins by array dimensions 
No-Overwrite Data Versioning 
Extensible 
User-defined types, functions, operators
Paradigm4 enables data-intensive research 
© Paradigm4 Inc. confidential 
Capture Ingest, store, and manage 
data throughout its lifecycle 
Curation Save raw, corrected, pre-processed 
and derived 
analytic data, with meta data 
and provenance 
Curiosityh Explore, drill down, filter, select 
Compute Complex math and modeling 
Collaboration Shared resource 
No data silos with long, 
metadata filenames 
Compliance No overwrite, versioned data 
storage supports 
reproducibility and validation of 
results
First class support for scientific data & scientific 
research 
• Ingest, store, access, and manage data throughout its life 
cycle 
• No overwrite database; historical versioning support 
• Metadata – store curation and calibration information 
• Extensibility (user defined types and operations) 
• Save raw, corrected, pre-processed, and derived data 
• Support for provenance 
• Support reproducibility of results 
• Share data across work groups and with outside 
organizations 
© Paradigm4 Inc. confidential 
Why SciDB for scientists?
P4’s native Array DB beats Relational DBs* on 
storage efficiency & complex computations 
16 cells 
● Math functions run directly on native storage format 
● Dramatic storage efficiencies as # of dimensions & 
© Paradigm4 Inc. confidential 
attributes grows 
– Architecture supports n dimensions 
● Facilitates drill-down & clustering by like groups 
● High performance for both sparse and dense data 
– 10-100x faster than RDBMSs on array operations 
48 cells 
* Applies to both row stores & column stores
Data exploration & analytics work better when 
the natural ordering of data is preserved 
Clusters, temporal regions are stored together 
Resample time or re-grid geospatial data at any resolution 
Slice & drill-down in any n-dimensional region 
Fast data selection for ad hoc queries 
Efficient analytics over sub-regions & moving windows 
© Paradigm4 Inc. confidential
Complex math underpins many use cases 
© Paradigm4 Inc. confidential 
Industrial 
Analytics 
• Precision warranty pricing 
• Proactive preventive 
maintenance 
• Modeling & optimization 
• Event monitoring in 
refineries and factories 
• covariance 
• PCA , SVD 
• cross validation 
• bootstrapping 
• cluster analysis 
• linear/logistic regression 
Pharma 
Biotech 
Healthcare 
• Next-gen Sequencing 
• Population studies 
• Outcome studies 
• Precision medicine
Complex math underpins many use cases 
© Paradigm4 Inc. confidential 
Computational 
Finance 
• Back testing 
• Sophisticating modeling 
• Portfolio optimization 
• Risk management 
• covariance 
• PCA , SVD 
• cross validation 
• bootstrapping 
• cluster analysis 
• monte carlo methods 
• linear/logistic regression
P4’s native math library supports 
distributed processing 
© Paradigm4 Inc. confidential 
• Task parallelism 
‘Embarrassingly parallel’ tasks 
Process subpopulations in parallel 
Run simulations in parallel 
• Massively scalable complex math 
‘Non-embarrassingly’ parallel tasks like large scale linear algebra 
Math operations that pass intermediate data between nodes 
Challenging O(n3) computations 
Math operations on data too large to fit on one node 
• Large scale analytics without sampling 
Look at whole populations, big time windows, big regions 
Sample when you want to; not to fit analytics package constraints 
Use all the data: sometimes you really want the long tail or the black swan
Query language seamlessly 
integrates data manipulation & math 
© Paradigm4 Inc. confidential 
Array Query Language -- AQL 
Declarative SQL-like language extended for working with array data 
Large-scale math operations embedded in queries 
Extensible 
Add user-defined types and functions 
R, python, and other client interfaces 
Compute the log odds ratio for a failure model using logistic regression 
SELECT * 
FROM LOGISTREGR (model_matrix, success_count, failure_count, 
'coefficients')
Linear Algebra as Building Block 
Mathematical and Data Manipulation Operations 
multiply ( transpose ( Simple_Array ), Simple_Array ); 
© Paradigm4 Inc. confidential 
regrid( Simple_Array, 10, 10, avg (v2) ); 
cumsum (filter ( Simple_Array, v1 = ‘Odd’ ),I, v1 );
© Paradigm4 Inc. confidential 
Flexible Schema 
• Ad hoc queries 
• Don’t have to know a priori 
what questions you will 
want to ask of your data 
• Change schema dynamically 
• Values <=> dimensions 
• Supports transparent data 
exploration and mining
Well-suited for storing, accessing, & analyzing images 
Satellite images Healthcare images 
© Paradigm4 Inc. confidential 
GIS data 
Store metadata with the data 
• Instrument id & calibration data 
• Experimental conditions and variables 
• Data set identifiers & comments
© Paradigm4 Inc. confidential 
SciDB support for images 
• Regrid operator 
• Change resolution and coordinate systems 
• Overlap 
• Supports feature detection when features fall between nodes 
• Support for multi-dimensional window operations 
• Spatial averaging 
• Non-integer dimensions 
• Access image through spatio-temporal coordinate systems 
• Astronomy (right ascension, declination) 
• Remote sensing (lat, long, time)
SciDB array model: create array 
© Paradigm4 Inc. confidential 
CREATE ARRAY RGB 
< red : int16, 
green : int16, 
blue : int16> 
[ longitude(double) = *, 10000, 0, 
lattitude(double) = *, 10000, 0 ]; 
Attributes 
red, green, 
blue 
Dimensions 
longitude, 
lattitude 
Dimension size 
* indicates unbounded 
Chunk 
size 
Chunk 
overlap
[SciDB] Scalable data management 
1 2 
instance 1 (coordinator) instance 2 (worker) 
3 4 
instance 3 (worker) instance 4 (worker) 
© Paradigm4 Inc. confidential
© Paradigm4 Inc. 
Soft scalability test 
on automotive telematics and location data 
• This graph shows how performance scales when both the data volume 
and the number of instances are increased together 
• Query computes a score for each driver based on how many other 
vehicles were driving at the same time, in the same areas as the driver 
• If data is perfectly distributed and if all operations in a query are 
perfectly parallelizable, the graph should be a 0 slope line 
execution time 
relative to 1X 
scale factor
© Paradigm4 Inc. confidential 
New Data Window operator 
• Computes aggregates over rolling one-dimensional windows 
• skipping over empty array cells 
• particularly useful for analysis of time series events that happen at 
varying frequencies 
• Data window accepts an input array, a dimension name, 
number of preceding values, number of following values, and 
a list of aggregate calls 
data_window (input_array, dim_name, 
num_preceding, num_following, 
aggregate1(attribute1), aggregate2(attribute2)...)
© Paradigm4 Inc. confidential 
Analyzing event data 
• Event hot spots 
• Look at which specific sets of locations (at the lat-long level) have 
the most hard acceleration and hard braking events (count or 
volume normalized metric) 
• Profile hot spots by day of week, time of day 
• Event windows 
• Look at a 30 second window before and after each hard braking 
and hard acceleration event 
• Look for patterns to predict adverse events or profile drivers
Manage data throughout its life cycle 
© Paradigm4 Inc. confidential 
• Data is never overwritten 
• Preserve raw data, corrected data, and updates in the database 
• Facilitates reproducibility, audits, compliance 
• Supports model development and testing: what-if modeling, 
scenario testing, back-testing, sensitivity testing 
• Updates are versioned
© Paradigm4 Inc. confidential 
Client Interfaces 
• i-query interactive command line query interface 
• Python, C++, R clients 
• GUI (forms) interface coming 
• Open source client api – roll your own!
© Paradigm4 Inc. 
What about hadoop? 
• Hadoop alone is not a DBMS 
• No indexes, updates, data consistency, metadata 
• Modules (hadoop, Pig, Hive, Hbase, HDFS) are loosely integrated and 
require a lot of glue code 
• Requires skilled development staff to write custom code and maintain clusters 
• Slower than a real parallel distributed database so needs more HW 
• Linear algebra operators are hard to implement as a map and a reduce 
• See Stonebraker, Kepner CACM blog post: Possible Hadoop Trajectories 
http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories
What about NoSQL like MongoDB? 
Great for some uses cases: match the tool to your requirements 
• NoSQL and XML-based systems bake ‘schema’ into the application 
code or the records themselves 
• NoSQL is most easily defined by what it excludes 
• No schemas 
• No query language 
• Lacks easily automatic data integrity of ACID databases 
• No support for joins which are useful when working with multiple data 
sources 
• Requires coding to walk the data structures to manage data and extract 
information 
• Harder to collaborate and share data across groups 
• More custom code than a DB means potential longer term maintenance 
and data archiving issues 
• Paradigm4 offers the flexibility of object-oriented data schemas 
without sacrificing ACID database integrity or ad hoc query support 
© Paradigm4 Inc.
© Paradigm4 Inc. confidential 
SciDB and Paradigm4 
• SciDB is a global, open source community 
• Scientists from many fields & computer-scientists 
• www.scidb.org 
• Paradigm4, a commercial company, sponsors & manages SciDB 
• Doing all the initial development for SciDB 
• Sells and supports a commercial-quality release of SciDB 
• Along with enterprise management tools (e.g. provisioning, 
security, recovery) 
• And industry-specific add-ons 
• www.paradigm4.com
Get more from your analytical database 
© Paradigm4 Inc. confidential 
• Power, Productivity & Performance 
– Less coding 
– Less data movement 
– Transparent scale-up & speed-up 
– Prototypes scale to production without rewriting 
– Lower cost deployment 
• Highly pedigreed technical team 
CTO is Mike Stonebraker 
renowned database researcher & entrepreneur 
• Ready to work with early adopters
Big Complex Analytics 
combines data sources for novel insights & products 
Automotive Telematics Healthcare Informatics 
© Paradigm4 Inc.
© Paradigm4 Inc. 
Big Complex Analytics 
powers population studies 
> 70K tissue samples 
> 65K gene probes per sample 
covariance, clustering, SVD 
> 10 million cars 
GPS & driving data every sec 
insurance by the trip & how you drive 
linear regressions, risk & pricing modeling
© Paradigm4 Inc. 
Architecture 
• ‘Shared Nothing” cluster of commodity hardware nodes 
• Interconnected with standard ethernet and TCP/IP
SciDB array model: create array 
© Paradigm4 Inc. confidential 
CREATE ARRAY RGB 
< red : int16, 
green : int16, 
blue : int16> 
[ longitude(double) = *, 10000, 0, 
lattitude(double) = *, 10000, 0 ]; 
Attributes 
red, green, 
blue 
Dimensions 
longitude, 
lattitude 
Dimension size 
* indicates unbounded 
Chunk 
size 
Chunk 
overlap
© Paradigm4 Inc. confidential 
SciDB Array Schema 
CREATE ARRAY Simple_Array 
< v1 : double, 
v2 : int64, 
v3 : string > 
[ I = 0:*, 5, 0, J = 0:9, 5, 0 ]; 
Attributes 
v1, v2, v3 
Dimensions 
I, J Dimension size 
* is unbounded 
Chunk 
size 
Chunk 
overlap
SciDB array model: data types 
• Whole numbers: int8, int16, int32, int64 
• Unsigned whole numbers: uint8, ..., uint64 
• Date and Time: datetime 
• Date and Time with timezone: datetimez 
• Floating point: float, double 
• Boolean: bool 
• Character: char 
• Variable-length strings: string 
© Paradigm4 Inc.
© Paradigm4 Inc. 
SciDB array model: Storage 
• SciDB store every attribute separatelly 
• Good compression: 
– RLE 
– zlib 
• Parallel processing
© Paradigm4 Inc. 
SciDB array model: 1D-array 
• Chunk: unit of data processing 
• Chunk should fit in memory entirely 
• User chooses chunk size
© Paradigm4 Inc. 
SciDB array model: bitmap 
• SciDB describes EMPTY values using bitmap 
• bitmap is compressed efficiently with RLE
© Paradigm4 Inc. 
SciDB array model: 2D-array 
• Stride-major-order of chunks
© Paradigm4 Inc. 
SciDB array model: 2D-chunk
© Paradigm4 Inc. 
SciDB array model: clustering 
• Several available chunk distributions: 
– Round-Robin (default) 
– Replication 
• Optimizer splits queries into stages 
• Every stage processed parallel 
• Scatter/Gather intermediate results after 
every stage according to requirements 
• Overlap helps descrease SG size (!) 
• NO single point of failure
SciDB array model: redundancy 
• --redundancy=X 
• Every chunk is replicated X times 
• Single copy on every node 
• Redundand chunks used only when a node 
becomes unavilable 
• We protect networks and disk failures 
• Use RAID for protect disk failures 
© Paradigm4 Inc.
SciDB array model: release 12.7 
• Time series 
• Optimizations 
• Binary loader (based on PostgreSQL binary 
loader) 
• data_window operator 
© Paradigm4 Inc.
SciDB array model: next release 
• Repart failed nodes by redundand data 
• Elastic cluster: 
© Paradigm4 Inc. 
– Increase/decrease node count
© Paradigm4 Inc. confidential 
Contact 
– Marilyn Matz 
– CEO & co-founder 
– 781 718 3999 
– mmatz@paradigm4.com 
– www.paradigm4.com 
– www.scidb.org
innovative data management with complex analytics 
© Paradigm4 Inc.

Contenu connexe

Tendances

Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jNeo4j
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...DataWorks Summit
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
 
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsGoing Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsCambridge Semantics
 
Munich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data TransformationMunich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data TransformationDataWorks Summit
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineeringNovita Sari
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
 
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020Cambridge Semantics
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
 
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationBig Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationCambridge Semantics
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 

Tendances (20)

Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsGoing Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph Analytics
 
Munich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data TransformationMunich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data Transformation
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
 
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationBig Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 

Similaire à SciDB

20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticssai Pramoda
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Choosing Right datastorage.pptx
Choosing Right datastorage.pptxChoosing Right datastorage.pptx
Choosing Right datastorage.pptxAlekhyaAchanta3
 
Choosing right data storage.pptx
Choosing right data storage.pptxChoosing right data storage.pptx
Choosing right data storage.pptxAlekhyaAchanta3
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
 
Business Intelligence Architecture
Business Intelligence ArchitectureBusiness Intelligence Architecture
Business Intelligence ArchitecturePhilippe Julio
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Precisely
 
The New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need ThemThe New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need ThemPrecisely
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisCaserta
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 

Similaire à SciDB (20)

20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analytics
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Choosing Right datastorage.pptx
Choosing Right datastorage.pptxChoosing Right datastorage.pptx
Choosing Right datastorage.pptx
 
Choosing right data storage.pptx
Choosing right data storage.pptxChoosing right data storage.pptx
Choosing right data storage.pptx
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Business Intelligence Architecture
Business Intelligence ArchitectureBusiness Intelligence Architecture
Business Intelligence Architecture
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
The New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need ThemThe New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need Them
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, Zementis
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 

Dernier

Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesLumiverse Solutions Pvt Ltd
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 

Dernier (9)

Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 

SciDB

  • 1. © Paradigm4 Inc. confidential
  • 2. © Paradigm4 Inc. confidential Topics • The Big Complex Analytics Space • SciDB Overview • How we are different and why that matters • Architecture Note: We call our company P4 for short
  • 3. Rich Data + Complex Analyticsdrive insights and innovative product offerings © Paradigm4 Inc. confidential ● Tap new types and sources of data – Location, genomic, behavioral, speech, sensors, images, … ● Integrate mixed data sources for novel insights – Genomic/wearable sensors/ EHRs /clinical/payer/provider – Satellite images/smart grid data – Location/weather/traffic/driving behavior ● Generate micro-segmented pricing & products – Personalized insurance – Precision medicine – Precision warranties – Behavioral targeting – Location-based services ● Look at whole populations, big time windows, big regions
  • 4. Where some of the ‘complex analytics’ problems are © Paradigm4 Inc. confidential Pharma, Biotech, AgroBusiness, Healthcare Informatics • Next-gen sequencing analysis & GWAS • Population studies • Evidence-based outcome studies • Pharmaco-economics Insurance Analytics • Personalized auto or workman’s comp insurance • Catastrophe modeling and policy pricing • Risk modeling for insurance exchanges Industrial Analytics • Precision warranty pricing & maintenance schedules Call Centers • Speech analytics Energy • Data from smart sensor grids Digital Marketing • Geo-targeting & other personalization strategies • Recommendation engines Financial Services • Financial modeling, back testing, sensitivity testing • Algorithmic trading • Portfolio management & risk management Scientific Research • Astronomy, Climatology, High Energy Physics, et al
  • 5. ‘Big Analytics’ covers two categories P4’s space © Paradigm4 Inc. confidential • Big Volume + Simple analytics – Traditional Data Warehouses, RDBMSs – Business analysts – Count statistics, roll-ups, aggregates • Big Volume + Complex Analytics – Emerging markets; new tools – Data scientists / healthcare analysts / quants / operations researchers – Multivariate statistics, clustering, SVD, machine-learning, et al
  • 6. Why would industrial & commercial analytics applications benefit from yet another software platform? • Sensor data, geospatial data, temporal data, genomic data & images are far more efficiently managed as multi-dimensional arrays than as relational tables • Complex analytics should execute in place where the data resides and scale easily with additional nodes and cores © Paradigm4 Inc. confidential
  • 7. P4’s new ‘Complex Analytics’ databasescientific data management & analytics for the commercial world © Paradigm4 Inc. confidential Rich Data Massively Scalable Math Smart Data Management
  • 8. P4 is well-matched for M2M data Machine-generated data have inherent ordering & structure © Paradigm4 Inc. • location data from cars and cell phones • telematics data from sensors • energy usage data from smart sensors grids • genetic sequencing data • patient telemetry data • time series and longitudinal event data • satellite images of the earth’s surface 2012-01-31 22:32:36.968000
  • 9. A new ‘Complex Analytics’ database scientific data management & analytics for the commercial & industrial worlds • All-in-one next generation database with data life cycle management native, seamlessly integrated, scalable complex math operations • Array data model optimal for temporal, geospatial, and machine-generated data n-dimensional • Open Source • Commodity HW grid or cloud © Paradigm4 Inc. confidential
  • 10. © Paradigm4 Inc. confidential SciDB Features Distributed data storage With redundancy/fault tolerance and high-availability Scalable Parallel operations Parallel linear algebra, aggregates, summaries, data loading ACID Transactions Stuctured N-dimensional Sparse Array Data Model Defined by schema Expressive SQL-like Query Syntax Supports joins by array dimensions No-Overwrite Data Versioning Extensible User-defined types, functions, operators
  • 11. Paradigm4 enables data-intensive research © Paradigm4 Inc. confidential Capture Ingest, store, and manage data throughout its lifecycle Curation Save raw, corrected, pre-processed and derived analytic data, with meta data and provenance Curiosityh Explore, drill down, filter, select Compute Complex math and modeling Collaboration Shared resource No data silos with long, metadata filenames Compliance No overwrite, versioned data storage supports reproducibility and validation of results
  • 12. First class support for scientific data & scientific research • Ingest, store, access, and manage data throughout its life cycle • No overwrite database; historical versioning support • Metadata – store curation and calibration information • Extensibility (user defined types and operations) • Save raw, corrected, pre-processed, and derived data • Support for provenance • Support reproducibility of results • Share data across work groups and with outside organizations © Paradigm4 Inc. confidential Why SciDB for scientists?
  • 13. P4’s native Array DB beats Relational DBs* on storage efficiency & complex computations 16 cells ● Math functions run directly on native storage format ● Dramatic storage efficiencies as # of dimensions & © Paradigm4 Inc. confidential attributes grows – Architecture supports n dimensions ● Facilitates drill-down & clustering by like groups ● High performance for both sparse and dense data – 10-100x faster than RDBMSs on array operations 48 cells * Applies to both row stores & column stores
  • 14. Data exploration & analytics work better when the natural ordering of data is preserved Clusters, temporal regions are stored together Resample time or re-grid geospatial data at any resolution Slice & drill-down in any n-dimensional region Fast data selection for ad hoc queries Efficient analytics over sub-regions & moving windows © Paradigm4 Inc. confidential
  • 15. Complex math underpins many use cases © Paradigm4 Inc. confidential Industrial Analytics • Precision warranty pricing • Proactive preventive maintenance • Modeling & optimization • Event monitoring in refineries and factories • covariance • PCA , SVD • cross validation • bootstrapping • cluster analysis • linear/logistic regression Pharma Biotech Healthcare • Next-gen Sequencing • Population studies • Outcome studies • Precision medicine
  • 16. Complex math underpins many use cases © Paradigm4 Inc. confidential Computational Finance • Back testing • Sophisticating modeling • Portfolio optimization • Risk management • covariance • PCA , SVD • cross validation • bootstrapping • cluster analysis • monte carlo methods • linear/logistic regression
  • 17. P4’s native math library supports distributed processing © Paradigm4 Inc. confidential • Task parallelism ‘Embarrassingly parallel’ tasks Process subpopulations in parallel Run simulations in parallel • Massively scalable complex math ‘Non-embarrassingly’ parallel tasks like large scale linear algebra Math operations that pass intermediate data between nodes Challenging O(n3) computations Math operations on data too large to fit on one node • Large scale analytics without sampling Look at whole populations, big time windows, big regions Sample when you want to; not to fit analytics package constraints Use all the data: sometimes you really want the long tail or the black swan
  • 18. Query language seamlessly integrates data manipulation & math © Paradigm4 Inc. confidential Array Query Language -- AQL Declarative SQL-like language extended for working with array data Large-scale math operations embedded in queries Extensible Add user-defined types and functions R, python, and other client interfaces Compute the log odds ratio for a failure model using logistic regression SELECT * FROM LOGISTREGR (model_matrix, success_count, failure_count, 'coefficients')
  • 19. Linear Algebra as Building Block Mathematical and Data Manipulation Operations multiply ( transpose ( Simple_Array ), Simple_Array ); © Paradigm4 Inc. confidential regrid( Simple_Array, 10, 10, avg (v2) ); cumsum (filter ( Simple_Array, v1 = ‘Odd’ ),I, v1 );
  • 20. © Paradigm4 Inc. confidential Flexible Schema • Ad hoc queries • Don’t have to know a priori what questions you will want to ask of your data • Change schema dynamically • Values <=> dimensions • Supports transparent data exploration and mining
  • 21. Well-suited for storing, accessing, & analyzing images Satellite images Healthcare images © Paradigm4 Inc. confidential GIS data Store metadata with the data • Instrument id & calibration data • Experimental conditions and variables • Data set identifiers & comments
  • 22. © Paradigm4 Inc. confidential SciDB support for images • Regrid operator • Change resolution and coordinate systems • Overlap • Supports feature detection when features fall between nodes • Support for multi-dimensional window operations • Spatial averaging • Non-integer dimensions • Access image through spatio-temporal coordinate systems • Astronomy (right ascension, declination) • Remote sensing (lat, long, time)
  • 23. SciDB array model: create array © Paradigm4 Inc. confidential CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ]; Attributes red, green, blue Dimensions longitude, lattitude Dimension size * indicates unbounded Chunk size Chunk overlap
  • 24. [SciDB] Scalable data management 1 2 instance 1 (coordinator) instance 2 (worker) 3 4 instance 3 (worker) instance 4 (worker) © Paradigm4 Inc. confidential
  • 25. © Paradigm4 Inc. Soft scalability test on automotive telematics and location data • This graph shows how performance scales when both the data volume and the number of instances are increased together • Query computes a score for each driver based on how many other vehicles were driving at the same time, in the same areas as the driver • If data is perfectly distributed and if all operations in a query are perfectly parallelizable, the graph should be a 0 slope line execution time relative to 1X scale factor
  • 26. © Paradigm4 Inc. confidential New Data Window operator • Computes aggregates over rolling one-dimensional windows • skipping over empty array cells • particularly useful for analysis of time series events that happen at varying frequencies • Data window accepts an input array, a dimension name, number of preceding values, number of following values, and a list of aggregate calls data_window (input_array, dim_name, num_preceding, num_following, aggregate1(attribute1), aggregate2(attribute2)...)
  • 27. © Paradigm4 Inc. confidential Analyzing event data • Event hot spots • Look at which specific sets of locations (at the lat-long level) have the most hard acceleration and hard braking events (count or volume normalized metric) • Profile hot spots by day of week, time of day • Event windows • Look at a 30 second window before and after each hard braking and hard acceleration event • Look for patterns to predict adverse events or profile drivers
  • 28. Manage data throughout its life cycle © Paradigm4 Inc. confidential • Data is never overwritten • Preserve raw data, corrected data, and updates in the database • Facilitates reproducibility, audits, compliance • Supports model development and testing: what-if modeling, scenario testing, back-testing, sensitivity testing • Updates are versioned
  • 29. © Paradigm4 Inc. confidential Client Interfaces • i-query interactive command line query interface • Python, C++, R clients • GUI (forms) interface coming • Open source client api – roll your own!
  • 30. © Paradigm4 Inc. What about hadoop? • Hadoop alone is not a DBMS • No indexes, updates, data consistency, metadata • Modules (hadoop, Pig, Hive, Hbase, HDFS) are loosely integrated and require a lot of glue code • Requires skilled development staff to write custom code and maintain clusters • Slower than a real parallel distributed database so needs more HW • Linear algebra operators are hard to implement as a map and a reduce • See Stonebraker, Kepner CACM blog post: Possible Hadoop Trajectories http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories
  • 31. What about NoSQL like MongoDB? Great for some uses cases: match the tool to your requirements • NoSQL and XML-based systems bake ‘schema’ into the application code or the records themselves • NoSQL is most easily defined by what it excludes • No schemas • No query language • Lacks easily automatic data integrity of ACID databases • No support for joins which are useful when working with multiple data sources • Requires coding to walk the data structures to manage data and extract information • Harder to collaborate and share data across groups • More custom code than a DB means potential longer term maintenance and data archiving issues • Paradigm4 offers the flexibility of object-oriented data schemas without sacrificing ACID database integrity or ad hoc query support © Paradigm4 Inc.
  • 32. © Paradigm4 Inc. confidential SciDB and Paradigm4 • SciDB is a global, open source community • Scientists from many fields & computer-scientists • www.scidb.org • Paradigm4, a commercial company, sponsors & manages SciDB • Doing all the initial development for SciDB • Sells and supports a commercial-quality release of SciDB • Along with enterprise management tools (e.g. provisioning, security, recovery) • And industry-specific add-ons • www.paradigm4.com
  • 33. Get more from your analytical database © Paradigm4 Inc. confidential • Power, Productivity & Performance – Less coding – Less data movement – Transparent scale-up & speed-up – Prototypes scale to production without rewriting – Lower cost deployment • Highly pedigreed technical team CTO is Mike Stonebraker renowned database researcher & entrepreneur • Ready to work with early adopters
  • 34. Big Complex Analytics combines data sources for novel insights & products Automotive Telematics Healthcare Informatics © Paradigm4 Inc.
  • 35. © Paradigm4 Inc. Big Complex Analytics powers population studies > 70K tissue samples > 65K gene probes per sample covariance, clustering, SVD > 10 million cars GPS & driving data every sec insurance by the trip & how you drive linear regressions, risk & pricing modeling
  • 36. © Paradigm4 Inc. Architecture • ‘Shared Nothing” cluster of commodity hardware nodes • Interconnected with standard ethernet and TCP/IP
  • 37. SciDB array model: create array © Paradigm4 Inc. confidential CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ]; Attributes red, green, blue Dimensions longitude, lattitude Dimension size * indicates unbounded Chunk size Chunk overlap
  • 38. © Paradigm4 Inc. confidential SciDB Array Schema CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ]; Attributes v1, v2, v3 Dimensions I, J Dimension size * is unbounded Chunk size Chunk overlap
  • 39. SciDB array model: data types • Whole numbers: int8, int16, int32, int64 • Unsigned whole numbers: uint8, ..., uint64 • Date and Time: datetime • Date and Time with timezone: datetimez • Floating point: float, double • Boolean: bool • Character: char • Variable-length strings: string © Paradigm4 Inc.
  • 40. © Paradigm4 Inc. SciDB array model: Storage • SciDB store every attribute separatelly • Good compression: – RLE – zlib • Parallel processing
  • 41. © Paradigm4 Inc. SciDB array model: 1D-array • Chunk: unit of data processing • Chunk should fit in memory entirely • User chooses chunk size
  • 42. © Paradigm4 Inc. SciDB array model: bitmap • SciDB describes EMPTY values using bitmap • bitmap is compressed efficiently with RLE
  • 43. © Paradigm4 Inc. SciDB array model: 2D-array • Stride-major-order of chunks
  • 44. © Paradigm4 Inc. SciDB array model: 2D-chunk
  • 45. © Paradigm4 Inc. SciDB array model: clustering • Several available chunk distributions: – Round-Robin (default) – Replication • Optimizer splits queries into stages • Every stage processed parallel • Scatter/Gather intermediate results after every stage according to requirements • Overlap helps descrease SG size (!) • NO single point of failure
  • 46. SciDB array model: redundancy • --redundancy=X • Every chunk is replicated X times • Single copy on every node • Redundand chunks used only when a node becomes unavilable • We protect networks and disk failures • Use RAID for protect disk failures © Paradigm4 Inc.
  • 47. SciDB array model: release 12.7 • Time series • Optimizations • Binary loader (based on PostgreSQL binary loader) • data_window operator © Paradigm4 Inc.
  • 48. SciDB array model: next release • Repart failed nodes by redundand data • Elastic cluster: © Paradigm4 Inc. – Increase/decrease node count
  • 49. © Paradigm4 Inc. confidential Contact – Marilyn Matz – CEO & co-founder – 781 718 3999 – mmatz@paradigm4.com – www.paradigm4.com – www.scidb.org
  • 50. innovative data management with complex analytics © Paradigm4 Inc.