Soumettre la recherche
Mettre en ligne
SciDB
•
6 j'aime
•
2,491 vues
Oleg Tsarev
Suivre
Доклад с Moscow Data Science Meetup
Lire moins
Lire la suite
Internet
Signaler
Partager
Signaler
Partager
1 sur 50
Télécharger maintenant
Télécharger pour lire hors ligne
Recommandé
Big Analytics Without Big Hassles
Big Analytics Without Big Hassles
Paradigm4
Massively Scalable Computational Finance with SciDB
Massively Scalable Computational Finance with SciDB
Paradigm4Inc
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
San Diego Supercomputer Center
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Cambridge Semantics
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Stavros Papadopoulos
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
Cambridge Semantics
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
Cambridge Semantics
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
Recommandé
Big Analytics Without Big Hassles
Big Analytics Without Big Hassles
Paradigm4
Massively Scalable Computational Finance with SciDB
Massively Scalable Computational Finance with SciDB
Paradigm4Inc
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
San Diego Supercomputer Center
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Cambridge Semantics
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Stavros Papadopoulos
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
Cambridge Semantics
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
Cambridge Semantics
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Cambridge Semantics
Managed Cluster Services
Managed Cluster Services
Adam Doyle
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
DataWorks Summit
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Cambridge Semantics
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph Analytics
Cambridge Semantics
Munich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data Transformation
DataWorks Summit
Big Data in the Real World
Big Data in the Real World
Mark Kromer
Data Mining - The Big Picture!
Data Mining - The Big Picture!
Khalid Salama
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
DataWorks Summit
Summary introduction to data engineering
Summary introduction to data engineering
Novita Sari
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
Cambridge Semantics
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Cambridge Semantics
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
DataWorks Summit
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
Cambridge Semantics
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
BigData Hadoop
BigData Hadoop
Kumari Surabhi
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
Allen Day, PhD
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
Allen Day, PhD
Contenu connexe
Tendances
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Cambridge Semantics
Managed Cluster Services
Managed Cluster Services
Adam Doyle
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
DataWorks Summit
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Cambridge Semantics
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph Analytics
Cambridge Semantics
Munich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data Transformation
DataWorks Summit
Big Data in the Real World
Big Data in the Real World
Mark Kromer
Data Mining - The Big Picture!
Data Mining - The Big Picture!
Khalid Salama
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
DataWorks Summit
Summary introduction to data engineering
Summary introduction to data engineering
Novita Sari
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
Cambridge Semantics
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Cambridge Semantics
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
DataWorks Summit
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
Cambridge Semantics
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
BigData Hadoop
BigData Hadoop
Kumari Surabhi
Tendances
(20)
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Managed Cluster Services
Managed Cluster Services
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph Analytics
Munich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data Transformation
Big Data in the Real World
Big Data in the Real World
Data Mining - The Big Picture!
Data Mining - The Big Picture!
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
Summary introduction to data engineering
Summary introduction to data engineering
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
BigData Hadoop
BigData Hadoop
Similaire à SciDB
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
Allen Day, PhD
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
Allen Day, PhD
Big data unit 2
Big data unit 2
RojaT4
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analytics
sai Pramoda
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Databricks
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
The HDF-EOS Tools and Information Center
Analytics&IoT
Analytics&IoT
Selvaraj Kesavan
Choosing Right datastorage.pptx
Choosing Right datastorage.pptx
AlekhyaAchanta3
Choosing right data storage.pptx
Choosing right data storage.pptx
AlekhyaAchanta3
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Ganesan Narayanasamy
Business Intelligence Architecture
Business Intelligence Architecture
Philippe Julio
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Ali Alkan
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Precisely
The New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need Them
Precisely
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Caserta
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
Similaire à SciDB
(20)
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
Big data unit 2
Big data unit 2
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analytics
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
Analytics&IoT
Analytics&IoT
Choosing Right datastorage.pptx
Choosing Right datastorage.pptx
Choosing right data storage.pptx
Choosing right data storage.pptx
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Business Intelligence Architecture
Business Intelligence Architecture
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
The New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need Them
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Dernier
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
Lumiverse Solutions Pvt Ltd
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
APNIC
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
NIMMANAGANTI RAMAKRISHNA
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
rnrncn29
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
eusebiomeyer
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
Mario
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
rnrncn29
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
mibuzondetrabajo
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
AndrieCagasanAkio
Dernier
(9)
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
SciDB
1.
© Paradigm4 Inc.
confidential
2.
© Paradigm4 Inc.
confidential Topics • The Big Complex Analytics Space • SciDB Overview • How we are different and why that matters • Architecture Note: We call our company P4 for short
3.
Rich Data +
Complex Analyticsdrive insights and innovative product offerings © Paradigm4 Inc. confidential ● Tap new types and sources of data – Location, genomic, behavioral, speech, sensors, images, … ● Integrate mixed data sources for novel insights – Genomic/wearable sensors/ EHRs /clinical/payer/provider – Satellite images/smart grid data – Location/weather/traffic/driving behavior ● Generate micro-segmented pricing & products – Personalized insurance – Precision medicine – Precision warranties – Behavioral targeting – Location-based services ● Look at whole populations, big time windows, big regions
4.
Where some of
the ‘complex analytics’ problems are © Paradigm4 Inc. confidential Pharma, Biotech, AgroBusiness, Healthcare Informatics • Next-gen sequencing analysis & GWAS • Population studies • Evidence-based outcome studies • Pharmaco-economics Insurance Analytics • Personalized auto or workman’s comp insurance • Catastrophe modeling and policy pricing • Risk modeling for insurance exchanges Industrial Analytics • Precision warranty pricing & maintenance schedules Call Centers • Speech analytics Energy • Data from smart sensor grids Digital Marketing • Geo-targeting & other personalization strategies • Recommendation engines Financial Services • Financial modeling, back testing, sensitivity testing • Algorithmic trading • Portfolio management & risk management Scientific Research • Astronomy, Climatology, High Energy Physics, et al
5.
‘Big Analytics’ covers
two categories P4’s space © Paradigm4 Inc. confidential • Big Volume + Simple analytics – Traditional Data Warehouses, RDBMSs – Business analysts – Count statistics, roll-ups, aggregates • Big Volume + Complex Analytics – Emerging markets; new tools – Data scientists / healthcare analysts / quants / operations researchers – Multivariate statistics, clustering, SVD, machine-learning, et al
6.
Why would industrial
& commercial analytics applications benefit from yet another software platform? • Sensor data, geospatial data, temporal data, genomic data & images are far more efficiently managed as multi-dimensional arrays than as relational tables • Complex analytics should execute in place where the data resides and scale easily with additional nodes and cores © Paradigm4 Inc. confidential
7.
P4’s new ‘Complex
Analytics’ databasescientific data management & analytics for the commercial world © Paradigm4 Inc. confidential Rich Data Massively Scalable Math Smart Data Management
8.
P4 is well-matched
for M2M data Machine-generated data have inherent ordering & structure © Paradigm4 Inc. • location data from cars and cell phones • telematics data from sensors • energy usage data from smart sensors grids • genetic sequencing data • patient telemetry data • time series and longitudinal event data • satellite images of the earth’s surface 2012-01-31 22:32:36.968000
9.
A new ‘Complex
Analytics’ database scientific data management & analytics for the commercial & industrial worlds • All-in-one next generation database with data life cycle management native, seamlessly integrated, scalable complex math operations • Array data model optimal for temporal, geospatial, and machine-generated data n-dimensional • Open Source • Commodity HW grid or cloud © Paradigm4 Inc. confidential
10.
© Paradigm4 Inc.
confidential SciDB Features Distributed data storage With redundancy/fault tolerance and high-availability Scalable Parallel operations Parallel linear algebra, aggregates, summaries, data loading ACID Transactions Stuctured N-dimensional Sparse Array Data Model Defined by schema Expressive SQL-like Query Syntax Supports joins by array dimensions No-Overwrite Data Versioning Extensible User-defined types, functions, operators
11.
Paradigm4 enables data-intensive
research © Paradigm4 Inc. confidential Capture Ingest, store, and manage data throughout its lifecycle Curation Save raw, corrected, pre-processed and derived analytic data, with meta data and provenance Curiosityh Explore, drill down, filter, select Compute Complex math and modeling Collaboration Shared resource No data silos with long, metadata filenames Compliance No overwrite, versioned data storage supports reproducibility and validation of results
12.
First class support
for scientific data & scientific research • Ingest, store, access, and manage data throughout its life cycle • No overwrite database; historical versioning support • Metadata – store curation and calibration information • Extensibility (user defined types and operations) • Save raw, corrected, pre-processed, and derived data • Support for provenance • Support reproducibility of results • Share data across work groups and with outside organizations © Paradigm4 Inc. confidential Why SciDB for scientists?
13.
P4’s native Array
DB beats Relational DBs* on storage efficiency & complex computations 16 cells ● Math functions run directly on native storage format ● Dramatic storage efficiencies as # of dimensions & © Paradigm4 Inc. confidential attributes grows – Architecture supports n dimensions ● Facilitates drill-down & clustering by like groups ● High performance for both sparse and dense data – 10-100x faster than RDBMSs on array operations 48 cells * Applies to both row stores & column stores
14.
Data exploration &
analytics work better when the natural ordering of data is preserved Clusters, temporal regions are stored together Resample time or re-grid geospatial data at any resolution Slice & drill-down in any n-dimensional region Fast data selection for ad hoc queries Efficient analytics over sub-regions & moving windows © Paradigm4 Inc. confidential
15.
Complex math underpins
many use cases © Paradigm4 Inc. confidential Industrial Analytics • Precision warranty pricing • Proactive preventive maintenance • Modeling & optimization • Event monitoring in refineries and factories • covariance • PCA , SVD • cross validation • bootstrapping • cluster analysis • linear/logistic regression Pharma Biotech Healthcare • Next-gen Sequencing • Population studies • Outcome studies • Precision medicine
16.
Complex math underpins
many use cases © Paradigm4 Inc. confidential Computational Finance • Back testing • Sophisticating modeling • Portfolio optimization • Risk management • covariance • PCA , SVD • cross validation • bootstrapping • cluster analysis • monte carlo methods • linear/logistic regression
17.
P4’s native math
library supports distributed processing © Paradigm4 Inc. confidential • Task parallelism ‘Embarrassingly parallel’ tasks Process subpopulations in parallel Run simulations in parallel • Massively scalable complex math ‘Non-embarrassingly’ parallel tasks like large scale linear algebra Math operations that pass intermediate data between nodes Challenging O(n3) computations Math operations on data too large to fit on one node • Large scale analytics without sampling Look at whole populations, big time windows, big regions Sample when you want to; not to fit analytics package constraints Use all the data: sometimes you really want the long tail or the black swan
18.
Query language seamlessly
integrates data manipulation & math © Paradigm4 Inc. confidential Array Query Language -- AQL Declarative SQL-like language extended for working with array data Large-scale math operations embedded in queries Extensible Add user-defined types and functions R, python, and other client interfaces Compute the log odds ratio for a failure model using logistic regression SELECT * FROM LOGISTREGR (model_matrix, success_count, failure_count, 'coefficients')
19.
Linear Algebra as
Building Block Mathematical and Data Manipulation Operations multiply ( transpose ( Simple_Array ), Simple_Array ); © Paradigm4 Inc. confidential regrid( Simple_Array, 10, 10, avg (v2) ); cumsum (filter ( Simple_Array, v1 = ‘Odd’ ),I, v1 );
20.
© Paradigm4 Inc.
confidential Flexible Schema • Ad hoc queries • Don’t have to know a priori what questions you will want to ask of your data • Change schema dynamically • Values <=> dimensions • Supports transparent data exploration and mining
21.
Well-suited for storing,
accessing, & analyzing images Satellite images Healthcare images © Paradigm4 Inc. confidential GIS data Store metadata with the data • Instrument id & calibration data • Experimental conditions and variables • Data set identifiers & comments
22.
© Paradigm4 Inc.
confidential SciDB support for images • Regrid operator • Change resolution and coordinate systems • Overlap • Supports feature detection when features fall between nodes • Support for multi-dimensional window operations • Spatial averaging • Non-integer dimensions • Access image through spatio-temporal coordinate systems • Astronomy (right ascension, declination) • Remote sensing (lat, long, time)
23.
SciDB array model:
create array © Paradigm4 Inc. confidential CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ]; Attributes red, green, blue Dimensions longitude, lattitude Dimension size * indicates unbounded Chunk size Chunk overlap
24.
[SciDB] Scalable data
management 1 2 instance 1 (coordinator) instance 2 (worker) 3 4 instance 3 (worker) instance 4 (worker) © Paradigm4 Inc. confidential
25.
© Paradigm4 Inc.
Soft scalability test on automotive telematics and location data • This graph shows how performance scales when both the data volume and the number of instances are increased together • Query computes a score for each driver based on how many other vehicles were driving at the same time, in the same areas as the driver • If data is perfectly distributed and if all operations in a query are perfectly parallelizable, the graph should be a 0 slope line execution time relative to 1X scale factor
26.
© Paradigm4 Inc.
confidential New Data Window operator • Computes aggregates over rolling one-dimensional windows • skipping over empty array cells • particularly useful for analysis of time series events that happen at varying frequencies • Data window accepts an input array, a dimension name, number of preceding values, number of following values, and a list of aggregate calls data_window (input_array, dim_name, num_preceding, num_following, aggregate1(attribute1), aggregate2(attribute2)...)
27.
© Paradigm4 Inc.
confidential Analyzing event data • Event hot spots • Look at which specific sets of locations (at the lat-long level) have the most hard acceleration and hard braking events (count or volume normalized metric) • Profile hot spots by day of week, time of day • Event windows • Look at a 30 second window before and after each hard braking and hard acceleration event • Look for patterns to predict adverse events or profile drivers
28.
Manage data throughout
its life cycle © Paradigm4 Inc. confidential • Data is never overwritten • Preserve raw data, corrected data, and updates in the database • Facilitates reproducibility, audits, compliance • Supports model development and testing: what-if modeling, scenario testing, back-testing, sensitivity testing • Updates are versioned
29.
© Paradigm4 Inc.
confidential Client Interfaces • i-query interactive command line query interface • Python, C++, R clients • GUI (forms) interface coming • Open source client api – roll your own!
30.
© Paradigm4 Inc.
What about hadoop? • Hadoop alone is not a DBMS • No indexes, updates, data consistency, metadata • Modules (hadoop, Pig, Hive, Hbase, HDFS) are loosely integrated and require a lot of glue code • Requires skilled development staff to write custom code and maintain clusters • Slower than a real parallel distributed database so needs more HW • Linear algebra operators are hard to implement as a map and a reduce • See Stonebraker, Kepner CACM blog post: Possible Hadoop Trajectories http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories
31.
What about NoSQL
like MongoDB? Great for some uses cases: match the tool to your requirements • NoSQL and XML-based systems bake ‘schema’ into the application code or the records themselves • NoSQL is most easily defined by what it excludes • No schemas • No query language • Lacks easily automatic data integrity of ACID databases • No support for joins which are useful when working with multiple data sources • Requires coding to walk the data structures to manage data and extract information • Harder to collaborate and share data across groups • More custom code than a DB means potential longer term maintenance and data archiving issues • Paradigm4 offers the flexibility of object-oriented data schemas without sacrificing ACID database integrity or ad hoc query support © Paradigm4 Inc.
32.
© Paradigm4 Inc.
confidential SciDB and Paradigm4 • SciDB is a global, open source community • Scientists from many fields & computer-scientists • www.scidb.org • Paradigm4, a commercial company, sponsors & manages SciDB • Doing all the initial development for SciDB • Sells and supports a commercial-quality release of SciDB • Along with enterprise management tools (e.g. provisioning, security, recovery) • And industry-specific add-ons • www.paradigm4.com
33.
Get more from
your analytical database © Paradigm4 Inc. confidential • Power, Productivity & Performance – Less coding – Less data movement – Transparent scale-up & speed-up – Prototypes scale to production without rewriting – Lower cost deployment • Highly pedigreed technical team CTO is Mike Stonebraker renowned database researcher & entrepreneur • Ready to work with early adopters
34.
Big Complex Analytics
combines data sources for novel insights & products Automotive Telematics Healthcare Informatics © Paradigm4 Inc.
35.
© Paradigm4 Inc.
Big Complex Analytics powers population studies > 70K tissue samples > 65K gene probes per sample covariance, clustering, SVD > 10 million cars GPS & driving data every sec insurance by the trip & how you drive linear regressions, risk & pricing modeling
36.
© Paradigm4 Inc.
Architecture • ‘Shared Nothing” cluster of commodity hardware nodes • Interconnected with standard ethernet and TCP/IP
37.
SciDB array model:
create array © Paradigm4 Inc. confidential CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ]; Attributes red, green, blue Dimensions longitude, lattitude Dimension size * indicates unbounded Chunk size Chunk overlap
38.
© Paradigm4 Inc.
confidential SciDB Array Schema CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ]; Attributes v1, v2, v3 Dimensions I, J Dimension size * is unbounded Chunk size Chunk overlap
39.
SciDB array model:
data types • Whole numbers: int8, int16, int32, int64 • Unsigned whole numbers: uint8, ..., uint64 • Date and Time: datetime • Date and Time with timezone: datetimez • Floating point: float, double • Boolean: bool • Character: char • Variable-length strings: string © Paradigm4 Inc.
40.
© Paradigm4 Inc.
SciDB array model: Storage • SciDB store every attribute separatelly • Good compression: – RLE – zlib • Parallel processing
41.
© Paradigm4 Inc.
SciDB array model: 1D-array • Chunk: unit of data processing • Chunk should fit in memory entirely • User chooses chunk size
42.
© Paradigm4 Inc.
SciDB array model: bitmap • SciDB describes EMPTY values using bitmap • bitmap is compressed efficiently with RLE
43.
© Paradigm4 Inc.
SciDB array model: 2D-array • Stride-major-order of chunks
44.
© Paradigm4 Inc.
SciDB array model: 2D-chunk
45.
© Paradigm4 Inc.
SciDB array model: clustering • Several available chunk distributions: – Round-Robin (default) – Replication • Optimizer splits queries into stages • Every stage processed parallel • Scatter/Gather intermediate results after every stage according to requirements • Overlap helps descrease SG size (!) • NO single point of failure
46.
SciDB array model:
redundancy • --redundancy=X • Every chunk is replicated X times • Single copy on every node • Redundand chunks used only when a node becomes unavilable • We protect networks and disk failures • Use RAID for protect disk failures © Paradigm4 Inc.
47.
SciDB array model:
release 12.7 • Time series • Optimizations • Binary loader (based on PostgreSQL binary loader) • data_window operator © Paradigm4 Inc.
48.
SciDB array model:
next release • Repart failed nodes by redundand data • Elastic cluster: © Paradigm4 Inc. – Increase/decrease node count
49.
© Paradigm4 Inc.
confidential Contact – Marilyn Matz – CEO & co-founder – 781 718 3999 – mmatz@paradigm4.com – www.paradigm4.com – www.scidb.org
50.
innovative data management
with complex analytics © Paradigm4 Inc.
Télécharger maintenant