Getting to Real-Time in a Multi-Model Architecture

a platform by
Getting to Real-Time in a
Multi-Model Architecture
Data Architecture Summit 2017
Benjamin Nussbaum – CTO, AtomRain Inc.
(Creators of GraphGrid Connected Data Platform)
ben@atomrain.com | @bennussbaum
atomrain.com | graphgrid.com

#DASummit
Business and Architecture
An exercise in how to think about data and the technologies
available to make it a valuable business asset
111/21/17

#DASummit
Technology is Overhyped
• No one-size fits all, silver bullets
• Need to align the shape of your data with the technology made for it
• Rarely does a truly transformational technology come along
• Deep understanding and discipline required to get through the noise
• Avoid running from one technology to the next hot one
• A cursory or surface understanding is not enough here
• Throwing technology or more developer bodies at a problem is not the
solution; often it makes it worse.
• Work with a SME to establish and navigate the data landscape
211/21/17

#DASummit
Data is Living
• Constantly changing as business needs and requirements change
• Data comes in many shapes and sizes
• Data goes through many transformations within an enterprise
• Data needs to become all things to all people
311/21/17

#DASummit
Separating Data Concerns
What do I need to do with my data?
411/21/17

#DASummit
Ingest it (ETL) from multiple sources
511/21/17

#DASummit
Store both structured and unstructured data
611/21/17

#DASummit
Process the unstructured data to make it useable
711/21/17

#DASummit
Contextualize, enrich and improve the structured data
811/21/17

#DASummit
Contextualize, enrich and improve the structured data
Analyze, Reason & Learn. Understand & Drive Decisions
911/21/17

#DASummit
Indexes for Direct Retrieval
• Designed for responding to anticipated questions
• Rapid responses for indexed values
• Not designed for adhoc and unexpected questions
• Costly to maintain in a rapidly changing data environment
• Slow to update because it requires dev/dba cycles
1011/21/17

#DASummit
Pointers for Traversal
• Designed for responding to unanticipated questions
• Rapid responses for connection-centric and depth-based questions
• Not designed for static cache-like return sets
• Easy to maintain in a rapidly changing data environment
• No dev/dba cycle required to maintain performance as data changes
• Optimal for adhoc and unexpected connection-centric questions
1111/21/17

#DASummit
Disk/S3/etc for Binary Files
• Designed for storing files of binary types (i.e pdf, docx, etc)
• Efficient for raw storage of files
• Processing phase should extract meaningful information
• Structured data should reference stored location when relevant
• Optimal for non-processed, unstructured data
1211/21/17

#DASummit
Search for Natural Language
• Designed for finding data through plain text
• Rapid response in ranked order for defined indexes
• Not designed for adhoc and unexpected questions
• Requires synchronization with primary database as data changes
• Optimal for processed data built into indexed text documents
1311/21/17

#DASummit
Many aspects are not solved by databases
1411/21/17

#DASummit
All The Database Things
DBMS: Database Management System
1511/21/17

#DASummit
Object DBMS
Specialty Database:
• Designed for storing object oriented structures without translation
• Defines classes, objects and models (no separation – mapping – between data and applications)
• Object structures including inheritance and other objects referenced on a property persisted together
• Not designed for multiple applications with varying code structures
• Object oriented code structures are stored directly
• Today many applications use the same database but with varying views and configurations of the data
• Not an option as a backing store for your primary data
1611/21/17

#DASummit
Object DBMS
1711/21/17

#DASummit
Native XML DBMS
Specialty Database:
• Designed for storing XML structures without translation
• Internal data model corresponds to XML documents, but don’t necessarily store data as XML documents
• Support XML specific query languages such as XPath, XQuery and/or XSLT
• Not designed for normalized representations of data
• Similar to document stores in this way
• Overlapping representations of the same data in varying XML documents
1811/21/17

#DASummit
Native XML DBMS
1911/21/17
http://www.brainkart.com/media/extra/COsmyOf.jpg

#DASummit
Time Series
Specialty Database:
• Designed for streams of time series data as inputs where time is
written in ascending order (most recent on top)
• Not designed for all CRUD operations
• Updates to operations are expected to be a rare occurrence
• Deletions of data are rare and only anticipated to be a large chunk of data far in the past
2011/21/17

#DASummit
Time Series
2111/21/17
https://axibase.com/wp-content/uploads/2015/08/img_55ccc91fe5d9a.png

#DASummit
Search Engine
Specialty Database:
• Designed for finding content within the search data store
• Typically stored as finely tuned textual or geospatial indexes
• Should be seen as a purposely built cache to support finding data in ranked order
• Not designed for ACID reliability
2211/21/17

#DASummit
Search Engine
2311/21/17
https://developer.apple.com/library/content/documentation/UserExperience/Conceptual/SearchKitConcepts/art/inverted_index_textposition.jpg

#DASummit
RDF/Triple Stores
Specialty Database:
• Designed for storing RDF model
• Resource Description Framework is a methodology for description of information
• Information is represented in triples: subject – predicate – object
• Provide methods specifically for dealing with triples in an SQL/SPARQL query language
• Rely on indexes being built and maintained for retrieving data
• Not designed for application and other such non-RDF format data
2411/21/17

#DASummit
RDF/Triple Stores
2511/21/17
https://image.slidesharecdn.com/semanticblockchain-170220202334/95/semantic-blockchain-12-638.jpg?cb=1487667022

#DASummit
Key-Value Stores
Specialty Database:
• Designed to store and retrieve values associated with a given key
• Very simple structure that works well in a store<-->retrieve paradigm with a known key
• Good for caching type requirements when near instant retrieval of a key’s value is needed
• Not designed for ACID reliability
2611/21/17

#DASummit
Key-Value Stores
2711/21/17

#DASummit
Document Stores
Specialty Database:
• Designed to store and index schema-free textual documents (i.e JSON)
• Enables defining and maintaining indexes for querying against fields in the documents
• Good for store<-->retrieve of large text document structures based on indexed fields within them
• Some implementations support varying document levels of ACID reliability
2811/21/17

#DASummit
Document Stores
2911/21/17
http://docs.mongodb.org/master/MongoDB-data-models-guide.pdf

#DASummit
Wide Column Stores
Specialty Database:
• Designed to store a large number of dynamic columns
• Uses a table structure where columns are created for each row instead of being defined by the table
• Column names and record keys are not fixed (schema-free in this regard compared to RDBMS tables)
• Good for store<-->retrieve of data that looks like two-dimensional key-value store and is within a table
• Some implementations support varying levels of ACID reliability
3011/21/17

#DASummit
Wide Column Stores
3111/21/17

#DASummit
Relational DBMS
Primary Database:
• Designed to store and retrieve data using a table-oriented data model
• Tables look like excel – columns are the schema (properties/attributes) and rows are the data entries
• Good for aggregations and filters within a single table
• Supports JOIN operations to return data across 2 or more tables
• JOIN operations slow down exponentially with each table included
• This is due to the way the Cartesian product of a JOIN works
• Designed to be fully ACID and Transactional (avoid those that aren’t)
• Historically has been the backing store for primary data throughout the enterprise
• Yes, you can read this as “a changing of the guard is taking place” (I’ve seen it accelerate a lot this year)
3211/21/17

#DASummit
Relational DBMS
3311/21/17
http://www2.amk.fi/digma.fi/www.amk.fi/material/images/vanhaamk/etuotanto/5hNkauUBp/Concepts.gif

#DASummit
Graph DBMS
Primary Database:
• Designed to store and retrieve data using a connection-oriented model
• Connections (edges) provide the context of how two things are related
• Entities (nodes) are the things (i.e Person, Account, etc) in your data
• Properties are supported on nodes and edges (avoid those that don’t)
• Indexes are only used to find starting points in the data (avoid those that don’t)
• This removes JOIN pain when answering questions requiring movement across data entities (traversals)
• Designed to be fully ACID and Transactional (avoid those that aren’t)
• Provide Dynamic (index-free), Constant Time movement across data entities
• Becoming the backing store for primary data throughout the enterprise
• Yes, you can read this as “replacing RDBMS as the primary store” (I’ve been using one like this for 5+yrs)
3411/21/17

#DASummit
Graph DBMS
3511/21/17

#DASummit
Multi-Model Database
Distracted Database:
• Mostly designed for or favors one primary storage model and purpose
• Then added support for other conceptual models at the API level
• Beneficial for marketing
• Convenient for developers
• Largely a distraction to initial primary objective
• There are always tradeoffs in optimizing for a specific model
3611/21/17

#DASummit
Not a Database
Not a Database:
• Be warry of databases that aren’t actually databases
• There is much marketing confusion and misclassification happening
• Commonly tout a pluggable storage engine (that just means they’re using a database for storage)
• Other times you have to dig through the documentation to see how they’re actually storing their data
• Often the novel thing is a value added API layer
• Using these will tie you into whichever data storage decisions they’ve made
• Not all “databases” listed on db-engines are actually databases in the true sense
• A database that uses another database as the storage engine
• A database that is all in memory and doesn’t deal with persisted data
3711/21/17

#DASummit
Many aspects are not solved by databases
3811/21/17

#DASummit
Real-World, Real-Time
An example scenario of what becomes possible when
matching the shape of your data with the technology
3911/21/17

#DASummit
Business: Buying and selling of online advertising
Accepted Reality: Maximum of 1hr to update bids
Original Technical: 3TB SQL RDBMS relying on distributed,
federated and highly indexed views to come close to 1hr
Challenge: Taking more than 1hr to update bids
4011/21/17

#DASummit
Solution: Identified data structure as highly-connected & deep
New Reality: Search and Intelligent Bid Optimization
Solution Technical: 3TB Neo4j (10% of hardware), Elasticsearch
integrated on GraphGrid, writing over 2B nodes/edges per day
Result: Taking less than 300ms to update bids
4111/21/17

#DASummit
4211/21/17

#DASummit
Business: Selling complex content packages
Accepted Reality: Between 4-6hrs for sales rep to get answer
Original Technical: Generating 1B row hash tables (Oracle
RDBMS) w/only 1 or 2 SMEs able able to modify stored proc
Challenge: Takes 4-6hrs to know if content package can be sold
4311/21/17

#DASummit
Solution: Identified data structure as highly-connected, living
New Reality: Search and intelligent content package negotiator
Solution Technical: Neo4j, Elasticsearch integrated on
GraphGrid, interactive package optimizer & recommender
Result: Sub-second determination of non-conflicting package
across entire sales organization & advisory recommender
system suggesting content to include/exclude throughout deal
4411/21/17

#DASummit
4511/21/17

#DASummit
Business: Highly regulated global financial institution
Accepted Reality: Complex data lineages will never finish
Original Technical: Oracle SQL RDBMS
Challenge: Queries for complex lineages never finish
4611/21/17

#DASummit
Solution: Identified data structure as highly-connected & deep
New Reality: Complex lineages finish in under 1 minute
Solution Technical: Neo4j
Result: Even the most complex lineages finish under 1 minute
4711/21/17

#DASummit
Thank You! Questions?
Getting to Real-Time in a Multi-Model Architecture
by
Benjamin Nussbaum – CTO, AtomRain Inc.
(Creators of GraphGrid Connected Data Platform)
ben@atomrain.com | @bennussbaum
atomrain.com | graphgrid.com
@atomrain | @graphgrid
4811/21/17

Getting to Real-Time in a Multi-Model Architecture

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Getting to Real-Time in a Multi-Model Architecture

Similaire à Getting to Real-Time in a Multi-Model Architecture (20)

Dernier

Dernier (20)

Getting to Real-Time in a Multi-Model Architecture