Recording available here: https://youtu.be/VNYB373by0s
In any enterprise you’ll find a myriad of data technologies and architectures leveraged by various departments and groups – segmented and disconnected – yet all unified in the objective of turning the vast amounts of data generated and gathered by the organization into real-time (at least as the goal) insights and quantified decisions. All the components of a multi-model architecture likely already exist but can lack the strategy, structure, and integration to become an effective real-time data architecture. We discuss these issues, and present solutions that pave the way.
This presentation covers
- Defining a Strategy for Real-Time
- Understanding Each Model’s Strengths and Purpose
- Integrating It All Together
Getting to Real-Time in a Multi-Model Architecture
1. a platform by
Getting to Real-Time in a
Multi-Model Architecture
Data Architecture Summit 2017
Benjamin Nussbaum – CTO, AtomRain Inc.
(Creators of GraphGrid Connected Data Platform)
ben@atomrain.com | @bennussbaum
atomrain.com | graphgrid.com
3. #DASummit
Technology is Overhyped
• No one-size fits all, silver bullets
• Need to align the shape of your data with the technology made for it
• Rarely does a truly transformational technology come along
• Deep understanding and discipline required to get through the noise
• Avoid running from one technology to the next hot one
• A cursory or surface understanding is not enough here
• Throwing technology or more developer bodies at a problem is not the
solution; often it makes it worse.
• Work with a SME to establish and navigate the data landscape
211/21/17
4. #DASummit
Data is Living
• Constantly changing as business needs and requirements change
• Data comes in many shapes and sizes
• Data goes through many transformations within an enterprise
• Data needs to become all things to all people
311/21/17
7. #DASummit
Separating Data Concerns
What do I need to do with my data?
Ingest it (ETL) from multiple sources
Store both structured and unstructured data
611/21/17
8. #DASummit
Separating Data Concerns
What do I need to do with my data?
Ingest it (ETL) from multiple sources
Store both structured and unstructured data
Process the unstructured data to make it useable
711/21/17
9. #DASummit
Separating Data Concerns
What do I need to do with my data?
Ingest it (ETL) from multiple sources
Store both structured and unstructured data
Process the unstructured data to make it useable
Contextualize, enrich and improve the structured data
811/21/17
10. #DASummit
Separating Data Concerns
What do I need to do with my data?
Ingest it (ETL) from multiple sources
Store both structured and unstructured data
Process the unstructured data to make it useable
Contextualize, enrich and improve the structured data
Analyze, Reason & Learn. Understand & Drive Decisions
911/21/17
11. #DASummit
Indexes for Direct Retrieval
• Designed for responding to anticipated questions
• Rapid responses for indexed values
• Not designed for adhoc and unexpected questions
• Costly to maintain in a rapidly changing data environment
• Slow to update because it requires dev/dba cycles
1011/21/17
12. #DASummit
Pointers for Traversal
• Designed for responding to unanticipated questions
• Rapid responses for connection-centric and depth-based questions
• Not designed for static cache-like return sets
• Easy to maintain in a rapidly changing data environment
• No dev/dba cycle required to maintain performance as data changes
• Optimal for adhoc and unexpected connection-centric questions
1111/21/17
13. #DASummit
Disk/S3/etc for Binary Files
• Designed for storing files of binary types (i.e pdf, docx, etc)
• Efficient for raw storage of files
• Processing phase should extract meaningful information
• Structured data should reference stored location when relevant
• Optimal for non-processed, unstructured data
1211/21/17
14. #DASummit
Search for Natural Language
• Designed for finding data through plain text
• Rapid response in ranked order for defined indexes
• Not designed for adhoc and unexpected questions
• Requires synchronization with primary database as data changes
• Optimal for processed data built into indexed text documents
1311/21/17
17. #DASummit
Object DBMS
Specialty Database:
• Designed for storing object oriented structures without translation
• Defines classes, objects and models (no separation – mapping – between data and applications)
• Object structures including inheritance and other objects referenced on a property persisted together
• Not designed for multiple applications with varying code structures
• Object oriented code structures are stored directly
• Today many applications use the same database but with varying views and configurations of the data
• Not an option as a backing store for your primary data
1611/21/17
19. #DASummit
Native XML DBMS
Specialty Database:
• Designed for storing XML structures without translation
• Internal data model corresponds to XML documents, but don’t necessarily store data as XML documents
• Support XML specific query languages such as XPath, XQuery and/or XSLT
• Not designed for normalized representations of data
• Similar to document stores in this way
• Overlapping representations of the same data in varying XML documents
• Not an option as a backing store for your primary data
1811/21/17
21. #DASummit
Time Series
Specialty Database:
• Designed for streams of time series data as inputs where time is
written in ascending order (most recent on top)
• Not designed for all CRUD operations
• Updates to operations are expected to be a rare occurrence
• Deletions of data are rare and only anticipated to be a large chunk of data far in the past
• Not an option as a backing store for your primary data
2011/21/17
23. #DASummit
Search Engine
Specialty Database:
• Designed for finding content within the search data store
• Typically stored as finely tuned textual or geospatial indexes
• Should be seen as a purposely built cache to support finding data in ranked order
• Not designed for ACID reliability
• Not an option as a backing store for your primary data
2211/21/17
25. #DASummit
RDF/Triple Stores
Specialty Database:
• Designed for storing RDF model
• Resource Description Framework is a methodology for description of information
• Information is represented in triples: subject – predicate – object
• Provide methods specifically for dealing with triples in an SQL/SPARQL query language
• Rely on indexes being built and maintained for retrieving data
• Not designed for application and other such non-RDF format data
• Not an option as a backing store for your primary data
2411/21/17
27. #DASummit
Key-Value Stores
Specialty Database:
• Designed to store and retrieve values associated with a given key
• Very simple structure that works well in a store<-->retrieve paradigm with a known key
• Good for caching type requirements when near instant retrieval of a key’s value is needed
• Not designed for ACID reliability
• Not an option as a backing store for your primary data
2611/21/17
29. #DASummit
Document Stores
Specialty Database:
• Designed to store and index schema-free textual documents (i.e JSON)
• Enables defining and maintaining indexes for querying against fields in the documents
• Good for store<-->retrieve of large text document structures based on indexed fields within them
• Some implementations support varying document levels of ACID reliability
2811/21/17
31. #DASummit
Wide Column Stores
Specialty Database:
• Designed to store a large number of dynamic columns
• Uses a table structure where columns are created for each row instead of being defined by the table
• Column names and record keys are not fixed (schema-free in this regard compared to RDBMS tables)
• Good for store<-->retrieve of data that looks like two-dimensional key-value store and is within a table
• Some implementations support varying levels of ACID reliability
3011/21/17
33. #DASummit
Relational DBMS
Primary Database:
• Designed to store and retrieve data using a table-oriented data model
• Tables look like excel – columns are the schema (properties/attributes) and rows are the data entries
• Good for aggregations and filters within a single table
• Supports JOIN operations to return data across 2 or more tables
• JOIN operations slow down exponentially with each table included
• This is due to the way the Cartesian product of a JOIN works
• Designed to be fully ACID and Transactional (avoid those that aren’t)
• Historically has been the backing store for primary data throughout the enterprise
• Yes, you can read this as “a changing of the guard is taking place” (I’ve seen it accelerate a lot this year)
3211/21/17
35. #DASummit
Graph DBMS
Primary Database:
• Designed to store and retrieve data using a connection-oriented model
• Connections (edges) provide the context of how two things are related
• Entities (nodes) are the things (i.e Person, Account, etc) in your data
• Properties are supported on nodes and edges (avoid those that don’t)
• Indexes are only used to find starting points in the data (avoid those that don’t)
• This removes JOIN pain when answering questions requiring movement across data entities (traversals)
• Designed to be fully ACID and Transactional (avoid those that aren’t)
• Provide Dynamic (index-free), Constant Time movement across data entities
• Becoming the backing store for primary data throughout the enterprise
• Yes, you can read this as “replacing RDBMS as the primary store” (I’ve been using one like this for 5+yrs)
3411/21/17
37. #DASummit
Multi-Model Database
Distracted Database:
• Mostly designed for or favors one primary storage model and purpose
• Then added support for other conceptual models at the API level
• Beneficial for marketing
• Convenient for developers
• Largely a distraction to initial primary objective
• There are always tradeoffs in optimizing for a specific model
3611/21/17
38. #DASummit
Not a Database
Not a Database:
• Be warry of databases that aren’t actually databases
• There is much marketing confusion and misclassification happening
• Commonly tout a pluggable storage engine (that just means they’re using a database for storage)
• Other times you have to dig through the documentation to see how they’re actually storing their data
• Often the novel thing is a value added API layer
• Using these will tie you into whichever data storage decisions they’ve made
• Not all “databases” listed on db-engines are actually databases in the true sense
• A database that uses another database as the storage engine
• A database that is all in memory and doesn’t deal with persisted data
3711/21/17
41. #DASummit
Real-World, Real-Time
Business: Buying and selling of online advertising
Accepted Reality: Maximum of 1hr to update bids
Original Technical: 3TB SQL RDBMS relying on distributed,
federated and highly indexed views to come close to 1hr
Challenge: Taking more than 1hr to update bids
4011/21/17
42. #DASummit
Real-World, Real-Time
Solution: Identified data structure as highly-connected & deep
New Reality: Search and Intelligent Bid Optimization
Solution Technical: 3TB Neo4j (10% of hardware), Elasticsearch
integrated on GraphGrid, writing over 2B nodes/edges per day
Result: Taking less than 300ms to update bids
4111/21/17
44. #DASummit
Real-World, Real-Time
Business: Selling complex content packages
Accepted Reality: Between 4-6hrs for sales rep to get answer
Original Technical: Generating 1B row hash tables (Oracle
RDBMS) w/only 1 or 2 SMEs able able to modify stored proc
Challenge: Takes 4-6hrs to know if content package can be sold
4311/21/17
45. #DASummit
Real-World, Real-Time
Solution: Identified data structure as highly-connected, living
New Reality: Search and intelligent content package negotiator
Solution Technical: Neo4j, Elasticsearch integrated on
GraphGrid, interactive package optimizer & recommender
Result: Sub-second determination of non-conflicting package
across entire sales organization & advisory recommender
system suggesting content to include/exclude throughout deal
4411/21/17
47. #DASummit
Real-World, Real-Time
Business: Highly regulated global financial institution
Accepted Reality: Complex data lineages will never finish
Original Technical: Oracle SQL RDBMS
Challenge: Queries for complex lineages never finish
4611/21/17
48. #DASummit
Real-World, Real-Time
Solution: Identified data structure as highly-connected & deep
New Reality: Complex lineages finish in under 1 minute
Solution Technical: Neo4j
Result: Even the most complex lineages finish under 1 minute
4711/21/17
49. #DASummit
Thank You! Questions?
Getting to Real-Time in a Multi-Model Architecture
by
Benjamin Nussbaum – CTO, AtomRain Inc.
(Creators of GraphGrid Connected Data Platform)
ben@atomrain.com | @bennussbaum
atomrain.com | graphgrid.com
@atomrain | @graphgrid
4811/21/17