Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
11. What is Streaming Data
Synchronous Req/Response
0 – 100s ms
Near Real Time
> 100s ms
Offline Batch
> 1 hour
KAFKA
Stream Data Platform
Search
RDBMS
Apps Monitoring
Real-time AnalyticsNoSQL Stream Processing
HADOOP
Data Lake
Impala
DWH
Hive
Spark Map-Reduce
13. Confluent Platform: It’s Kafka ++
Feature Benefit Apache Kafka Confluent Platform
Confluent Platform
Enterprise
Apache Kafka
High throughput, low latency, high availability, secure distributed message
system
Kafka Connect
Advanced framework for connecting external sources/destinations into
Kafka
Java Client Provides easy integration into Java applications
Kafka Streams
Simple library that enables streaming application development within the
Kafka framework
Additional Clients Supports non-Java clients; C, C++, Python, etc.
REST Proxy
Provides universal access to Kafka from any network connected device via
HTTP
Schema Registry
Central registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built Connectors
HDFS, JDBC and other connectors fully Certified
and fully supported by Confluent
Confluent Control Center Includes Connector Management and Stream Monitoring
Support
Enterprise class support to keep your Kafka environment running at top
performance
Community Community 24x7x365
Free Free Subscription
14. Common Kafka Use Cases
Data transport and integration
• Log data
• Database changes
• Sensors and device data
• Monitoring streams
• Call data records
• Stock ticker data
Real-time stream processing
• Monitoring
• Asynchronous applications
• Fraud and security
15. People Using Kafka Today
Financial Services
Entertainment & Media
Consumer Tech
Travel & Leisure
Enterprise Tech
Telecom Retail
29. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Kafka
Streams
30. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Configure where to
land incoming data
Distributed
Processing
Frameworks
Kafka
Streams
31. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Raw data processed to
generate analytics models
Distributed
Processing
Frameworks
Kafka
Streams
32. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
MongoDB exposes
analytics models to
operational apps.
Handles real time
updates
Distributed
Processing
Frameworks
Kafka
Streams
33. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Compute new
models against
MongoDB &
HDFS
Distributed
Processing
Frameworks
Kafka
Streams
41. MongoDB Atlas
Database as a service for MongoDB
MongoDB Atlas is…
• Automated: The easiest way to build, launch, and scale apps on MongoDB
• Flexible: The only database as a service with all you need for modern applications
• Secured: Multiple levels of security available to give you peace of mind
• Scalable: Deliver massive scalability with zero downtime as you grow
• Highly available: Your deployments are fault-tolerant and self-healing by default
• High performance: The performance you need for your most demanding workloads
42. MongoDB Atlas Features
• Spin up a cluster in
seconds
• Replicated & always-
on deployments
• Fully elastic: scale
out or up in a few
clicks with zero
downtime
• Automatic patches &
simplified upgrades
for the newest
MongoDB features
• Authenticated &
encrypted
• Continuous backup
with point-in-time
recovery
• Fine-grained
monitoring &
custom alerts
Safe & SecureRun for You
• On-demand pricing
model; billed by the
hour
• Multi-cloud support
(AWS available with
others coming
soon)
• Part of a suite of
products & services
designed for all
phases of your app;
migrate easily to
different
environments
(private cloud, on-
prem, etc) when
needed
No Lock-In
Database as a service for MongoDB
43. MongoDB Enterprise Advanced
• MongoDB Ops
Manager or
MongoDB Cloud
Manager Premium
• MongoDB Compass
• MongoDB
Connector for BI
• Encrypted Storage
Engine
• LDAP / Kerberos
Integration
• DDL & DML
Auditing
• FIPS 140-2 Support
SecurityTooling
• 24 x 7 Support
• 1 hr SLA
• Emergency
Patches
• Customer Success
Program
• On-Demand
Training
Support License
• Commercial
License
44. Resources
• Data Streaming with Apache Kafka & MongoDB
• https://www.mongodb.com/collateral/data-streaming-with-apache-
kafka-and-mongodb
• Implementing a Kafka Consumer for MongoDB
• https://www.mongodb.com/blog/post/mongodb-and-data-streaming-
implementing-a-mongodb-kafka-consumer
• Tailing the Oplog on a sharded MongoDB Cluster
• https://www.mongodb.com/blog/post/tailing-mongodb-oplog-sharded-
clusters
46. Document Data Model
Relational MongoDB
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
Customer ID First Name Last Name City
0 John Doe New York
1 Mark Smith San Francisco
2 Jay Black Newark
3 Meagan White London
4 Edward Daniels Boston
Phone Number Type DNC Customer ID
1-212-555-1212 home T 0
1-212-555-1213 home T 0
1-212-555-1214 cell F 0
1-212-777-1212 home T 1
1-212-777-1213 cell (null) 1
1-212-888-1212 home F 2
47. Document Model Benefits
{
customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
Agility and flexibility
Data model supports business change
Rapidly iterate to meet new requirements
Intuitive, natural data representation
Eliminates ORM layer
Developers are more productive
Reduces the need for joins, disk seeks
Programming is more simple
Performance delivered at scale
48. Rich Functionality
MongoDB
Expressive Queries
• Find anyone with phone # “1-212…”
• Check if the person with number “555…” is on the “do not
call” list
Geospatial
• Find the best offer for the customer at geo coordinates of 42nd
St. and 6th Ave
Text Search • Find all tweets that mention the firm within the last 2 days
Aggregation • Count and sort number of customers by city
Native Binary
JSON support
• Add an additional phone number to Mark Smith’s without
rewriting the document
• Select just the mobile phone number in the list
• Sort on the modified date
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [ {
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
{
number : “1-212-777-1213”,
type : “cell”
}]
}
Left outer join
($lookup)
• Query for all San Francisco residences, lookup their
transactions, and sum the amount by person
50. MongoDB Use Cases
Single View Internet of Things Mobile Real-Time Analytics
Catalog Personalization Content Management
Notes de l'éditeur
A lot of people expect us to come in and bash relational database or say we don’t think they’re good. And that’s simply not true.
Relational databases have laid the foundation for what you’d want out of a database, and we absolutely think there are capabilities that remain critical today
Expressive query language & secondary Indexes. Users should be able to access and manipulate their data in sophisticated ways – and you need a query language that let’s you do all that out of the box. Indexes are a critical part of providing efficient access to data. We believe these are table stakes for a database.
Strong consistency. Strong consistency has become second nature for how we think about building applications, and for good reason. The database should always provide access to the most up-to-date copy of the data. Strong consistency is the right way to design a database.
Enterprise Management and Integrations. Finally, databases are just one piece of the puzzle, and they need to fit into the enterprise IT stack. Organizations need a database that can be secured, monitored, automated, and integrated with their existing IT infrastructure and staff, such as operations teams, DBAs, and data analysts.
But of course the world has changed a lot since the 1980s when the relational database first came about.
First of all, data and risk are significantly up.
In terms of data
90% data created in last 2 years - think about that for a moment, of all the data ever created, 90% of it was in the last 2 years
80% of enterprise data is unstructured - this is data that doesn’t fit into the neat tables of a relational database
Unstructured data is growing 2X rate of structured data
At the same time, risks of running a database are higher than ever before. You are now faced with:
More users - Apps have shifted from small internal departmental system with thousands of users to large external audiences with millions of users
No downtime - It’s no longer the case that apps only need to be available during standard business hours. They must be up 24/7.
All across the globe - your users are everywhere, and they are always connected
On the other hand, time and costs are way down.
There’s less time to build apps than ever before. You’re being asked to:
Ship apps in a few months not years - Development methods have shifted from a waterfall process to an iterative process that ships new functionality in weeks and in some cases multiple times per day at companies like Facebook and Amazon.
And costs are way down too. Companies want to:
Pay for value over time - Companies have shifted to open-source business and SaaS models that allow them to pay for value over time
Use cloud and commodity resources - to reduce the time to provision their infrastructure, and to lower their total cost of ownership
Because the relational database was not designed for modern applications, starting about 10 years ago a number of companies began to build their own databases that are fundamentally different. The market calls these NoSQL.
NoSQL databases were designed for this new world…
Flexibility. All of them have some kind of flexible data model to allow for faster iteration and to accommodate the data we see dominating modern applications. While they all have different approaches, what they have in common is they want to be more flexible.
Scalability + Performance. Similarly, they were all built with a focus on scalability, so they all include some form of sharding or partitioning. And they're all designed to deliver great performance. Some are better at reads, some are better at writes, but more or less they all strive to have better performance than a relational database.
Always-On Global Deployments. Lastly, NoSQL databases are designed for highly available systems that provide a consistent, high quality experience for users all over the world. They are designed to run on many computers, and they include replication to automatically synchronize the data across servers, racks, and data centers.
However, when you take a closer look at these NoSQL systems, it turns out they have thrown out the baby with the bathwater. They have sacrificed the core database capabilities you’ve come to expect and rely on in order to build fully functional apps, like rich querying and secondary indexes, strong consistency, and enterprise management.
MongoDB was built to address the way the world has changed while preserving the core database capabilities required to build modern applications.
Our vision is to leverage the work that Oracle and others have done over the last 40 years to make relational databases what they are today, and to take the reins from here. We pick up where they left off, incorporating the work that internet pioneers like Google and Amazon did to address the requirements of modern applications.
MongoDB is the only database that harnesses the innovations of NoSQL and maintains the foundation of relational databases – and we call this our Nexus Architecture.
When using any database as a producer, it's necessary to capture any database changes so that they can be written to Kafka. With MongoDB this can be achieved by monitoring its oplog.
The oplog (operations log) is a special capped collection that keeps a rolling record of all operations that modify the data stored in your database.
Tailable cursors, have many uses, such as real-time notifications of all the changes to your database. A tailable cursor is conceptually similar to the Unix `tail -f` command. Once you've reached the end of the result set, the cursor will not be closed, rather it will continue to wait forever for new data and when it arrives, return that too.
MongoDB replication is implemented using the oplog and tailable cursors; the primary records all write operations in its oplog. The secondary members then asynchronously fetch and then apply those operations. By using a tailable cursor on the oplog, an application receives all changes that are made to the database in near real-time.
A producer can be written to propagate all MongoDB writes to Kafka by tailing the oplog in the same way. The logic is more complex when using a sharded cluster:
The oplog for each shard must be tailed
The MongoDB shard balancer occasionally moves documents from one shard to another; causing *deletes* to be written to the originating shard's oplog and *inserts* to that of the receiving shard. Those internal operations must be filtered out.
This is a design pattern for the data lake – multiple components that collectively handle ingest, storage, processing and analysis of data, then serving it to consuming operational apps
Step thru
Data ingestion: Data streams are ingested to a pub/sub message queue, which routes all raw data into HDFS.
Often also have event processing running against the queue to find interesting events that need to be consumed by the operational apps immediately - displaying an offer to a user browsing a product page, or alarms generated against vehicle telemetry from an IoT apps, are routed to MongoDB for immediate consumption by operational applications.
Raw data is loaded into the data lake where we can use Hadoop jobs – MR or Spark, generate analytics models from the raw data – see examples in the layer above HDFS
MongoDB exposes these models to the operational processes, serving indexed queries and updates against them with real-time latency
The distributed processing frameworks can re-compute analytics models, against data stored in either HDFS or MongoDB, continuously flowing updates from the operational database to analytics models.
Look at some examples of users who have deployed this type of design pattern little later
**Comparethemarket.com**: One of the UK’s leading price comparison providers, and one of the country’s best known household brands. Comparethemarket.com uses MongoDB as the default operational database across its microservices architecture. Its online comparison systems need to collect customer details efficiently and then securely send them to a number of different providers. Once the insurers' systems respond, Comparethemarket.com can aggregate and display prices for consumers. At the same time, MongoDB generates real-time analytics to personalize the customer experience across the company's web and mobile properties.
As Comparethemarket.com transitioned to microservices, the data warehousing and analytics stack were also modernized. While each microservice uses its own MongoDB database, the company needs to maintain synchronization between services, so every application event is written to a Kafka topic. Event processing runs against the topic to identify relevant events that can then trigger specific actions – for example customizing customer questions, firing off emails, presenting new offers and more. Relevant events are written to MongoDB, enabling the user experience to be personalized in real time as customers interact with the service.
Man is one of the largest hedge fund managers in the world. AHL is a subsiduary focussed on system trading – have been moving all of their data to MongoDB.
Need to analyse data from a large number of disparate data sources.
100x faster retrieving data than when they were using flat files and RDBMS.
Use MongoDB for futures and single stock forecasting. The Kafka use case is for ”tick data” – every change in the price of a stock -> 400,000,000 messages per day.
Source of data is 3rd party commercial feeds into a 3rd party message bus. 150K/sec ticks written to Kafka -> buffer, batch and replay ticks in the event of a problem. Then write the data to MongoDB. Each database holds a year’s worth of ticks.
25X greater tick throughput with just 2 machines => 250M per second. 40x cost saving.
**State**: State is an intelligent opinion network; connecting people with similar beliefs who want to join forces and make waves. User and opinion data is written to MongoDB and then the oplog is tailed so that all changes are written to user and opinion topics in Kafka where they are consumed by the user recommendation engine. Details on State's use of MongoDB and Kafka can be found in this [presentation](http://www.slideshare.net/danharvey/change-data-capture-with-mongodb-and-kafka "Use of MongoDB and Kafka in the State social network").
Built and managed by the same team that builds the database, MongoDB Atlas provides the features of MongoDB without the operational heavy lifting, enabling you to focus on what you do best.
MongoDB Enterprise Advanced provides everything you need to [insert relevant value driver. Draw from relevant bullets below to support this claim]
MongoDB Ops Manager or Cloud Manager Premium– full management platform to de-risk MongoDB in production
Monitor the health of your system
Visual Query profiler to identify slow-running queries
Index suggestions and automated index rollouts
Automate deployment, configuration, maintenance, upgrades and scaling
Back up and restore to any point in time (standard network mountable filesystems supported)
Visual Query profiler to identify slow-running queries
Index suggestions and automated index rollouts
APM integration with enhanced drivers
(Ops Manager) Runs behind your firewall.
MongoDB Compass – Schema and data visualization; understand the data stored in your database with no knowledge of the MongoDB query language. Ad hoc queries with a few clicks of your mouse
BI Connector – Visualize and analyze the multi-structured data stored in MongoDB using SQL-based BI tools such as Tableau, Qlikview, Spotfire and more
Enterprise-grade, follow the sun support with a 1-hour SLA
Not just break/fix support
Direct access to industry best-practices
On-Demand Training
Access to our online courses at your own pace to get team members up to speed
Advanced Security
Encrypted Storage Engine for end-to-end database encryption
LDAP and Kerberos to integrate with existing authentication and authorization infrastructure
Auditing of all database operations for compliance
Commercial license
To meet the needs of organizations that have policies against using open source, AGPL software
Platform Certification
Tested and certified for stability and performance on Windows, Red Hat/CentOS, Ubuntu, and Amazon Linux
BETA ACCESS to
In memory storage engine for your ultra throughput, most demanding apps: in memory computing without sacrificing data durability