Kafka is a scalable, distributed publish subscribe messaging system that's used as a data transmission backbone in many data intensive digital businesses. Couchbase Server is a scalable, flexible document database that's fast, agile, and elastic. Because they both appeal to the same type of customers, Couchbase and Kafka are often used together.
This presentation from a meetup in Mountain View describes Kafka's design and why people use it, Couchbase Server and its uses, and the use cases for both together. Also covered is a description and demo of Couchbase Server writing documents to a Kafka topic and consuming messages from a Kafka topic. using the Couchbase Kafka Connector.
If you tweet about Kafka, you may be followed by this weird Franz Kafka bot…
… but the real Franz Kafka died of tuberculosis in 1924…
Interestingly, this Franz Kafka bot sounds quite a bit like the real Franz Kafka. Not overly cheery… I thought perhaps it was real quotes from just quotes from his work
But no – it just sounds like it could be him…
It helps you decouple systems in time – systems can be asynchronous, but this is more than that
Consuming systems don’t have to be on or even exist at the time that producers are making messages
Schema registry that describes what is known about data produced in different systems
Publish / subscribe system
Hooking up application server logs, caches, databases, and so forth
You don’t want each system to have to be matched or hand integrated with every other system and service, with different adapter codes, different error handling behavior, logging, etc. And how do share metadata? What do you do with different revisions of
That’s madness
Imagine trying to add a new service that needs to read from 10 other services….
Organizationally that’s difficult, and every team has the potential to make different decisions about what systems to use…
Kafka helps mitigate different expectations of speed and size of data being ingested in various systems
Hadoop – HDFS can take tons of data, but not in tiny pieces – it’s a batch oriented system
NoSQL databases like Couchbase can scale to billions of users with sub millisecond response times but not with bulk load
Compare application server logs, a bulk database extraction, processing a stream of Twitter messages
There can be issues with integrations where the slowest
Vision is to have scalable, low latency pub sub message queue as standard interface for realtime streaming data
Hadoop, HDFS specifically, fills this role for batch systems and led to a large ecosystem of useful tools that can interoperate via Hadoop data storage
Kafka does the same for realtime data, and can scale to handle your entire organizations data. Kafka acts as the hub and applications hang off of it, exchanging data through Kafka
We refer to this architecture as a stream data platform.
Reminder: On this slide – you need to talk about the differences between Couchbase and Hadoop – they are complementary, they solve different problems
Messaging: Decouple data processing from data producers
Log Aggregation: A log as stream of messages
Stream Processing: Consume data from one topic and put the filtered/transformed data into another one
Click Stream Analysis: Page views/searches as real-time publish-subscribe feeds
Publish Subscribe
Broker
Stores messages
Failover: Leader vs. Follower
Load balanced
Producer
Publish data/messages to the topic
Consumer
Applications/processes/threads those are subscribed to the topic
Can be grouped (consumer groups) in order to process messages in parallel
Multiple consumer instances can load balance reading the partitions of a topic
Consumer groups are elastic and fault tolerant
Topic
Distributed and partitioned message queue
Topics are partitioned so they can scale across multiple servers
Partitions are also replicated for fault tolerance
This is what the producers actually write to and what the consumers actually read
Scales the Kafka brokers
High performance: Log
Kafka operates only on logs – they are always append only logs, and messages are always read sequentially
Do not track the per message read state – you don’t need to because access is sequential
Retention based on policy – either time based or size based
Not keeping per message state
Multiple consumers reading from the same log means that multiple consumers can do what they need to do (they know where they left off, Kafka doesn’t need to). This is like DCP in Couchbase
Consumer
Applications/processes/threads those are subscribed to the topic
Can be grouped (consumer groups) in order to process messages in parallel
Multiple consumer instances can load balance reading the partitions of a topic
Consumer groups are elastic and fault tolerant
Zookeeper – Distributed Synchronization and configuration store –
Needed to partition topics and to support consumer groups (where multiple consumers work together to process (ingest) a Kafka Topic in parallel
Multiple data models
N1QL - SQL-Like query language
Multiple indexes
SDKs, ODBC / JDBC drivers and frameworks
Push-button scalability
Consistent high-performance
Always on 24x7 with HA - DR
Easy Administration with Web UI, Rest API and CLI
KEY POINT: COUCHBASE HAS YOU COVERED FOR YOUR GENERAL PURPOSE DB NEEDS. FROM CACHING TO KV STORE, TO JSON DOCUMENT STORE, TO MOBILE APPS. NO OTHER NOSQL DB VENDOR HAS THIS BREADTH AND DEPTH OF TECHNOLOGY
The purpose of this slide is to discuss the high level concepts of Couchbase, and if the SE wants to discuss what parts of Couchbase make up each concept. It is not to go over specific technologies like N1QL, ODBC, etc
KEY POINT: YOU HAVE THE OPTION TO REPRESENT DATA QUITE DIFFERENTLY USING JSON AS OPPOSED TO A RELATIONAL DATABASE.
- Where in relational databases you might have to have multiple tables to best represent your data, in JSON you can model your data like an object might already be in your programming language of choice. No ORM (Object Relational Model) needed.
You can do relationships in Couchbase, but they are different than in a relational database and outside of the scope of an intro call normally.
Make sure to stress that normalization is still something that can be done in Couchbase where it makes sense for the application, but this diagram is something that helps people coming from relational understand what is possible for JSON.
Work people do in these systems -
Training ML models
ETL / Data wrangling
Aggregations
Reporting / BI
Kafka is a data multiplexer – some people are still going to want to do this, but it’s designed for higher latency applications with a known high complexity (e.g. ebay – many different consumers for information)
Traditional data warehouse – definitely will be a different programming language – how do you make sense of the data feed? You get into the problems that making changes on one side introduces tons of complexity on the other
Downsides – maturity is not 100% on the Spark side, still in active development in the Couchbase side
KV / N1QL
KEY POINT: ENTERPRISES ARE USING COUCHBASE ACROSS A RANGE OF MISSION CRITICAL USE CASES.
As the slide shows, Couchbase supports a wide range of use cases, from Profile Management to HA Cache.
Each use case has its own set of requirements – some need very high performance, some need very high availability, some need flexibility of the data model.
The ability to meet all of these requirements is what has driven adoption of Couchbase by large enterprise companies
You should memorize a few things about a customer use per case so you can quickly go through these. What you want is a sound bite per use case.
1. All your data is managed in Couchbase and the other systems record these changes – for example, a users purchase might be logg
2. A user’s web session is being stored in a Couchbase bucket and you want to react on it – for example – delete the session in another system like people do in Single Sign On
Couchbase can handle 100,000’s of operations per second
3. Real Time data integration
For example, you want to do a quick check on purchases to see if there’s anything suspicious about them – that may be done in another system
4. In this case, it’s important to note that Couchbase can be a Kafka Consumer or a Kafka Producer, so doing tasks like ML – flow data out, train models and flow data back into Couchbase. This is similar to number 3, but the difference is you’re loading something back into Couchbase so that users can quickly interact with it. You may have systems that build recommendations but then flow those back into Couchbase so that the next visitors get a slightly better mix of product offers
Write data to a topic, process it with a framework and load it back into another separate bucket to serve users
Skip if short of time – don’t need to cover anything besides DCP
Punchline is, this mechanism allows Couchbase to scale elastically and without downtime while still enabling any client to find exactly where the active copy of a piece of data is (using the cluster map)
Multiple buckets can exist within a single cluster of nodes
(1, 2 or 3 extra copies)
Each data set has 1024 Virtual Buckets (vBuckets)
Each vBucket contains 1/1024th portion of the data set
vBuckets do not have a fixed physical server location
Add lots of notes
Add lots of notes
What can possibly go wrong if you write your own connector with DCP?
A lot –
First of all, you need to be able to drink from the firehose.
Couchbase 100K’s of messages per second – the Kafka brokers sit there and soak up those messages and can write them out the other end at whatever speed your consuming systems are capable of
DCP is written for memory to memory type replication – if you’re writing to a system that can’t keep up, the client needs to do some fancy footwork to make everything come out ok
What can possibly go wrong?
A lot –
First of all, you need to be able to drink from the firehose.
Couchbase 100K’s of messages per second – the Kafka brokers sit there and soak up those messages and can write them out the other end at whatever speed your consuming systems are capable of
DCP is written for memory to memory type replication – if you’re writing to a system that can’t keep up, the client needs to do some fancy footwork to make everything come out ok
This just does the work of creating some messages – for demo purposes, I can type things in here and see them show up as documents in Couchbase
The keys are random – and limited to 10
DCP is a way of doing mutations, so sometimes we are going to overwrite existing docs and sometimes we will end up making new docs
Overwrites are captured as sequence numbers (when combined with a document key, you have version information) similar to the offset in Kafka
Producer is going to grab the docs and send them to Kafka
This filter that we’re using prints out the dcpEvent to the console so we can read it but otherwise does no filtering
You can add logic to the filter to mark events false, in which case they won’t be written to Kafka
Finally, this is attaching to my Kafka vagrant image on port 9092 and subscribing to the topic default, partition 0 (we only have one partition)
Fully transparent cluster and bucket management, including direct access if needed