SlideShare une entreprise Scribd logo
1  sur  63
1© Cloudera, Inc. All rights reserved.
Decoupling Decisions
with Apache Kafka
August, 2016
2© Cloudera, Inc. All rights reserved.
About Me
linkedin.com/in/granthenke
github.com/granthenke
@gchenke
grant@cloudera.com
• Cloudera Kafka Software Engineer
• Distributed Systems Enthusiast
• Father to a 15 month old
3© Cloudera, Inc. All rights reserved.
Agenda
Introduction
Terminology
Guarantees
What? Command Line
Configurations
Choosing Partitions
Apache Kafka Decoupling Decisions Getting Started
4© Cloudera, Inc. All rights reserved.
Apache Kafka
A brief overview
5© Cloudera, Inc. All rights reserved.
What Is Kafka?
Kafka provides the functionality of a
messaging system, but with a unique design.
6© Cloudera, Inc. All rights reserved.
What Is Kafka?
Kafka is a distributed, partitioned, replicated
commit log service.
7© Cloudera, Inc. All rights reserved.
What Is Kafka?
Kafka is Fast:
A single Kafka broker can handle hundreds of
megabytes of reads and writes per second
from thousands of clients.
8© Cloudera, Inc. All rights reserved.
What Is Kafka?
Kafka is Scalable:
Kafka is designed to allow a single cluster to
serve as the central data backbone for a
large organization.
9© Cloudera, Inc. All rights reserved.
What Is Kafka?
Kafka is Scalable:
Kafka can be expanded without downtime.
10© Cloudera, Inc. All rights reserved.
What Is Kafka?
Kafka is Durable:
Messages are persisted and replicated within
the cluster to prevent data loss.
11© Cloudera, Inc. All rights reserved.
What Is Kafka?
Kafka is Durable:
Each broker can handle terabytes of
messages without performance impact.
12© Cloudera, Inc. All rights reserved.
• Messages are organized into topics
• Topics are broken into partitions
• Partitions are replicated across the
brokers as replicas
• Kafka runs in a cluster. Nodes are
called brokers
• Producers push messages
• Consumers pull messages
The Basics
13© Cloudera, Inc. All rights reserved.
Replicas
• A partition has 1 leader replica. The
others are followers.
• Followers are considered in-sync when:
• The replica is alive
• The replica is not “too far” behind the
leader (configurable)
• The group of in-sync replicas for a
partition is called the ISR (In-Sync
Replicas)
• Replicas map to physical locations on a
broker
Messages
• Optionally be keyed in order to map to a
static partition
• Used if ordering within a partition is
needed
• Avoid otherwise (extra complexity,
skew, etc.)
• Location of a message is denoted by its
topic, partition & offset
• A partitions offset increases as
messages are appended
Beyond Basics…
14© Cloudera, Inc. All rights reserved.
Kafka Guarantees
15© Cloudera, Inc. All rights reserved.
Kafka Guarantees
WARNING: Guarantees can vary based on your
configuration choices.
16© Cloudera, Inc. All rights reserved.
• Messages sent to each partition will
be appended to the log in the order
they are sent
• Messages read from each partition
will be seen in the order stored in the
log
Kafka Guarantees: Message Ordering
17© Cloudera, Inc. All rights reserved.
Kafka Guarantees: Message Delivery
• At-least-once: Messages are never lost but may be redelivered
• Duplicates can be minimized but not totally eliminated
• Generally only get duplicates during failure or rebalance scenarios
• It’s a good practice to build pipelines with duplicates in mind regardless
18© Cloudera, Inc. All rights reserved.
Kafka Guarantees: Message Safety
• Messages written to Kafka are durable and safe
• Once a published message is committed it will not be lost as long as one broker
that replicates the partition to which this message was written remains "alive”
• Only committed messages are ever given out to the consumer. This means that
the consumer need not worry about potentially seeing a message that could be
lost if the leader fails.
19© Cloudera, Inc. All rights reserved.
Decoupling Decisions
Flexible from the beginning
20© Cloudera, Inc. All rights reserved.
• Data pipelines start simple
• One or two data sources
• One backend application
Initial Decisions:
• How can I be successful quickly?
• What does this specific pipeline
need?
• Don’t prematurely optimize
How It Starts
Client Backend
21© Cloudera, Inc. All rights reserved.
• Multiple sources
• Another backend application
• Initial decisions need to change
Then Quickly…
Source Batch Backend
Source
Source
Source
Streaming
Backend
22© Cloudera, Inc. All rights reserved.
• More environments
• Backend applications feed other
backend applications
• You may also want to
• Experiment with new software
• Change data formats
• Move to a streaming architecture
And Eventually…
Source Batch Backend
Source
Source
Source
Streaming
Backend
Cloud Backend
QA Backend
23© Cloudera, Inc. All rights reserved.
• Early decisions made for that single
pipeline have impacted each system
added
• Because sources and applications are
tightly coupled change is difficult
• Progress becomes slower and slower
• The system has grown fragile
• Experimentation, growth, and
innovation is risky
Technical Debt
24© Cloudera, Inc. All rights reserved.
Decision Types: Type 1 decisions
“Some decisions are consequential and
irreversible or nearly irreversible – one-way
doors – and these decisions must be made
methodically, carefully, slowly, with great
deliberation and consultation…” —Jeff Bezos
25© Cloudera, Inc. All rights reserved.
Decision Types: Type 2 Decisions
“Type 2 decisions are changeable, reversible
– they’re two-way doors. If you’ve made a
suboptimal Type 2 decision, you don’t have
to live with the consequences for that
long.”—Jeff Bezos
26© Cloudera, Inc. All rights reserved.
Kafka Is Here To Help!
27© Cloudera, Inc. All rights reserved.
• A central backbone for the entire
system
• Decouples source and backend
systems
• Slow or failing consumers don’t
impact source system
• Adding new sources or consumers is
easy and low impact
With Kafka
Source
Batch
Backend
Source
Source
Source
Streaming
Backend
Cloud
Backend
QA
Backend
Kafka
28© Cloudera, Inc. All rights reserved.
Lets Make Some Changes
29© Cloudera, Inc. All rights reserved.
• Add new source or backend
• Process more data
• Move from batch to streaming
• Change data source
The Really Easy Changes
Batch
Backend
Batch
Backend
Source
Source
Kafka
Source
Batch
Backend
Source
Source
Old Source
Streaming
Backend
Cloud
Backend
QA
Backend
Kafka
Kafka
New Source
30© Cloudera, Inc. All rights reserved.
• I would like to support avro (or thrift,
protobuf, xml, json, …)
• Keep source data raw
• In a streaming application transform
formats
• Read from source-topic and produce
to source-topic-{format}
• This could also include
lossy/optimization transformations
Change Data Format
Source
Batch
Backend
Source
Source
Source
Streaming
Backend
Cloud
Backend
QA
Backend
Kafka
Format
Conversion App
31© Cloudera, Inc. All rights reserved.
• Deploy new application and replay
the stream
• Great for testing and development
• Extremely useful for handling failures
and recovery too
Change Business Logic
32© Cloudera, Inc. All rights reserved.
• There are well written clients in a lot
of programming languages
• In the rare case your language of
choice doesn’t have a client, you can
use the binary wire protocol and
write one
Change Application Language
33© Cloudera, Inc. All rights reserved.
• Many processing frameworks get
Kafka integration early on
• Because consumers don’t affect
source applications its safe to
experiment
Change Processing Framework
Streams
34© Cloudera, Inc. All rights reserved.
35© Cloudera, Inc. All rights reserved.
Quick Start
Sounds great...but how do I use it?
36© Cloudera, Inc. All rights reserved.
Let’s Keep it Simple
37© Cloudera, Inc. All rights reserved.
Install Kafka
38© Cloudera, Inc. All rights reserved.
Start with the CLI tools
39© Cloudera, Inc. All rights reserved.
# Create a topic & describe
kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10 -
-replication-factor 3
kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic
# Produce in one shell
vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 --
topic my-topic
# Consume in a separate shell
kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
40© Cloudera, Inc. All rights reserved.
# Create a topic & describe
kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10
--replication-factor 3
kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic
# Produce in one shell
vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 --
topic my-topic
# Consume in a separate shell
kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
41© Cloudera, Inc. All rights reserved.
# Create a topic & describe
kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10 -
-replication-factor 3
kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic
# Produce in one shell
vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 --
topic my-topic
# Consume in a separate shell
kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
42© Cloudera, Inc. All rights reserved.
# Create a topic & describe
kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10 -
-replication-factor 3
kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic
# Produce in one shell
vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 --
topic my-topic
# Consume in a separate shell
kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
43© Cloudera, Inc. All rights reserved.
Kafka Configuration
A starting point
44© Cloudera, Inc. All rights reserved.
• Tune for throughput or
safety
• At least once or at most
once
• Per topic overrides and
client overrides
Flexible Configuration
45© Cloudera, Inc. All rights reserved.
Broker Configuration
• 3 or more Brokers
• broker_max_heap_size=8GiB
• zookeeper.chroot=kafka
• auto.create.topics.enable=false
• If you must use it make sure you set
• num.partitions >= #OfBrokers
• default.replication.factor=3
• min.insync.replicas=2
• unclean.leader.election=false (default)
46© Cloudera, Inc. All rights reserved.
Producer Configuration
• Use the new Java Producer
• acks=all
• retries=Integer.MAX_VALUE
• max.block.ms=Long.MAX_VALUE
• max.in.flight.requests.per.connection=1
• linger.ms=1000
• compression.type=snappy
47© Cloudera, Inc. All rights reserved.
Consumer Configuration
• Use the new Java Consumer
• group.id represents the “Coordinated Application”
• Consumers within the group share the load
• auto.offset.reset = latest/earliest/none
• enable.auto.commit=false
48© Cloudera, Inc. All rights reserved.
Choosing Partition Counts: Quick Pick
• Just getting started, don’t overthink it
• Don’t make the mistake of picking (1 partition)
• Don’t pick way too many (1000 partitions)
• Often a handwave choice of 25 to 100 partitions is a good
start
• Tune when you can understand your data and use case
better
49© Cloudera, Inc. All rights reserved.
Make something
Getting started is the
hardest part
What’s Next?
50© Cloudera, Inc. All rights reserved.
Thank you
51© Cloudera, Inc. All rights reserved.
Common Questions
52© Cloudera, Inc. All rights reserved.
How do I size broker hardware?
Brokers
• Similar profile to data nodes
• Depends on what’s important
• Message Retention = Disk Size
• Client Throughput = Network Capacity
• Producer Throughput = Disk I/O
• Consumer Throughput = Memory
53© Cloudera, Inc. All rights reserved.
• Brokers: 3->15 per Cluster
• Common to start with 3-5
• Very large are around 30-40 nodes
• Having many clusters is common
• Topics: 1->100s per Cluster
• Partitions: 1->1000s per Topic
• Clusters with up to 10k total
partitions are workable. Beyond
that we don't aggressively test. [src]
• Consumer Groups: 1->100s active per
Cluster
• Could Consume 1 to all topics
Kafka Cardinality—What is large?
54© Cloudera, Inc. All rights reserved.
• Kafka is not designed for very large
messages
• Optimal performance ~10KB
• Could consider breaking up the
messages/files into smaller chunks
Large Messages
55© Cloudera, Inc. All rights reserved.
Should I use Raid 10 or JBOD?
RAID10
• Can survive single disk failure
• Single log directory
• Lower total I/O
JBOD
• Single disk failure kills broker
• More available disk space
• Higher write throughput
• Broker is not smart about balancing
partitions across disk
56© Cloudera, Inc. All rights reserved.
Do I need a separate Zookeeper for Kafka?
• It’s not required but preferred
• Kafka relies on Zookeeper for cluster
metadata & state
• Correct Zookeeper configuration is most
important
57© Cloudera, Inc. All rights reserved.
Zookeeper Configuration
• ZooKeeper's transaction log must be on a dedicated device (A dedicated
partition is not enough) for optimal performance
• ZooKeeper writes the log sequentially, without seeking
• Set dataLogDir to point to a directory on that device
• Make sure to point dataDir to a directory not residing on that device
• Do not put ZooKeeper in a situation that can cause a swap
• Therefore, make certain that the maximum heap size given to ZooKeeper is
not bigger than the amount of real memory available to ZooKeeper
58© Cloudera, Inc. All rights reserved.
Choosing Partition Counts
• A topic partition is the unit of parallelism in Kafka
• It is easier to increase partitions than it is reduce them
•Especially when using keyed messages
•Consumers are assigned partitions to consume
•They can’t split/share partitions
•Parallelism is bounded by the number of partitions
59© Cloudera, Inc. All rights reserved.
Choosing Partition Counts: Quick Pick
• Just getting started, don’t overthink it
• Don’t make the mistake of picking (1 partition)
• Don’t pick way too many (1000 partitions)
• Often a handwave choice of 25 to 100 partitions is a good
start
• Tune when you can understand your data and use case
better
60© Cloudera, Inc. All rights reserved.
Choosing Partition Counts: Estimation
Given:
• pt = production throughput per partition
• ct = consumption throughput per partition
• tt = total throughput you want to achieve
• pc = the minimum partition count
Then:
• pc >= max(tt/pt, tt/ct)
61© Cloudera, Inc. All rights reserved.
Choosing Partition Counts: Tools
• Kafka includes rudimentary benchmarking tools to help you get a
rough estimate
• kafka-producer-perft-test.sh (kafka.tools.ConsumerPerformance)
• kafka-consumer-perf-test.sh (kafka.tools.ProducerPerformance)
• kafka.tools.EndToEndLatency
• Use with kafka-run-class.sh
• Nothing is more accurate than a real application
• With real/representative data
62© Cloudera, Inc. All rights reserved.
How do I manage Schemas?
• A big topic with enough content for its own talk
• Options
•Schema Registry
•Source Controlled Dependency
•Static "Envelop Schema”
63© Cloudera, Inc. All rights reserved.
Thank you

Contenu connexe

Tendances

Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersJean-Paul Azar
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Chen-en Lu
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent
 
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019confluent
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...HostedbyConfluent
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...confluent
 
Introducing Scylla Manager: Cluster Management and Task Automation
Introducing Scylla Manager: Cluster Management and Task AutomationIntroducing Scylla Manager: Cluster Management and Task Automation
Introducing Scylla Manager: Cluster Management and Task AutomationScyllaDB
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 

Tendances (20)

Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
 
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Introducing Scylla Manager: Cluster Management and Task Automation
Introducing Scylla Manager: Cluster Management and Task AutomationIntroducing Scylla Manager: Cluster Management and Task Automation
Introducing Scylla Manager: Cluster Management and Task Automation
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 

En vedette

Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataCloudera, Inc.
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupJeff Holoman
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewDmitry Tolpeko
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystemconfluent
 

En vedette (12)

Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
intro-kafka
intro-kafkaintro-kafka
intro-kafka
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big Data
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User Group
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 

Similaire à Decoupling Decisions with Apache Kafka

Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Data Con LA
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsBryan Bende
 
The Flink - Apache Bigtop integration
The Flink - Apache Bigtop integrationThe Flink - Apache Bigtop integration
The Flink - Apache Bigtop integrationMárton Balassi
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For WorkflowTimothy Spann
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Timothy Spann
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark OperationsCloudera, Inc.
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Pat Patterson
 
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...HostedbyConfluent
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Intro to Apache Kafka
Intro to Apache KafkaIntro to Apache Kafka
Intro to Apache KafkaJason Hubbard
 

Similaire à Decoupling Decisions with Apache Kafka (20)

Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Apache Kafka - Strakin Technologies Pvt Ltd
Apache Kafka - Strakin Technologies Pvt LtdApache Kafka - Strakin Technologies Pvt Ltd
Apache Kafka - Strakin Technologies Pvt Ltd
 
The Flink - Apache Bigtop integration
The Flink - Apache Bigtop integrationThe Flink - Apache Bigtop integration
The Flink - Apache Bigtop integration
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Intro to Apache Kafka
Intro to Apache KafkaIntro to Apache Kafka
Intro to Apache Kafka
 

Dernier

Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 

Dernier (20)

Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 

Decoupling Decisions with Apache Kafka

  • 1. 1© Cloudera, Inc. All rights reserved. Decoupling Decisions with Apache Kafka August, 2016
  • 2. 2© Cloudera, Inc. All rights reserved. About Me linkedin.com/in/granthenke github.com/granthenke @gchenke grant@cloudera.com • Cloudera Kafka Software Engineer • Distributed Systems Enthusiast • Father to a 15 month old
  • 3. 3© Cloudera, Inc. All rights reserved. Agenda Introduction Terminology Guarantees What? Command Line Configurations Choosing Partitions Apache Kafka Decoupling Decisions Getting Started
  • 4. 4© Cloudera, Inc. All rights reserved. Apache Kafka A brief overview
  • 5. 5© Cloudera, Inc. All rights reserved. What Is Kafka? Kafka provides the functionality of a messaging system, but with a unique design.
  • 6. 6© Cloudera, Inc. All rights reserved. What Is Kafka? Kafka is a distributed, partitioned, replicated commit log service.
  • 7. 7© Cloudera, Inc. All rights reserved. What Is Kafka? Kafka is Fast: A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
  • 8. 8© Cloudera, Inc. All rights reserved. What Is Kafka? Kafka is Scalable: Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization.
  • 9. 9© Cloudera, Inc. All rights reserved. What Is Kafka? Kafka is Scalable: Kafka can be expanded without downtime.
  • 10. 10© Cloudera, Inc. All rights reserved. What Is Kafka? Kafka is Durable: Messages are persisted and replicated within the cluster to prevent data loss.
  • 11. 11© Cloudera, Inc. All rights reserved. What Is Kafka? Kafka is Durable: Each broker can handle terabytes of messages without performance impact.
  • 12. 12© Cloudera, Inc. All rights reserved. • Messages are organized into topics • Topics are broken into partitions • Partitions are replicated across the brokers as replicas • Kafka runs in a cluster. Nodes are called brokers • Producers push messages • Consumers pull messages The Basics
  • 13. 13© Cloudera, Inc. All rights reserved. Replicas • A partition has 1 leader replica. The others are followers. • Followers are considered in-sync when: • The replica is alive • The replica is not “too far” behind the leader (configurable) • The group of in-sync replicas for a partition is called the ISR (In-Sync Replicas) • Replicas map to physical locations on a broker Messages • Optionally be keyed in order to map to a static partition • Used if ordering within a partition is needed • Avoid otherwise (extra complexity, skew, etc.) • Location of a message is denoted by its topic, partition & offset • A partitions offset increases as messages are appended Beyond Basics…
  • 14. 14© Cloudera, Inc. All rights reserved. Kafka Guarantees
  • 15. 15© Cloudera, Inc. All rights reserved. Kafka Guarantees WARNING: Guarantees can vary based on your configuration choices.
  • 16. 16© Cloudera, Inc. All rights reserved. • Messages sent to each partition will be appended to the log in the order they are sent • Messages read from each partition will be seen in the order stored in the log Kafka Guarantees: Message Ordering
  • 17. 17© Cloudera, Inc. All rights reserved. Kafka Guarantees: Message Delivery • At-least-once: Messages are never lost but may be redelivered • Duplicates can be minimized but not totally eliminated • Generally only get duplicates during failure or rebalance scenarios • It’s a good practice to build pipelines with duplicates in mind regardless
  • 18. 18© Cloudera, Inc. All rights reserved. Kafka Guarantees: Message Safety • Messages written to Kafka are durable and safe • Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive” • Only committed messages are ever given out to the consumer. This means that the consumer need not worry about potentially seeing a message that could be lost if the leader fails.
  • 19. 19© Cloudera, Inc. All rights reserved. Decoupling Decisions Flexible from the beginning
  • 20. 20© Cloudera, Inc. All rights reserved. • Data pipelines start simple • One or two data sources • One backend application Initial Decisions: • How can I be successful quickly? • What does this specific pipeline need? • Don’t prematurely optimize How It Starts Client Backend
  • 21. 21© Cloudera, Inc. All rights reserved. • Multiple sources • Another backend application • Initial decisions need to change Then Quickly… Source Batch Backend Source Source Source Streaming Backend
  • 22. 22© Cloudera, Inc. All rights reserved. • More environments • Backend applications feed other backend applications • You may also want to • Experiment with new software • Change data formats • Move to a streaming architecture And Eventually… Source Batch Backend Source Source Source Streaming Backend Cloud Backend QA Backend
  • 23. 23© Cloudera, Inc. All rights reserved. • Early decisions made for that single pipeline have impacted each system added • Because sources and applications are tightly coupled change is difficult • Progress becomes slower and slower • The system has grown fragile • Experimentation, growth, and innovation is risky Technical Debt
  • 24. 24© Cloudera, Inc. All rights reserved. Decision Types: Type 1 decisions “Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation…” —Jeff Bezos
  • 25. 25© Cloudera, Inc. All rights reserved. Decision Types: Type 2 Decisions “Type 2 decisions are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long.”—Jeff Bezos
  • 26. 26© Cloudera, Inc. All rights reserved. Kafka Is Here To Help!
  • 27. 27© Cloudera, Inc. All rights reserved. • A central backbone for the entire system • Decouples source and backend systems • Slow or failing consumers don’t impact source system • Adding new sources or consumers is easy and low impact With Kafka Source Batch Backend Source Source Source Streaming Backend Cloud Backend QA Backend Kafka
  • 28. 28© Cloudera, Inc. All rights reserved. Lets Make Some Changes
  • 29. 29© Cloudera, Inc. All rights reserved. • Add new source or backend • Process more data • Move from batch to streaming • Change data source The Really Easy Changes Batch Backend Batch Backend Source Source Kafka Source Batch Backend Source Source Old Source Streaming Backend Cloud Backend QA Backend Kafka Kafka New Source
  • 30. 30© Cloudera, Inc. All rights reserved. • I would like to support avro (or thrift, protobuf, xml, json, …) • Keep source data raw • In a streaming application transform formats • Read from source-topic and produce to source-topic-{format} • This could also include lossy/optimization transformations Change Data Format Source Batch Backend Source Source Source Streaming Backend Cloud Backend QA Backend Kafka Format Conversion App
  • 31. 31© Cloudera, Inc. All rights reserved. • Deploy new application and replay the stream • Great for testing and development • Extremely useful for handling failures and recovery too Change Business Logic
  • 32. 32© Cloudera, Inc. All rights reserved. • There are well written clients in a lot of programming languages • In the rare case your language of choice doesn’t have a client, you can use the binary wire protocol and write one Change Application Language
  • 33. 33© Cloudera, Inc. All rights reserved. • Many processing frameworks get Kafka integration early on • Because consumers don’t affect source applications its safe to experiment Change Processing Framework Streams
  • 34. 34© Cloudera, Inc. All rights reserved.
  • 35. 35© Cloudera, Inc. All rights reserved. Quick Start Sounds great...but how do I use it?
  • 36. 36© Cloudera, Inc. All rights reserved. Let’s Keep it Simple
  • 37. 37© Cloudera, Inc. All rights reserved. Install Kafka
  • 38. 38© Cloudera, Inc. All rights reserved. Start with the CLI tools
  • 39. 39© Cloudera, Inc. All rights reserved. # Create a topic & describe kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10 - -replication-factor 3 kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic # Produce in one shell vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 -- topic my-topic # Consume in a separate shell kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
  • 40. 40© Cloudera, Inc. All rights reserved. # Create a topic & describe kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10 --replication-factor 3 kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic # Produce in one shell vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 -- topic my-topic # Consume in a separate shell kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
  • 41. 41© Cloudera, Inc. All rights reserved. # Create a topic & describe kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10 - -replication-factor 3 kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic # Produce in one shell vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 -- topic my-topic # Consume in a separate shell kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
  • 42. 42© Cloudera, Inc. All rights reserved. # Create a topic & describe kafka-topics --zookeeper my-zk-host:2181 --create --topic my-topic --partitions 10 - -replication-factor 3 kafka-topics --zookeeper my-zk-host:2181 --describe --topic my-topic # Produce in one shell vmstat -w -n -t 1 | kafka-console-producer --broker-list my-broker-host:9092 -- topic my-topic # Consume in a separate shell kafka-console-consumer --zookeeper my-zk-host:2181 --topic my-topic
  • 43. 43© Cloudera, Inc. All rights reserved. Kafka Configuration A starting point
  • 44. 44© Cloudera, Inc. All rights reserved. • Tune for throughput or safety • At least once or at most once • Per topic overrides and client overrides Flexible Configuration
  • 45. 45© Cloudera, Inc. All rights reserved. Broker Configuration • 3 or more Brokers • broker_max_heap_size=8GiB • zookeeper.chroot=kafka • auto.create.topics.enable=false • If you must use it make sure you set • num.partitions >= #OfBrokers • default.replication.factor=3 • min.insync.replicas=2 • unclean.leader.election=false (default)
  • 46. 46© Cloudera, Inc. All rights reserved. Producer Configuration • Use the new Java Producer • acks=all • retries=Integer.MAX_VALUE • max.block.ms=Long.MAX_VALUE • max.in.flight.requests.per.connection=1 • linger.ms=1000 • compression.type=snappy
  • 47. 47© Cloudera, Inc. All rights reserved. Consumer Configuration • Use the new Java Consumer • group.id represents the “Coordinated Application” • Consumers within the group share the load • auto.offset.reset = latest/earliest/none • enable.auto.commit=false
  • 48. 48© Cloudera, Inc. All rights reserved. Choosing Partition Counts: Quick Pick • Just getting started, don’t overthink it • Don’t make the mistake of picking (1 partition) • Don’t pick way too many (1000 partitions) • Often a handwave choice of 25 to 100 partitions is a good start • Tune when you can understand your data and use case better
  • 49. 49© Cloudera, Inc. All rights reserved. Make something Getting started is the hardest part What’s Next?
  • 50. 50© Cloudera, Inc. All rights reserved. Thank you
  • 51. 51© Cloudera, Inc. All rights reserved. Common Questions
  • 52. 52© Cloudera, Inc. All rights reserved. How do I size broker hardware? Brokers • Similar profile to data nodes • Depends on what’s important • Message Retention = Disk Size • Client Throughput = Network Capacity • Producer Throughput = Disk I/O • Consumer Throughput = Memory
  • 53. 53© Cloudera, Inc. All rights reserved. • Brokers: 3->15 per Cluster • Common to start with 3-5 • Very large are around 30-40 nodes • Having many clusters is common • Topics: 1->100s per Cluster • Partitions: 1->1000s per Topic • Clusters with up to 10k total partitions are workable. Beyond that we don't aggressively test. [src] • Consumer Groups: 1->100s active per Cluster • Could Consume 1 to all topics Kafka Cardinality—What is large?
  • 54. 54© Cloudera, Inc. All rights reserved. • Kafka is not designed for very large messages • Optimal performance ~10KB • Could consider breaking up the messages/files into smaller chunks Large Messages
  • 55. 55© Cloudera, Inc. All rights reserved. Should I use Raid 10 or JBOD? RAID10 • Can survive single disk failure • Single log directory • Lower total I/O JBOD • Single disk failure kills broker • More available disk space • Higher write throughput • Broker is not smart about balancing partitions across disk
  • 56. 56© Cloudera, Inc. All rights reserved. Do I need a separate Zookeeper for Kafka? • It’s not required but preferred • Kafka relies on Zookeeper for cluster metadata & state • Correct Zookeeper configuration is most important
  • 57. 57© Cloudera, Inc. All rights reserved. Zookeeper Configuration • ZooKeeper's transaction log must be on a dedicated device (A dedicated partition is not enough) for optimal performance • ZooKeeper writes the log sequentially, without seeking • Set dataLogDir to point to a directory on that device • Make sure to point dataDir to a directory not residing on that device • Do not put ZooKeeper in a situation that can cause a swap • Therefore, make certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper
  • 58. 58© Cloudera, Inc. All rights reserved. Choosing Partition Counts • A topic partition is the unit of parallelism in Kafka • It is easier to increase partitions than it is reduce them •Especially when using keyed messages •Consumers are assigned partitions to consume •They can’t split/share partitions •Parallelism is bounded by the number of partitions
  • 59. 59© Cloudera, Inc. All rights reserved. Choosing Partition Counts: Quick Pick • Just getting started, don’t overthink it • Don’t make the mistake of picking (1 partition) • Don’t pick way too many (1000 partitions) • Often a handwave choice of 25 to 100 partitions is a good start • Tune when you can understand your data and use case better
  • 60. 60© Cloudera, Inc. All rights reserved. Choosing Partition Counts: Estimation Given: • pt = production throughput per partition • ct = consumption throughput per partition • tt = total throughput you want to achieve • pc = the minimum partition count Then: • pc >= max(tt/pt, tt/ct)
  • 61. 61© Cloudera, Inc. All rights reserved. Choosing Partition Counts: Tools • Kafka includes rudimentary benchmarking tools to help you get a rough estimate • kafka-producer-perft-test.sh (kafka.tools.ConsumerPerformance) • kafka-consumer-perf-test.sh (kafka.tools.ProducerPerformance) • kafka.tools.EndToEndLatency • Use with kafka-run-class.sh • Nothing is more accurate than a real application • With real/representative data
  • 62. 62© Cloudera, Inc. All rights reserved. How do I manage Schemas? • A big topic with enough content for its own talk • Options •Schema Registry •Source Controlled Dependency •Static "Envelop Schema”
  • 63. 63© Cloudera, Inc. All rights reserved. Thank you