After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Kafka Connect & Streams
the Ecosystem around Kafka
Guido Schmutz – 29.11.2017
@gschmutz guidoschmutz.wordpress.com
2. Guido Schmutz
Working at Trivadis for more than 20 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Kafka Connect & Streams - the Ecosystem around Kafka
3. Our company.
Kafka Connect & Streams - the Ecosystem around Kafka
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and
technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
4. COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
Kafka Connect & Streams - the Ecosystem around Kafka
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers
5. Agenda
1. What is Apache Kafka?
2. Kafka Connect
3. Kafka Streams
4. KSQL
5. Kafka and "Big Data" / "Fast Data" Ecosystem
6. Kafka in Software Architecture
Kafka Connect & Streams - the Ecosystem around Kafka
10. Kafka High Level Architecture
The who is who
• Producers write data to brokers.
• Consumers read data from
brokers.
• All this is distributed.
The data
• Data is stored in topics.
• Topics are split into partitions,
which are replicated.
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Broker 1 Broker 2 Broker 3
Zookeeper
Ensemble
Kafka Connect & Streams - the Ecosystem around Kafka
11. Kafka – Distributed Log at the Core
At the heart of Apache Kafka sits a
distributed log
collection of messages, appended
sequentially to a file
service ‘seeks’ to the position of the last
message it read, then scans sequentially,
reading messages in order
log-structured character makes Kafka
well suited to performing the role of an
Event Store in Event Sourcing
Event Hub
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22
Reads are a single
seek & scan
Writes are
append only
Kafka Connect & Streams - the Ecosystem around Kafka
12. Scale-Out Architecture
Kafka Connect & Streams - the Ecosystem around Kafka
topic consists of
many partitions
producer load
load-balanced
over all partitions
consumer can
consume with as
many threads as
there are
partitions
Producer 1
Consumer 1
Broker 1
Producer 2
Producer 3
Broker 2
Broker 3
Consumer 2
Consumer 3
Consumer 4
Consumer Group 1
Consumer Group 2
Kafka Cluster
13. Strong Ordering Guarantees
most business systems need strong
ordering guarantees
messages that require relative
ordering need to be sent to the same
partition
supply same key for
all messages that
require a relative order
To maintain global ordering use a
single partition topic
Producer 1
Consumer 1
Broker 1
Broker 2
Broker 3
Consumer 2
Consumer 3
Key-1
Key-2
Key-3
Key-4
Key-5
Key-6
Key-3
Key-1
Kafka Connect & Streams - the Ecosystem around Kafka
16. Replay-ability – Logs never forget
by keeping events in a log, we have a
version control system for our data
if you were to deploy a faulty program,
the system might become corrupted, but
it would always be recoverable
sequence of events provides an audit
point, so that you can examine exactly
what happened
rewind and reply events, once service is
back and bug is fixed
Event Hub
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22
Replay
Rewind
Service
Logic State
Kafka Connect & Streams - the Ecosystem around Kafka
17. Hold Data for Long-Term – Data Retention
Producer 1
Broker 1
Broker 2
Broker 3
1. Never
2. Time based (TTL)
log.retention.{ms | minutes | hours}
3. Size based
log.retention.bytes
4. Log compaction based
(entries with same key are removed):
kafka-topics.sh --zookeeper zk:2181
--create --topic customers
--replication-factor 1
--partitions 1
--config cleanup.policy=compact
Kafka Connect & Streams - the Ecosystem around Kafka
19. How to get a Kafka environent
Kafka Connect & Streams - the Ecosystem around Kafka
On Premises
• Bare Metal Installation
• Docker
• Mesos / Kubernetes
• Hadoop Distributions
Cloud
• Oracle Event Hub Cloud Service
• Azure HDInsight Kafka
• Confluent Cloud
• …
22. Demo (I) – Run Producer and Kafka-Console-Consumer
Kafka Connect & Streams - the Ecosystem around Kafka
23. Demo (I) – Java Producer to "truck_position"
Constructing a Kafka Producer
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers","broker-1:9092);
kafkaProps.put("key.serializer", "...StringSerializer");
kafkaProps.put("value.serializer", "...StringSerializer");
producer = new KafkaProducer<String, String>(kafkaProps);
ProducerRecord<String, String> record =
new ProducerRecord<>("truck_position", driverId, eventData);
try {
metadata = producer.send(record).get();
} catch (Exception e) {}
Kafka Connect & Streams - the Ecosystem around Kafka
24. Demo (II) – devices send to MQTT instead of Kafka
Truck-2
truck/nn/
position
Truck-1
Truck-3
2016-06-02 14:39:56.605|98|27|803014426|
Wichita to Little Rock Route2|
Normal|38.65|90.21|5187297736652502631
Kafka Connect & Streams - the Ecosystem around Kafka
25. Demo (II) – devices send to MQTT instead of Kafka
Kafka Connect & Streams - the Ecosystem around Kafka
26. Demo (II) - devices send to MQTT instead of Kafka –
how to get the data into Kafka?
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
?
2016-06-02 14:39:56.605|98|27|803014426|
Wichita to Little Rock Route2|
Normal|38.65|90.21|5187297736652502631
Kafka Connect & Streams - the Ecosystem around Kafka
28. Kafka Connect - Overview
Source
Connector
Sink
Connector
Kafka Connect & Streams - the Ecosystem around Kafka
29. Kafka Connect – Single Message Transforms (SMT)
Simple Transformations for a single message
Defined as part of Kafka Connect
• some useful transforms provided out-of-the-box
• Easily implement your own
Optionally deploy 1+ transforms with each
connector
• Modify messages produced by source
connector
• Modify messages sent to sink connectors
Makes it much easier to mix and match connectors
Some of currently available
transforms:
• InsertField
• ReplaceField
• MaskField
• ValueToKey
• ExtractField
• TimestampRouter
• RegexRouter
• SetSchemaMetaData
• Flatten
• TimestampConverter
Kafka Connect & Streams - the Ecosystem around Kafka
30. Kafka Connect – Many Connectors
60+ since first release (0.9+)
20+ from Confluent and Partners
Source: http://www.confluent.io/product/connectors
Confluent supported Connectors
Certified Connectors Community Connectors
Kafka Connect & Streams - the Ecosystem around Kafka
36. Kafka Streams - Overview
• Designed as a simple and lightweight library in Apache
Kafka
• no external dependencies on systems other than Apache
Kafka
• Part of open source Apache Kafka, introduced in 0.10+
• Leverages Kafka as its internal messaging layer
• Supports fault-tolerant local state
• Event-at-a-time processing (not microbatch) with millisecond
latency
• Windowing with out-of-order data using a Google DataFlow-like
model
Kafka Connect & Streams - the Ecosystem around Kafka
37. Kafka Stream DSL and Processor Topology
KStream<Integer, String> stream1 =
builder.stream("in-1");
KStream<Integer, String> stream2=
builder.stream("in-2");
KStream<Integer, String> joined =
stream1.leftJoin(stream2, …);
KTable<> aggregated =
joined.groupBy(…).count("store");
aggregated.to("out-1");
1 2
lj
a
t
State
Kafka Connect & Streams - the Ecosystem around Kafka
38. Kafka Stream DSL and Processor Topology
KStream<Integer, String> stream1 =
builder.stream("in-1");
KStream<Integer, String> stream2=
builder.stream("in-2");
KStream<Integer, String> joined =
stream1.leftJoin(stream2, …);
KTable<> aggregated =
joined.groupBy(…).count("store");
aggregated.to("out-1");
1 2
lj
a
t
State
Kafka Connect & Streams - the Ecosystem around Kafka
39. Kafka Streams Cluster
Processor Topology
Kafka Cluster
input-1
input-2
store (changelog)
output
1 2
lj
a
t
State
Kafka Connect & Streams - the Ecosystem around Kafka
43. Kafka Streams: Key Features
Kafka Connect & Streams - the Ecosystem around Kafka
• Native, 100%-compatible Kafka integration
• Secure stream processing using Kafka’s security features
• Elastic and highly scalable
• Fault-tolerant
• Stateful and stateless computations
• Interactive queries
• Time model
• Windowing
• Supports late-arriving and out-of-order data
• Millisecond processing latency, no micro-batching
• At-least-once and exactly-once processing guarantees
47. KSQL: a Streaming SQL Engine for Apache Kafka
• Enables stream processing with zero coding required
• The simples way to process streams of data in real-time
• Powered by Kafka and Kafka Streams: scalable, distributed, mature
• All you need is Kafka – no complex deployments
• available as Developer preview!
• STREAM and TABLE as first-class citizens
• STREAM = data in motion
• TABLE = collected state of a stream
• join STREAM and TABLE
Kafka Connect & Streams - the Ecosystem around Kafka
55. Demo (V) – Create JDBC Connect through REST API
Kafka Connect & Streams - the Ecosystem around Kafka
56. Demo (V) - Create Table with Driver State
ksql> CREATE TABLE driver_t
(id BIGINT,
first_name VARCHAR,
last_name VARCHAR,
available VARCHAR)
WITH (kafka_topic='trucking_driver',
value_format='JSON');
Message
----------------
Table created
Kafka Connect & Streams - the Ecosystem around Kafka
57. Demo (V) - Create Table with Driver State
ksql> CREATE STREAM dangerous_driving_and_driver_s
WITH (kafka_topic='dangerous_driving_and_driver_s',
value_format='JSON')
AS SELECT driverid, first_name, last_name, truckid, routeid,routename,
eventtype
FROM truck_position_s
LEFT JOIN driver_t
ON dangerous_driving_and_driver_s.driverid = driver_t.id;
Message
----------------------------
Stream created and running
ksql> select * from dangerous_driving_and_driver_s;
1511173352906 | 21 | 21 | Lila | Page | 58 | 1594289134 | Memphis to Little Rock
Route 2 | Unsafe tail distance
1511173353669 | 12 | 12 | Laurence | Lindsey | 93 | 1384345811 | Joplin to Kansas
City | Lane Departure
1511173435385 | 11 | 11 | Micky | Isaacson | 22 | 1198242881 | Saint Louis to
Chicago Route2 | Unsafe tail distance
Kafka Connect & Streams - the Ecosystem around Kafka
58. Kafka and "Big Data" / "Fast Data"
Ecosystem
Kafka Connect & Streams - the Ecosystem around Kafka
59. Kafka and the Big Data / Fast Data ecosystem
Kafka integrates with many popular products / frameworks
• Apache Spark Streaming
• Apache Flink
• Apache Storm
• Apache Apex
• Apache NiFi
• StreamSets
• Oracle Stream Analytics
• Oracle Service Bus
• Oracle GoldenGate
• Oracle Event Hub Cloud Service
• Debezium CDC
• …
Additional Info: https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Kafka Connect & Streams - the Ecosystem around Kafka
60. Kafka in Software Architecture
Kafka Connect & Streams - the Ecosystem around Kafka
61. Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Traditional Big Data Architecture
BI Tools
Enterprise Data
Warehouse
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
File Import / SQL Import
SQL
Search / Explore
Online & Mobile
Apps
Search
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Kafka Connect & Streams - the Ecosystem around Kafka
62. Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – handle event stream data
BI Tools
Enterprise Data
Warehouse
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Event
Hub
Call
Center
Weather
Data
Mobile
Apps
SQL
Search / Explore
Online & Mobile
Apps
Search
Data Flow
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Kafka Connect & Streams - the Ecosystem around Kafka
63. Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – taking Velocity into account
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Results
Parallel Batch
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
File Import / SQL Import
Weather
Data
Event
Hub
Event
Hub
Event
Hub
Kafka Connect & Streams - the Ecosystem around Kafka
64. Container
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – Asynchronous Microservice Architecture
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Parallel
Batch
ProcessingDistributed
Filesystem
Microservice
NoSQLRDBMS
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
File Import / SQL Import
Weather
Data
{ }
API
Event
Hub
Event
Hub
Event
Hub
Kafka Connect & Streams - the Ecosystem around Kafka
65. Kafka Connect & Streams - the Ecosystem around Kafka
Technology on its own won't help you.
You need to know how to use it properly.