SlideShare une entreprise Scribd logo
1  sur  31
Stream Processing Advanced
with Flink and Pulsar
Till Rohrmann, Engineering Lead at Ververica, @stsffap
Addison Higham, Chief Architect at StreamNative, @addisonjh
Why is Stream Processing Important?
Spectrum of data streaming applications
offline real time
Data warehousing
OLAP / BI / reporting
Machine learning & model
training
Continuous
ETL
Unified offline /
real-time analytics
Continuous
monitoring
(position, risk, …)
Real-time behaviour
modeling (pricing,
recommenders, …)
Real-time alerts
(fraud, security, …)
Real-time ML
model
training/evaluation
Distributed OLTP
applications
Choice of data architecture defines which uses cases you can cover!
Modern Streaming Data Architecture
Stream
Processing
Stream
Processing
Stream
Processing
Stream Storage
Long Term Storage
Results / Views
Triggered
Applications
Event Producers
Reference Streaming Data Architecture: Flink + Pulsar
Stream Storage
Long Term Storage
Results / Views
Stateful
Functions
Event Producers
Apache Flink: Analytics and Applications on Streaming Data
Flink Runtime
Stateful Computations over Data Streams
Stateful Stream
Processing
Streams, State, Time
Streaming Analytics
SQL and Tables
Event-driven
Applications
Stateful Functions
Apache Flink in Numbers
● Contributors: 887
● Most active Apache ML
● Most active sources by visits and commits (#2)
● Github stars: 16.4k
● Commits: >27k
● Releases: 1.13 latest major release
● Maven downloads per month: ~ 170k
● LOC: > 1.8 million
● Latency: < 1s
● Throughput: 4 billion events/s, 7 TB/s
● Input size: 100 TB for batch job
Some Apache Flink Users
One of The Most Advanced Stream Processors
● First class support for state
○ Asynchronous barrier checkpointing algorithm to create globally consistent checkpoints
● Event-time support
○ Correctness under delayed events
● End-to-end exactly once processing guarantees
○ Correctness under faults
● Resource elastic
○ Flink applications’ resources can be adjusted to the actual need
● Unified batch-streaming APIs
○ A single query to process historical as well as live data
● Stream: Sequence of data which is made available over time
● All computation processes chunks of data over time producing results over
time → Stream processing
○ E.g. reading data from disks is done in streaming fashion
● Events can be of various forms
● Decisive difference: Is my stream bounded or not?
Everything Is a Stream
SQL / Table API: Unified Batch & Stream Processing
SQL Query
Batch query
execution
SELECT
room,
TUMBLE_END(rowtime, INTERVAL ‘1’ HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL ‘1’ HOUR),
room
Stream-Table Duality
User lastLogin
Alice 2021-01-01
Bob 2021-01-02
User lastLogin
Alice 2021-01-14
Bob 2021-02-01
Eve 2021-01-21
Alice, 2021-01-01
Bob, 2021-01-02
Alice, 2021-01-14
Eve, 2021-01-21
Bob, 2021-02-01
SQL / Table API: Running The Same Query On Streams
SQL Query
Incremental
query execution
SELECT
room,
TUMBLE_END(rowtime, INTERVAL ‘1’ HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL ‘1’ HOUR),
room
Interpret stream as
table
Data Has to Come From Somewhere
● Flink is a powerful compute engine for unified batch and streaming
processing that is pushing boundaries of data processing
● For Flink really to shine, a storage system that supports efficient stream and
batch ingestion is required
Pulsar offers exactly that!
● Apache Pulsar is evolving to meet the needs of a system like Flink, offering
both a low-latency, high bandwidth streaming API as well as a flexible
architecture to support batch access
Why is Pulsar a Great Fit For Streaming And Batch?
Streaming
● Like other streaming systems,
Pulsar can be used like a
distributed scale-out log, providing
consumer controlled offsets and
high throughput through parallel
partitions of topics
Batch
● In addition, Pulsar’s segment-
based, multi-layer architecture
allows for applications to access
historical data more directly via
the underlying storage
Producer Consumer
Broker 2
Broker 1 Broker 3
Topic1-Part1 Topic1-Part2 Topic1-Part3
Segment 2
Segment 1 Segment 3 Segment X
Bookie
1
Bookie
2
Bookie
3
Segment 1
Segment X
Segment 1
Segment X
Segment 1
Segment X
Offload
Segments
+
Flink + Pulsar
Adoption at Scale
While Flink and Pulsar are both widely used independently, together they offer a
strong technology for unified batch and stream storage and compute
Many large organizations are adopting Flink + Pulsar for complex workloads that
can take advantage of the strengths of both systems
State of The Pulsar-Flink Connector Today
● Pulsar-Flink connector supports Flink 1.11, 1.12 and soon 1.13
● Exactly once source and sink via producer deduplication
● Full support for Flink Schema with integration to Pulsar Schema Registry
● Full Flink SQL support for batch and streaming modes
○ Also support for an “upsert” mode which can reason about inserts, updates, and deletes
● Support for KeyShared subscriptions for higher source parallelism than
number of topic-partitions
● https://github.com/streamnative/pulsar-flink
+
Demo
The Road Ahead
Many in-progress features are being developed
● Pulsar-Flink connector being upstreamed into Flink
○ Targeted for Flink 1.14
● Pulsar Transaction Support
○ Pulsar transaction’s allow for stronger exactly-once processing guarantees
● Native Pulsar Watermarking
○ Pulsar is able to broker watermarks between producers and consumers
● Parallel Batch Source
○ Query multiple segments in parallel for higher throughput
● Unified Batch + Streaming Source
○ Using Pulsar’s batch mode for catch up and then switching over to streaming ingestion
Job 1
Flink Support For Pulsar Transactions
Pulsar transactions are GA in Pulsar 2.8.0
Support in Flink is nearing completion, which can allows for end-to-end exactly-
once processing guarantees!
Learn more about transactions at the talk Exactly-Once Made Easy: Transactional Messaging in
Apache Pulsar at 11:30 AM
Job 2
Job 3
tx
The Importance of Watermarks
High-quality watermarks are crucial for correct and
stable stream processing jobs.
In order for results to be correct, we need to take into
account “Event Time” rather than solely relying on
“Processing Time”
This is especially important when replaying or
processing older streaming data
“Watermarks” advance the event-time clock, if the clock
does not “tick” accurately, the results will not be correct
Ideal
Reality
Skew
Event Time
Native Pulsar Watermarking
Pulsar is adding support for watermarks
● Producers have a new API for injecting
watermarks into the stream, with
multiple producers potentially producing
● Watermarks are broadcast across all
partitions, which allows the broker to
dispatch more accurate watermarks to
consumers, in both realtime and
historical scenarios
● The API is designed to be simple to
integrate with Flink, removing the
complexity of accurate watermark
generation from the developer
// Pulsar
for (int = 0; i < NUM_MESSAGES; i++) {
producer.newMessage()
.value("hello with event time! " + i)
.eventTime(
System.currentTimeMillis()).sendAsync();
}
Producer
.newWatermark()
.eventTime(
System.currentTimeMillis()).sendAsync();
// Flink
PulsarSource<String> pulsarSource = new
FlinkPulsarSource(topic, ...);
pulsarSource.assignTimestampsAndWatermarks(
PulsarWatermarkStrategy.forPulsarWatermarks());
Shared Event-Time Domain
End-to-End Watermarks
Job 1
Job 2
Job 3
Producer
W|24
W|17 W|12
W|12
W|8
W|4
Pulsar + Flink and StreamNative + Ververica
Pulsar + Flink community collaboration
● Contributing Pulsar-Flink connector to Flink repository
● Evolve connector to support more advanced Pulsar and Flink features
● Build best in the class open source stream processing platform
StreamNative + Ververica Cloud partnership
● Help customers to unlock the full potential of stream processing
Thanks a lot for your attention!
Questions?

Contenu connexe

Tendances

Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
HostedbyConfluent
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
StreamNative
 

Tendances (20)

Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
 
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 KeynoteScaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
 
OpenShift Overview
OpenShift OverviewOpenShift Overview
OpenShift Overview
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Building an Authorization Solution for Microservices Using Neo4j and OPA
Building an Authorization Solution for Microservices Using Neo4j and OPABuilding an Authorization Solution for Microservices Using Neo4j and OPA
Building an Authorization Solution for Microservices Using Neo4j and OPA
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
 

Similaire à Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote

From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
Thomas Weise
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 

Similaire à Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote (20)

Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEvents
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 

Plus de StreamNative

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
StreamNative
 

Plus de StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote

  • 1. Stream Processing Advanced with Flink and Pulsar Till Rohrmann, Engineering Lead at Ververica, @stsffap Addison Higham, Chief Architect at StreamNative, @addisonjh
  • 2. Why is Stream Processing Important?
  • 3.
  • 4.
  • 5. Spectrum of data streaming applications offline real time Data warehousing OLAP / BI / reporting Machine learning & model training Continuous ETL Unified offline / real-time analytics Continuous monitoring (position, risk, …) Real-time behaviour modeling (pricing, recommenders, …) Real-time alerts (fraud, security, …) Real-time ML model training/evaluation Distributed OLTP applications Choice of data architecture defines which uses cases you can cover!
  • 6. Modern Streaming Data Architecture Stream Processing Stream Processing Stream Processing Stream Storage Long Term Storage Results / Views Triggered Applications Event Producers
  • 7. Reference Streaming Data Architecture: Flink + Pulsar Stream Storage Long Term Storage Results / Views Stateful Functions Event Producers
  • 8.
  • 9. Apache Flink: Analytics and Applications on Streaming Data Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Streaming Analytics SQL and Tables Event-driven Applications Stateful Functions
  • 10. Apache Flink in Numbers ● Contributors: 887 ● Most active Apache ML ● Most active sources by visits and commits (#2) ● Github stars: 16.4k ● Commits: >27k ● Releases: 1.13 latest major release ● Maven downloads per month: ~ 170k ● LOC: > 1.8 million ● Latency: < 1s ● Throughput: 4 billion events/s, 7 TB/s ● Input size: 100 TB for batch job
  • 12. One of The Most Advanced Stream Processors ● First class support for state ○ Asynchronous barrier checkpointing algorithm to create globally consistent checkpoints ● Event-time support ○ Correctness under delayed events ● End-to-end exactly once processing guarantees ○ Correctness under faults ● Resource elastic ○ Flink applications’ resources can be adjusted to the actual need ● Unified batch-streaming APIs ○ A single query to process historical as well as live data
  • 13. ● Stream: Sequence of data which is made available over time ● All computation processes chunks of data over time producing results over time → Stream processing ○ E.g. reading data from disks is done in streaming fashion ● Events can be of various forms ● Decisive difference: Is my stream bounded or not? Everything Is a Stream
  • 14. SQL / Table API: Unified Batch & Stream Processing SQL Query Batch query execution SELECT room, TUMBLE_END(rowtime, INTERVAL ‘1’ HOUR), AVG(temperature) FROM sensors GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR), room
  • 15. Stream-Table Duality User lastLogin Alice 2021-01-01 Bob 2021-01-02 User lastLogin Alice 2021-01-14 Bob 2021-02-01 Eve 2021-01-21 Alice, 2021-01-01 Bob, 2021-01-02 Alice, 2021-01-14 Eve, 2021-01-21 Bob, 2021-02-01
  • 16. SQL / Table API: Running The Same Query On Streams SQL Query Incremental query execution SELECT room, TUMBLE_END(rowtime, INTERVAL ‘1’ HOUR), AVG(temperature) FROM sensors GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR), room Interpret stream as table
  • 17.
  • 18. Data Has to Come From Somewhere ● Flink is a powerful compute engine for unified batch and streaming processing that is pushing boundaries of data processing ● For Flink really to shine, a storage system that supports efficient stream and batch ingestion is required Pulsar offers exactly that! ● Apache Pulsar is evolving to meet the needs of a system like Flink, offering both a low-latency, high bandwidth streaming API as well as a flexible architecture to support batch access
  • 19. Why is Pulsar a Great Fit For Streaming And Batch? Streaming ● Like other streaming systems, Pulsar can be used like a distributed scale-out log, providing consumer controlled offsets and high throughput through parallel partitions of topics Batch ● In addition, Pulsar’s segment- based, multi-layer architecture allows for applications to access historical data more directly via the underlying storage Producer Consumer Broker 2 Broker 1 Broker 3 Topic1-Part1 Topic1-Part2 Topic1-Part3 Segment 2 Segment 1 Segment 3 Segment X Bookie 1 Bookie 2 Bookie 3 Segment 1 Segment X Segment 1 Segment X Segment 1 Segment X Offload Segments
  • 20. +
  • 22. Adoption at Scale While Flink and Pulsar are both widely used independently, together they offer a strong technology for unified batch and stream storage and compute Many large organizations are adopting Flink + Pulsar for complex workloads that can take advantage of the strengths of both systems
  • 23. State of The Pulsar-Flink Connector Today ● Pulsar-Flink connector supports Flink 1.11, 1.12 and soon 1.13 ● Exactly once source and sink via producer deduplication ● Full support for Flink Schema with integration to Pulsar Schema Registry ● Full Flink SQL support for batch and streaming modes ○ Also support for an “upsert” mode which can reason about inserts, updates, and deletes ● Support for KeyShared subscriptions for higher source parallelism than number of topic-partitions ● https://github.com/streamnative/pulsar-flink
  • 25. The Road Ahead Many in-progress features are being developed ● Pulsar-Flink connector being upstreamed into Flink ○ Targeted for Flink 1.14 ● Pulsar Transaction Support ○ Pulsar transaction’s allow for stronger exactly-once processing guarantees ● Native Pulsar Watermarking ○ Pulsar is able to broker watermarks between producers and consumers ● Parallel Batch Source ○ Query multiple segments in parallel for higher throughput ● Unified Batch + Streaming Source ○ Using Pulsar’s batch mode for catch up and then switching over to streaming ingestion
  • 26. Job 1 Flink Support For Pulsar Transactions Pulsar transactions are GA in Pulsar 2.8.0 Support in Flink is nearing completion, which can allows for end-to-end exactly- once processing guarantees! Learn more about transactions at the talk Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar at 11:30 AM Job 2 Job 3 tx
  • 27. The Importance of Watermarks High-quality watermarks are crucial for correct and stable stream processing jobs. In order for results to be correct, we need to take into account “Event Time” rather than solely relying on “Processing Time” This is especially important when replaying or processing older streaming data “Watermarks” advance the event-time clock, if the clock does not “tick” accurately, the results will not be correct Ideal Reality Skew Event Time
  • 28. Native Pulsar Watermarking Pulsar is adding support for watermarks ● Producers have a new API for injecting watermarks into the stream, with multiple producers potentially producing ● Watermarks are broadcast across all partitions, which allows the broker to dispatch more accurate watermarks to consumers, in both realtime and historical scenarios ● The API is designed to be simple to integrate with Flink, removing the complexity of accurate watermark generation from the developer // Pulsar for (int = 0; i < NUM_MESSAGES; i++) { producer.newMessage() .value("hello with event time! " + i) .eventTime( System.currentTimeMillis()).sendAsync(); } Producer .newWatermark() .eventTime( System.currentTimeMillis()).sendAsync(); // Flink PulsarSource<String> pulsarSource = new FlinkPulsarSource(topic, ...); pulsarSource.assignTimestampsAndWatermarks( PulsarWatermarkStrategy.forPulsarWatermarks());
  • 29. Shared Event-Time Domain End-to-End Watermarks Job 1 Job 2 Job 3 Producer W|24 W|17 W|12 W|12 W|8 W|4
  • 30. Pulsar + Flink and StreamNative + Ververica Pulsar + Flink community collaboration ● Contributing Pulsar-Flink connector to Flink repository ● Evolve connector to support more advanced Pulsar and Flink features ● Build best in the class open source stream processing platform StreamNative + Ververica Cloud partnership ● Help customers to unlock the full potential of stream processing
  • 31. Thanks a lot for your attention! Questions?

Notes de l'éditeur

  1. Faster results → faster business decisions → competitive advantage → more money
  2. Generalization of batch processing Processing of historical data Processing of live data Processing both historical data + live data
  3. Brings together the worlds of data analytics and event-driven applications Building powerful applications which benefit from real time analytics (e.g. fraud detection, match-making, etc.)
  4. Real time search and recommendation models (e.g., Alibaba) Build a real-time session behavior profile of users (e.g., Netflix) Real time trade settlement dashboard (e.g., UBS) Real time revenue accounting (various AdTechs) Machine Learning-based anomaly/fraud detection (e.g., ING, Microsoft) Real-time data refinement and data pipelines (many) Realtime Analytics Platforms (e.g., Alibaba, Uber, Lyft, Yelp!, Tencent) Materializing Views (dashboards, data marts) ETL - batch and continuous Machine Learning Training (Alibaba, new ML library) Sub-second latency Peak throughput: 4B Event/s Throughput size: 7 TB/s Largest Batch Job Input size: 100TB Flink cluster size: 30K
  5. If everything is a stream, can I treat it with a single API? TableAPI, SQL and DataStream can handle bounded (batch) and unbounded streams Write the program once → Run it on live and historic data Lower development and maintenance costs Having a single engine for batch and stream processing Do batch processing when catching up with the stream One engine to run batch & streaming workloads Scheduler which can schedule bounded and unbounded tasks Exploiting bounded streams property for runtime optimizations More batching of results Out of order processing Stage wise execution possible
  6. In this section, I will give a brief overview of watermarks and discuss correctness and skew a bit
  7. Watermarks are assertions about the progression of event time, according to the data producer. Pulsar is brokering the assertions from producer to consumer. The quality of the watermarks (and hence the job output) is determined by the producer; system internals (partitioning, I/O performance, etc) are effectively factored out.