Within fintech catching fraudsters is one of the primary opportunities for us to use streaming applications to apply ML models in real-time. This talk will be a review of our journey to bring fraud decisioning to our tellers at Capital One using Kafka, Flink and AWS Lambda. We will share our learnings and experiences to common problems such as custom windowing, breaking down a monolith app to small queryable state apps, feature engineering with Jython, dealing with back pressure from combining two disparate streams, model/feature validation in a regulatory environment, and running Flink jobs on Kubernetes.
3. Developing a Fraud
Defense Platform
Fraud Defense at the
Teller Using Flink
Our journey to build a Fraud Decisioning Platform and use
Flink to build out the use cases
12. PROS
• Community support for
Docker/Kube
• Resilient
• Easy to tear down and bring
back
• Maximizing resource efficiency
CONS
• Maintaining your own
Kubernetes solution
• Containing blast radius
• Edge cases when combining #
of technology solutions
Developing on Kubernetes has been challenging but very
rewarding
16. A FLINK MONOLITH
• Problem: Develop a stream processing workflow for
two legacy batch data sources
• First Attempt: Do everything in Flink and take
advantage of Flink Connected Streams
18. PROS
• Cheap
• Not a lot of
Code/Config
• Scalability / Availability
• Deployments are a
breeze
CONS
• Not truly stateless
• Start-up time
AWS Lambda is a good fit for our use case and works well
with our underlying technologies
23. USING JYTHON TO BRIDGE THE GAP TO
DATA SCIENTISTS
Flink
Jython Adapter
.py .py .py .py
Windows
Data
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
.py .py .py .py
Data
24. GITFLOW AND JYTHON IMPROVE
TRACEABILITY
Featur
e JAR
v1.0.42
Junit
Tests
Pull
Request
Merge
Build
Develop Denied
Failed
Maven
Import
Junit
Tests
Build
Flink
Job
JAR
Commit
26. FEATURES EXIST TO FEED MODELS
FeatureFeature
Model Model Score
H20 Tensor Flow Seldon (whatever)
27.
28. BREAKING UP THE MONOLITH
• Problem: Back Pressure leading to Delayed Transactions
• Solution: Break up the monolith Flink App into small Queryable State
Apps
30. •Connected Streams
•Flink Keyed State
•Checkpointing/Savepointing
•Queryable State
Features Used
•Flink Versioning (FLINK-7783, FLINK-8487)
•Keyed Source Function
•Kafka Offsets
Issues
We had a lot of fun and success using Flink, but not without a
few hiccups
31. Developing a Fraud
Defense Platform
Fraud Defense at the
Teller Using Flink
Our journey to build a Fraud Decisioning Platform and use
Flink to build out the use cases
QUESTIONS?
Notes de l'éditeur
Jeff Intro
Andrew Intro
We are part of the Forest teams(very high level intro)
Kubernetes-based fraud decisioning platform that you can deploy multiple fraud use cases on
With the goal of being able to rapidly spin up fraud apps
Running in Production since September 2017
Our talk today:
Talk briefly about our journey building out this Forest platform using Kubernetes as well as talk about how we used Flink with Kubernetes at a high level
Then talk about a specific use case we have on the platform and do a deep dive on what’s inside our Flink app
Customers First
If one day you take a look at your bank account and its empty
However if your account was locked for no reason you would be upset
This sense of balance between catching stopping fraud and providing a great customer experience is a common trend that we have to deal with
If we wanted to stop fraud completely we could just stop letting people take their money
On a similar note, we have a limited number of fraud operators
Do not have the manpower to call every single person up and ask them
Primary directive of the platform is to empower Data Scientists/ Data Analysts by building the tools on the platform to help create the models needed to make decisions
This includes having access to all the data in a fast and easy-to-understand format
Seeing how their models are performing, and whether the features are being calculated as expected
When they need to refit the model they need to be able to do the data transformations quickly so we can turn a refreshed model around
Lastly as we are developing a fraud platform, we need to keep in mind the engineers/developers that will be developing the fraud app
it should be something that engineers enjoy to develop on
When you have a feature/model/action repository its very easy to develop turn around fraud apps
To help us balance these different needs we have our product owners to help bridge the gap
Customers First
If one day you take a look at your bank account and its empty
However if your account was locked for no reason you would be upset
This sense of balance between catching stopping fraud and providing a great customer experience is a common trend that we have to deal with
If we wanted to stop fraud completely we could just stop letting people take their money
On a similar note, we have a limited number of fraud operators
Do not have the manpower to call every single person up and ask them
Primary directive of the platform is to empower Data Scientists/ Data Analysts by building the tools on the platform to help create the models needed to make decisions
This includes having access to all the data in a fast and easy-to-understand format
Seeing how their models are performing, and whether the features are being calculated as expected
When they need to refit the model they need to be able to do the data transformations quickly so we can turn a refreshed model around
Lastly as we are developing a fraud platform, we need to keep in mind the engineers/developers that will be developing the fraud app
it should be something that engineers enjoy to develop on
When you have a feature/model/action repository its very easy to develop turn around fraud apps
To help us balance these different needs we have our product owners to help bridge the gap
14 EC2s
6 m4.10xlarge for general minions
5 m4.2xlarge for kafka nodes
3 m4.large for masters
Ansible to provision
200+ pods
Flink apps in Java/Scala/Kotlin
Microservices in Golang
Holy smokes that’s a lot
Zookeeper/Kafka/Flink/Nifi
Kappa Architecture
Kafka is our primary messaging bus throughout the platform
Nifi is one of the tools we use to grab data from different sources in the company
Flink does the calculations and applies needed transformations
Minio/Istio to handle http communications throughout the platform
EFK = ElasticSearch / FluentD / Kibana
Docker logs
Managed AWS service
Influx / Prometheus / Grafana
Metrics reporting and Dashboards
Platform health
Fraud health
Drill / zeppelin / s3 for data analysts to view transactions
Why are we switching from influx to prometheus
Holy smokes that’s a lot
Zookeeper/Kafka/Flink/Nifi
Kafka is our primary messaging bus throughout the platform
Nifi is one of the tools we use to grab data from different sources in the company
Flink does the calculations and applies needed transformations
Minio/Istio to handle http communications throughout the platform
EFK = ElasticSearch / FluentD / Kibana
Docker logs
Managed AWS service
Influx / Prometheus / Grafana
Metrics reporting and Dashboards
Platform health
Fraud health
Drill / zeppelin / s3 for data analysts to view transactions
Why are we switching from influx to prometheus
Kubernetes has been a challenge
If a task manager goes down, it will auto-heal
If your configurations are set up correctly you can just delete pods and they’ll come back
Unless your configurations are completely fleshed out, the blast radius on failure can be rippling
Situation where docker logs could not make it out to kubernetes logs because the docker machines were dying
Developed internal tool for ci/cd and deployment
Use cases tell us the resources they need and we provision them a flink cluster
1 Job Manager per cluster
5 Task Managers per cluster
RocksDB backend
Checkpoint/Savepoint persist on S3
Job Deployment Options
Considerations
People obviously don’t want to wait too long
But we want to respond with the most data we have available on the customer
Two data streams need to share state
Data stream from online interactions / all other customer interactions
Data stream that we receive from the branch
Need to calculate Features
Need to apply ML model
Need to respond in real-time
Developed in python, evaluating golang
Developed internal tool for ci/cd and deployment
Teller transactions have a real-time SLA
Connected Streams is the culprit
Break Up One Flink App into Smaller Flink Queryable State Apps
Flink Apps as Functions
Disparate Data Streams: Back Pressure
In our case: we have all the account level activity for a given customer from one source and on the other we have the data from the teller machine
Not all transactions are equal due to their source. However in a ML world we still want to examine every transaction
Results in back pressure and uneven transaction flow
Alvin for each data source
Scurry of Alvins build out our feature repository
Theodore builds his own features, adds on features for Alvin and the passes it down
Why did we break Simon out?
We can replace it with anything such as Seldon