SlideShare une entreprise Scribd logo
1  sur  28
Large Scale Machine Learning for Fraud
Prevention
Disclaimer: views expressed here are my own and do not
necessarily represent the views of PayPal, its affiliates or
subsidiaries.
Agenda Introduction
Fraud Prevention
Algorithm
Experiments
Conclusion
©2016 PayPal Inc. Confidential and proprietary.
INTRODUCTION
© 2016 PayPal Inc. Confidential and proprietary.
About Me
• Software Engineer/Data Scientist/ML Researcher
• Ph. D Computer Science
• Research in Face Recognition, Phishing/Spam, Fraud Prevention
4
Vibrant
developer
ecosyste
m
Efficient
payment
processing
with low SLA
active customer
accounts
200M
Secure
data
storage
&
handling
Traditional
& NoSQL
databases
PayPal operates
one of the largest
PRIVATE
CLOUDS
in the world
We have transformed
core business
processes into robust
SERVICE-BASED
PLATFORMS
The power of
our platform
Our technology transformation enables us to:
• Process payments at tremendous scale
• Accelerate the innovation of new products
• Engage world-class developers & technologists
About PayPal
FRAUD PREVENTION
Fraud Prevention @ PayPal
Robust feature engineering, machine
learning and statistical models
Highly scalable and multi-layered
infrastructure software
Superior team of data scientists,
researchers, financial and intelligence
analysts
Images source:
Fraud Prevention @ PayPal
• Employs advanced machine learning and statistical models to flag
fraudulent behavior up-front
• More sophisticated algorithms after transaction is complete
Transaction Level
• Monitor account level activity to identify abusive behavior
• Abusive pattern include frequent payments, suspicious profile
changes
Account Level
• Monitor account-to-account interaction
• Frequent transfer of money from several accounts to one central
account
Network Level
Fraud Prevention – What are we up against?
Fraudsters are becoming increasingly smarter and adaptive
Need cost-effective solutions that can model complex attack
patterns not previously observed
Need scalable and computationally efficient prediction models
© 2016 PayPal Inc. Confidential and proprietary.
Fraud Prevention – What are we up against?
• Much harder to get performance lift on
our flagship models
• Need to re-look at all aspects of
traditional model building
• Need out-of-the-box thinking
10
Area we are missing (AUC 0.96)
© 2016 PayPal Inc. Confidential and proprietary.
Fraud Prevention – What can we do to build better models?
11
feature1 …. featureN ……… Target
(Label)
d1
d2
…
dM
…..
Better
feature
Better
labeling
Advanced ML
Algorithms
Bigger
better data
LSML ALGORITHMS – ACTIVE LEARNING,
DEEP LEARNING, GBT
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Supervised learning algorithms require
data to be labeled
• Labelling is difficult, time-consuming
and expensive : Active Learning to the
rescue
• Idea – ML Algorithm can achieve better
accuracy if it is allowed to “choose the
data” from which it learns*
• Overcome labelling bottleneck by asking
queries (unlabeled data) to be labeled by
human
13
Unlabeled
Data
Labeled Data
Human Annotator
Machine Learning Model
(Re)Build Model
Select Queries
Source*: Burr Settles
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Scenarios
• Membership Query Synthesis – request labels for ‘any’
unlabeled instance in input space
• Stream-based Selective Sampling – unlabeled instance is drawn
one at a time & learner decides whether to discard or query
• Pool-based Sampling – instances are queried from a pool
according to informative-ness measure
14
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Query Strategy Frameworks
• Uncertainty Sampling
• Query-By-Committee
• Expected Model Change
• Expected Error Reduction
• Variance Reduction
• Density Weighted Methods
15
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning –Toy Example
16
Toy data – 400 instances Model using random sampling
70% accuracy
Model using active learning
Uncertainty sampling – 90% accuracy
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning For Fraud Prevention – Why is it unique?
17
• Data is unbalanced
• Fraud labelling require trained experts. Can’t be outsourced
• Fraud labelling is time consuming
• Fraud labelling require more than just individual instances. Require before
& after transactions
• Fraud labelling require data from other entities (ex: IP address)
• Fraud labelling require aggregate data
• Fraud tag mature at different times (ex: chargeback) & not instantaneous
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning For Fraud Prevention – High Level Framework
18
Labeled
Data
Create Bags
Deep Learning
Model
GBT Model
(Re)Build Models
Unlabeled
Data
Predict
Query By Committee
Human Expert
Create
Statistics
Active
Feature
Engineering
Simulate
Features
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Deep Learning
19
Input Layer
Hidden Layers
Output Layer
• If a network has many layers of non-linearity, it is “deep”
• Need scalable platform
• Need lots of training data
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Deep Learning
20
•NetworkTopology – Feed forward
•Key Parameters
• # of hidden layers
• # of neurons @ each hidden layer
• Regularization
• Activation function
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Gradient BoostingTrees
21
• GBT = Gradient Descent + Boosting
• Fit an additive (ensemble) model in forward stage wise manner
• In each stage introduce a new model to compensate the shortcomings
of existing models
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Gradient BoostingTrees
22
• Strengths
• No pre-processing required
• Robust
• Scalable
• Weaknesses
• Overfits (Need to find proper stopping point)
• Sensitive to noise
• Key Parameters
• # of trees
• Max depth
• Max observations
• Learning rate
EXPERIMENTS
© 2016 PayPal Inc. Confidential and proprietary.
Datasets
24
• Training Data
• 1 year
• 11 million transactions (1 million for active labelling)
• Test Data
• 4 months
• 4 million transactions
• # of features
• 500 - 600
© 2016 PayPal Inc. Confidential and proprietary.
Tools
25
• H2O
• Open source
• Scalable
• Robust
• Deep Learning & GBM implementations
• R
• Open source
• Active learning package
© 2016 PayPal Inc. Confidential and proprietary. 26
# of instances queried AUC (*weighted)
0 0.960
1000 0.961
10000 0.963
50000 0.971
100000 0.975
500000 0.977
1000000 0.979
Early Results – Active Learning Shows Promise…
CONCLUSIONS
© 2016 PayPal Inc. Confidential and proprietary.
Conclusions
28
• Deep learning & GBT has shown tremendous performance for fraud
detection.
• Active learning shows promise in improving performance of these
champion models
• Active learning also significantly reduce our labelling cost

Contenu connexe

Tendances

Webinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big DataWebinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big DataZaloni
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereZaloni
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeDataWorks Summit
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command CenterDataWorks Summit
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopCraig Warman
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerZaloni
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...DataWorks Summit/Hadoop Summit
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteMark van Rijmenam
 

Tendances (20)

Webinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big DataWebinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big Data
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data Lake
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command Center
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShare
 
Big Data Telecom
Big Data TelecomBig Data Telecom
Big Data Telecom
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on Hadoop
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
 
Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Ibm big data
Ibm big dataIbm big data
Ibm big data
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 

Similaire à Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud Prevention

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Practical model management in the age of Data science and ML
Practical model management in the age of Data science and MLPractical model management in the age of Data science and ML
Practical model management in the age of Data science and MLQuantUniversity
 
Saama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 FinalSaama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 FinalDan Maxwell
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
Predictive analytics from a to z
Predictive analytics from a to zPredictive analytics from a to z
Predictive analytics from a to zalpinedatalabs
 
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomePredictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomeSAP Ariba
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Pixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScalePixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScaleAntónio Alegria
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Presentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector FinancieroPresentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector FinancieroJorge Puebla Fernández
 
Launching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech SeparationLaunching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech SeparationSri Shivananda
 
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward
 
Skylads - Big Data for Telcos
Skylads - Big Data for TelcosSkylads - Big Data for Telcos
Skylads - Big Data for TelcosXavier Litt
 
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor..."The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...Quantopian
 

Similaire à Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud Prevention (20)

Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
Practical model management in the age of Data science and ML
Practical model management in the age of Data science and MLPractical model management in the age of Data science and ML
Practical model management in the age of Data science and ML
 
Saama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 FinalSaama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 Final
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
Managing AI Products
Managing AI ProductsManaging AI Products
Managing AI Products
 
Predictive analytics from a to z
Predictive analytics from a to zPredictive analytics from a to z
Predictive analytics from a to z
 
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomePredictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Pixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScalePixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at Scale
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Presentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector FinancieroPresentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector Financiero
 
Launching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech SeparationLaunching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech Separation
 
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
 
Skylads - Big Data for Telcos
Skylads - Big Data for TelcosSkylads - Big Data for Telcos
Skylads - Big Data for Telcos
 
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor..."The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud Prevention

  • 1. Large Scale Machine Learning for Fraud Prevention Disclaimer: views expressed here are my own and do not necessarily represent the views of PayPal, its affiliates or subsidiaries.
  • 4. © 2016 PayPal Inc. Confidential and proprietary. About Me • Software Engineer/Data Scientist/ML Researcher • Ph. D Computer Science • Research in Face Recognition, Phishing/Spam, Fraud Prevention 4
  • 5. Vibrant developer ecosyste m Efficient payment processing with low SLA active customer accounts 200M Secure data storage & handling Traditional & NoSQL databases PayPal operates one of the largest PRIVATE CLOUDS in the world We have transformed core business processes into robust SERVICE-BASED PLATFORMS The power of our platform Our technology transformation enables us to: • Process payments at tremendous scale • Accelerate the innovation of new products • Engage world-class developers & technologists About PayPal
  • 7. Fraud Prevention @ PayPal Robust feature engineering, machine learning and statistical models Highly scalable and multi-layered infrastructure software Superior team of data scientists, researchers, financial and intelligence analysts Images source:
  • 8. Fraud Prevention @ PayPal • Employs advanced machine learning and statistical models to flag fraudulent behavior up-front • More sophisticated algorithms after transaction is complete Transaction Level • Monitor account level activity to identify abusive behavior • Abusive pattern include frequent payments, suspicious profile changes Account Level • Monitor account-to-account interaction • Frequent transfer of money from several accounts to one central account Network Level
  • 9. Fraud Prevention – What are we up against? Fraudsters are becoming increasingly smarter and adaptive Need cost-effective solutions that can model complex attack patterns not previously observed Need scalable and computationally efficient prediction models
  • 10. © 2016 PayPal Inc. Confidential and proprietary. Fraud Prevention – What are we up against? • Much harder to get performance lift on our flagship models • Need to re-look at all aspects of traditional model building • Need out-of-the-box thinking 10 Area we are missing (AUC 0.96)
  • 11. © 2016 PayPal Inc. Confidential and proprietary. Fraud Prevention – What can we do to build better models? 11 feature1 …. featureN ……… Target (Label) d1 d2 … dM ….. Better feature Better labeling Advanced ML Algorithms Bigger better data
  • 12. LSML ALGORITHMS – ACTIVE LEARNING, DEEP LEARNING, GBT
  • 13. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Supervised learning algorithms require data to be labeled • Labelling is difficult, time-consuming and expensive : Active Learning to the rescue • Idea – ML Algorithm can achieve better accuracy if it is allowed to “choose the data” from which it learns* • Overcome labelling bottleneck by asking queries (unlabeled data) to be labeled by human 13 Unlabeled Data Labeled Data Human Annotator Machine Learning Model (Re)Build Model Select Queries Source*: Burr Settles
  • 14. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Scenarios • Membership Query Synthesis – request labels for ‘any’ unlabeled instance in input space • Stream-based Selective Sampling – unlabeled instance is drawn one at a time & learner decides whether to discard or query • Pool-based Sampling – instances are queried from a pool according to informative-ness measure 14
  • 15. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Query Strategy Frameworks • Uncertainty Sampling • Query-By-Committee • Expected Model Change • Expected Error Reduction • Variance Reduction • Density Weighted Methods 15
  • 16. © 2016 PayPal Inc. Confidential and proprietary. Active Learning –Toy Example 16 Toy data – 400 instances Model using random sampling 70% accuracy Model using active learning Uncertainty sampling – 90% accuracy
  • 17. © 2016 PayPal Inc. Confidential and proprietary. Active Learning For Fraud Prevention – Why is it unique? 17 • Data is unbalanced • Fraud labelling require trained experts. Can’t be outsourced • Fraud labelling is time consuming • Fraud labelling require more than just individual instances. Require before & after transactions • Fraud labelling require data from other entities (ex: IP address) • Fraud labelling require aggregate data • Fraud tag mature at different times (ex: chargeback) & not instantaneous
  • 18. © 2016 PayPal Inc. Confidential and proprietary. Active Learning For Fraud Prevention – High Level Framework 18 Labeled Data Create Bags Deep Learning Model GBT Model (Re)Build Models Unlabeled Data Predict Query By Committee Human Expert Create Statistics Active Feature Engineering Simulate Features
  • 19. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Deep Learning 19 Input Layer Hidden Layers Output Layer • If a network has many layers of non-linearity, it is “deep” • Need scalable platform • Need lots of training data
  • 20. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Deep Learning 20 •NetworkTopology – Feed forward •Key Parameters • # of hidden layers • # of neurons @ each hidden layer • Regularization • Activation function
  • 21. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Gradient BoostingTrees 21 • GBT = Gradient Descent + Boosting • Fit an additive (ensemble) model in forward stage wise manner • In each stage introduce a new model to compensate the shortcomings of existing models
  • 22. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Gradient BoostingTrees 22 • Strengths • No pre-processing required • Robust • Scalable • Weaknesses • Overfits (Need to find proper stopping point) • Sensitive to noise • Key Parameters • # of trees • Max depth • Max observations • Learning rate
  • 24. © 2016 PayPal Inc. Confidential and proprietary. Datasets 24 • Training Data • 1 year • 11 million transactions (1 million for active labelling) • Test Data • 4 months • 4 million transactions • # of features • 500 - 600
  • 25. © 2016 PayPal Inc. Confidential and proprietary. Tools 25 • H2O • Open source • Scalable • Robust • Deep Learning & GBM implementations • R • Open source • Active learning package
  • 26. © 2016 PayPal Inc. Confidential and proprietary. 26 # of instances queried AUC (*weighted) 0 0.960 1000 0.961 10000 0.963 50000 0.971 100000 0.975 500000 0.977 1000000 0.979 Early Results – Active Learning Shows Promise…
  • 28. © 2016 PayPal Inc. Confidential and proprietary. Conclusions 28 • Deep learning & GBT has shown tremendous performance for fraud detection. • Active learning shows promise in improving performance of these champion models • Active learning also significantly reduce our labelling cost