Advertising Fraud Detection at Scale at T-Mobile

Advertising Fraud Detection
at Scale @ T-Mobile
Eric Yatskowitz, Data Scientist
Phan Chuong, Data Engineer

Ad Tech Industry
▪ A lack of regulation is one reason
▪ Ad Tech Industry complexity
Why is this industry so rife with fraud?

What are the typical fraudsters’ behavior ?
Bot Farms Domain Spoofing

How do we detect suspicious behavior?
We need data
▪ DMP, bit request data, data from SSP
and DSP
▪ Device Network Data
We need a model which is adaptive and
can detect different anomalies, which
requires historical data
We need to be able to scale the model
on network data size 4-10Tb per day

Building Data Science Products
T-MOBILE DATA
PLATFORM
3rd-PARTY DATA
DATA SCIENCE PRODUCTS
csv Parquetorc csv orc Parquet
YARN/MESOS
MR TEZ SPARK STORM

Building Data Science Product: Working Pipeline
ORC
PARQUET
CSV
DEVELOPMODEL
SAVEMODELANDOUTPUTS
VISUALIZATIONANDBUSINESS
INTERPRETATITON
READDATA

Spark and Big Data
▪ When working with BIG data, Spark becomes a necessity
▪ Hive or SQL does not support Machine Learning
▪ Python, R can not operate in large data sets ( > 4Gb )

Spark Tuning - Overview
Resources management
Static allocation vs Dynamic allocation
Reading
Partition sizing & Split strategy
Joining & Aggregating
Maximizing parallelism & Shuffling strategy
Writing
Maximizing parallelism & Shuffling strategy

Spark Tuning – Resources Management
s.d.enabled = True
s.d.initialExecutors
s.d.minExecutors
s.d.maxExecutors
s.d.executorIdleTimeout
s.d.cachedExecutorIdleTimeout
* s.d. is acronym of spark.dynamicAllocation

Spark Tuning – Reading from HDFS
spark.files.openCostInBytes (oCIB) Default = 4 Mb
spark.files.maxPartitionBytes (mPB) Default = 128 Mb
-
10,000
20,000
30,000
40,000
50,000
60,000
0.13
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
3.00
3.25
3.50
3.75
4.00
Numberoftasks
Partition size in Gb
mPB =
𝐷𝑎𝑡𝑎𝐵𝑦𝑡𝑒𝑠 + 𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑖𝑙𝑒𝑠 ∗ 𝑜𝐶𝐼𝐵
𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 ∗ 𝐶𝑜𝑟𝑒𝑠

Spark Tuning – Shuffling Strategy
spark.sql.shuffle.partitions Default = 200
𝑆ℎ𝑢𝑓𝑓𝑙𝑒𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 = 𝐸𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑠 ∗ 𝐶𝑜𝑟𝑒𝑠

df.write…
df.write…
Spark Tuning – Writing Strategy
df.coalesce(num) .write…
df.repartition(num).write…

From Python to PySpark
▪ Many data scientists are most
comfortable coding in Python
▪ Spark can seem very intimidating
to the newcomer
▪ UDFs provide a useful tool to run
Python code in Spark
▪ But it is oftentimes still much
more efficient to run PySpark
code directly

Python vs PySpark schema
Python stages:PySpark stages:
Load Data
Features
Model
Pipeline
Data Split
data = spark.read.option().csv()
from … import RandomForestClassifier
rf = RandomForestClassifier()
from … import Pipeline
pipeline = Pipeline()
(trainDF, testDF) = df.randomSplit()
from … import VectorAssembler
Load Data
Data Split
Model
data = pd.read_csv()
from … import train_test_split
X_train, X_test, y_train, y_test =
train_test_split()
from … import RandomForestRegressor

Python UDF vs PySpark
PySparkPython UDF
Filter Variables ...
Read Variables ...

Python UDF vs PySpark (cont.)
UDFs can serve as a useful go-between for getting from Python to Spark, but
converting to PySpark will almost always be more efficient
PySpark SQLPython UDF

Normalized Entropy: The algorithm
App TMO
user 1
TMO
user 2
TMO
user 3
…
TMO
user
70M
Norm. Entropy
Score
Facebook 28 50 0 … 154 76 - normal
Netflix 287 340 78 … 0 54 – normal
Free weather app 0 0 1000 … 0 0 - stalker
Misc. banking app 1 1 0 … 1 100 - spammer
Where P(xi) is the probability the user xi used app = C(xi)/C(X), C(xi) is
the number of times the app showed up in user xi’s network, and C(X)
is the number of times the app showed up in the entire network
Shannon Entropy
Normalized Shannon Entropy

Normalized Entropy: Static Allocation
config Default Optimized
spark.executors.instances 2 100
spark.executor.cores 1 4
spark.executor.memory 1g 2g
spark.sql.shuffle.partitions 200 400
Completion time 8 min 23 sec
Default Configuration
Optimized Configuration

Normalized Entropy: Dynamic Allocation
config Default Optimized
spark.dynamicAllocation.ma
xExecutors
infinity 100
spark.executor.cores 1 4
spark.executor.memory 1g 2g
spark.sql.shuffle.partitions 200 400
Completion time -- 41 sec
12x faster than default static configuration
1.8x slower than optimized static configuration

Performance Tracking
import mlflow
import mlflow.spark as mlsp
mlflow.set_tracking_uri('http://tracking-server/')
mlflow.set_experiment('datasource_0’)
data = spark.read. (...) .collect()
for i in range(len(data)):
with mlflow.start_run(run_name=data[i]['date_part']) as run:
mlflow.log_metrics({m:vfor (m, v) in data[i].asDict().items()})

References
▪ https://www.buzzfeednews.com/article/craigsilverman/google-
banned-cootek-adware (slide 2)
▪ https://www.euronews.com/2018/11/28/feds-say-russian-
cybercriminals-duped-u-s-companies-out-tens-n940946 (slide 4)
▪ https://www.bleepingcomputer.com/news/security/russian-
methbot-operation-makes-up-to-5-million-per-day-from-click-fraud/
(slide 6)
▪ https://www.thedrum.com/news/2019/06/06/cost-global-ad-fraud-
could-top-30bn (slide 4)
Open Source References Were Used for Describing Ad Fraud Scenarios

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Advertising Fraud Detection at Scale at T-Mobile

Advertising Fraud Detection at Scale at T-Mobile

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Advertising Fraud Detection at Scale at T-Mobile

Similaire à Advertising Fraud Detection at Scale at T-Mobile (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Advertising Fraud Detection at Scale at T-Mobile