The development of big data products and solutions – at scale – brings many challenges to the teams of platform architects, data scientists, and data engineers. While it is easy to find ourselves working in silos, successful organizations intensively collaborate across disciplines such that problems can be understood, a proposed model and solution can be scaled and optimized on multi-terabytes of data.
6. What are the typical fraudsters’ behavior ?
Bot Farms Domain Spoofing
7. How do we detect suspicious behavior?
We need data
▪ DMP, bit request data, data from SSP
and DSP
▪ Device Network Data
We need a model which is adaptive and
can detect different anomalies, which
requires historical data
We need to be able to scale the model
on network data size 4-10Tb per day
8. Building Data Science Products
T-MOBILE DATA
PLATFORM
3rd-PARTY DATA
DATA SCIENCE PRODUCTS
csv Parquetorc csv orc Parquet
YARN/MESOS
MR TEZ SPARK STORM
9. Building Data Science Product: Working Pipeline
ORC
PARQUET
CSV
DEVELOPMODEL
SAVEMODELANDOUTPUTS
VISUALIZATIONANDBUSINESS
INTERPRETATITON
READDATA
10. Spark and Big Data
▪ When working with BIG data, Spark becomes a necessity
▪ Hive or SQL does not support Machine Learning
▪ Python, R can not operate in large data sets ( > 4Gb )
19. From Python to PySpark
▪ Many data scientists are most
comfortable coding in Python
▪ Spark can seem very intimidating
to the newcomer
▪ UDFs provide a useful tool to run
Python code in Spark
▪ But it is oftentimes still much
more efficient to run PySpark
code directly
20. Python vs PySpark schema
Python stages:PySpark stages:
Load Data
Features
Model
Pipeline
Data Split
data = spark.read.option().csv()
from … import RandomForestClassifier
rf = RandomForestClassifier()
from … import Pipeline
pipeline = Pipeline()
(trainDF, testDF) = df.randomSplit()
from … import VectorAssembler
Load Data
Data Split
Model
data = pd.read_csv()
from … import train_test_split
X_train, X_test, y_train, y_test =
train_test_split()
from … import RandomForestRegressor
23. Python UDF vs PySpark (cont.)
UDFs can serve as a useful go-between for getting from Python to Spark, but
converting to PySpark will almost always be more efficient
PySpark SQLPython UDF
24. Normalized Entropy: The algorithm
App TMO
user 1
TMO
user 2
TMO
user 3
…
TMO
user
70M
Norm. Entropy
Score
Facebook 28 50 0 … 154 76 - normal
Netflix 287 340 78 … 0 54 – normal
Free weather app 0 0 1000 … 0 0 - stalker
Misc. banking app 1 1 0 … 1 100 - spammer
Where P(xi) is the probability the user xi used app = C(xi)/C(X), C(xi) is
the number of times the app showed up in user xi’s network, and C(X)
is the number of times the app showed up in the entire network
Shannon Entropy
Normalized Shannon Entropy
29. Performance Tracking
import mlflow
import mlflow.spark as mlsp
mlflow.set_tracking_uri('http://tracking-server/')
mlflow.set_experiment('datasource_0’)
data = spark.read. (...) .collect()
for i in range(len(data)):
with mlflow.start_run(run_name=data[i]['date_part']) as run:
mlflow.log_metrics({m:vfor (m, v) in data[i].asDict().items()})