Introduction to Data Science and Large-scale Machine Learning

Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 1
November 16, 2016
How Data and Data
Science are revolutionizing the world
James G. Shanahan1,2
1IoTGurus., 2iSchool UC Berkeley, CA,
EMAIL: James_DOT_Shanahan_AT_gmail_DOT_com

Outline
• Introduction
• Artificial Intelligence
• Machine Learning
– Emperical Sport
– NetFlix
– Dashboards
• Data Science
• Applications
• Architecture
• What’s next?

James G. Shanahan 25+ years in data science
Systems,Parallel
Computing, Hadoop,
Spark, Python, R,
Scala,Java
Digital Advertising &
Marketing,
Web+mobile+local Search,
Anticipatory info. systems,
Cellular Networks, Social
Networks
Statistics, Optimization
Theory, Probability
Social Network Analytics,
Geo-InformationalScience,
HCI, Graphs, NLP
Math&Theory
Domain Expertise
Led teams of R&D,r&D
Xerox Research,AT&T,
Turn, NativeX, Adobe
Entrepreneur
Teach at UC Berkeley
Technology
16+
25+
25+years
16+
Leadership, Business
Acumen, Teacher

James G. Shanahan
• 25+ years in data science
• Currently
– Principal and Founder, Data Science Consultancy
• Clients: Target, Adobe, Akamai, Ancestry, AT&T, Nokia Siemens, SearchMe, …
– Teaching
• Co-creator of UC Berkeley MIDS program; curriculum development
• Teach Large Scale Machine Learning (Fall 2014,2015,2016)
• Teach Machine Learning and Optimization Theory at University of California
Santa Cruz (UCSC), TIM 206, TIM 209, TIM 250, TIM 251 (since 2008)
– Advising: Quixey, InferSystems, Knotch
• Previously
– NativeX: SVP of Data Science, Chief Scientist, and board member
– Founding Chief Scientist, Turn Inc.
– Principal Scientist, Clairvoyance Corp (CMU spinoff; sister lab to JRC)
– Research Scientist, Xerox Research;
– Entrepreneur: Cofounder of Document Souls and RTB Fast
• Education: PhD in ML, University of Bristol, UK; B.Sc. CS, Uni. of Limerick, Ireland

Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 5
Audience Participation is encouraged!

Outline
• Introduction
• Data Science
• Applications
• What’s next?

Data science everywhere
• .

Traditional Data Science
• ..

Deep Learning
• ..

• ..

What is Intelligence?
• Intelligence:
– “the capacity to learn and solve problems” (Websters
dictionary)
– in particular,
• the ability to solve novel problems
• the ability to act rationally
• the ability to act like humans
– build and understand intelligent entities or agents
– 2 main approaches: “engineering” versus “cognitive
modeling”

What’s involved in Intelligence?
• Ability to interact with the real world
– to perceive, understand, and act
– e.g., speech recognition and understanding and synthesis
– e.g., image understanding
– e.g., ability to take actions, have an effect
• Reasoning and Planning
– modeling the external world, given input
– solving new problems, planning, and making decisions
– ability to deal with unexpected problems, uncertainties
• Learning and Adaptation
– we are continuously learning and adapting
– our internal models are always being “updated”
• e.g., a baby learning to categorize and recognize animals

Can machines think?  Turing Test
• In the test, an interrogator converses with a man
and a machine via a text-based channel.
– If the interrogator fails to guess which one is the machine,
then the machine is said to have passed the Turing test.
(This is a simplification; there are more nuances in and
variants of the Turing test, but these are not relevant for our
present purposes.)
• The beauty of the Turing test is its simplicity and
its objectivity, because it is only a test of
behavior, not of the internals of the machine. It
doesn't care whether the machine is using logical
methods or neural networks. This decoupling of
what to solve from how to solve is an important
theme in this class.

• ..

What AI can do for you?
• .
Instead of asking what AI is, let us turn to the more pragmatic question
of what AI can do for you. We will go through some examples where AI
has already had a substantial impact on society.

Academic Disciplines relevant to AI
• Philosophy Logic,methodsof reasoning,mind as physical
system,foundations oflearning,language,
rationality.
• Mathematics Formalrepresentation and proof,algorithms,
computation,(un)decidability,(in)tractability
• Probability/Statistics modeling uncertainty,learning from data
• Economics utility, decisiontheory,rationaleconomic agents
• Neuroscience neuronsas information processingunits.
• Psychology/ how do people behave,perceive,process cognitive
Cognitive Science information, representknowledge.
• Computer building fastcomputers
engineering
• Controltheory design systems thatmaximize an objective
function over time
• Linguistics knowledgerepresentation,grammars

History of AI
• 1943: early beginnings
– McCulloch & Pitts: Boolean circuit model of brain
• 1950: Turing
– Turing's "Computing Machinery and Intelligence“
• 1956: birth of AI
– Dartmouth meeting: "Artificial Intelligence“ name adopted
• 1950s: initial promise
– Early AI programs, including
– Samuel's checkers program
– Newell & Simon's Logic Theorist
• 1955-65: “great enthusiasm”
– Newell and Simon: GPS, general problem solver
– Gelertner: Geometry Theorem Prover
– McCarthy: invention of LISP

History of AI
• 1966—73: Reality dawns
– Realization that many AI problems are intractable
– Limitations of existing neural network methods identified
• Neural network research almost disappears
• 1969—85: Adding domain knowledge
– Development of knowledge-based systems
– Success of rule-based expert systems,
• E.g., DENDRAL, MYCIN
• But were brittle and did not scale well in practice
• 1986-- Rise of machine learning
– Neural networks return to popularity
– Major advances in machine learning algorithms and
applications
• 1990-- Role of uncertainty
– Bayesian networks as a knowledge representation framework
• 1995-- AI as Science
– Integration of learning, reasoning, knowledge representation
– AI methods used in vision, language, data mining, etc

• ..
http://www.andreykurenkov.com/writing/images/2
016-4-15-a-brief-history-of-game-ai/0-history.png

Success Stories
• Deep Blue defeated the reigning world chess
champion Garry Kasparov in 1997
• AI program proved a mathematical conjecture
(Robbins conjecture) unsolved for decades
• During the 1991 Gulf War, US forces deployed an
AI logistics planning and scheduling program
that involved up to 50,000 vehicles, cargo, and
people
• NASA's on-board autonomous planning program
controlled the scheduling of operations for a
spacecraft
• Proverb solves crossword puzzles better than
most humans

Can Computers beat Humans at Chess?
• Chess Playing is a classic AI problem
– well-defined problem
– very complex: difficult for humans to play well
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
1966 1971 1976 1981 1986 1991 1997
Ratings
Human World Champion
Deep Blue
Deep Thought
PointsRatings

Summary of State of AI Systems in Practice
• Speech synthesis, recognition and understanding
– very useful for limited vocabulary applications
– unconstrained speech understanding is still too hard
• Computer vision
– works for constrained problems (hand-written zip-codes)
– understanding real-world, natural scenes is still too hard
• Learning
– adaptive systems are used in many applications: have their limits
• Planning and Reasoning
– only works for constrained problems: e.g., chess
– real-world is too complex for general systems
• Overall:
– many components of intelligent systems are “doable”
– there are many interesting research problems remaining

Can Computers Talk?
• This is known as “speech synthesis”
– translate text to phonetic form
• e.g., “fictitious” -> fik-tish-es
– use pronunciation rules to map phonemes to actual sound
• e.g., “tish” -> sequence of basic audio sounds
• Difficulties
– sounds made by this “lookup” approach sound unnatural
– sounds are not independent
• e.g., “act” and “action”
• modern systems (e.g., at AT&T) can handle this pretty well
– a harder problem is emphasis, emotion, etc
• humans understand what they are saying
• machines don’t: so they sound unnatural
• Conclusion:
– NO, for complete sentences
– YES, for individual words

Can Computers Recognize Speech?
• Speech Recognition:
– mapping sounds from a microphone into a list of words
– classic problem in AI, very difficult
• “Lets talk about how to wreck a nice beach”
• (I really said “________________________”)
• Recognizing single words from a small
vocabulary
• systems can do this with high accuracy (order of 99%)
• e.g., directory inquiries
– limited vocabulary (area codes, city names)
– computer tries to recognize you first, if unsuccessful hands you
over to a human operator
– saves millions of dollars a year for the phone companies

Recognizing human speech (ctd.)
• Recognizing normal speech is much more difficult
– speech is continuous: where are the boundaries between words?
• e.g., “John’s car has a flat tire”
– large vocabularies
• can be many thousands of possible words
• we can use context to help figure out what someone said
– e.g., hypothesize and test
– try telling a waiter in a restaurant:
“I would like some dream and sugar in my coffee”
– background noise, other speakers, accents, colds, etc
– on normal speech, modern systems are only about 60-70%
accurate
• Conclusion:
– NO, normal speech is too complex to accurately recognize
– YES, for restricted problems (small vocabulary, single speaker)

Can Computers Understand speech?
• Understanding is different to recognition:
– “Time flies like an arrow”
• assume the computer can recognize all the words
• how many different interpretations are there?

Can Computers Understand speech?
– 1. time passes quickly like an arrow?
– 2. command: time the flies the way an arrow times the flies
– 3. command: only time those flies which are like an arrow
– 4. “time-flies” are fond of arrows

Can Computers Understand
speech?
– 1. time passes quickly like an arrow?
– 2. command: time the flies the way an arrow times the flies
– 3. command: only time those flies which are like an arrow
– 4. “time-flies” are fond of arrows
• only 1. makes any sense,
– but how could a computer figure this out?
– clearly humans use a lot of implicit commonsense knowledge in
communication
• Conclusion: NO, much of what we say is beyond
the capabilities of a computer to understand at
present

Can Computers Learn and Adapt ?
• Learning and Adaptation
– consider a computer learning to drive on the freeway
– we could teach it lots of rules about what to do
– or we could let it drive and steer it back on course when it heads for
the embankment
• systems like this are under development (e.g., Daimler Benz)
• e.g., RALPH at CMU
– in mid 90’s it drove 98% of the way from Pittsburgh to San Diego without
any human assistance
– machine learning allows computers to learn to do things without
explicit programming
– many successful applications:
• requires some “set-up”: does not mean your PC can learn to
forecast the stock market or become a brain surgeon
• Conclusion: YES, computers can learn and adapt, when
presented with information in the appropriate way

• Recognition v. Understanding (like Speech)
– Recognition and Understanding of Objects in a scene
• look around this room
• you can effortlessly recognize objects
• human brain can map 2d visual image to 3d “map”
• Why is visual recognition a hard problem?
• Conclusion:
– mostly NO: computers can only “see” certain types of objects
under limited circumstances
– YES for certain constrained problems (e.g., face recognition)
Can Computers “see”?

Can computers plan and make optimal decisions?
• Intelligence
– involves solving problemsand making decisionsand plans
– e.g., you want to take a holiday in Brazil
• you need to decide on dates, flights
• you need to get to the airport, etc
• involves a sequence of decisions,plans, and actions
• What makes planning hard?
– the world is not predictable:
• your flight is canceled or there’s a backup on the 405
– there are a potentially huge number of details
• do you considerall flights? all dates?
– no: commonsenseconstrains your solutions
– AI systems are only successfulin constrained planning problems
• Conclusion: NO, real-world planning and decision-making is still beyond
the capabilities of modern computers
– exception:very well-defined,constrained problems

Summary of State of AI Systems in Practice
• Speech synthesis, recognition and understanding
– very useful for limited vocabulary applications
– unconstrained speechunderstanding is still too hard
• Computer vision
– works for constrained problems (hand-written zip-codes)
– understanding real-world, natural scenes is still too hard
• Learning
– adaptive systems are used in many applications:have their limits
• Planning and Reasoning
– only works for constrained problems:e.g.,chess
– real-world is too complexfor general systems
• Overall:
– many components of intelligent systems are “doable”
– there are many interesting research problemsremaining

• ..

• .
Separate what to
compute (modeling)
from how to compute
it (algorithms)

Lecture Outline
• Introduction
• Data Science
• Applications
• What’s next?

• .

What is machine learning?
• ,,

machine learning
• Supporting all of these models is machine learning.
• In the non-machine learning approach, one would write a complex
program (remember, we are solving tasks of significant
complexity), but this gets very tedious.
– For example, how should a spellcheckerknow that for "hte", "the" (transposition) is
more likely to be the correctoutput as compared to "hate" (insertion)?
• The machine learning approach is to instead write a really simple
program with unknown parameters (e.g., numbers measuring how
bad it is to transpose or insert characters).
• Then, we obtain a set of training examples that partially specifies
the desired system behavior. A learning algorithm takes these
training examples and sets the parameters of our simple program
so that the resulting program approximately produces the desired
system behavior.
• Abstractly, machine learning allows us to shift the complexity
from the program to the data, which is much easier to obtain
(either naturally occurring or via crowdsourcing).

Equation of a line
y = mx +b
f(x) = mx +b

Machine Learning in one slide
• Machine learning, a branch of artificial intelligence, is a scientific
discipline that is concerned with the design and development of
algorithms that allow computers to evolve behaviors based on
empirical data, such as from sensor data or databases.
• A learner can take advantage of examples (data) to capture
characteristics of interest of their unknown underlying probability
distribution. Data can be seen as examples that illustrate relations
between observed variables.
• A major focus of machine learning research is to automatically
learn to recognize complex patterns and make intelligent
decisions based on data; the difficulty lies in the fact that the set
of all possible behaviors given all possible inputs is too large to
be covered by the set of observed examples (training data).
• Hence the learner must generalize from the given examples, so as
to be able to produce a useful output in new cases. Machine
learning, like all subjects in artificial intelligence, require cross-
disciplinary proficiency in several areas, such as probability
theory, statistics, pattern recognition, cognitive science, data
mining, adaptive control, computational neuroscience and
theoretical computer science.

What is the Learning Problem?
• Improve over Task T
• with respect to performance measure P
• based on experience E
Learning = Improving with experience at some task

Types of Learning
• Supervised learning - Generates a function that mapsinputs
to desired outputs. For example, in a classification problem,
the learner approximates a function mapping a vector into
classes by looking at input-output examples of the function.
• Unsupervised learning - Models a set of inputs: like clustering
• Semi-supervised learning - Combines both labeled and
unlabeled examples to generate an appropriate function or
classifier.
• Reinforcement learning - Learns how to act given an
observation of the world. Every action has some impact in the
environment, and the environment provides feedback in the
form of rewards that guides the learning algorithm.
Transduction - Tries to predict new outputs based on training
inputs, training outputs, and test inputs.

Supervised Learning :Regression
• Regression
– Linear Regression
• Classification
– Logistic Regression
• Generalized Linear Models (GLMs)
– Broader family of models (that subsume Linear Regression and
logistic regress and more
– In R checkout ?glm()
Parametric Approaches vs. Non-parametric
Convex/Concave
Discriminative versus generative

Classification versus Regression
• Classification is just like a regression problem,
except where the values of y that we now want to
predict take on only a small number of discrete
values (assume no order in y)
• Binary logistic regression
– For now let’s focus on binary classification where y can take
on two values 0 and 1 (can be generalized to multi-class
case)
• E.g., building an ancestor class; a person is an
ancestor (where y might take the value of 1) or
not (y=0).
– Given Xi the corresponding yi is AKA the label for the training
data

• Generative Classifier
(Bottom-up learning)
– Build model of each class
– Assume the underlying form
of the classes and estimate
their parameters (e.g., a
Gaussian)
• Discriminative Classifier
(Top down)
– Build model of boundary
between classes
of the discriminant and
estimate its parameters (e.g.,
a hyperplane)
Families of Supervised Learning
Sports
Arts
Business
Health
Sports
Arts
BusinessHealth

Terminology: linear regression
Predicted Predictor variables
Response variable Explanatory variables
Outcomevariable Covariables
Dependent Independent variables
...1 nn2210 xwxwxwwy 
Wi are the model coefficients
Xi’sy
Y-intercept/threshold

Pr(Click): Advertising Problem
• Predict Pr(Click|dwellTimeOnWebpage)
– at the times 1, 2, 3, 4, and 5 seconds after
loading the page.
• Graph each data point with time on the
x-axis and CTR on the y-axis. Your data
should follow a straight line.
• Use locator() to input data
• Find the equation of this line.
# x y%
1 1 2
. 2 3
. 3 7
. 4 8
m 5 9
F(x)
x
X are features, aka variables, continuous,
discrete, ordinal ( X  n )

Least Square Fit Approximations
Suppose we want to fit the data set.
We would like to find the best straight
line to fit the data?
# x y
1 1 2
. 2 3
. 3 7
. 4 8
m 5 9

Fit a line based on…
• If we assume that the first two points are correct
and choose the line that goes through them, we
get the line y = 1 + x.
• If we substitute our points (x-values) into this
equation, we get the following chart.
• How good is this line?
– The sum of the squares of the errors is 27.
SSE = 27
Do you think that we can do better
than this?

Linear Model More Generally
• E.g., y=mx+b can be more generally seen a function of
the form
• Here the W’s are the parameters (also called weights)
parametering the space of linear function mapping
from X  Y=F(x)
# X0 x1 y
1 1 1 2
. 1 2 3
. 1 3 7
. 1 4 8
m 1 5 9







n
i
T
ii
n
i
T
ii
Xxxxfy
W
XWxw
xwxwxxfy
1
10
1
110010
),(
ofinsteaduseSometimes
),(

mslope 
x1
F(x)
b

Types of Learning
• Supervised learning - Generates a function that mapsinputs
to desired outputs. For example, in a classification problem,
the learner approximates a function mapping a vector into
classes by looking at input-output examples of the function.
• Unsupervised learning - Models a set of inputs: like clustering
• Semi-supervised learning - Combines both labeled and
unlabeled examples to generate an appropriate function or
classifier.
• Reinforcement learning - Learns how to act given an
observation of the world. Every action has some impact in the
environment, and the environment provides feedback in the
form of rewards that guides the learning algorithm.
Transduction - Tries to predict new outputs based on training
inputs, training outputs, and test inputs.

Machine Learning Background
Machine Learning (ML):”a computer program that improves its performance at
some task through experience” [Mitchell 1997]
GIVEN: Input data is a table of attribute values and associated class values (in
the case of supervised learning)
GOAL: Approximate f(x1,…,xn)->y
InstanceAttr x1 x2 … xn y
1 3 0 .. 7 -1
2 +1
… … … … … …
L (aka m) 0 4 ... 8 -1
Y is categorical

Machine Learning: Regression
1 3 0 .. 7 73
2 76
… … … … … …
L (aka m) 0 4 ... 8 97
Y is real valued

Machine Learning semi-supervised
1 3 0 .. 7 73
2 76
… … … … … …
L (aka m) 0 4 ... 8 97
Y is only partially available

Machine Learning Unsupervised
1 3 0 .. 7 73
2 76
… … … … … …
L (aka m) 0 4 ... 8 97
Y is not available

• Generative Classifier
(Bottom-up learning)
– Build model of each class
of the classes and estimate
their parameters (e.g., a
Gaussian)
• Discriminative Classifier
(Top down)
– Build model of boundary
between classes
of the discriminant and
estimate its parameters (e.g.,
a hyperplane)
Families of Supervised Learning
Sports
Arts
Business
Health
Sports
Arts
BusinessHealth

Generative vs. Discriminative
• Generative learning (e.g., Bayesian
Networks, HMM, Naïve Bayes, EM GMM)
typically more flexible
– More complex problems
– More flexible predictions
• Discriminative learning (e.g., ANN, SVM)
typically more accurate
– Better with small datasets
– Faster to train

Parametric vs. Non-Parametric ML Algorithms
• Parametric ML Algorithms (e.g., OLS, Decision Trees;
SVMs, NNs)
– Model-based methods, such as neural networks and the mixture of
Gaussians, use the data to build a parameterized model. After training,
the model is used for predictions and the data are generally discarded.
• Non-Parametric (lowess(); knn; some flavours of SVMs)
– In contrast, ``memory-based'' methods are non-parametric approaches
that explicitly retain the training data, and use it each time a prediction
needs to be made.
– The term “non-parametric” (roughly) refers to the fact that the amount
of stuff we need to keep in order to represent the hypothesis/model
grows linearly with the size of the training set.

Linear Model: Ordinary Least Squares
• How do we pick, or learn, the parameters W (aka θ)?
• One reasonable method seems to be to make f(x)
close to y, at least for the training examples.
• To formalize, let’s define a function that measures,
for each possible model/hypothesis, W, how close
fθ(xi)’s are to the corresponding yi ’s:
• Sum of squared error
• AKA Residual Sum of Squares (Residual squared)
Measuring Quality
 

m
i
ii
yWXWJ
1
2
2
1
)(
This error minimization is going
to have problems?


m
i
ii
yWXWJ
1
)(
Residual sum of squares

Residual
0
10
20
30
40
50
60
0 2 4 6 8 10 12 14 16
x
y Residuali
 ii
yWX i
Residual

Which Line is it anyway?
• Select another two points and build a line
• If we choose the line that goes through the points
when x = 3 and 4, we get the line y = 4 + x. Will we
get a better fit? Let's look at it.
SSE = 18. Getting better but can we do better?

Can we do better than guesswork?
• Let's try the line that is half way between these
two lines. The equation would be y = 2.5 + x.
• Is there a more scientific or efficient way than
guessing at which line would give the best fit.
– Surely there is a methodical way to determine the best fit
line. Let's think about what we want.
SSE = 11.25. Getting better but can we do
better?

Hypothesis Space of Linear Models
• Here the W’s are the parameters (also called
weights) parameterizing the space of linear
function mapping from X  Y = f(X)
• Augment Training Data with dummy intercept
variable (simplifies notation and modeling)
# X0 x1 y
1 1 1 2
. 1 2 3
. 1 3 7
. 1 4 8
m 1 5 9







n
i
T
ii
n
i
T
ii
Xxxxfy
W
XWxw
xwxwxxfy
1
10
1
110010
),(
ofinsteaduseSometimes
),(



Space of Hypotheses: Weights
example.OLS_Heatmap()
• Each model is in our case a coefficient for the y-intercept (bias)
and a coefficient for the feature-variable (time)
• Plot weight-space in 2D where the third dimesion is the error
• Select combination that minimizes the sum of square error
HeatMap with isolines overlayed 3D error surface z=log(w0+w1x)
 

m
i
iii
yXWWJ
1
2
2
1
)(

Hyperplanes partition the input space(a line in a 2 input
variable problem) and do NOT predict real values
• Many methods in machine learning are based on
finding parameters that minimize some objective
function.
• Very often, the objective function is a weighted
sum of two terms:
• a cost function and regularization term.
• In statistics terms the (log-)likelihood and (log-)prior.
– If both of these components are convex, then their sum is
also convex.
– Loss functions are summed over examples so the sum of a
convex functions is a convex function
Minimize Residuals
Given a linear regression model
W, Please type in the loss
function for linear regression
y= f(X1) where y
is real-valued
y= f(X1, X2) where y is
in {0,1} or {-1, 1}.
Y=mX +b
y= f(X1)
X1
X2 Y
X1
Separating hyperplanepartitions
AX1 + BX2 + C
Class(X1, X2) = sign(AX1 + BX2 + C)
Prediction Line
y=mX +b
Prediction(X1) =mX
Partitions versus predicts

Unsupervised Learning (Clustering)
Input data
We want 3 clusters,red, green and blue

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Center of a cluster
Let’s compute the center of those points

Center of a cluster
We can use the meanon each dimension

Center of a cluster

Center of a cluster
But the meanhastroublewith outliers

Center of a cluster
Using the median on each dimension is more robust

Assignment
All points coloured properly already
⇒ wearedone !

Three generations of machine learning
• First generation: dataset that fits in memory
– Single node learning summary statistics and some batch modeling (at small
scale); SQL, R
– Down sampling the data
• Second generation: General purpose clusters and frameworks
– Distributed frameworks that allows us to divide and conquer problems
– Learning using general purpose frameworks such as hadoop big data analysis
offline, realtime decision making, homegrown specialist systems (Hadoop for
analysis and modeling; ), Hadoop, R
– In-house purpose built systems; specialist sport
• Third generation: Purpose-built libraries and frameworks
– Built for iterative algorithms that are common place in ML
– huge scale realtime analysis and decision making systems
– Specialized frameworks for large scale manipulation the type of data you are
workign with.
– For example, Machine learning libraries like MLLib in Spark, graph processing
libraries like Apache Giraph or GraphX in Spark

Evolution of Map-Reduce frameworks
for big data processing
mid 90s
Jimi’s PhD
First generation
2nd generation
2015
Spark 1.5
As of 10/2015Spark 1.0
3rd generation
Hadoop V2.0

Top 10 ML Algorithms
• ..
https://www.dezyre.com/article/top-10-
machine-learning-algorithms/202
Naïve Bayes Classifier
K Means Clustering Algorithm
Nearest Neighbours
Apriori Algorithm
Linear Regression
Logistic Regression
Support Vector Machine
Decision Trees
Ensembles/Forests
Artificial Neural Networks/Deep Learning
Reinforcement learning
Forecasting
Many more!

• .. 2005

Lecture Outline
• Introduction
• Data Science
• Applications
• What’s next?

Internet companies started the revolution
• ..

Internet companies started the revolution
• ..
But more traditional
companies are leveraging
their data and DS Tech

Data Analysis Has Been Around for a While
R.A. Fisher
Howard
Dresner
Peter Luhn
W.E.
Demming
2012: Deep Learning
2013: Spark
1997 Google

Data Science DS Skillset
• Linear regression, DT models for
domain experts
Domain
Expertise
A venn diagram
with a Danger
Bearing
[adapted from
Drew Conway]

Data Science
Technology
Hadoop, Spark,Python,
Scala, Java, R
Marketing, Econometrics,
Web Search, Cellular
Networks, Social Networks
Statistics, Optimization Theory,
Geo-Informational Science
Math
Domain Expertise
Mobile Advertising
Adapted from Drew Conway’s Venn diagram of data science
DS

Data Scientist
Technology
Hadoop, Spark,Python,
Scala, Java, R
Marketing, Econometrics,
Web Search, Cellular
Networks, Social Networks
Statistics, Optimization Theory,
Geo-Informational Science
MathDomain Expertise
Mobile Advertising
Communication
DS

• ..
RockStars and Super Models
Technology
Math
Domain
expertise
RockStar

Data Analytics at Scale
Algorithms: Machine
Learning and Analytics,
Representation,
Vizualization
Big Data: human-centric,
M2M, IoT
Machines:
Cloud Computing
Storage and compute
Frameworks:
MapReduce,HDFS,
Hadoop, Spark, MPI
Security/Privacy Data
Analytic
sat
Scale

DS is Systems + Theory + Verticals
• ..
http://support.sas.com/resources/papers/proceedings14/SAS313-
2014.pdf
Systems
- NoSQL
- Hadoop
- Spark
- MPIVerticals
- Advertising
- Voting
- Sports
- Autonomous Agents
- Healtcare
- Education
Theory
Visualization
Legal

1
2
Understand domain,
Collect requirements
Exploratorydata analysis
Modeling
FeatureEngineering3
4
5
6
Deploy Models in the
wild (e.g., AB test)
Lab-based
experiments
Typical Abstract Data Analytics Pipeline
WarehouseData
7
Reports and
Decisions
Models and
decisions

Lecture Outline
• Google Doc and Group
• Welcome & Class Introductions
• Big Data and Applications
• Course introduction
• Class logistics
• Systems (part 1 of N)

Data Science at Scale
Security/Privacy
Big Data: human-
centric, M2M, IoT
Machines:
Cloud Computing
Parallel Frameworks:
MapReduce:cmdLine,
Hadoop, MRJob,Spark
Algorithms: Machine
Learning and Analytics Machine
learning
at Scale

Big data Definition: use
• Big data is a broad term for data sets so large or
complex that traditional data processing
applications are inadequate.
– PROCESSING:
• Think of your laptop that gets overwhelmed with 3-4 gig
of data (disk space is 1TB)
– STORAGE:
• Laptop : 1 TB (1012 bytes)
– THROUGH-PUT (Read 108 (100 meg/sec) 104 seconds)
• 1TB would take 3 hours to read it using your laptop
• Challenges
– Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, security, and
information privacy.

Big Data
• In 2012, Gartner updated its definition as follows:
"Big data is high volume, high velocity, and/or
high variety information assets that require new
forms of processing to enable enhanced decision
making, insight discovery and process
optimization."[18]

Big Data: V3
• ..
10121021
speed of generation of data
or how fast the data is
generated and processed
2015: 1-2 TB per online individual
4ZB (1021) Today  40ZB in 2020

Sources Driving Big Data
It’s All Happening On-line
Every:
Click
Ad impression
Billing event
Fast Forward, pause,…
Friend Request
Transaction
Network message
Fault
…
User Generated
(Web, Social & Mobile)
…
..
Internet of Things / M2M Scientific Computing
Quantified Self

Big Data Infographic
• ..
http://www.ibmbigdatahub.com/sites/defaul
t/files/infographic_file/4-Vs-of-big-data.jpg
http://www.ibmbigdatahub.com/infographic/
four-vs-big-data
By 2005 we had 120* 1018
By 2007 we had 280*2018
By 2020 we will have 40* 1021
The quality of the
data being captured
can vary greatly

3 Vs of Big Data
• …
40TB per person by 2020
1-2 TB per person today2014/2015

Lecture Outline
• Introduction
• Data Science
• Applications
• What’s next?

Why all the excitement?
• Government:
– Obama used 80 pieces of information on each person; 4 year
history (versus Romney)
– Nate Silver used Bayesian techniques to publish analyses and
predictions related to the 2008 and 2012 United States presidential
election
• Sports:
– Oakland Athletics baseball team and its manager Billy Beane
• Transportation ( e.g., Autonomous Vehicles)
• HCI: Speech Recognition and Translation
• Healthcare
– AI Cure: Do you know if your patients are taking their meds?
• Digital Advertising
• Search (web, local, mobile)

How does data, ML, data science work?

• ..

• .

Web search lifecycle
• ..
http://www.slideshare.net/GaneshVenkataraman3/learn-to-rank-using-machine-learning
https://en.wikipedia.org/wiki/Monty_Hall_problem

Understand user intent
• ,,

Fixing user errors
• ..

• ,,

Like the Index at end of book
• ..

PageRank
• ..

Search is a ranking problem
• ..

Learning to rank
• ..

• ..

Training Data
• ..

Supervised Feedback Loop
Guided by human editors
• ..

Mining relevance judgements
• ..

Search ranking (web, jobs, local, etc)
And Ads
• ..

DB size = 100s billions of sites
Google server farms
2 million machines (est)
1011 X 104 = 1015 ~1 Petabyte of data

Learning to Rank at SearchMe
• Page Quality, Page Category, Webspam, Query understanding LETOR

LeToR: Improve in a measured way
Doubled size of index
More labeled training data

More data or more data science?
• ..

Advertising ~2% of US GDP; $140B WW
"Half the money I spend on advertising is wasted; the trouble is, I don't
know which half." - John Wanamaker, father of modern advertising.
– Less than 1% of all impressions lead to measureable ROI
Despite its problems (Attribution, etc.)
• US GDP = $14.1 Trillion (Global $56 Trillion, 56x1012)
• US Advertising Spend
– ~$275 Billion across all media
• (2% of GDP since the early 1900s)
• In 2014, Worldwide online advertising was $140
– I.e., about 20% of all ad spending across all media
– $42 billion global mobile-advertising market in 2014

• Stopped here 11/15/2016
• Jgs

Making Money from Apps
• 93% of downloaded apps in 2013 (globally) are
free apps !
• 76% of revenue generated from apps (globally) in
2013 is from in-app purchases
– [http://www.forbes.com/sites/chuckjones/2013/03/31/apps-with-in-app-
purchase-generate-the-highest-revenue/]
• In the Freeium economy
– To make money from apps, publishers must maintain
customer satisfaction through superior app performance and
design,
– then monetize though advertising and in-app purchases
[http://venturebeat.com/2014/03/27/mobile-app-monetization-freemium-is-king-
but-in-app-ads-are-growing-fast/,IDC, AppAnnie]

Mobile Publisher: How do I make money?
Auction
Ad
Which Ad?
Publisher:
App Developer
Consumer:
App user
• Paid app download
• In app purchases
• In app Advertising

• ..

Native Advertising

Rich Media Templates
• Advertiser
Template/Configuration
• Defines an offer design/display for a
specific ad unit
• Publisher
Template/Configuration
• Defined and designed to provide
native experience in publisher games
• Controls allowable content (ad units)
with a placement
• Is a “shell” to an ad (advertiser offer
template)
• Tracks placement performance
• Allows to control the behavior
and look/design from the server

Native Design and Dynamic Creative Optimization
Ad Frame Treatment
Variable Intro Text
Publisher Game Art or
Character Integration
Variable Integrated Call
to Action
Context, where is this
solution being shown
in the game?
N
Native Design
Blends with
Content
Dynamic UI elements adapt to
the ad and the audience

• ..
http://venturebeat.com/2014/04/29/mobile-apps-could-hit-70b-in-revenues-by-2017-as-non-game-
categories-take-off/

Mobile Ad Spend to Top $100 Billion
Worldwide in 2016, 51% of Digital Market
• US and China will account for nearly 62% of
global mobile ad spending next year
• http://www.emarketer.com/Article/Mobile-Ad-Spend-Top-100-
Billion-Worldwide-2016-51-of-Digital-
Market/1012299#sthash.FBfZAlaC.dpuf

CPMs on Mobile are catching up
• Mobile Advertising: What is the average CPM on
mobile?
– The effective cost per thousand impressions (CPM) for
desktop web ads is about $3.50, while the CPM for mobile
ads is just $0.75.
– Video-based CPMs typically > $15
http://www.quora.com/Mobile-Advertising/What-is-the-average-CPM-on-mobile
http://mashable.com/2012/10/23/mobile-ad-prices/

NativeX: Art and Science of Native Mobile Advertising
A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y
SUPPLY/PublishersCONSUMERS DEMAND/Advertisers

A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y
• DOE NativeAds
• Yieldmgt
• LTV/Churn
• SDK
• LTV/Churn
• Event-based CPA
• Flexible, multiple
conversions
• Segment-based targeting
• Forecasting
• Coldstart
• Pacing
• Metrics and Evaluation

“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
• Used for Reporting
and Modeling
• Online Data
• Used in Real Time
• Used for Offer
Serving
Realtime Batch
NativeX Data Pipelines
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL
(Extract, Transform,and Load)
Ad serving data pipelines

“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
and Modeling
• Online Data
• Used for Offer
Serving
Realtime Batch
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL

Devices
Ad serving architecture
SDK
Kinesis
Lambda
Spark
&
Scala
Spark
Ad
Servers
Aurora SSAS
Cassandra
SQL
Server
EMR
Modeling
Java / Python / R
Excel Pivots
Self-Service
S3
S3 S3
Ad Hoc / Deep Analysis Pipeline
BI Pipeline
Data Science Pipeline
Glacie
r
Spark
ELB
HA
Proxy
Elasticache
Activity Tracking
Raw Data
Archived Activity
Tracking
EC2 Cluster
Tableau
Reporting
Services
Reporting APIs
Hourly ETL
EC2 Instance
Data Warehouse
Alerts
Dashboards
Debugging / Ops
Ad-hoc Analysis
EventTracking
Data (Logs)
Device Profiles
Device
Data
Configuration / Lookup Data

Publisher: Which ad to show?
Bids
Auction
getAd
Ad
Which Ad?

Publisher: Which ad to show?
Ads, Bid (CPI)
Auction
getAd
Ad
Which Ad?

NativeX conducts an eCPM-based Auction
Ads Pick
best ads
Bids
Auction
Action argmaxAd eCPM=bid*CR
getAd
Ad
Transaction
Logs
$5×0.010×1000=$50
$10×0.002×1000=$20
$3×0.002×1000=$6
$4×0.001×1000=$4

NativeX conducts an eCPM-based Auction
Ads Pick
best ads
Bids
Auction
Action argmaxAd eCPM=bid*CR
getAd
Ad
Transaction
Logs
$5×0.010×1000=$50
$10×0.002×1000=$20
$3×0.002×1000=$6
$4×0.001×1000=$4
eCPMAd = CRAd × BidAd× 1000

1
2
Understand domain, Collect
requirements
Exploratorydata analysis
Modeling: Conversion Rate Models
Feature Engineering3
4
5
6
Deploy Models in the wild
(e.g., AB test)
Lab-based experiments
7 Steps in Modeling: E.g., Conversion Rate Modeling
WarehouseData
7

Multiple Ad Sources:
• DSPs,Exchanges
• Ad Networks
• Internal/Self-service
Multiple conversiontypes:
• CPM, CPC, CPI,
CPCV, CPA,CPE
De-duplication
Optimization by geo
Modeling Features:
• Geo location
• Device
• Reviews (star rating, review
text; Geo location of
reviews)
• Social media Tweets/ FB
posts
• Categories on Android and
iOS
• Creative Message
• User profiles (RFM based
on network behavior)
• Device Behavioral (based
on installed apps on device
RFM, recommendations,
categories)
• Graph-based features
• Others….
Campaign-specific models for CTR/CR

Modeling
• ML Approaches
– Gradient boosted decision trees
– Bayesian hierarchical approaches
– Segmentation via matrix factorization
• Feature engineering
– Feature invention
• Metrics and evaluation
• Storing and accessing data
• Perennial Challenges
– Coldstart
– Bias
– Scale

• ..
https://upload.wikimedia.org/wikipedia/commons/
thumb/5/5f/Minard%27s_Map_%28vectorized%29.
svg/2023px-
Minard%27s_Map_%28vectorized%29.svg.png

If we can’t measure it then…
• …
Data Science Updates: 2013/10/25 ©2013 NativeX Holdings, LLC For
16%  40%

From a systems perspective:
Three generations of machine learning
• First generation: dataset that fit in memory
– Single node learning summary statistics and some batch modeling (at sma
scale); SQL, R
– Down sampling the data
• Second generation: General purpose clusters and framework
– Distributedframeworks that allows us to divide and conquer problems
– Learning using general purpose frameworks such as hadoop big data
analysis offline, realtime decision making, homegrown specialist systems
(Hadoop for analysis and modeling; ), Hadoop, R
– In-house purpose built systems; specialist sport
• Third generation: Purpose-built libraries and frameworks
– Built for iterative algorithms that are common place in ML
– huge scale realtime analysis and decision making systems
– Specialized frameworks for large scale manipulation the type of data you a
workign with.
– For example, Machine learning libraries like MLLib in Spark, graph

Ranking Ads (more) at Turn Inc.

Text Processing
• .. http://aylien.com/
http://aylien.com/
Deep Learning based
CNN
RNN

• ..
Linking other things such as
groups

• ..
Growing

• Deep Learning

• ..

Logistic Regression Model
Inputs
Coefficients
a, b, c
Output
Independent
variables
x1, x2, x3
Dependent
variable
p
Prediction
Age 34
1Gender
Stage 4
“Probability
of
beingAlive”
5
8
4
0.6
S

S is the sum of inputs * weights
Inputs
Coefficients
Output
Independent
variables
Prediction
Age 34
1Gender
Stage 4
5
8
4 S  34.5  1.4  4.8  20.6

Neural Network Model
Inputs
Weights
Output
Independent
variables
Dependent
variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.2
.1
.3
.7
.2
WeightsHiddenLa
yer
“Probability
of
beingAlive”
0.6
S
S
.
4
.2
S

Intelligent Systems in Your Everyday Life
• Post Office
– automatic address recognitionand sorting of mail
• Banks
– automatic check readers,signature verification systems
– automated loan application classification
• Customer Service
– automatic voice recognition
• The Web
– Identifying your age, gender,location, from your Web surfing
– Automated fraud detection
• Digital Cameras
– Automated face detectionand focusing
• Computer Games
– Intelligent characters/agents

• ..

• .. http://3.bp.blogspot.com/-iEx-
C0ljkKk/VV38zjj_vdI/AAAAAAAAA7w/aron8CBjm
os/s1600/alexnet.png

• .
Daterequirements

• ..

Conversational UI
• We’re witnessing an explosion of applications
that no longer have a graphical user interface
(GUI).
• They’ve actually been around for a while, but
they’ve only recently started spreading into the
mainstream.
• They are called bots, virtual assistants, invisible
apps.
• They can run on Slack, WeChat, Facebook
Messenger, plain SMS, or Amazon Echo.
• They can be entirely driven by artificial
intelligence, or there can be a human behind the
curtain.

Conversational UI
• ..

• ..
Check Balance replenish
Charts --__--

Conversational UI
• Amazon Echo is controlled by voice, but has a
companion app.

Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 1766Microsoft Research
Cortan
a

• ..

Speech Recognition Breakthrough for the
Spoken, Translated Word
• Published on Nov 8, 2012
• Chief Research Officer Rick Rashid demonstrates a speech
recognition breakthrough via machine translation that
converts his spoken English words into computer-
generated Chinese language. The breakthrough is
patterned after deep neural networks and significantly
reduces errors in spoken as well as written translation.
• For moreinformation on Speech Recognition and
Translation, visit
– http://www.microsoft.com/translator/skype.aspx
• Excellent Video (please watch all this video!)
– https://www.youtube.com/watch?v=Nu-nlQqFCKg (Minute 7:11)
– English text (ASR)  Chinese Text  Text to speech system (sound like
english speaker)

• /..
English text (ASR)  Chinese Text  Text to
speech system (sound like english speaker)

ASR (Audio signal  word sequence)
• ..
HMM, Deep Learning, Language models

Tipping point: Humans no longer the
center to the data universe
• ..

IoT/IoE
• ..

Personal; society; M2M; crowdsourcing
• Society
– Graphs: Social, professional;
– Quantified self: Eating; Sleeping; exercising
– Voting
– Education
– Healthcare…. Economics, shopping, etc.
• Internet of things
– Tracking Wildebeests in Serengeti, Tanzania (not just with GPS tags, but
also with cameras at key strategic locations through out the Serengeti
• Population changes in species; Scheduling safaris
– 1 Billion smart meters by 2020;
• 1 Petabyte of data per day? 10^9 =10^12 10^15
• 1 Billion smart meters (One megabye of data per device per day; Poll
meter 1000 times per day; 1000 bytes of data each time
– Smart cities
• Etc.

Japanese to English
• ..
http://www.ustar-consortium.com/research.html

From analytics to closed loop control systems
Historical Realtime Future
Analytical
Now
Predictive
Customerexitrate

Analytical
Now
Predictive
Customerexitrate
Decisive

Analytical
$
Now
Predictive
$$
Customerexitrate
Decisive
$$$

Managers and CEOs see the value of DA
Data (Science) improves KPIs dramatically
Summary
Stats and
Reports
Offline Data
Mining (e.g,
user Profiles)
Realitime
decision
making
Personalization
LTV
Advanced BI,
Regional Sales
KPIPerformanceImprovement
(e.g.,Sales)
10-20%
20-30%
2X-10X
10X+
Churn, Repeat,
BigSpender
Realtime
Recommendations,
LookAlike Modeling
Ads (DSP/DMP)
Amazon
Google
Netflix
Oracle, SQL
Hadoop
(Omniture,
Hyperion)
SAS, SPSS
Cloudera, R

Autonomous Vehicles
• ..

• ..

Autonomous Vehicles
• ..
An image of what Google's self-driving car
sees when it makes a left turn.
http://www.rand.org/pubs/research_briefs/RB9755.html

autonomous vehicles
• Research in autonomous cars started in the 1980s, but the
technology wasn't there.
• Perhaps the first significant event was the 2005 DARPA Grand
Challenge, in which the goal was to have a driverless car go
through a 132-mile off-road course. Stanford finished in first
place. The car was equipped with various sensors (laser, vision,
radar), whose readings needed to be synthesized (using
probabilistic techniques that we'll learn from this class) to localize
the car and then to generate control signals for the steering,
throttle, and brake.
• In 2007, DARPA created an even harder Urban Challenge, which
was won by CMU.
• In 2009, Google started a self-driving car program, and since then,
their self-driving cars have driven over 1 million miles on freeways
and streets.
• In January 2015, Uber hired about 50 people from CMU's robotics
department to build self-driving cars.
• While there are still technological and policy issues to be worked
out, the potential impact on transportation is huge.

• ..
http://www.nature.com/news/auto
omous-vehicles-no-drivers-
required-1.16832
http://asirt.org/initiatives/informing
road-users/road-safety-facts/road
crash-statistics
800Million
parking spots
in US

Save fuel, Safer logistics
• ..
http://peloton-tech.com/

Data Science in Ecommerce
• ..
This is just a
subset

Defining Product Strategy for the
optimum product mix
• Ecommerce, and bricks and mortar businesses
– What products should they sell?
– What price should be offered for the products and when?
• Data science algorithms help ecommerce businesses
define and optimize the product mix.
– Every ecommerce business has a product team that looks into the
design process where data science algorithms can help the business
with forecasting like-
• What are the loopholes in the product mix?
• What should they make?
• How many quantities should be ordered as initial batch from the factory
outlet?
• When should they halt the supply of those products?
• When should they sell?
• Data scientists versus Data Analysts
– work on advanced predictive and prescriptive analytics
– whereas data analysts will merely look into the retrospective analysis like

• https://www.aicure.com/

Do you know if your patients are taking their meds?
• ..

Trust but verify!
• ..

Rank patients
• ..

Alerts
• ..

Machine learning at Scale
Algorithms: Machine
Learning and Analytics
Big Data: human-centric,
M2M, IoT
Machines:
Cloud Computing
Parallel Frameworks:
MapReduce:cmdLine,
Hadoop, MRJob,Spark
Security/Privacy
Machine
learning
at Scale

Lecture Outline
• Introduction
• Data Science
• Applications
• What’s next?

150,000 Data Scientists needed in US
[McKinsey Report on Big Data 2011]
With such enormous potential to change the world, it will come as no surprise
that data scientists are in huge demand

Top 10 Best Jobs in the US as of 2/2016
How much you make
The demand for your skills
How easily you can advance
117K Median salary
1,700 openings right now

Analytical
$
Now
Predictive
$$
Customerexitrate
Decisive
$$$
IoE, Deep Learning, GPU, Data,Bandwidth (5G)

•Architecture

Cool Thing #2: Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
BEFORE:
ETL, schema design upfront,
tossing out original data,
comprehensive data study
Keep original data around!
Have multiple views of the same data!
Work with unstructured data sooner!
Store first, figure out what to do with it later!
WITH HADOOP:

Cool Thing #4: Unstructured Data
• Unstructured data:
media, text,
forms, log data
lumped structured data
• Query languages like SQL and
Pig assume some sort of
“structure”
• MapReduce is just Java:
You can do anything Java can
do in a Mapper or Reducer

Left outer join: return all rows from left table
even if there are no matches in the right lable
• ..
Customers is the left
Customers Orders
CustomerName OrderID
A 2
A 4
A 3
B BLANK

Inner Join or simply join
• .
CustomerName OrderID
A 2
A 4
A 3
B BLANK

Join Question: Ecommerce Company
• Given
– Transaction Logfile/DB/CSVfile (1 Billion transactions)
• User ID, Date, Time, Referring URL, item purchased,
price, etc..
– User Information/Location file/DB (1Million records)
• User ID, HomeCountry, HomeState, HomeZipCode, etc..
• 5 numbers X 2 bytes X * 10^6 = 10^7 (Around 10 MEG)
• Join Transaction DB with Location DB using the
USER_ID (e.g., Phone number)
• Complete this job within one hour every hour!
• Using Hadoop, what type of join would you
recommend?
– NOTE: remember to specify type of join, role of each table,
and how do it in Hadoop
TASK

• In memory join with user table broadcast to all
nodes
• Left = User table; right = Transactions table
• Right outer join:
– Transaction + User

Join part 2
• Left table (Customer information table)
• Right table (Transaction table)
• Question: Left/Right/Inner/Outer Join?
• Right join:
– some customers may not exist
• HashJoin? Reduce side Join?

Advertising ~2% of US GDP; $140B WW
"Half the money I spend on advertising is wasted; the trouble is, I don't
know which half." - John Wanamaker, father of modern advertising.
– Less than 1% of all impressions lead to measureable ROI
Despite its problems (Attribution, etc.)
• US GDP = $14.1 Trillion (Global $56 Trillion, 56x1012)
• US Advertising Spend
– ~$275 Billion across all media
• (2% of GDP since the early 1900s)
• In 2015, Worldwide online advertising was $150Billion
– I.e., about 20% of all ad spending across all media
$400 Million on Super Bowl Advertising TV/Online
Cover in more detail in Week 12

A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y

“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
and Modeling
• Online Data
• Used for Offer
Serving
Realtime Batch
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL
Devices
SDK
Bid X CTRAD, Context X 1000 =eCPMAd
100 Milliseconds

“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
and Modeling
• Online Data
• Used for Offer
Serving
Realtime Batch
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL

Ranking Ads (more) at Turn Inc.

Devices
Potential Ad serving architecture
SDK
Streaming
Spark
Ad
Servers
Aurora Cube
Cassandra
SQL
Server
EMR
Modeling
Java / Python / R
Excel Pivots
Self-Service
S3
S3 S3
BI Pipeline
Data Science Pipeline
Glacie
r
Spark
MemCache
December 2015 View
Activity Tracking
Raw Data
Archived Activity
Tracking
EC2 Cluster
Tableau
Reporting
Services
Reporting APIs
Hourly ETL
EC2 Instance
Data Warehouse
EventTracking
Data (Logs)
Device Profiles
Device
Data
Configuration / Lookup Data

A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y
• DOE Native Ads
• Yield mgt
• LTV/Churn
• SDK
• LTV/Churn
• Event-based
CPA
• Flexible,
multiple
conversions
• Segment-based targeting
• Forecasting
• Coldstart
• Pacing
• Metrics and Evaluation

• End Deep Artificial
Intelligence Talk

Data Mining Lectures Lecture 18: Credit Scoring
ICS 278: Data Mining
Lecture 18: Credit Scoring
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine

Presentations for Next Week
• Names for each day will be emailed out by
tomorrow
• Instructions:
– Email me your presentations by 12 noon the day of your
presentation (no later please)
– I will load them on my laptop (so no need to bring a
machine)
– Each presentation will be 6 minutes long + 2 minutes
questions
• So probably about 4 to 8 (max) slides per presentation

References on Credit Scoring
Statistical Classification Methods in Consumer
Credit Scoring: a Review
D. J. Hand and W. E. Henley
Journal of the Royal Statistical Society: Series A
Volume 160: Issue 3, November 1997
Available online at class Web page under lecture notes
Also:
Credit Scoring and its Applications: L. C. Thomas, D. B. Edelman, J.
N. Crook,
SIAM, 2002
Credit Risk Modeling, E. Mays (editor), American Management
Association, 1998.

Outline
• Credit Scoring
– Problem definition, standard notation
• Data Sources
• Models
– Logistic regression, trees, linear regression, etc
• Model building issues
– Problem of reject inference
• Practical issues
– Cutoff selection, updating models

The Problem of Credit Scoring
• Applicants apply for a bank loan
– Population 1 is rejected
– Population 2 is accepted
• Population 2a repays their loan -> labeled “good”
• Population 2b goes into some form of default -> labeled
“bad”
• Model building
– Build a model that can discriminate population 2a from
population 2b
– Usually treated as a classification problem
– Typically want to estimate p(good | features) and rank
individuals this way
• Widely used by banks and credit card companies
– Similar problems occur in direct marketing and other

Many different applications for
Customer Scoring
• Other financial applications:
– Delinquent loans: who is most likely to pay up
• Uses historical data on who paid in the past
• Often used to create “portfolios” ofdelinquentdebt
– Customer revenue
• How much will each customergenerate in revenue over the next K years
• Predicting marketing response
– Cost of a mailer to a customer is order of $1 dollar
– Targeted marketing
• Rank customers interms of “likelihood to respond”
• “Churn” prediction
– Predicting which customers are most likely to switch to another brand
– E.g., wireless phone service
– Scores used to rank customers and then target most likely with incentives
• Many more….

Some background
• History
– General ideas started in the 1950’s
• e.g., Bill Fair and Eric Isaac -> FairIsaac -> FICO scores
– Initially a bit contraversial
• Worries about it being unfair to some segments of
society
– US Equal Opportunity Credit Acts, 1975/76
• Skepticism that “machine generated rules” from data
could outperform human generated guidelines
– First adopted in credit-card approvals (1960’s)
– Later broadly adopted in home-loans, etc
– Now widely accepted and used by almost all banks, credit-
granting agencies, etc

Data Sources
• Data from the loan application
– Age, address, income, profession, SS#, number of credit cards, savings, etc
– Easy to obtain
• Internal Performance data
– How the individual has performed on other loans with the same bank
– May only be available for a subset of customers
• External Performance data:
– Credit Reports
• How the individual has performedhistorically on all loans and credit cards
• Relatively expensive to obtain (e.g., $1 per individual)
– Court Judgements
– Real Estate records
• Macro-level external data
– Demographic characteristics for applicant’s zip code or census tract

Loan Application Data
• Issues
– Data entry errors (e.g., birthday = date of loan application)
– Deliberate falsifications (e.g., over-reporting of income)
– Legal issues
• US Equal Credit Opportunity Acts, 1975/76
• Illegal to use race, color, religion, national origin, sex,
marital status, or age in the decision to grant credit
• But what if other variables are highly predictive of some
of these variables?

Variable
Name Description Codings
dob Year of birth If unknown the year will be 99
nkid Number of children number
dep Number of other dependents number
phon Is there a home phone 1=yes, 0 = no
sinc Spouse's income
aes Applicant's employment status V = Government
W = housewife
M = military
P = private sector
B = public sector
R = retired
E = self employed
T = student
U = unemployed
N = others
Z = no response

Variable
Name Description Codings
dainc Applicant's income
res Residential status O = Owner
F = tenant furnished
U = Tenant Unfurnished
P = With parents
N = Other
Z = No response
dhval Value of Home 0 = no response or not owner
000001 = zero value
blank = no response
dmort Mortgage balance outstanding 0 = no response or not owner
000001 = zero balance
blank = no response
doutm Outgoings on mortgage or rent
doutl Outgoings on Loans
douthp Outgoings on Hire Purchase
doutcc Outgoings on credit cards
Bad Good/bad indicator 1 = Bad
0 = Good

Credit Report Data
• Available from 3 major bureaus in the US:
– Experian, Trans-Union, and Equifax
• Data in the form of a list of transactions/events
– Typically needs to be converted into feature-value form
• E.g., “number of credit cards opened in past 12 months”
– Can result in a huge number of features
• Cost varies as a function of type and time-window
of data requested
– Interesting problem: “cost-optimal” downloading of selected
credit report features adapted to each individual as a
function of cheaper features

Defining Good and Bad
• Good versus Bad
– Not necessarily clear how to define 2 classes
– E.g.,
• bad = ever 3 or more payments in arrears?
• Bad = 2 or more payments in arrears more than once?
– A “spectrum” of behavior
• Never any problems in payments
• Occasional problems
• Persistent problems
– Typical to discard the intermediate cases and also those with
insufficient experience to reliably classify them
• Not ideal theoretically, but convenient

Selecting a Data Set for Model
Building
• Sample selection
– Typical sample sizes ~ 10k to 100k per class
– Should be representative of customers who will apply in the
future
– Need to be able to get the relevant variables for this set of
customers
• Internal performance data
• External performance data
• Etc
• External data sources (e.g., credit reports) can
result in a very large number of possible
variables
– E.g., in the 1000’s

Models used in Credit Scoring
• Regression:
– Ignore the fact that we are estimating a probability
– Typically linear regression is used
• Classification (more common approach)
– Logistic regression (most widely used)
– Decision trees (becoming more popular)
– Neural networks (experimented with, but not used in practice so much)
– Nearest neighbors
– Model combining - some work in this area
– SVMs - too new, relatively unproven
• General comments
– Many trade-secrets, companies like FairIsaac do not publish details
– Generally the industry is conservative: prefer well-established methods
– Classification accuracy is only one part of the overall solution….

g-1( ) w0 + w1x1 +…+ wpxp=p
Logistic Regression Models
Training Data
log(odds)
( )
p
1 - p
log
logit(p)
0.0
1.0
p 0.5
logit(p )
0
Note that near 0,
logit(p) is almost linear,
so linear and logistic regression
will be similar in this region
w0 + w1x1

Modeling Example
Model Bad Risk Rate (%)
k nearest neighbor with special
metric
43.09
k nearest neighbor (standard) 43.25
logistic regression 43.30
linear regression 43.36
decision tree 43.77
(from Hand and Henley paper)

Evaluation Methods
• Decile/Centile reporting:
– Rank customers by predicted scores
– Report “lift” rate in each decile (and cumulatively) compared to accepting
everyone
• Receiver Operation Characteristics
– Vary classification threshold
– Plot proportion of good risks accepted vs. bad risks accepted
• Bad Risk rate = bad risk among those accepted
– Let p = proportion of good risks
– Let a = proportion accepted
e.g., can show that, with a > p, the bad risk rate among those accepted is
lower bounded by 1 – p/a
e.g., p = 0.45, a =0.70 => bad risk rate must be between 0.35 and 0.78

Introduction to Data Science and Large-scale Machine Learning

Introduction to Data Science and Large-scale Machine Learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Introduction to Data Science and Large-scale Machine Learning

Similaire à Introduction to Data Science and Large-scale Machine Learning (20)

Dernier

Dernier (20)

Introduction to Data Science and Large-scale Machine Learning