SlideShare une entreprise Scribd logo
1  sur  271
Télécharger pour lire hors ligne
Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 1
November 16, 2016
How Data and Data
Science are revolutionizing the world
James G. Shanahan1,2
1IoTGurus., 2iSchool UC Berkeley, CA,
EMAIL: James_DOT_Shanahan_AT_gmail_DOT_com
Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 2
Outline
• Introduction
• Artificial Intelligence
• Machine Learning
– Emperical Sport
– NetFlix
– Dashboards
• Data Science
• Applications
• Architecture
• What’s next?
Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 3
James G. Shanahan 25+ years in data science
Systems,Parallel
Computing, Hadoop,
Spark, Python, R,
Scala,Java
Digital Advertising &
Marketing,
Web+mobile+local Search,
Anticipatory info. systems,
Cellular Networks, Social
Networks
Statistics, Optimization
Theory, Probability
Social Network Analytics,
Geo-InformationalScience,
HCI, Graphs, NLP
Math&Theory
Domain Expertise
Led teams of R&D,r&D
Xerox Research,AT&T,
Turn, NativeX, Adobe
Entrepreneur
Teach at UC Berkeley
Technology
16+
25+
25+years
16+
Leadership, Business
Acumen, Teacher
Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 4
James G. Shanahan
• 25+ years in data science
• Currently
– Principal and Founder, Data Science Consultancy
• Clients: Target, Adobe, Akamai, Ancestry, AT&T, Nokia Siemens, SearchMe, …
– Teaching
• Co-creator of UC Berkeley MIDS program; curriculum development
• Teach Large Scale Machine Learning (Fall 2014,2015,2016)
• Teach Machine Learning and Optimization Theory at University of California
Santa Cruz (UCSC), TIM 206, TIM 209, TIM 250, TIM 251 (since 2008)
– Advising: Quixey, InferSystems, Knotch
• Previously
– NativeX: SVP of Data Science, Chief Scientist, and board member
– Founding Chief Scientist, Turn Inc.
– Principal Scientist, Clairvoyance Corp (CMU spinoff; sister lab to JRC)
– Research Scientist, Xerox Research;
– Entrepreneur: Cofounder of Document Souls and RTB Fast
• Education: PhD in ML, University of Bristol, UK; B.Sc. CS, Uni. of Limerick, Ireland
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 5
Audience Participation is encouraged!
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 6
Outline
• Introduction
• Artificial Intelligence
• Machine Learning
• Data Science
• Applications
• What’s next?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 7
Data science everywhere
• .
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 8
Traditional Data Science
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 9
Deep Learning
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 10
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 11
What is Intelligence?
• Intelligence:
– “the capacity to learn and solve problems” (Websters
dictionary)
– in particular,
• the ability to solve novel problems
• the ability to act rationally
• the ability to act like humans
• Artificial Intelligence
– build and understand intelligent entities or agents
– 2 main approaches: “engineering” versus “cognitive
modeling”
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 13
What’s involved in Intelligence?
• Ability to interact with the real world
– to perceive, understand, and act
– e.g., speech recognition and understanding and synthesis
– e.g., image understanding
– e.g., ability to take actions, have an effect
• Reasoning and Planning
– modeling the external world, given input
– solving new problems, planning, and making decisions
– ability to deal with unexpected problems, uncertainties
• Learning and Adaptation
– we are continuously learning and adapting
– our internal models are always being “updated”
• e.g., a baby learning to categorize and recognize animals
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 14
Can machines think?  Turing Test
• In the test, an interrogator converses with a man
and a machine via a text-based channel.
– If the interrogator fails to guess which one is the machine,
then the machine is said to have passed the Turing test.
(This is a simplification; there are more nuances in and
variants of the Turing test, but these are not relevant for our
present purposes.)
• The beauty of the Turing test is its simplicity and
its objectivity, because it is only a test of
behavior, not of the internals of the machine. It
doesn't care whether the machine is using logical
methods or neural networks. This decoupling of
what to solve from how to solve is an important
theme in this class.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 15
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 16
What AI can do for you?
• .
Instead of asking what AI is, let us turn to the more pragmatic question
of what AI can do for you. We will go through some examples where AI
has already had a substantial impact on society.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 17
Academic Disciplines relevant to AI
• Philosophy Logic,methodsof reasoning,mind as physical
system,foundations oflearning,language,
rationality.
• Mathematics Formalrepresentation and proof,algorithms,
computation,(un)decidability,(in)tractability
• Probability/Statistics modeling uncertainty,learning from data
• Economics utility, decisiontheory,rationaleconomic agents
• Neuroscience neuronsas information processingunits.
• Psychology/ how do people behave,perceive,process cognitive
Cognitive Science information, representknowledge.
• Computer building fastcomputers
engineering
• Controltheory design systems thatmaximize an objective
function over time
• Linguistics knowledgerepresentation,grammars
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 18
History of AI
• 1943: early beginnings
– McCulloch & Pitts: Boolean circuit model of brain
• 1950: Turing
– Turing's "Computing Machinery and Intelligence“
• 1956: birth of AI
– Dartmouth meeting: "Artificial Intelligence“ name adopted
• 1950s: initial promise
– Early AI programs, including
– Samuel's checkers program
– Newell & Simon's Logic Theorist
• 1955-65: “great enthusiasm”
– Newell and Simon: GPS, general problem solver
– Gelertner: Geometry Theorem Prover
– McCarthy: invention of LISP
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 19
History of AI
• 1966—73: Reality dawns
– Realization that many AI problems are intractable
– Limitations of existing neural network methods identified
• Neural network research almost disappears
• 1969—85: Adding domain knowledge
– Development of knowledge-based systems
– Success of rule-based expert systems,
• E.g., DENDRAL, MYCIN
• But were brittle and did not scale well in practice
• 1986-- Rise of machine learning
– Neural networks return to popularity
– Major advances in machine learning algorithms and
applications
• 1990-- Role of uncertainty
– Bayesian networks as a knowledge representation framework
• 1995-- AI as Science
– Integration of learning, reasoning, knowledge representation
– AI methods used in vision, language, data mining, etc
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 20
• ..
http://www.andreykurenkov.com/writing/images/2
016-4-15-a-brief-history-of-game-ai/0-history.png
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 21
Success Stories
• Deep Blue defeated the reigning world chess
champion Garry Kasparov in 1997
• AI program proved a mathematical conjecture
(Robbins conjecture) unsolved for decades
• During the 1991 Gulf War, US forces deployed an
AI logistics planning and scheduling program
that involved up to 50,000 vehicles, cargo, and
people
• NASA's on-board autonomous planning program
controlled the scheduling of operations for a
spacecraft
• Proverb solves crossword puzzles better than
most humans
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 22
Can Computers beat Humans at Chess?
• Chess Playing is a classic AI problem
– well-defined problem
– very complex: difficult for humans to play well
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
1966 1971 1976 1981 1986 1991 1997
Ratings
Human World Champion
Deep Blue
Deep Thought
PointsRatings
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 23
Summary of State of AI Systems in Practice
• Speech synthesis, recognition and understanding
– very useful for limited vocabulary applications
– unconstrained speech understanding is still too hard
• Computer vision
– works for constrained problems (hand-written zip-codes)
– understanding real-world, natural scenes is still too hard
• Learning
– adaptive systems are used in many applications: have their limits
• Planning and Reasoning
– only works for constrained problems: e.g., chess
– real-world is too complex for general systems
• Overall:
– many components of intelligent systems are “doable”
– there are many interesting research problems remaining
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 24
Can Computers Talk?
• This is known as “speech synthesis”
– translate text to phonetic form
• e.g., “fictitious” -> fik-tish-es
– use pronunciation rules to map phonemes to actual sound
• e.g., “tish” -> sequence of basic audio sounds
• Difficulties
– sounds made by this “lookup” approach sound unnatural
– sounds are not independent
• e.g., “act” and “action”
• modern systems (e.g., at AT&T) can handle this pretty well
– a harder problem is emphasis, emotion, etc
• humans understand what they are saying
• machines don’t: so they sound unnatural
• Conclusion:
– NO, for complete sentences
– YES, for individual words
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 25
Can Computers Recognize Speech?
• Speech Recognition:
– mapping sounds from a microphone into a list of words
– classic problem in AI, very difficult
• “Lets talk about how to wreck a nice beach”
• (I really said “________________________”)
• Recognizing single words from a small
vocabulary
• systems can do this with high accuracy (order of 99%)
• e.g., directory inquiries
– limited vocabulary (area codes, city names)
– computer tries to recognize you first, if unsuccessful hands you
over to a human operator
– saves millions of dollars a year for the phone companies
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 26
Recognizing human speech (ctd.)
• Recognizing normal speech is much more difficult
– speech is continuous: where are the boundaries between words?
• e.g., “John’s car has a flat tire”
– large vocabularies
• can be many thousands of possible words
• we can use context to help figure out what someone said
– e.g., hypothesize and test
– try telling a waiter in a restaurant:
“I would like some dream and sugar in my coffee”
– background noise, other speakers, accents, colds, etc
– on normal speech, modern systems are only about 60-70%
accurate
• Conclusion:
– NO, normal speech is too complex to accurately recognize
– YES, for restricted problems (small vocabulary, single speaker)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 27
Can Computers Understand speech?
• Understanding is different to recognition:
– “Time flies like an arrow”
• assume the computer can recognize all the words
• how many different interpretations are there?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 28
Can Computers Understand speech?
• Understanding is different to recognition:
– “Time flies like an arrow”
• assume the computer can recognize all the words
• how many different interpretations are there?
– 1. time passes quickly like an arrow?
– 2. command: time the flies the way an arrow times the flies
– 3. command: only time those flies which are like an arrow
– 4. “time-flies” are fond of arrows
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 29
Can Computers Understand
speech?
• Understanding is different to recognition:
– “Time flies like an arrow”
• assume the computer can recognize all the words
• how many different interpretations are there?
– 1. time passes quickly like an arrow?
– 2. command: time the flies the way an arrow times the flies
– 3. command: only time those flies which are like an arrow
– 4. “time-flies” are fond of arrows
• only 1. makes any sense,
– but how could a computer figure this out?
– clearly humans use a lot of implicit commonsense knowledge in
communication
• Conclusion: NO, much of what we say is beyond
the capabilities of a computer to understand at
present
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 30
Can Computers Learn and Adapt ?
• Learning and Adaptation
– consider a computer learning to drive on the freeway
– we could teach it lots of rules about what to do
– or we could let it drive and steer it back on course when it heads for
the embankment
• systems like this are under development (e.g., Daimler Benz)
• e.g., RALPH at CMU
– in mid 90’s it drove 98% of the way from Pittsburgh to San Diego without
any human assistance
– machine learning allows computers to learn to do things without
explicit programming
– many successful applications:
• requires some “set-up”: does not mean your PC can learn to
forecast the stock market or become a brain surgeon
• Conclusion: YES, computers can learn and adapt, when
presented with information in the appropriate way
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 31
• Recognition v. Understanding (like Speech)
– Recognition and Understanding of Objects in a scene
• look around this room
• you can effortlessly recognize objects
• human brain can map 2d visual image to 3d “map”
• Why is visual recognition a hard problem?
• Conclusion:
– mostly NO: computers can only “see” certain types of objects
under limited circumstances
– YES for certain constrained problems (e.g., face recognition)
Can Computers “see”?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 32
Can computers plan and make optimal decisions?
• Intelligence
– involves solving problemsand making decisionsand plans
– e.g., you want to take a holiday in Brazil
• you need to decide on dates, flights
• you need to get to the airport, etc
• involves a sequence of decisions,plans, and actions
• What makes planning hard?
– the world is not predictable:
• your flight is canceled or there’s a backup on the 405
– there are a potentially huge number of details
• do you considerall flights? all dates?
– no: commonsenseconstrains your solutions
– AI systems are only successfulin constrained planning problems
• Conclusion: NO, real-world planning and decision-making is still beyond
the capabilities of modern computers
– exception:very well-defined,constrained problems
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 33
Summary of State of AI Systems in Practice
• Speech synthesis, recognition and understanding
– very useful for limited vocabulary applications
– unconstrained speechunderstanding is still too hard
• Computer vision
– works for constrained problems (hand-written zip-codes)
– understanding real-world, natural scenes is still too hard
• Learning
– adaptive systems are used in many applications:have their limits
• Planning and Reasoning
– only works for constrained problems:e.g.,chess
– real-world is too complexfor general systems
• Overall:
– many components of intelligent systems are “doable”
– there are many interesting research problemsremaining
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 34
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 35
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 36
• .
Separate what to
compute (modeling)
from how to compute
it (algorithms)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 37
Lecture Outline
• Introduction
• Artificial Intelligence
• Machine Learning
• Data Science
• Applications
• What’s next?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 38
• .
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 39
What is machine learning?
• ,,
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 40
machine learning
• Supporting all of these models is machine learning.
• In the non-machine learning approach, one would write a complex
program (remember, we are solving tasks of significant
complexity), but this gets very tedious.
– For example, how should a spellcheckerknow that for "hte", "the" (transposition) is
more likely to be the correctoutput as compared to "hate" (insertion)?
• The machine learning approach is to instead write a really simple
program with unknown parameters (e.g., numbers measuring how
bad it is to transpose or insert characters).
• Then, we obtain a set of training examples that partially specifies
the desired system behavior. A learning algorithm takes these
training examples and sets the parameters of our simple program
so that the resulting program approximately produces the desired
system behavior.
• Abstractly, machine learning allows us to shift the complexity
from the program to the data, which is much easier to obtain
(either naturally occurring or via crowdsourcing).
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 41
Equation of a line
y = mx +b
f(x) = mx +b
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 42
Machine Learning in one slide
• Machine learning, a branch of artificial intelligence, is a scientific
discipline that is concerned with the design and development of
algorithms that allow computers to evolve behaviors based on
empirical data, such as from sensor data or databases.
• A learner can take advantage of examples (data) to capture
characteristics of interest of their unknown underlying probability
distribution. Data can be seen as examples that illustrate relations
between observed variables.
• A major focus of machine learning research is to automatically
learn to recognize complex patterns and make intelligent
decisions based on data; the difficulty lies in the fact that the set
of all possible behaviors given all possible inputs is too large to
be covered by the set of observed examples (training data).
• Hence the learner must generalize from the given examples, so as
to be able to produce a useful output in new cases. Machine
learning, like all subjects in artificial intelligence, require cross-
disciplinary proficiency in several areas, such as probability
theory, statistics, pattern recognition, cognitive science, data
mining, adaptive control, computational neuroscience and
theoretical computer science.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 43
What is the Learning Problem?
• Improve over Task T
• with respect to performance measure P
• based on experience E
Learning = Improving with experience at some task
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 44
Types of Learning
• Supervised learning - Generates a function that mapsinputs
to desired outputs. For example, in a classification problem,
the learner approximates a function mapping a vector into
classes by looking at input-output examples of the function.
• Unsupervised learning - Models a set of inputs: like clustering
• Semi-supervised learning - Combines both labeled and
unlabeled examples to generate an appropriate function or
classifier.
• Reinforcement learning - Learns how to act given an
observation of the world. Every action has some impact in the
environment, and the environment provides feedback in the
form of rewards that guides the learning algorithm.
Transduction - Tries to predict new outputs based on training
inputs, training outputs, and test inputs.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 45
Supervised Learning :Regression
• Regression
– Linear Regression
• Classification
– Logistic Regression
• Generalized Linear Models (GLMs)
– Broader family of models (that subsume Linear Regression and
logistic regress and more
– In R checkout ?glm()
Parametric Approaches vs. Non-parametric
Convex/Concave
Discriminative versus generative
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 46
Classification versus Regression
• Classification is just like a regression problem,
except where the values of y that we now want to
predict take on only a small number of discrete
values (assume no order in y)
• Binary logistic regression
– For now let’s focus on binary classification where y can take
on two values 0 and 1 (can be generalized to multi-class
case)
• E.g., building an ancestor class; a person is an
ancestor (where y might take the value of 1) or
not (y=0).
– Given Xi the corresponding yi is AKA the label for the training
data
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 47
• Generative Classifier
(Bottom-up learning)
– Build model of each class
– Assume the underlying form
of the classes and estimate
their parameters (e.g., a
Gaussian)
• Discriminative Classifier
(Top down)
– Build model of boundary
between classes
– Assume the underlying form
of the discriminant and
estimate its parameters (e.g.,
a hyperplane)
Families of Supervised Learning
Sports
Arts
Business
Health
Sports
Arts
BusinessHealth
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 48
Terminology: linear regression
Predicted Predictor variables
Response variable Explanatory variables
Outcomevariable Covariables
Dependent Independent variables
...1 nn2210 xwxwxwwy 
Wi are the model coefficients
Xi’sy
Y-intercept/threshold
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 49
Pr(Click): Advertising Problem
• Predict Pr(Click|dwellTimeOnWebpage)
– at the times 1, 2, 3, 4, and 5 seconds after
loading the page.
• Graph each data point with time on the
x-axis and CTR on the y-axis. Your data
should follow a straight line.
• Use locator() to input data
• Find the equation of this line.
# x y%
1 1 2
. 2 3
. 3 7
. 4 8
m 5 9
F(x)
x
X are features, aka variables, continuous,
discrete, ordinal ( X  n )
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 50
Least Square Fit Approximations
Suppose we want to fit the data set.
We would like to find the best straight
line to fit the data?
# x y
1 1 2
. 2 3
. 3 7
. 4 8
m 5 9
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 51
Fit a line based on…
• If we assume that the first two points are correct
and choose the line that goes through them, we
get the line y = 1 + x.
• If we substitute our points (x-values) into this
equation, we get the following chart.
• How good is this line?
– The sum of the squares of the errors is 27.
SSE = 27
Do you think that we can do better
than this?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 52
Linear Model More Generally
• E.g., y=mx+b can be more generally seen a function of
the form
• Here the W’s are the parameters (also called weights)
parametering the space of linear function mapping
from X  Y=F(x)
# X0 x1 y
1 1 1 2
. 1 2 3
. 1 3 7
. 1 4 8
m 1 5 9







n
i
T
ii
n
i
T
ii
Xxxxfy
W
XWxw
xwxwxxfy
1
10
1
110010
),(
ofinsteaduseSometimes
),(

mslope 
x1
F(x)
b
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 53
Types of Learning
• Supervised learning - Generates a function that mapsinputs
to desired outputs. For example, in a classification problem,
the learner approximates a function mapping a vector into
classes by looking at input-output examples of the function.
• Unsupervised learning - Models a set of inputs: like clustering
• Semi-supervised learning - Combines both labeled and
unlabeled examples to generate an appropriate function or
classifier.
• Reinforcement learning - Learns how to act given an
observation of the world. Every action has some impact in the
environment, and the environment provides feedback in the
form of rewards that guides the learning algorithm.
Transduction - Tries to predict new outputs based on training
inputs, training outputs, and test inputs.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 54
Machine Learning Background
Machine Learning (ML):”a computer program that improves its performance at
some task through experience” [Mitchell 1997]
GIVEN: Input data is a table of attribute values and associated class values (in
the case of supervised learning)
GOAL: Approximate f(x1,…,xn)->y
InstanceAttr x1 x2 … xn y
1 3 0 .. 7 -1
2 +1
… … … … … …
L (aka m) 0 4 ... 8 -1
Y is categorical
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 55
Machine Learning: Regression
Machine Learning (ML):”a computer program that improves its performance at
some task through experience” [Mitchell 1997]
GIVEN: Input data is a table of attribute values and associated class values (in
the case of supervised learning)
GOAL: Approximate f(x1,…,xn)->y
InstanceAttr x1 x2 … xn y
1 3 0 .. 7 73
2 76
… … … … … …
L (aka m) 0 4 ... 8 97
Y is real valued
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 56
Machine Learning semi-supervised
Machine Learning (ML):”a computer program that improves its performance at
some task through experience” [Mitchell 1997]
GIVEN: Input data is a table of attribute values and associated class values (in
the case of supervised learning)
GOAL: Approximate f(x1,…,xn)->y
InstanceAttr x1 x2 … xn y
1 3 0 .. 7 73
2 76
… … … … … …
L (aka m) 0 4 ... 8 97
Y is only partially available
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 57
Machine Learning Unsupervised
Machine Learning (ML):”a computer program that improves its performance at
some task through experience” [Mitchell 1997]
GIVEN: Input data is a table of attribute values and associated class values (in
the case of supervised learning)
GOAL: Approximate f(x1,…,xn)->y
InstanceAttr x1 x2 … xn y
1 3 0 .. 7 73
2 76
… … … … … …
L (aka m) 0 4 ... 8 97
Y is not available
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 58
• Generative Classifier
(Bottom-up learning)
– Build model of each class
– Assume the underlying form
of the classes and estimate
their parameters (e.g., a
Gaussian)
• Discriminative Classifier
(Top down)
– Build model of boundary
between classes
– Assume the underlying form
of the discriminant and
estimate its parameters (e.g.,
a hyperplane)
Families of Supervised Learning
Sports
Arts
Business
Health
Sports
Arts
BusinessHealth
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 59
Generative vs. Discriminative
• Generative learning (e.g., Bayesian
Networks, HMM, Naïve Bayes, EM GMM)
typically more flexible
– More complex problems
– More flexible predictions
• Discriminative learning (e.g., ANN, SVM)
typically more accurate
– Better with small datasets
– Faster to train
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 60
Parametric vs. Non-Parametric ML Algorithms
• Parametric ML Algorithms (e.g., OLS, Decision Trees;
SVMs, NNs)
– Model-based methods, such as neural networks and the mixture of
Gaussians, use the data to build a parameterized model. After training,
the model is used for predictions and the data are generally discarded.
• Non-Parametric (lowess(); knn; some flavours of SVMs)
– In contrast, ``memory-based'' methods are non-parametric approaches
that explicitly retain the training data, and use it each time a prediction
needs to be made.
– The term “non-parametric” (roughly) refers to the fact that the amount
of stuff we need to keep in order to represent the hypothesis/model
grows linearly with the size of the training set.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 61
Linear Model: Ordinary Least Squares
• How do we pick, or learn, the parameters W (aka θ)?
• One reasonable method seems to be to make f(x)
close to y, at least for the training examples.
• To formalize, let’s define a function that measures,
for each possible model/hypothesis, W, how close
fθ(xi)’s are to the corresponding yi ’s:
• Sum of squared error
• AKA Residual Sum of Squares (Residual squared)
Measuring Quality
 

m
i
ii
yWXWJ
1
2
2
1
)(
This error minimization is going
to have problems?


m
i
ii
yWXWJ
1
)(
Residual sum of squares
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 62
Residual
0
10
20
30
40
50
60
0 2 4 6 8 10 12 14 16
x
y Residuali
 ii
yWX i
Residual
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 63
Which Line is it anyway?
• Select another two points and build a line
• If we choose the line that goes through the points
when x = 3 and 4, we get the line y = 4 + x. Will we
get a better fit? Let's look at it.
SSE = 18. Getting better but can we do better?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 64
Can we do better than guesswork?
• Let's try the line that is half way between these
two lines. The equation would be y = 2.5 + x.
• Is there a more scientific or efficient way than
guessing at which line would give the best fit.
– Surely there is a methodical way to determine the best fit
line. Let's think about what we want.
SSE = 11.25. Getting better but can we do
better?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 65
Hypothesis Space of Linear Models
• Here the W’s are the parameters (also called
weights) parameterizing the space of linear
function mapping from X  Y = f(X)
• Augment Training Data with dummy intercept
variable (simplifies notation and modeling)
# X0 x1 y
1 1 1 2
. 1 2 3
. 1 3 7
. 1 4 8
m 1 5 9







n
i
T
ii
n
i
T
ii
Xxxxfy
W
XWxw
xwxwxxfy
1
10
1
110010
),(
ofinsteaduseSometimes
),(


Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 66
Space of Hypotheses: Weights
example.OLS_Heatmap()
• Each model is in our case a coefficient for the y-intercept (bias)
and a coefficient for the feature-variable (time)
• Plot weight-space in 2D where the third dimesion is the error
• Select combination that minimizes the sum of square error
HeatMap with isolines overlayed 3D error surface z=log(w0+w1x)
 

m
i
iii
yXWWJ
1
2
2
1
)(
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 67
Hyperplanes partition the input space(a line in a 2 input
variable problem) and do NOT predict real values
• Many methods in machine learning are based on
finding parameters that minimize some objective
function.
• Very often, the objective function is a weighted
sum of two terms:
• a cost function and regularization term.
• In statistics terms the (log-)likelihood and (log-)prior.
– If both of these components are convex, then their sum is
also convex.
– Loss functions are summed over examples so the sum of a
convex functions is a convex function
Minimize Residuals
Given a linear regression model
W, Please type in the loss
function for linear regression
y= f(X1) where y
is real-valued
y= f(X1, X2) where y is
in {0,1} or {-1, 1}.
Y=mX +b
y= f(X1)
X1
X2 Y
X1
Separating hyperplanepartitions
AX1 + BX2 + C
Class(X1, X2) = sign(AX1 + BX2 + C)
Prediction Line
y=mX +b
Prediction(X1) =mX
Partitions versus predicts
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 68
Unsupervised Learning (Clustering)
Input data
We want 3 clusters,red, green and blue
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 69
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Center of a cluster
Let’s compute the center of those points
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 70
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Center of a cluster
We can use the meanon each dimension
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 71
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Center of a cluster
We can use the meanon each dimension
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 72
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Center of a cluster
We can use the meanon each dimension
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 73
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Center of a cluster
But the meanhastroublewith outliers
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 74
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Center of a cluster
Using the median on each dimension is more robust
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 75
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Assignment
All points coloured properly already
⇒ wearedone !
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 76
Three generations of machine learning
• First generation: dataset that fits in memory
– Single node learning summary statistics and some batch modeling (at small
scale); SQL, R
– Down sampling the data
• Second generation: General purpose clusters and frameworks
– Distributed frameworks that allows us to divide and conquer problems
– Learning using general purpose frameworks such as hadoop big data analysis
offline, realtime decision making, homegrown specialist systems (Hadoop for
analysis and modeling; ), Hadoop, R
– In-house purpose built systems; specialist sport
• Third generation: Purpose-built libraries and frameworks
– Built for iterative algorithms that are common place in ML
– huge scale realtime analysis and decision making systems
– Specialized frameworks for large scale manipulation the type of data you are
workign with.
– For example, Machine learning libraries like MLLib in Spark, graph processing
libraries like Apache Giraph or GraphX in Spark
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 77
Evolution of Map-Reduce frameworks
for big data processing
mid 90s
Jimi’s PhD
First generation
2nd generation
2015
Spark 1.5
As of 10/2015Spark 1.0
3rd generation
Hadoop V2.0
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 78
Top 10 ML Algorithms
• ..
https://www.dezyre.com/article/top-10-
machine-learning-algorithms/202
Naïve Bayes Classifier
K Means Clustering Algorithm
Nearest Neighbours
Apriori Algorithm
Linear Regression
Logistic Regression
Support Vector Machine
Decision Trees
Ensembles/Forests
Artificial Neural Networks/Deep Learning
Reinforcement learning
Forecasting
Many more!
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 79
• .. 2005
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 80
Lecture Outline
• Introduction
• Artificial Intelligence
• Machine Learning
• Data Science
• Applications
• What’s next?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 81
Internet companies started the revolution
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 82
Internet companies started the revolution
• ..
But more traditional
companies are leveraging
their data and DS Tech
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 83
Data Analysis Has Been Around for a While
R.A. Fisher
Howard
Dresner
Peter Luhn
W.E.
Demming
2012: Deep Learning
2013: Spark
1997 Google
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 84
Data Science DS Skillset
• Linear regression, DT models for
domain experts
Domain
Expertise
A venn diagram
with a Danger
Bearing
[adapted from
Drew Conway]
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 85
Data Science
Technology
Hadoop, Spark,Python,
Scala, Java, R
Digital Advertising &
Marketing, Econometrics,
Web Search, Cellular
Networks, Social Networks
Statistics, Optimization Theory,
Social Network Analytics,
Geo-Informational Science
Math
Domain Expertise
Mobile Advertising
Adapted from Drew Conway’s Venn diagram of data science
DS
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 86
Data Scientist
Technology
Hadoop, Spark,Python,
Scala, Java, R
Digital Advertising &
Marketing, Econometrics,
Web Search, Cellular
Networks, Social Networks
Statistics, Optimization Theory,
Social Network Analytics,
Geo-Informational Science
MathDomain Expertise
Mobile Advertising
Communication
DS
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 87
• ..
RockStars and Super Models
Technology
Math
Domain
expertise
RockStar
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 88
Data Analytics at Scale
Algorithms: Machine
Learning and Analytics,
Representation,
Vizualization
Big Data: human-centric,
M2M, IoT
Machines:
Cloud Computing
Storage and compute
Frameworks:
MapReduce,HDFS,
Hadoop, Spark, MPI
Security/Privacy Data
Analytic
sat
Scale
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 89
DS is Systems + Theory + Verticals
• ..
http://support.sas.com/resources/papers/proceedings14/SAS313-
2014.pdf
Systems
- NoSQL
- Hadoop
- Spark
- MPIVerticals
- Advertising
- Voting
- Sports
- Autonomous Agents
- Healtcare
- Education
Theory
Visualization
Legal
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 90
1
2
Understand domain,
Collect requirements
Exploratorydata analysis
Modeling
FeatureEngineering3
4
5
6
Deploy Models in the
wild (e.g., AB test)
Lab-based
experiments
Typical Abstract Data Analytics Pipeline
WarehouseData
7
Reports and
Decisions
Models and
decisions
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 91
Lecture Outline
• Google Doc and Group
• Welcome & Class Introductions
• Big Data and Applications
• Course introduction
• Class logistics
• Systems (part 1 of N)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 92
Data Science at Scale
Security/Privacy
Big Data: human-
centric, M2M, IoT
Machines:
Cloud Computing
Parallel Frameworks:
MapReduce:cmdLine,
Hadoop, MRJob,Spark
Algorithms: Machine
Learning and Analytics Machine
learning
at Scale
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 93
Big data Definition: use
• Big data is a broad term for data sets so large or
complex that traditional data processing
applications are inadequate.
– PROCESSING:
• Think of your laptop that gets overwhelmed with 3-4 gig
of data (disk space is 1TB)
– STORAGE:
• Laptop : 1 TB (1012 bytes)
– THROUGH-PUT (Read 108 (100 meg/sec) 104 seconds)
• 1TB would take 3 hours to read it using your laptop
• Challenges
– Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, security, and
information privacy.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 94
Big Data
• In 2012, Gartner updated its definition as follows:
"Big data is high volume, high velocity, and/or
high variety information assets that require new
forms of processing to enable enhanced decision
making, insight discovery and process
optimization."[18]
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 95
Big Data: V3
• ..
10121021
speed of generation of data
or how fast the data is
generated and processed
2015: 1-2 TB per online individual
4ZB (1021) Today  40ZB in 2020
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 96
Sources Driving Big Data
It’s All Happening On-line
Every:
Click
Ad impression
Billing event
Fast Forward, pause,…
Friend Request
Transaction
Network message
Fault
…
User Generated
(Web, Social & Mobile)
…
..
Internet of Things / M2M Scientific Computing
Quantified Self
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 97
Big Data Infographic
• ..
http://www.ibmbigdatahub.com/sites/defaul
t/files/infographic_file/4-Vs-of-big-data.jpg
http://www.ibmbigdatahub.com/infographic/
four-vs-big-data
By 2005 we had 120* 1018
By 2007 we had 280*2018
By 2020 we will have 40* 1021
The quality of the
data being captured
can vary greatly
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 98
3 Vs of Big Data
• …
40TB per person by 2020
1-2 TB per person today2014/2015
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 100
Lecture Outline
• Introduction
• Artificial Intelligence
• Machine Learning
• Data Science
• Applications
• What’s next?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 101
Why all the excitement?
• Government:
– Obama used 80 pieces of information on each person; 4 year
history (versus Romney)
– Nate Silver used Bayesian techniques to publish analyses and
predictions related to the 2008 and 2012 United States presidential
election
• Sports:
– Oakland Athletics baseball team and its manager Billy Beane
• Transportation ( e.g., Autonomous Vehicles)
• HCI: Speech Recognition and Translation
• Healthcare
– AI Cure: Do you know if your patients are taking their meds?
• Digital Advertising
• Search (web, local, mobile)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 102
How does data, ML, data science work?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 103
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 104
• .
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 105
Web search lifecycle
• ..
http://www.slideshare.net/GaneshVenkataraman3/learn-to-rank-using-machine-learning
https://en.wikipedia.org/wiki/Monty_Hall_problem
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 106
Understand user intent
• ,,
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 107
Fixing user errors
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 108
• ,,
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 109
Like the Index at end of book
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 110
PageRank
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 111
Search is a ranking problem
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 112
Learning to rank
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 113
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 114
Training Data
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 115
Supervised Feedback Loop
Guided by human editors
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 116
Mining relevance judgements
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 117
Search ranking (web, jobs, local, etc)
And Ads
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 118
DB size = 100s billions of sites
Google server farms
2 million machines (est)
1011 X 104 = 1015 ~1 Petabyte of data
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 119
Learning to Rank at SearchMe
• Page Quality, Page Category, Webspam, Query understanding LETOR
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 120
LeToR: Improve in a measured way
Doubled size of index
More labeled training data
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 121
More data or more data science?
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 122
Advertising ~2% of US GDP; $140B WW
"Half the money I spend on advertising is wasted; the trouble is, I don't
know which half." - John Wanamaker, father of modern advertising.
– Less than 1% of all impressions lead to measureable ROI
Despite its problems (Attribution, etc.)
• US GDP = $14.1 Trillion (Global $56 Trillion, 56x1012)
• US Advertising Spend
– ~$275 Billion across all media
• (2% of GDP since the early 1900s)
• In 2014, Worldwide online advertising was $140
– I.e., about 20% of all ad spending across all media
– $42 billion global mobile-advertising market in 2014
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 123
• Stopped here 11/15/2016
• Jgs
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 124
Making Money from Apps
• 93% of downloaded apps in 2013 (globally) are
free apps !
• 76% of revenue generated from apps (globally) in
2013 is from in-app purchases
– [http://www.forbes.com/sites/chuckjones/2013/03/31/apps-with-in-app-
purchase-generate-the-highest-revenue/]
• In the Freeium economy
– To make money from apps, publishers must maintain
customer satisfaction through superior app performance and
design,
– then monetize though advertising and in-app purchases
[http://venturebeat.com/2014/03/27/mobile-app-monetization-freemium-is-king-
but-in-app-ads-are-growing-fast/,IDC, AppAnnie]
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 125
Mobile Publisher: How do I make money?
Auction
Ad
Which Ad?
Publisher:
App Developer
Consumer:
App user
• Paid app download
• In app purchases
• In app Advertising
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 126
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 127
Native Advertising
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 128
Rich Media Templates
• Advertiser
Template/Configuration
• Defines an offer design/display for a
specific ad unit
• Publisher
Template/Configuration
• Defined and designed to provide
native experience in publisher games
• Controls allowable content (ad units)
with a placement
• Is a “shell” to an ad (advertiser offer
template)
• Tracks placement performance
• Allows to control the behavior
and look/design from the server
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 129
Native Design and Dynamic Creative Optimization
Ad Frame Treatment
Variable Intro Text
Publisher Game Art or
Character Integration
Variable Integrated Call
to Action
Context, where is this
solution being shown
in the game?
N
Native Design
Blends with
Content
Dynamic UI elements adapt to
the ad and the audience
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 130
• ..
http://venturebeat.com/2014/04/29/mobile-apps-could-hit-70b-in-revenues-by-2017-as-non-game-
categories-take-off/
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 131
Mobile Ad Spend to Top $100 Billion
Worldwide in 2016, 51% of Digital Market
• US and China will account for nearly 62% of
global mobile ad spending next year
• http://www.emarketer.com/Article/Mobile-Ad-Spend-Top-100-
Billion-Worldwide-2016-51-of-Digital-
Market/1012299#sthash.FBfZAlaC.dpuf
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 132
CPMs on Mobile are catching up
• Mobile Advertising: What is the average CPM on
mobile?
– The effective cost per thousand impressions (CPM) for
desktop web ads is about $3.50, while the CPM for mobile
ads is just $0.75.
– Video-based CPMs typically > $15
http://www.quora.com/Mobile-Advertising/What-is-the-average-CPM-on-mobile
http://mashable.com/2012/10/23/mobile-ad-prices/
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 133
NativeX: Art and Science of Native Mobile Advertising
A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y
SUPPLY/PublishersCONSUMERS DEMAND/Advertisers
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 134
NativeX: Art and Science of Native Mobile Advertising
A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y
• DOE NativeAds
• Yieldmgt
• LTV/Churn
• SDK
• LTV/Churn
• Event-based CPA
• Flexible, multiple
conversions
• Segment-based targeting
• Forecasting
• Coldstart
• Pacing
• Metrics and Evaluation
SUPPLY/PublishersCONSUMERS DEMAND/Advertisers
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 135
“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
• Used for Reporting
and Modeling
• Online Data
• Used in Real Time
• Used for Offer
Serving
Realtime Batch
NativeX Data Pipelines
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL
(Extract, Transform,and Load)
Ad serving data pipelines
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 136
“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
• Used for Reporting
and Modeling
• Online Data
• Used in Real Time
• Used for Offer
Serving
Realtime Batch
NativeX Data Pipelines
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL
(Extract, Transform,and Load)
Ad serving data pipelines
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 137
Devices
Ad serving architecture
SDK
Kinesis
Lambda
Spark
&
Scala
Spark
Ad
Servers
Aurora SSAS
Cassandra
SQL
Server
EMR
Modeling
Java / Python / R
Excel Pivots
Self-Service
S3
S3 S3
Ad Hoc / Deep Analysis Pipeline
BI Pipeline
Data Science Pipeline
Glacie
r
Spark
ELB
HA
Proxy
Elasticache
Activity Tracking
Raw Data
Archived Activity
Tracking
EC2 Cluster
Tableau
Reporting
Services
Reporting APIs
Hourly ETL
EC2 Instance
Data Warehouse
Alerts
Dashboards
Debugging / Ops
Ad-hoc Analysis
EventTracking
Data (Logs)
Device Profiles
Device
Data
Configuration / Lookup Data
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 138
Publisher: Which ad to show?
Bids
Auction
getAd
Ad
Which Ad?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 139
Publisher: Which ad to show?
Ads, Bid (CPI)
Auction
getAd
Ad
Which Ad?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 140
NativeX conducts an eCPM-based Auction
Ads Pick
best ads
Bids
Auction
Action argmaxAd eCPM=bid*CR
getAd
Ad
Transaction
Logs
$5×0.010×1000=$50
$10×0.002×1000=$20
$3×0.002×1000=$6
$4×0.001×1000=$4
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 141
NativeX conducts an eCPM-based Auction
Ads Pick
best ads
Bids
Auction
Action argmaxAd eCPM=bid*CR
getAd
Ad
Transaction
Logs
$5×0.010×1000=$50
$10×0.002×1000=$20
$3×0.002×1000=$6
$4×0.001×1000=$4
eCPMAd = CRAd × BidAd× 1000
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 142
1
2
Understand domain, Collect
requirements
Exploratorydata analysis
Modeling: Conversion Rate Models
Feature Engineering3
4
5
6
Deploy Models in the wild
(e.g., AB test)
Lab-based experiments
7 Steps in Modeling: E.g., Conversion Rate Modeling
WarehouseData
7
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 143
Multiple Ad Sources:
• DSPs,Exchanges
• Ad Networks
• Internal/Self-service
Multiple conversiontypes:
• CPM, CPC, CPI,
CPCV, CPA,CPE
De-duplication
Optimization by geo
Modeling Features:
• Geo location
• Device
• Reviews (star rating, review
text; Geo location of
reviews)
• Social media Tweets/ FB
posts
• Categories on Android and
iOS
• Creative Message
• User profiles (RFM based
on network behavior)
• Device Behavioral (based
on installed apps on device
RFM, recommendations,
categories)
• Graph-based features
• Others….
Campaign-specific models for CTR/CR
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 144
Modeling
• ML Approaches
– Gradient boosted decision trees
– Bayesian hierarchical approaches
– Segmentation via matrix factorization
• Feature engineering
– Feature invention
• Metrics and evaluation
• Storing and accessing data
• Perennial Challenges
– Coldstart
– Bias
– Scale
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 145
• ..
https://upload.wikimedia.org/wikipedia/commons/
thumb/5/5f/Minard%27s_Map_%28vectorized%29.
svg/2023px-
Minard%27s_Map_%28vectorized%29.svg.png
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 146
If we can’t measure it then…
• …
Data Science Updates: 2013/10/25 ©2013 NativeX Holdings, LLC For
16%  40%
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 147
From a systems perspective:
Three generations of machine learning
• First generation: dataset that fit in memory
– Single node learning summary statistics and some batch modeling (at sma
scale); SQL, R
– Down sampling the data
• Second generation: General purpose clusters and framework
– Distributedframeworks that allows us to divide and conquer problems
– Learning using general purpose frameworks such as hadoop big data
analysis offline, realtime decision making, homegrown specialist systems
(Hadoop for analysis and modeling; ), Hadoop, R
– In-house purpose built systems; specialist sport
• Third generation: Purpose-built libraries and frameworks
– Built for iterative algorithms that are common place in ML
– huge scale realtime analysis and decision making systems
– Specialized frameworks for large scale manipulation the type of data you a
workign with.
– For example, Machine learning libraries like MLLib in Spark, graph
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 148
Ranking Ads (more) at Turn Inc.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 149
Text Processing
• .. http://aylien.com/
http://aylien.com/
Deep Learning based
CNN
RNN
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 150
• ..
Linking other things such as
groups
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 151
• ..
Growing
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 152
• Deep Learning
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 153
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 154
Logistic Regression Model
Inputs
Coefficients
a, b, c
Output
Independent
variables
x1, x2, x3
Dependent
variable
p
Prediction
Age 34
1Gender
Stage 4
“Probability
of
beingAlive”
5
8
4
0.6
S
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 155
S is the sum of inputs * weights
Inputs
Coefficients
Output
Independent
variables
Prediction
Age 34
1Gender
Stage 4
5
8
4 S  34.5  1.4  4.8  20.6
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 156
Neural Network Model
Inputs
Weights
Output
Independent
variables
Dependent
variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.2
.1
.3
.7
.2
WeightsHiddenLa
yer
“Probability
of
beingAlive”
0.6
S
S
.
4
.2
S
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 157
Intelligent Systems in Your Everyday Life
• Post Office
– automatic address recognitionand sorting of mail
• Banks
– automatic check readers,signature verification systems
– automated loan application classification
• Customer Service
– automatic voice recognition
• The Web
– Identifying your age, gender,location, from your Web surfing
– Automated fraud detection
• Digital Cameras
– Automated face detectionand focusing
• Computer Games
– Intelligent characters/agents
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 158
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 159
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 160
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 161
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 162
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 163
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 164
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 165
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 166
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 167
• .. http://3.bp.blogspot.com/-iEx-
C0ljkKk/VV38zjj_vdI/AAAAAAAAA7w/aron8CBjm
os/s1600/alexnet.png
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 168
• .
Daterequirements
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 169
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 170
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 171
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 172
Conversational UI
• We’re witnessing an explosion of applications
that no longer have a graphical user interface
(GUI).
• They’ve actually been around for a while, but
they’ve only recently started spreading into the
mainstream.
• They are called bots, virtual assistants, invisible
apps.
• They can run on Slack, WeChat, Facebook
Messenger, plain SMS, or Amazon Echo.
• They can be entirely driven by artificial
intelligence, or there can be a human behind the
curtain.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 173
Conversational UI
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 174
• ..
Check Balance replenish
Charts --__--
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 175
Conversational UI
• Amazon Echo is controlled by voice, but has a
companion app.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 1766Microsoft Research
Cortan
a
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 177
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 178
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 179
Speech Recognition Breakthrough for the
Spoken, Translated Word
• Published on Nov 8, 2012
• Chief Research Officer Rick Rashid demonstrates a speech
recognition breakthrough via machine translation that
converts his spoken English words into computer-
generated Chinese language. The breakthrough is
patterned after deep neural networks and significantly
reduces errors in spoken as well as written translation.
• For moreinformation on Speech Recognition and
Translation, visit
– http://www.microsoft.com/translator/skype.aspx
• Excellent Video (please watch all this video!)
– https://www.youtube.com/watch?v=Nu-nlQqFCKg (Minute 7:11)
– English text (ASR)  Chinese Text  Text to speech system (sound like
english speaker)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 180
• /..
English text (ASR)  Chinese Text  Text to
speech system (sound like english speaker)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 181
ASR (Audio signal  word sequence)
• ..
HMM, Deep Learning, Language models
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 182
Tipping point: Humans no longer the
center to the data universe
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 183
IoT/IoE
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 184
Personal; society; M2M; crowdsourcing
• Society
– Graphs: Social, professional;
– Quantified self: Eating; Sleeping; exercising
– Voting
– Education
– Healthcare…. Economics, shopping, etc.
• Internet of things
– Tracking Wildebeests in Serengeti, Tanzania (not just with GPS tags, but
also with cameras at key strategic locations through out the Serengeti
• Population changes in species; Scheduling safaris
– 1 Billion smart meters by 2020;
• 1 Petabyte of data per day? 10^9 =10^12 10^15
• 1 Billion smart meters (One megabye of data per device per day; Poll
meter 1000 times per day; 1000 bytes of data each time
– Smart cities
• Etc.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 185
Japanese to English
• ..
http://www.ustar-consortium.com/research.html
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 186
From analytics to closed loop control systems
Historical Realtime Future
Analytical
Now
Predictive
Customerexitrate
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 187
From analytics to closed loop control systems
Historical Realtime Future
Analytical
Now
Predictive
Customerexitrate
Decisive
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 188
From analytics to closed loop control systems
Historical Realtime Future
Analytical
$
Now
Predictive
$$
Customerexitrate
Decisive
$$$
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 189
Managers and CEOs see the value of DA
Data (Science) improves KPIs dramatically
Summary
Stats and
Reports
Offline Data
Mining (e.g,
user Profiles)
Realitime
decision
making
Personalization
LTV
Advanced BI,
Regional Sales
KPIPerformanceImprovement
(e.g.,Sales)
10-20%
20-30%
2X-10X
10X+
Churn, Repeat,
BigSpender
Realtime
Recommendations,
LookAlike Modeling
Historical Realtime Future
Ads (DSP/DMP)
Amazon
Google
Netflix
Oracle, SQL
Hadoop
(Omniture,
Hyperion)
SAS, SPSS
Cloudera, R
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 190
Autonomous Vehicles
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 191
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 192
Autonomous Vehicles
• ..
An image of what Google's self-driving car
sees when it makes a left turn.
http://www.rand.org/pubs/research_briefs/RB9755.html
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 193
autonomous vehicles
• Research in autonomous cars started in the 1980s, but the
technology wasn't there.
• Perhaps the first significant event was the 2005 DARPA Grand
Challenge, in which the goal was to have a driverless car go
through a 132-mile off-road course. Stanford finished in first
place. The car was equipped with various sensors (laser, vision,
radar), whose readings needed to be synthesized (using
probabilistic techniques that we'll learn from this class) to localize
the car and then to generate control signals for the steering,
throttle, and brake.
• In 2007, DARPA created an even harder Urban Challenge, which
was won by CMU.
• In 2009, Google started a self-driving car program, and since then,
their self-driving cars have driven over 1 million miles on freeways
and streets.
• In January 2015, Uber hired about 50 people from CMU's robotics
department to build self-driving cars.
• While there are still technological and policy issues to be worked
out, the potential impact on transportation is huge.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 194
• ..
http://www.nature.com/news/auto
omous-vehicles-no-drivers-
required-1.16832
http://asirt.org/initiatives/informing
road-users/road-safety-facts/road
crash-statistics
800Million
parking spots
in US
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 195
Save fuel, Safer logistics
• ..
http://peloton-tech.com/
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 196
Data Science in Ecommerce
• ..
This is just a
subset
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 197
Defining Product Strategy for the
optimum product mix
• Ecommerce, and bricks and mortar businesses
– What products should they sell?
– What price should be offered for the products and when?
• Data science algorithms help ecommerce businesses
define and optimize the product mix.
– Every ecommerce business has a product team that looks into the
design process where data science algorithms can help the business
with forecasting like-
• What are the loopholes in the product mix?
• What should they make?
• How many quantities should be ordered as initial batch from the factory
outlet?
• When should they halt the supply of those products?
• When should they sell?
• Data scientists versus Data Analysts
– work on advanced predictive and prescriptive analytics
– whereas data analysts will merely look into the retrospective analysis like
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 198
• https://www.aicure.com/
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 199
Do you know if your patients are taking their meds?
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 200
Trust but verify!
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 201
Rank patients
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 202
Alerts
• ..
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 203
Machine learning at Scale
Algorithms: Machine
Learning and Analytics
Big Data: human-centric,
M2M, IoT
Machines:
Cloud Computing
Parallel Frameworks:
MapReduce:cmdLine,
Hadoop, MRJob,Spark
Security/Privacy
Machine
learning
at Scale
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 204
Lecture Outline
• Introduction
• Artificial Intelligence
• Machine Learning
• Data Science
• Applications
• What’s next?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 205
150,000 Data Scientists needed in US
[McKinsey Report on Big Data 2011]
With such enormous potential to change the world, it will come as no surprise
that data scientists are in huge demand
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 206
Top 10 Best Jobs in the US as of 2/2016
How much you make
The demand for your skills
How easily you can advance
117K Median salary
1,700 openings right now
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 207
From analytics to closed loop control systems
Historical Realtime Future
Analytical
$
Now
Predictive
$$
Customerexitrate
Decisive
$$$
IoE, Deep Learning, GPU, Data,Bandwidth (5G)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 208
•Architecture
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 209
Cool Thing #2: Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
BEFORE:
ETL, schema design upfront,
tossing out original data,
comprehensive data study
Keep original data around!
Have multiple views of the same data!
Work with unstructured data sooner!
Store first, figure out what to do with it later!
WITH HADOOP:
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 210
Cool Thing #4: Unstructured Data
• Unstructured data:
media, text,
forms, log data
lumped structured data
• Query languages like SQL and
Pig assume some sort of
“structure”
• MapReduce is just Java:
You can do anything Java can
do in a Mapper or Reducer
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 211
Left outer join: return all rows from left table
even if there are no matches in the right lable
• ..
Customers is the left
Customers Orders
CustomerName OrderID
A 2
A 4
A 3
B BLANK
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 212
Inner Join or simply join
• .
CustomerName OrderID
A 2
A 4
A 3
B BLANK
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 213
Join Question: Ecommerce Company
• Given
– Transaction Logfile/DB/CSVfile (1 Billion transactions)
• User ID, Date, Time, Referring URL, item purchased,
price, etc..
– User Information/Location file/DB (1Million records)
• User ID, HomeCountry, HomeState, HomeZipCode, etc..
• 5 numbers X 2 bytes X * 10^6 = 10^7 (Around 10 MEG)
• Join Transaction DB with Location DB using the
USER_ID (e.g., Phone number)
• Complete this job within one hour every hour!
• Using Hadoop, what type of join would you
recommend?
– NOTE: remember to specify type of join, role of each table,
and how do it in Hadoop
TASK
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 214
• In memory join with user table broadcast to all
nodes
• Left = User table; right = Transactions table
• Right outer join:
– Transaction + User
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 215
Join part 2
• Left table (Customer information table)
• Right table (Transaction table)
• Question: Left/Right/Inner/Outer Join?
• Right join:
– some customers may not exist
• HashJoin? Reduce side Join?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 216
Advertising ~2% of US GDP; $140B WW
"Half the money I spend on advertising is wasted; the trouble is, I don't
know which half." - John Wanamaker, father of modern advertising.
– Less than 1% of all impressions lead to measureable ROI
Despite its problems (Attribution, etc.)
• US GDP = $14.1 Trillion (Global $56 Trillion, 56x1012)
• US Advertising Spend
– ~$275 Billion across all media
• (2% of GDP since the early 1900s)
• In 2015, Worldwide online advertising was $150Billion
– I.e., about 20% of all ad spending across all media
– $42 billion global mobile-advertising market in 2014
– $100 billion global mobile-advertising market in 2016
$400 Million on Super Bowl Advertising TV/Online
Cover in more detail in Week 12
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 217
NativeX: Art and Science of Native Mobile Advertising
SUPPLY/PublishersCONSUMERS DEMAND/Advertisers
A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 218
“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
• Used for Reporting
and Modeling
• Online Data
• Used in Real Time
• Used for Offer
Serving
Realtime Batch
NativeX Data Pipelines
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL
(Extract, Transform,and Load)
Ad serving data pipelines
Devices
SDK
Bid X CTRAD, Context X 1000 =eCPMAd
100 Milliseconds
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 219
“OLTP” Data
Pipeline
“OLAP” Data
Pipeline
• Offline Data
• Logging Data
• Used for Reporting
and Modeling
• Online Data
• Used in Real Time
• Used for Offer
Serving
Realtime Batch
NativeX Data Pipelines
Data Science
Predictive
Analytics
Pipeline
• Offline Batch
Modeling
• Real-timeAd
Serving
ETL
(Extract, Transform,and Load)
Ad serving data pipelines
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 220
Ranking Ads (more) at Turn Inc.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 221
Devices
Potential Ad serving architecture
SDK
Streaming
Spark
Ad
Servers
Aurora Cube
Cassandra
SQL
Server
EMR
Modeling
Java / Python / R
Excel Pivots
Self-Service
S3
S3 S3
BI Pipeline
Data Science Pipeline
Glacie
r
Spark
MemCache
December 2015 View
Activity Tracking
Raw Data
Archived Activity
Tracking
EC2 Cluster
Tableau
Reporting
Services
Reporting APIs
Hourly ETL
EC2 Instance
Data Warehouse
EventTracking
Data (Logs)
Device Profiles
Device
Data
Configuration / Lookup Data
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 222
NativeX: Art and Science of Native Mobile Advertising
SUPPLY/PublishersCONSUMERS DEMAND/Advertisers
A
p
p
P
u
b
l
i
s
h
e
r
N
a
t
I
v
e
X
S
S
P
A
d
N
e
t
E
x
c
h
a
n
g
e
D
S
P
A
d
v
e
r
t
i
s
e
r
A
d
A
g
e
n
c
y
• DOE Native Ads
• Yield mgt
• LTV/Churn
• SDK
• LTV/Churn
• Event-based
CPA
• Flexible,
multiple
conversions
• Segment-based targeting
• Forecasting
• Coldstart
• Pacing
• Metrics and Evaluation
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 223
• End Deep Artificial
Intelligence Talk
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 224
Data Mining Lectures Lecture 18: Credit Scoring
ICS 278: Data Mining
Lecture 18: Credit Scoring
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 225
Data Mining Lectures Lecture 18: Credit Scoring
Presentations for Next Week
• Names for each day will be emailed out by
tomorrow
• Instructions:
– Email me your presentations by 12 noon the day of your
presentation (no later please)
– I will load them on my laptop (so no need to bring a
machine)
– Each presentation will be 6 minutes long + 2 minutes
questions
• So probably about 4 to 8 (max) slides per presentation
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 226
Data Mining Lectures Lecture 18: Credit Scoring
References on Credit Scoring
Statistical Classification Methods in Consumer
Credit Scoring: a Review
D. J. Hand and W. E. Henley
Journal of the Royal Statistical Society: Series A
Volume 160: Issue 3, November 1997
Available online at class Web page under lecture notes
Also:
Credit Scoring and its Applications: L. C. Thomas, D. B. Edelman, J.
N. Crook,
SIAM, 2002
Credit Risk Modeling, E. Mays (editor), American Management
Association, 1998.
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 227
Data Mining Lectures Lecture 18: Credit Scoring
Outline
• Credit Scoring
– Problem definition, standard notation
• Data Sources
• Models
– Logistic regression, trees, linear regression, etc
• Model building issues
– Problem of reject inference
• Practical issues
– Cutoff selection, updating models
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 228
Data Mining Lectures Lecture 18: Credit Scoring
The Problem of Credit Scoring
• Applicants apply for a bank loan
– Population 1 is rejected
– Population 2 is accepted
• Population 2a repays their loan -> labeled “good”
• Population 2b goes into some form of default -> labeled
“bad”
• Model building
– Build a model that can discriminate population 2a from
population 2b
– Usually treated as a classification problem
– Typically want to estimate p(good | features) and rank
individuals this way
• Widely used by banks and credit card companies
– Similar problems occur in direct marketing and other
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 229
Data Mining Lectures Lecture 18: Credit Scoring
Many different applications for
Customer Scoring
• Other financial applications:
– Delinquent loans: who is most likely to pay up
• Uses historical data on who paid in the past
• Often used to create “portfolios” ofdelinquentdebt
– Customer revenue
• How much will each customergenerate in revenue over the next K years
• Predicting marketing response
– Cost of a mailer to a customer is order of $1 dollar
– Targeted marketing
• Rank customers interms of “likelihood to respond”
• “Churn” prediction
– Predicting which customers are most likely to switch to another brand
– E.g., wireless phone service
– Scores used to rank customers and then target most likely with incentives
• Many more….
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 230
Data Mining Lectures Lecture 18: Credit Scoring
Some background
• History
– General ideas started in the 1950’s
• e.g., Bill Fair and Eric Isaac -> FairIsaac -> FICO scores
– Initially a bit contraversial
• Worries about it being unfair to some segments of
society
– US Equal Opportunity Credit Acts, 1975/76
• Skepticism that “machine generated rules” from data
could outperform human generated guidelines
– First adopted in credit-card approvals (1960’s)
– Later broadly adopted in home-loans, etc
– Now widely accepted and used by almost all banks, credit-
granting agencies, etc
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 231
Data Mining Lectures Lecture 18: Credit Scoring
Data Sources
• Data from the loan application
– Age, address, income, profession, SS#, number of credit cards, savings, etc
– Easy to obtain
• Internal Performance data
– How the individual has performed on other loans with the same bank
– May only be available for a subset of customers
• External Performance data:
– Credit Reports
• How the individual has performedhistorically on all loans and credit cards
• Relatively expensive to obtain (e.g., $1 per individual)
– Court Judgements
– Real Estate records
• Macro-level external data
– Demographic characteristics for applicant’s zip code or census tract
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 232
Data Mining Lectures Lecture 18: Credit Scoring
Loan Application Data
• Issues
– Data entry errors (e.g., birthday = date of loan application)
– Deliberate falsifications (e.g., over-reporting of income)
– Legal issues
• US Equal Credit Opportunity Acts, 1975/76
• Illegal to use race, color, religion, national origin, sex,
marital status, or age in the decision to grant credit
• But what if other variables are highly predictive of some
of these variables?
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 233
Data Mining Lectures Lecture 18: Credit Scoring
Variable
Name Description Codings
dob Year of birth If unknown the year will be 99
nkid Number of children number
dep Number of other dependents number
phon Is there a home phone 1=yes, 0 = no
sinc Spouse's income
aes Applicant's employment status V = Government
W = housewife
M = military
P = private sector
B = public sector
R = retired
E = self employed
T = student
U = unemployed
N = others
Z = no response
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 234
Data Mining Lectures Lecture 18: Credit Scoring
Variable
Name Description Codings
dainc Applicant's income
res Residential status O = Owner
F = tenant furnished
U = Tenant Unfurnished
P = With parents
N = Other
Z = No response
dhval Value of Home 0 = no response or not owner
000001 = zero value
blank = no response
dmort Mortgage balance outstanding 0 = no response or not owner
000001 = zero balance
blank = no response
doutm Outgoings on mortgage or rent
doutl Outgoings on Loans
douthp Outgoings on Hire Purchase
doutcc Outgoings on credit cards
Bad Good/bad indicator 1 = Bad
0 = Good
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 235
Data Mining Lectures Lecture 18: Credit Scoring
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 236
Data Mining Lectures Lecture 18: Credit Scoring
Credit Report Data
• Available from 3 major bureaus in the US:
– Experian, Trans-Union, and Equifax
• Data in the form of a list of transactions/events
– Typically needs to be converted into feature-value form
• E.g., “number of credit cards opened in past 12 months”
– Can result in a huge number of features
• Cost varies as a function of type and time-window
of data requested
– Interesting problem: “cost-optimal” downloading of selected
credit report features adapted to each individual as a
function of cheaper features
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 237
Data Mining Lectures Lecture 18: Credit Scoring
Defining Good and Bad
• Good versus Bad
– Not necessarily clear how to define 2 classes
– E.g.,
• bad = ever 3 or more payments in arrears?
• Bad = 2 or more payments in arrears more than once?
– A “spectrum” of behavior
• Never any problems in payments
• Occasional problems
• Persistent problems
– Typical to discard the intermediate cases and also those with
insufficient experience to reliably classify them
• Not ideal theoretically, but convenient
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 238
Data Mining Lectures Lecture 18: Credit Scoring
Selecting a Data Set for Model
Building
• Sample selection
– Typical sample sizes ~ 10k to 100k per class
– Should be representative of customers who will apply in the
future
– Need to be able to get the relevant variables for this set of
customers
• Internal performance data
• External performance data
• Etc
• External data sources (e.g., credit reports) can
result in a very large number of possible
variables
– E.g., in the 1000’s
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 239
Data Mining Lectures Lecture 18: Credit Scoring
Models used in Credit Scoring
• Regression:
– Ignore the fact that we are estimating a probability
– Typically linear regression is used
• Classification (more common approach)
– Logistic regression (most widely used)
– Decision trees (becoming more popular)
– Neural networks (experimented with, but not used in practice so much)
– Nearest neighbors
– Model combining - some work in this area
– SVMs - too new, relatively unproven
• General comments
– Many trade-secrets, companies like FairIsaac do not publish details
– Generally the industry is conservative: prefer well-established methods
– Classification accuracy is only one part of the overall solution….
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 240
Data Mining Lectures Lecture 18: Credit Scoring
g-1( ) w0 + w1x1 +…+ wpxp=p
Logistic Regression Models
Training Data
log(odds)
( )
p
1 - p
log
logit(p)
0.0
1.0
p 0.5
logit(p )
0
Note that near 0,
logit(p) is almost linear,
so linear and logistic regression
will be similar in this region
w0 + w1x1
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 241
Data Mining Lectures Lecture 18: Credit Scoring
Modeling Example
Model Bad Risk Rate (%)
k nearest neighbor with special
metric
43.09
k nearest neighbor (standard) 43.25
logistic regression 43.30
linear regression 43.36
decision tree 43.77
(from Hand and Henley paper)
Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 242
Data Mining Lectures Lecture 18: Credit Scoring
Evaluation Methods
• Decile/Centile reporting:
– Rank customers by predicted scores
– Report “lift” rate in each decile (and cumulatively) compared to accepting
everyone
• Receiver Operation Characteristics
– Vary classification threshold
– Plot proportion of good risks accepted vs. bad risks accepted
• Bad Risk rate = bad risk among those accepted
– Let p = proportion of good risks
– Let a = proportion accepted
e.g., can show that, with a > p, the bad risk rate among those accepted is
lower bounded by 1 – p/a
e.g., p = 0.45, a =0.70 => bad risk rate must be between 0.35 and 0.78
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning

Contenu connexe

Tendances

How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! PromptCloud
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)Buhwan Jeong
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big DataRevolution Analytics
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school studentsMelanie Manning, CFA
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big DataDataWorks Summit
 
Data Science For Social Scientists Workshop
Data Science For Social Scientists WorkshopData Science For Social Scientists Workshop
Data Science For Social Scientists WorkshopIan Hopkinson
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Data Science Thailand
 

Tendances (20)

Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps!
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Data Science For Social Scientists Workshop
Data Science For Social Scientists WorkshopData Science For Social Scientists Workshop
Data Science For Social Scientists Workshop
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)
 

En vedette

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 
Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe Martin Daniel
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Nik Spirin
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceKoo Ping Shung
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 ramuletc
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Machine learning workshop @DYP Pune
Machine learning workshop @DYP PuneMachine learning workshop @DYP Pune
Machine learning workshop @DYP PuneGanesh Raskar
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
An Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceAn Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceWesley Eldridge
 

En vedette (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 r
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Machine learning workshop @DYP Pune
Machine learning workshop @DYP PuneMachine learning workshop @DYP Pune
Machine learning workshop @DYP Pune
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
An Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceAn Obligatory Introduction to Data Science
An Obligatory Introduction to Data Science
 

Similaire à Introduction to Data Science and Large-scale Machine Learning

Smart Data Webinar: Advances in Natural Language Processing II - NL Generation
Smart Data Webinar: Advances in Natural Language Processing II - NL GenerationSmart Data Webinar: Advances in Natural Language Processing II - NL Generation
Smart Data Webinar: Advances in Natural Language Processing II - NL GenerationDATAVERSITY
 
Implementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) Better
Implementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) BetterImplementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) Better
Implementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) BetterDATAVERSITY
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in EducationPhilip Piety
 
The UVA School of Data Science
The UVA School of Data ScienceThe UVA School of Data Science
The UVA School of Data SciencePhilip Bourne
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Gregory Piatetsky-Shapiro
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
UVA School of Data Science
UVA School of Data ScienceUVA School of Data Science
UVA School of Data SciencePhilip Bourne
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)Galit Shmueli
 
[DSC Europe 22] Machine learning algorithms as tools for student success pred...
[DSC Europe 22] Machine learning algorithms as tools for student success pred...[DSC Europe 22] Machine learning algorithms as tools for student success pred...
[DSC Europe 22] Machine learning algorithms as tools for student success pred...DataScienceConferenc1
 
@microsoft Conversations on Education 10/29/2013
@microsoft Conversations on Education 10/29/2013@microsoft Conversations on Education 10/29/2013
@microsoft Conversations on Education 10/29/2013atmicrosoft
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfvishal choudhary
 
ELH School Tech 2013 - Computational Thinking
ELH School Tech 2013 - Computational ThinkingELH School Tech 2013 - Computational Thinking
ELH School Tech 2013 - Computational ThinkingPaul Herring
 

Similaire à Introduction to Data Science and Large-scale Machine Learning (20)

Smart Data Webinar: Advances in Natural Language Processing II - NL Generation
Smart Data Webinar: Advances in Natural Language Processing II - NL GenerationSmart Data Webinar: Advances in Natural Language Processing II - NL Generation
Smart Data Webinar: Advances in Natural Language Processing II - NL Generation
 
Implementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) Better
Implementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) BetterImplementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) Better
Implementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) Better
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 
The UVA School of Data Science
The UVA School of Data ScienceThe UVA School of Data Science
The UVA School of Data Science
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
UVA School of Data Science
UVA School of Data ScienceUVA School of Data Science
UVA School of Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
Data Science - NXT Level_Dr.Arun.pdf
Data Science - NXT Level_Dr.Arun.pdfData Science - NXT Level_Dr.Arun.pdf
Data Science - NXT Level_Dr.Arun.pdf
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
[DSC Europe 22] Machine learning algorithms as tools for student success pred...
[DSC Europe 22] Machine learning algorithms as tools for student success pred...[DSC Europe 22] Machine learning algorithms as tools for student success pred...
[DSC Europe 22] Machine learning algorithms as tools for student success pred...
 
@microsoft Conversations on Education 10/29/2013
@microsoft Conversations on Education 10/29/2013@microsoft Conversations on Education 10/29/2013
@microsoft Conversations on Education 10/29/2013
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
ELH School Tech 2013 - Computational Thinking
ELH School Tech 2013 - Computational ThinkingELH School Tech 2013 - Computational Thinking
ELH School Tech 2013 - Computational Thinking
 
Classroom of the futurev3
Classroom of the futurev3Classroom of the futurev3
Classroom of the futurev3
 

Dernier

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 

Dernier (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 

Introduction to Data Science and Large-scale Machine Learning

  • 1. Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 1 November 16, 2016 How Data and Data Science are revolutionizing the world James G. Shanahan1,2 1IoTGurus., 2iSchool UC Berkeley, CA, EMAIL: James_DOT_Shanahan_AT_gmail_DOT_com
  • 2. Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 2 Outline • Introduction • Artificial Intelligence • Machine Learning – Emperical Sport – NetFlix – Dashboards • Data Science • Applications • Architecture • What’s next?
  • 3. Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 3 James G. Shanahan 25+ years in data science Systems,Parallel Computing, Hadoop, Spark, Python, R, Scala,Java Digital Advertising & Marketing, Web+mobile+local Search, Anticipatory info. systems, Cellular Networks, Social Networks Statistics, Optimization Theory, Probability Social Network Analytics, Geo-InformationalScience, HCI, Graphs, NLP Math&Theory Domain Expertise Led teams of R&D,r&D Xerox Research,AT&T, Turn, NativeX, Adobe Entrepreneur Teach at UC Berkeley Technology 16+ 25+ 25+years 16+ Leadership, Business Acumen, Teacher
  • 4. Large Scale Distributed Data Science using Spark © KDD2015James G. Shanahan Contact:James.Shanahan @ gmail.com 4 James G. Shanahan • 25+ years in data science • Currently – Principal and Founder, Data Science Consultancy • Clients: Target, Adobe, Akamai, Ancestry, AT&T, Nokia Siemens, SearchMe, … – Teaching • Co-creator of UC Berkeley MIDS program; curriculum development • Teach Large Scale Machine Learning (Fall 2014,2015,2016) • Teach Machine Learning and Optimization Theory at University of California Santa Cruz (UCSC), TIM 206, TIM 209, TIM 250, TIM 251 (since 2008) – Advising: Quixey, InferSystems, Knotch • Previously – NativeX: SVP of Data Science, Chief Scientist, and board member – Founding Chief Scientist, Turn Inc. – Principal Scientist, Clairvoyance Corp (CMU spinoff; sister lab to JRC) – Research Scientist, Xerox Research; – Entrepreneur: Cofounder of Document Souls and RTB Fast • Education: PhD in ML, University of Bristol, UK; B.Sc. CS, Uni. of Limerick, Ireland
  • 5. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 5 Audience Participation is encouraged!
  • 6. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 6 Outline • Introduction • Artificial Intelligence • Machine Learning • Data Science • Applications • What’s next?
  • 7. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 7 Data science everywhere • .
  • 8. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 8 Traditional Data Science • ..
  • 9. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 9 Deep Learning • ..
  • 10. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 10 • ..
  • 11. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 11 What is Intelligence? • Intelligence: – “the capacity to learn and solve problems” (Websters dictionary) – in particular, • the ability to solve novel problems • the ability to act rationally • the ability to act like humans • Artificial Intelligence – build and understand intelligent entities or agents – 2 main approaches: “engineering” versus “cognitive modeling”
  • 12. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 13 What’s involved in Intelligence? • Ability to interact with the real world – to perceive, understand, and act – e.g., speech recognition and understanding and synthesis – e.g., image understanding – e.g., ability to take actions, have an effect • Reasoning and Planning – modeling the external world, given input – solving new problems, planning, and making decisions – ability to deal with unexpected problems, uncertainties • Learning and Adaptation – we are continuously learning and adapting – our internal models are always being “updated” • e.g., a baby learning to categorize and recognize animals
  • 13. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 14 Can machines think?  Turing Test • In the test, an interrogator converses with a man and a machine via a text-based channel. – If the interrogator fails to guess which one is the machine, then the machine is said to have passed the Turing test. (This is a simplification; there are more nuances in and variants of the Turing test, but these are not relevant for our present purposes.) • The beauty of the Turing test is its simplicity and its objectivity, because it is only a test of behavior, not of the internals of the machine. It doesn't care whether the machine is using logical methods or neural networks. This decoupling of what to solve from how to solve is an important theme in this class.
  • 14. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 15 • ..
  • 15. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 16 What AI can do for you? • . Instead of asking what AI is, let us turn to the more pragmatic question of what AI can do for you. We will go through some examples where AI has already had a substantial impact on society.
  • 16. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 17 Academic Disciplines relevant to AI • Philosophy Logic,methodsof reasoning,mind as physical system,foundations oflearning,language, rationality. • Mathematics Formalrepresentation and proof,algorithms, computation,(un)decidability,(in)tractability • Probability/Statistics modeling uncertainty,learning from data • Economics utility, decisiontheory,rationaleconomic agents • Neuroscience neuronsas information processingunits. • Psychology/ how do people behave,perceive,process cognitive Cognitive Science information, representknowledge. • Computer building fastcomputers engineering • Controltheory design systems thatmaximize an objective function over time • Linguistics knowledgerepresentation,grammars
  • 17. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 18 History of AI • 1943: early beginnings – McCulloch & Pitts: Boolean circuit model of brain • 1950: Turing – Turing's "Computing Machinery and Intelligence“ • 1956: birth of AI – Dartmouth meeting: "Artificial Intelligence“ name adopted • 1950s: initial promise – Early AI programs, including – Samuel's checkers program – Newell & Simon's Logic Theorist • 1955-65: “great enthusiasm” – Newell and Simon: GPS, general problem solver – Gelertner: Geometry Theorem Prover – McCarthy: invention of LISP
  • 18. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 19 History of AI • 1966—73: Reality dawns – Realization that many AI problems are intractable – Limitations of existing neural network methods identified • Neural network research almost disappears • 1969—85: Adding domain knowledge – Development of knowledge-based systems – Success of rule-based expert systems, • E.g., DENDRAL, MYCIN • But were brittle and did not scale well in practice • 1986-- Rise of machine learning – Neural networks return to popularity – Major advances in machine learning algorithms and applications • 1990-- Role of uncertainty – Bayesian networks as a knowledge representation framework • 1995-- AI as Science – Integration of learning, reasoning, knowledge representation – AI methods used in vision, language, data mining, etc
  • 19. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 20 • .. http://www.andreykurenkov.com/writing/images/2 016-4-15-a-brief-history-of-game-ai/0-history.png
  • 20. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 21 Success Stories • Deep Blue defeated the reigning world chess champion Garry Kasparov in 1997 • AI program proved a mathematical conjecture (Robbins conjecture) unsolved for decades • During the 1991 Gulf War, US forces deployed an AI logistics planning and scheduling program that involved up to 50,000 vehicles, cargo, and people • NASA's on-board autonomous planning program controlled the scheduling of operations for a spacecraft • Proverb solves crossword puzzles better than most humans
  • 21. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 22 Can Computers beat Humans at Chess? • Chess Playing is a classic AI problem – well-defined problem – very complex: difficult for humans to play well 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 1966 1971 1976 1981 1986 1991 1997 Ratings Human World Champion Deep Blue Deep Thought PointsRatings
  • 22. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 23 Summary of State of AI Systems in Practice • Speech synthesis, recognition and understanding – very useful for limited vocabulary applications – unconstrained speech understanding is still too hard • Computer vision – works for constrained problems (hand-written zip-codes) – understanding real-world, natural scenes is still too hard • Learning – adaptive systems are used in many applications: have their limits • Planning and Reasoning – only works for constrained problems: e.g., chess – real-world is too complex for general systems • Overall: – many components of intelligent systems are “doable” – there are many interesting research problems remaining
  • 23. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 24 Can Computers Talk? • This is known as “speech synthesis” – translate text to phonetic form • e.g., “fictitious” -> fik-tish-es – use pronunciation rules to map phonemes to actual sound • e.g., “tish” -> sequence of basic audio sounds • Difficulties – sounds made by this “lookup” approach sound unnatural – sounds are not independent • e.g., “act” and “action” • modern systems (e.g., at AT&T) can handle this pretty well – a harder problem is emphasis, emotion, etc • humans understand what they are saying • machines don’t: so they sound unnatural • Conclusion: – NO, for complete sentences – YES, for individual words
  • 24. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 25 Can Computers Recognize Speech? • Speech Recognition: – mapping sounds from a microphone into a list of words – classic problem in AI, very difficult • “Lets talk about how to wreck a nice beach” • (I really said “________________________”) • Recognizing single words from a small vocabulary • systems can do this with high accuracy (order of 99%) • e.g., directory inquiries – limited vocabulary (area codes, city names) – computer tries to recognize you first, if unsuccessful hands you over to a human operator – saves millions of dollars a year for the phone companies
  • 25. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 26 Recognizing human speech (ctd.) • Recognizing normal speech is much more difficult – speech is continuous: where are the boundaries between words? • e.g., “John’s car has a flat tire” – large vocabularies • can be many thousands of possible words • we can use context to help figure out what someone said – e.g., hypothesize and test – try telling a waiter in a restaurant: “I would like some dream and sugar in my coffee” – background noise, other speakers, accents, colds, etc – on normal speech, modern systems are only about 60-70% accurate • Conclusion: – NO, normal speech is too complex to accurately recognize – YES, for restricted problems (small vocabulary, single speaker)
  • 26. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 27 Can Computers Understand speech? • Understanding is different to recognition: – “Time flies like an arrow” • assume the computer can recognize all the words • how many different interpretations are there?
  • 27. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 28 Can Computers Understand speech? • Understanding is different to recognition: – “Time flies like an arrow” • assume the computer can recognize all the words • how many different interpretations are there? – 1. time passes quickly like an arrow? – 2. command: time the flies the way an arrow times the flies – 3. command: only time those flies which are like an arrow – 4. “time-flies” are fond of arrows
  • 28. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 29 Can Computers Understand speech? • Understanding is different to recognition: – “Time flies like an arrow” • assume the computer can recognize all the words • how many different interpretations are there? – 1. time passes quickly like an arrow? – 2. command: time the flies the way an arrow times the flies – 3. command: only time those flies which are like an arrow – 4. “time-flies” are fond of arrows • only 1. makes any sense, – but how could a computer figure this out? – clearly humans use a lot of implicit commonsense knowledge in communication • Conclusion: NO, much of what we say is beyond the capabilities of a computer to understand at present
  • 29. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 30 Can Computers Learn and Adapt ? • Learning and Adaptation – consider a computer learning to drive on the freeway – we could teach it lots of rules about what to do – or we could let it drive and steer it back on course when it heads for the embankment • systems like this are under development (e.g., Daimler Benz) • e.g., RALPH at CMU – in mid 90’s it drove 98% of the way from Pittsburgh to San Diego without any human assistance – machine learning allows computers to learn to do things without explicit programming – many successful applications: • requires some “set-up”: does not mean your PC can learn to forecast the stock market or become a brain surgeon • Conclusion: YES, computers can learn and adapt, when presented with information in the appropriate way
  • 30. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 31 • Recognition v. Understanding (like Speech) – Recognition and Understanding of Objects in a scene • look around this room • you can effortlessly recognize objects • human brain can map 2d visual image to 3d “map” • Why is visual recognition a hard problem? • Conclusion: – mostly NO: computers can only “see” certain types of objects under limited circumstances – YES for certain constrained problems (e.g., face recognition) Can Computers “see”?
  • 31. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 32 Can computers plan and make optimal decisions? • Intelligence – involves solving problemsand making decisionsand plans – e.g., you want to take a holiday in Brazil • you need to decide on dates, flights • you need to get to the airport, etc • involves a sequence of decisions,plans, and actions • What makes planning hard? – the world is not predictable: • your flight is canceled or there’s a backup on the 405 – there are a potentially huge number of details • do you considerall flights? all dates? – no: commonsenseconstrains your solutions – AI systems are only successfulin constrained planning problems • Conclusion: NO, real-world planning and decision-making is still beyond the capabilities of modern computers – exception:very well-defined,constrained problems
  • 32. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 33 Summary of State of AI Systems in Practice • Speech synthesis, recognition and understanding – very useful for limited vocabulary applications – unconstrained speechunderstanding is still too hard • Computer vision – works for constrained problems (hand-written zip-codes) – understanding real-world, natural scenes is still too hard • Learning – adaptive systems are used in many applications:have their limits • Planning and Reasoning – only works for constrained problems:e.g.,chess – real-world is too complexfor general systems • Overall: – many components of intelligent systems are “doable” – there are many interesting research problemsremaining
  • 33. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 34 • ..
  • 34. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 35 • ..
  • 35. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 36 • . Separate what to compute (modeling) from how to compute it (algorithms)
  • 36. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 37 Lecture Outline • Introduction • Artificial Intelligence • Machine Learning • Data Science • Applications • What’s next?
  • 37. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 38 • .
  • 38. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 39 What is machine learning? • ,,
  • 39. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 40 machine learning • Supporting all of these models is machine learning. • In the non-machine learning approach, one would write a complex program (remember, we are solving tasks of significant complexity), but this gets very tedious. – For example, how should a spellcheckerknow that for "hte", "the" (transposition) is more likely to be the correctoutput as compared to "hate" (insertion)? • The machine learning approach is to instead write a really simple program with unknown parameters (e.g., numbers measuring how bad it is to transpose or insert characters). • Then, we obtain a set of training examples that partially specifies the desired system behavior. A learning algorithm takes these training examples and sets the parameters of our simple program so that the resulting program approximately produces the desired system behavior. • Abstractly, machine learning allows us to shift the complexity from the program to the data, which is much easier to obtain (either naturally occurring or via crowdsourcing).
  • 40. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 41 Equation of a line y = mx +b f(x) = mx +b
  • 41. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 42 Machine Learning in one slide • Machine learning, a branch of artificial intelligence, is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. • A learner can take advantage of examples (data) to capture characteristics of interest of their unknown underlying probability distribution. Data can be seen as examples that illustrate relations between observed variables. • A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviors given all possible inputs is too large to be covered by the set of observed examples (training data). • Hence the learner must generalize from the given examples, so as to be able to produce a useful output in new cases. Machine learning, like all subjects in artificial intelligence, require cross- disciplinary proficiency in several areas, such as probability theory, statistics, pattern recognition, cognitive science, data mining, adaptive control, computational neuroscience and theoretical computer science.
  • 42. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 43 What is the Learning Problem? • Improve over Task T • with respect to performance measure P • based on experience E Learning = Improving with experience at some task
  • 43. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 44 Types of Learning • Supervised learning - Generates a function that mapsinputs to desired outputs. For example, in a classification problem, the learner approximates a function mapping a vector into classes by looking at input-output examples of the function. • Unsupervised learning - Models a set of inputs: like clustering • Semi-supervised learning - Combines both labeled and unlabeled examples to generate an appropriate function or classifier. • Reinforcement learning - Learns how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback in the form of rewards that guides the learning algorithm. Transduction - Tries to predict new outputs based on training inputs, training outputs, and test inputs.
  • 44. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 45 Supervised Learning :Regression • Regression – Linear Regression • Classification – Logistic Regression • Generalized Linear Models (GLMs) – Broader family of models (that subsume Linear Regression and logistic regress and more – In R checkout ?glm() Parametric Approaches vs. Non-parametric Convex/Concave Discriminative versus generative
  • 45. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 46 Classification versus Regression • Classification is just like a regression problem, except where the values of y that we now want to predict take on only a small number of discrete values (assume no order in y) • Binary logistic regression – For now let’s focus on binary classification where y can take on two values 0 and 1 (can be generalized to multi-class case) • E.g., building an ancestor class; a person is an ancestor (where y might take the value of 1) or not (y=0). – Given Xi the corresponding yi is AKA the label for the training data
  • 46. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 47 • Generative Classifier (Bottom-up learning) – Build model of each class – Assume the underlying form of the classes and estimate their parameters (e.g., a Gaussian) • Discriminative Classifier (Top down) – Build model of boundary between classes – Assume the underlying form of the discriminant and estimate its parameters (e.g., a hyperplane) Families of Supervised Learning Sports Arts Business Health Sports Arts BusinessHealth
  • 47. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 48 Terminology: linear regression Predicted Predictor variables Response variable Explanatory variables Outcomevariable Covariables Dependent Independent variables ...1 nn2210 xwxwxwwy  Wi are the model coefficients Xi’sy Y-intercept/threshold
  • 48. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 49 Pr(Click): Advertising Problem • Predict Pr(Click|dwellTimeOnWebpage) – at the times 1, 2, 3, 4, and 5 seconds after loading the page. • Graph each data point with time on the x-axis and CTR on the y-axis. Your data should follow a straight line. • Use locator() to input data • Find the equation of this line. # x y% 1 1 2 . 2 3 . 3 7 . 4 8 m 5 9 F(x) x X are features, aka variables, continuous, discrete, ordinal ( X  n )
  • 49. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 50 Least Square Fit Approximations Suppose we want to fit the data set. We would like to find the best straight line to fit the data? # x y 1 1 2 . 2 3 . 3 7 . 4 8 m 5 9
  • 50. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 51 Fit a line based on… • If we assume that the first two points are correct and choose the line that goes through them, we get the line y = 1 + x. • If we substitute our points (x-values) into this equation, we get the following chart. • How good is this line? – The sum of the squares of the errors is 27. SSE = 27 Do you think that we can do better than this?
  • 51. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 52 Linear Model More Generally • E.g., y=mx+b can be more generally seen a function of the form • Here the W’s are the parameters (also called weights) parametering the space of linear function mapping from X  Y=F(x) # X0 x1 y 1 1 1 2 . 1 2 3 . 1 3 7 . 1 4 8 m 1 5 9        n i T ii n i T ii Xxxxfy W XWxw xwxwxxfy 1 10 1 110010 ),( ofinsteaduseSometimes ),(  mslope  x1 F(x) b
  • 52. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 53 Types of Learning • Supervised learning - Generates a function that mapsinputs to desired outputs. For example, in a classification problem, the learner approximates a function mapping a vector into classes by looking at input-output examples of the function. • Unsupervised learning - Models a set of inputs: like clustering • Semi-supervised learning - Combines both labeled and unlabeled examples to generate an appropriate function or classifier. • Reinforcement learning - Learns how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback in the form of rewards that guides the learning algorithm. Transduction - Tries to predict new outputs based on training inputs, training outputs, and test inputs.
  • 53. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 54 Machine Learning Background Machine Learning (ML):”a computer program that improves its performance at some task through experience” [Mitchell 1997] GIVEN: Input data is a table of attribute values and associated class values (in the case of supervised learning) GOAL: Approximate f(x1,…,xn)->y InstanceAttr x1 x2 … xn y 1 3 0 .. 7 -1 2 +1 … … … … … … L (aka m) 0 4 ... 8 -1 Y is categorical
  • 54. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 55 Machine Learning: Regression Machine Learning (ML):”a computer program that improves its performance at some task through experience” [Mitchell 1997] GIVEN: Input data is a table of attribute values and associated class values (in the case of supervised learning) GOAL: Approximate f(x1,…,xn)->y InstanceAttr x1 x2 … xn y 1 3 0 .. 7 73 2 76 … … … … … … L (aka m) 0 4 ... 8 97 Y is real valued
  • 55. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 56 Machine Learning semi-supervised Machine Learning (ML):”a computer program that improves its performance at some task through experience” [Mitchell 1997] GIVEN: Input data is a table of attribute values and associated class values (in the case of supervised learning) GOAL: Approximate f(x1,…,xn)->y InstanceAttr x1 x2 … xn y 1 3 0 .. 7 73 2 76 … … … … … … L (aka m) 0 4 ... 8 97 Y is only partially available
  • 56. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 57 Machine Learning Unsupervised Machine Learning (ML):”a computer program that improves its performance at some task through experience” [Mitchell 1997] GIVEN: Input data is a table of attribute values and associated class values (in the case of supervised learning) GOAL: Approximate f(x1,…,xn)->y InstanceAttr x1 x2 … xn y 1 3 0 .. 7 73 2 76 … … … … … … L (aka m) 0 4 ... 8 97 Y is not available
  • 57. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 58 • Generative Classifier (Bottom-up learning) – Build model of each class – Assume the underlying form of the classes and estimate their parameters (e.g., a Gaussian) • Discriminative Classifier (Top down) – Build model of boundary between classes – Assume the underlying form of the discriminant and estimate its parameters (e.g., a hyperplane) Families of Supervised Learning Sports Arts Business Health Sports Arts BusinessHealth
  • 58. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 59 Generative vs. Discriminative • Generative learning (e.g., Bayesian Networks, HMM, Naïve Bayes, EM GMM) typically more flexible – More complex problems – More flexible predictions • Discriminative learning (e.g., ANN, SVM) typically more accurate – Better with small datasets – Faster to train
  • 59. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 60 Parametric vs. Non-Parametric ML Algorithms • Parametric ML Algorithms (e.g., OLS, Decision Trees; SVMs, NNs) – Model-based methods, such as neural networks and the mixture of Gaussians, use the data to build a parameterized model. After training, the model is used for predictions and the data are generally discarded. • Non-Parametric (lowess(); knn; some flavours of SVMs) – In contrast, ``memory-based'' methods are non-parametric approaches that explicitly retain the training data, and use it each time a prediction needs to be made. – The term “non-parametric” (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis/model grows linearly with the size of the training set.
  • 60. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 61 Linear Model: Ordinary Least Squares • How do we pick, or learn, the parameters W (aka θ)? • One reasonable method seems to be to make f(x) close to y, at least for the training examples. • To formalize, let’s define a function that measures, for each possible model/hypothesis, W, how close fθ(xi)’s are to the corresponding yi ’s: • Sum of squared error • AKA Residual Sum of Squares (Residual squared) Measuring Quality    m i ii yWXWJ 1 2 2 1 )( This error minimization is going to have problems?   m i ii yWXWJ 1 )( Residual sum of squares
  • 61. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 62 Residual 0 10 20 30 40 50 60 0 2 4 6 8 10 12 14 16 x y Residuali  ii yWX i Residual
  • 62. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 63 Which Line is it anyway? • Select another two points and build a line • If we choose the line that goes through the points when x = 3 and 4, we get the line y = 4 + x. Will we get a better fit? Let's look at it. SSE = 18. Getting better but can we do better?
  • 63. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 64 Can we do better than guesswork? • Let's try the line that is half way between these two lines. The equation would be y = 2.5 + x. • Is there a more scientific or efficient way than guessing at which line would give the best fit. – Surely there is a methodical way to determine the best fit line. Let's think about what we want. SSE = 11.25. Getting better but can we do better?
  • 64. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 65 Hypothesis Space of Linear Models • Here the W’s are the parameters (also called weights) parameterizing the space of linear function mapping from X  Y = f(X) • Augment Training Data with dummy intercept variable (simplifies notation and modeling) # X0 x1 y 1 1 1 2 . 1 2 3 . 1 3 7 . 1 4 8 m 1 5 9        n i T ii n i T ii Xxxxfy W XWxw xwxwxxfy 1 10 1 110010 ),( ofinsteaduseSometimes ),(  
  • 65. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 66 Space of Hypotheses: Weights example.OLS_Heatmap() • Each model is in our case a coefficient for the y-intercept (bias) and a coefficient for the feature-variable (time) • Plot weight-space in 2D where the third dimesion is the error • Select combination that minimizes the sum of square error HeatMap with isolines overlayed 3D error surface z=log(w0+w1x)    m i iii yXWWJ 1 2 2 1 )(
  • 66. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 67 Hyperplanes partition the input space(a line in a 2 input variable problem) and do NOT predict real values • Many methods in machine learning are based on finding parameters that minimize some objective function. • Very often, the objective function is a weighted sum of two terms: • a cost function and regularization term. • In statistics terms the (log-)likelihood and (log-)prior. – If both of these components are convex, then their sum is also convex. – Loss functions are summed over examples so the sum of a convex functions is a convex function Minimize Residuals Given a linear regression model W, Please type in the loss function for linear regression y= f(X1) where y is real-valued y= f(X1, X2) where y is in {0,1} or {-1, 1}. Y=mX +b y= f(X1) X1 X2 Y X1 Separating hyperplanepartitions AX1 + BX2 + C Class(X1, X2) = sign(AX1 + BX2 + C) Prediction Line y=mX +b Prediction(X1) =mX Partitions versus predicts
  • 67. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 68 Unsupervised Learning (Clustering) Input data We want 3 clusters,red, green and blue
  • 68. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 69 UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Center of a cluster Let’s compute the center of those points
  • 69. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 70 UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Center of a cluster We can use the meanon each dimension
  • 70. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 71 UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Center of a cluster We can use the meanon each dimension
  • 71. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 72 UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Center of a cluster We can use the meanon each dimension
  • 72. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 73 UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Center of a cluster But the meanhastroublewith outliers
  • 73. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 74 UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Center of a cluster Using the median on each dimension is more robust
  • 74. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 75 UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Assignment All points coloured properly already ⇒ wearedone !
  • 75. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 76 Three generations of machine learning • First generation: dataset that fits in memory – Single node learning summary statistics and some batch modeling (at small scale); SQL, R – Down sampling the data • Second generation: General purpose clusters and frameworks – Distributed frameworks that allows us to divide and conquer problems – Learning using general purpose frameworks such as hadoop big data analysis offline, realtime decision making, homegrown specialist systems (Hadoop for analysis and modeling; ), Hadoop, R – In-house purpose built systems; specialist sport • Third generation: Purpose-built libraries and frameworks – Built for iterative algorithms that are common place in ML – huge scale realtime analysis and decision making systems – Specialized frameworks for large scale manipulation the type of data you are workign with. – For example, Machine learning libraries like MLLib in Spark, graph processing libraries like Apache Giraph or GraphX in Spark
  • 76. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 77 Evolution of Map-Reduce frameworks for big data processing mid 90s Jimi’s PhD First generation 2nd generation 2015 Spark 1.5 As of 10/2015Spark 1.0 3rd generation Hadoop V2.0
  • 77. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 78 Top 10 ML Algorithms • .. https://www.dezyre.com/article/top-10- machine-learning-algorithms/202 Naïve Bayes Classifier K Means Clustering Algorithm Nearest Neighbours Apriori Algorithm Linear Regression Logistic Regression Support Vector Machine Decision Trees Ensembles/Forests Artificial Neural Networks/Deep Learning Reinforcement learning Forecasting Many more!
  • 78. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 79 • .. 2005
  • 79. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 80 Lecture Outline • Introduction • Artificial Intelligence • Machine Learning • Data Science • Applications • What’s next?
  • 80. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 81 Internet companies started the revolution • ..
  • 81. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 82 Internet companies started the revolution • .. But more traditional companies are leveraging their data and DS Tech
  • 82. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 83 Data Analysis Has Been Around for a While R.A. Fisher Howard Dresner Peter Luhn W.E. Demming 2012: Deep Learning 2013: Spark 1997 Google
  • 83. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 84 Data Science DS Skillset • Linear regression, DT models for domain experts Domain Expertise A venn diagram with a Danger Bearing [adapted from Drew Conway]
  • 84. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 85 Data Science Technology Hadoop, Spark,Python, Scala, Java, R Digital Advertising & Marketing, Econometrics, Web Search, Cellular Networks, Social Networks Statistics, Optimization Theory, Social Network Analytics, Geo-Informational Science Math Domain Expertise Mobile Advertising Adapted from Drew Conway’s Venn diagram of data science DS
  • 85. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 86 Data Scientist Technology Hadoop, Spark,Python, Scala, Java, R Digital Advertising & Marketing, Econometrics, Web Search, Cellular Networks, Social Networks Statistics, Optimization Theory, Social Network Analytics, Geo-Informational Science MathDomain Expertise Mobile Advertising Communication DS
  • 86. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 87 • .. RockStars and Super Models Technology Math Domain expertise RockStar
  • 87. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 88 Data Analytics at Scale Algorithms: Machine Learning and Analytics, Representation, Vizualization Big Data: human-centric, M2M, IoT Machines: Cloud Computing Storage and compute Frameworks: MapReduce,HDFS, Hadoop, Spark, MPI Security/Privacy Data Analytic sat Scale
  • 88. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 89 DS is Systems + Theory + Verticals • .. http://support.sas.com/resources/papers/proceedings14/SAS313- 2014.pdf Systems - NoSQL - Hadoop - Spark - MPIVerticals - Advertising - Voting - Sports - Autonomous Agents - Healtcare - Education Theory Visualization Legal
  • 89. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 90 1 2 Understand domain, Collect requirements Exploratorydata analysis Modeling FeatureEngineering3 4 5 6 Deploy Models in the wild (e.g., AB test) Lab-based experiments Typical Abstract Data Analytics Pipeline WarehouseData 7 Reports and Decisions Models and decisions
  • 90. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 91 Lecture Outline • Google Doc and Group • Welcome & Class Introductions • Big Data and Applications • Course introduction • Class logistics • Systems (part 1 of N)
  • 91. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 92 Data Science at Scale Security/Privacy Big Data: human- centric, M2M, IoT Machines: Cloud Computing Parallel Frameworks: MapReduce:cmdLine, Hadoop, MRJob,Spark Algorithms: Machine Learning and Analytics Machine learning at Scale
  • 92. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 93 Big data Definition: use • Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. – PROCESSING: • Think of your laptop that gets overwhelmed with 3-4 gig of data (disk space is 1TB) – STORAGE: • Laptop : 1 TB (1012 bytes) – THROUGH-PUT (Read 108 (100 meg/sec) 104 seconds) • 1TB would take 3 hours to read it using your laptop • Challenges – Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, security, and information privacy.
  • 93. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 94 Big Data • In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."[18]
  • 94. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 95 Big Data: V3 • .. 10121021 speed of generation of data or how fast the data is generated and processed 2015: 1-2 TB per online individual 4ZB (1021) Today  40ZB in 2020
  • 95. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 96 Sources Driving Big Data It’s All Happening On-line Every: Click Ad impression Billing event Fast Forward, pause,… Friend Request Transaction Network message Fault … User Generated (Web, Social & Mobile) … .. Internet of Things / M2M Scientific Computing Quantified Self
  • 96. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 97 Big Data Infographic • .. http://www.ibmbigdatahub.com/sites/defaul t/files/infographic_file/4-Vs-of-big-data.jpg http://www.ibmbigdatahub.com/infographic/ four-vs-big-data By 2005 we had 120* 1018 By 2007 we had 280*2018 By 2020 we will have 40* 1021 The quality of the data being captured can vary greatly
  • 97. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 98 3 Vs of Big Data • … 40TB per person by 2020 1-2 TB per person today2014/2015
  • 98. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 100 Lecture Outline • Introduction • Artificial Intelligence • Machine Learning • Data Science • Applications • What’s next?
  • 99. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 101 Why all the excitement? • Government: – Obama used 80 pieces of information on each person; 4 year history (versus Romney) – Nate Silver used Bayesian techniques to publish analyses and predictions related to the 2008 and 2012 United States presidential election • Sports: – Oakland Athletics baseball team and its manager Billy Beane • Transportation ( e.g., Autonomous Vehicles) • HCI: Speech Recognition and Translation • Healthcare – AI Cure: Do you know if your patients are taking their meds? • Digital Advertising • Search (web, local, mobile)
  • 100. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 102 How does data, ML, data science work?
  • 101. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 103 • ..
  • 102. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 104 • .
  • 103. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 105 Web search lifecycle • .. http://www.slideshare.net/GaneshVenkataraman3/learn-to-rank-using-machine-learning https://en.wikipedia.org/wiki/Monty_Hall_problem
  • 104. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 106 Understand user intent • ,,
  • 105. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 107 Fixing user errors • ..
  • 106. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 108 • ,,
  • 107. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 109 Like the Index at end of book • ..
  • 108. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 110 PageRank • ..
  • 109. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 111 Search is a ranking problem • ..
  • 110. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 112 Learning to rank • ..
  • 111. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 113 • ..
  • 112. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 114 Training Data • ..
  • 113. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 115 Supervised Feedback Loop Guided by human editors • ..
  • 114. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 116 Mining relevance judgements • ..
  • 115. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 117 Search ranking (web, jobs, local, etc) And Ads • ..
  • 116. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 118 DB size = 100s billions of sites Google server farms 2 million machines (est) 1011 X 104 = 1015 ~1 Petabyte of data
  • 117. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 119 Learning to Rank at SearchMe • Page Quality, Page Category, Webspam, Query understanding LETOR
  • 118. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 120 LeToR: Improve in a measured way Doubled size of index More labeled training data
  • 119. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 121 More data or more data science? • ..
  • 120. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 122 Advertising ~2% of US GDP; $140B WW "Half the money I spend on advertising is wasted; the trouble is, I don't know which half." - John Wanamaker, father of modern advertising. – Less than 1% of all impressions lead to measureable ROI Despite its problems (Attribution, etc.) • US GDP = $14.1 Trillion (Global $56 Trillion, 56x1012) • US Advertising Spend – ~$275 Billion across all media • (2% of GDP since the early 1900s) • In 2014, Worldwide online advertising was $140 – I.e., about 20% of all ad spending across all media – $42 billion global mobile-advertising market in 2014
  • 121. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 123 • Stopped here 11/15/2016 • Jgs
  • 122. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 124 Making Money from Apps • 93% of downloaded apps in 2013 (globally) are free apps ! • 76% of revenue generated from apps (globally) in 2013 is from in-app purchases – [http://www.forbes.com/sites/chuckjones/2013/03/31/apps-with-in-app- purchase-generate-the-highest-revenue/] • In the Freeium economy – To make money from apps, publishers must maintain customer satisfaction through superior app performance and design, – then monetize though advertising and in-app purchases [http://venturebeat.com/2014/03/27/mobile-app-monetization-freemium-is-king- but-in-app-ads-are-growing-fast/,IDC, AppAnnie]
  • 123. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 125 Mobile Publisher: How do I make money? Auction Ad Which Ad? Publisher: App Developer Consumer: App user • Paid app download • In app purchases • In app Advertising
  • 124. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 126 • ..
  • 125. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 127 Native Advertising
  • 126. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 128 Rich Media Templates • Advertiser Template/Configuration • Defines an offer design/display for a specific ad unit • Publisher Template/Configuration • Defined and designed to provide native experience in publisher games • Controls allowable content (ad units) with a placement • Is a “shell” to an ad (advertiser offer template) • Tracks placement performance • Allows to control the behavior and look/design from the server
  • 127. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 129 Native Design and Dynamic Creative Optimization Ad Frame Treatment Variable Intro Text Publisher Game Art or Character Integration Variable Integrated Call to Action Context, where is this solution being shown in the game? N Native Design Blends with Content Dynamic UI elements adapt to the ad and the audience
  • 128. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 130 • .. http://venturebeat.com/2014/04/29/mobile-apps-could-hit-70b-in-revenues-by-2017-as-non-game- categories-take-off/
  • 129. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 131 Mobile Ad Spend to Top $100 Billion Worldwide in 2016, 51% of Digital Market • US and China will account for nearly 62% of global mobile ad spending next year • http://www.emarketer.com/Article/Mobile-Ad-Spend-Top-100- Billion-Worldwide-2016-51-of-Digital- Market/1012299#sthash.FBfZAlaC.dpuf
  • 130. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 132 CPMs on Mobile are catching up • Mobile Advertising: What is the average CPM on mobile? – The effective cost per thousand impressions (CPM) for desktop web ads is about $3.50, while the CPM for mobile ads is just $0.75. – Video-based CPMs typically > $15 http://www.quora.com/Mobile-Advertising/What-is-the-average-CPM-on-mobile http://mashable.com/2012/10/23/mobile-ad-prices/
  • 131. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 133 NativeX: Art and Science of Native Mobile Advertising A p p P u b l i s h e r N a t I v e X S S P A d N e t E x c h a n g e D S P A d v e r t i s e r A d A g e n c y SUPPLY/PublishersCONSUMERS DEMAND/Advertisers
  • 132. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 134 NativeX: Art and Science of Native Mobile Advertising A p p P u b l i s h e r N a t I v e X S S P A d N e t E x c h a n g e D S P A d v e r t i s e r A d A g e n c y • DOE NativeAds • Yieldmgt • LTV/Churn • SDK • LTV/Churn • Event-based CPA • Flexible, multiple conversions • Segment-based targeting • Forecasting • Coldstart • Pacing • Metrics and Evaluation SUPPLY/PublishersCONSUMERS DEMAND/Advertisers
  • 133. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 135 “OLTP” Data Pipeline “OLAP” Data Pipeline • Offline Data • Logging Data • Used for Reporting and Modeling • Online Data • Used in Real Time • Used for Offer Serving Realtime Batch NativeX Data Pipelines Data Science Predictive Analytics Pipeline • Offline Batch Modeling • Real-timeAd Serving ETL (Extract, Transform,and Load) Ad serving data pipelines
  • 134. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 136 “OLTP” Data Pipeline “OLAP” Data Pipeline • Offline Data • Logging Data • Used for Reporting and Modeling • Online Data • Used in Real Time • Used for Offer Serving Realtime Batch NativeX Data Pipelines Data Science Predictive Analytics Pipeline • Offline Batch Modeling • Real-timeAd Serving ETL (Extract, Transform,and Load) Ad serving data pipelines
  • 135. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 137 Devices Ad serving architecture SDK Kinesis Lambda Spark & Scala Spark Ad Servers Aurora SSAS Cassandra SQL Server EMR Modeling Java / Python / R Excel Pivots Self-Service S3 S3 S3 Ad Hoc / Deep Analysis Pipeline BI Pipeline Data Science Pipeline Glacie r Spark ELB HA Proxy Elasticache Activity Tracking Raw Data Archived Activity Tracking EC2 Cluster Tableau Reporting Services Reporting APIs Hourly ETL EC2 Instance Data Warehouse Alerts Dashboards Debugging / Ops Ad-hoc Analysis EventTracking Data (Logs) Device Profiles Device Data Configuration / Lookup Data
  • 136. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 138 Publisher: Which ad to show? Bids Auction getAd Ad Which Ad?
  • 137. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 139 Publisher: Which ad to show? Ads, Bid (CPI) Auction getAd Ad Which Ad?
  • 138. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 140 NativeX conducts an eCPM-based Auction Ads Pick best ads Bids Auction Action argmaxAd eCPM=bid*CR getAd Ad Transaction Logs $5×0.010×1000=$50 $10×0.002×1000=$20 $3×0.002×1000=$6 $4×0.001×1000=$4
  • 139. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 141 NativeX conducts an eCPM-based Auction Ads Pick best ads Bids Auction Action argmaxAd eCPM=bid*CR getAd Ad Transaction Logs $5×0.010×1000=$50 $10×0.002×1000=$20 $3×0.002×1000=$6 $4×0.001×1000=$4 eCPMAd = CRAd × BidAd× 1000
  • 140. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 142 1 2 Understand domain, Collect requirements Exploratorydata analysis Modeling: Conversion Rate Models Feature Engineering3 4 5 6 Deploy Models in the wild (e.g., AB test) Lab-based experiments 7 Steps in Modeling: E.g., Conversion Rate Modeling WarehouseData 7
  • 141. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 143 Multiple Ad Sources: • DSPs,Exchanges • Ad Networks • Internal/Self-service Multiple conversiontypes: • CPM, CPC, CPI, CPCV, CPA,CPE De-duplication Optimization by geo Modeling Features: • Geo location • Device • Reviews (star rating, review text; Geo location of reviews) • Social media Tweets/ FB posts • Categories on Android and iOS • Creative Message • User profiles (RFM based on network behavior) • Device Behavioral (based on installed apps on device RFM, recommendations, categories) • Graph-based features • Others…. Campaign-specific models for CTR/CR
  • 142. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 144 Modeling • ML Approaches – Gradient boosted decision trees – Bayesian hierarchical approaches – Segmentation via matrix factorization • Feature engineering – Feature invention • Metrics and evaluation • Storing and accessing data • Perennial Challenges – Coldstart – Bias – Scale
  • 143. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 145 • .. https://upload.wikimedia.org/wikipedia/commons/ thumb/5/5f/Minard%27s_Map_%28vectorized%29. svg/2023px- Minard%27s_Map_%28vectorized%29.svg.png
  • 144. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 146 If we can’t measure it then… • … Data Science Updates: 2013/10/25 ©2013 NativeX Holdings, LLC For 16%  40%
  • 145. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 147 From a systems perspective: Three generations of machine learning • First generation: dataset that fit in memory – Single node learning summary statistics and some batch modeling (at sma scale); SQL, R – Down sampling the data • Second generation: General purpose clusters and framework – Distributedframeworks that allows us to divide and conquer problems – Learning using general purpose frameworks such as hadoop big data analysis offline, realtime decision making, homegrown specialist systems (Hadoop for analysis and modeling; ), Hadoop, R – In-house purpose built systems; specialist sport • Third generation: Purpose-built libraries and frameworks – Built for iterative algorithms that are common place in ML – huge scale realtime analysis and decision making systems – Specialized frameworks for large scale manipulation the type of data you a workign with. – For example, Machine learning libraries like MLLib in Spark, graph
  • 146. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 148 Ranking Ads (more) at Turn Inc.
  • 147. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 149 Text Processing • .. http://aylien.com/ http://aylien.com/ Deep Learning based CNN RNN
  • 148. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 150 • .. Linking other things such as groups
  • 149. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 151 • .. Growing
  • 150. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 152 • Deep Learning
  • 151. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 153 • ..
  • 152. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 154 Logistic Regression Model Inputs Coefficients a, b, c Output Independent variables x1, x2, x3 Dependent variable p Prediction Age 34 1Gender Stage 4 “Probability of beingAlive” 5 8 4 0.6 S
  • 153. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 155 S is the sum of inputs * weights Inputs Coefficients Output Independent variables Prediction Age 34 1Gender Stage 4 5 8 4 S  34.5  1.4  4.8  20.6
  • 154. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 156 Neural Network Model Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2Gender Stage 4 .6 .5 .8 .2 .1 .3 .7 .2 WeightsHiddenLa yer “Probability of beingAlive” 0.6 S S . 4 .2 S
  • 155. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 157 Intelligent Systems in Your Everyday Life • Post Office – automatic address recognitionand sorting of mail • Banks – automatic check readers,signature verification systems – automated loan application classification • Customer Service – automatic voice recognition • The Web – Identifying your age, gender,location, from your Web surfing – Automated fraud detection • Digital Cameras – Automated face detectionand focusing • Computer Games – Intelligent characters/agents
  • 156. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 158 • ..
  • 157. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 159 • ..
  • 158. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 160 • ..
  • 159. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 161 • ..
  • 160. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 162 • ..
  • 161. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 163 • ..
  • 162. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 164 • ..
  • 163. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 165 • ..
  • 164. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 166 • ..
  • 165. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 167 • .. http://3.bp.blogspot.com/-iEx- C0ljkKk/VV38zjj_vdI/AAAAAAAAA7w/aron8CBjm os/s1600/alexnet.png
  • 166. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 168 • . Daterequirements
  • 167. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 169 • ..
  • 168. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 170 • ..
  • 169. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 171 • ..
  • 170. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 172 Conversational UI • We’re witnessing an explosion of applications that no longer have a graphical user interface (GUI). • They’ve actually been around for a while, but they’ve only recently started spreading into the mainstream. • They are called bots, virtual assistants, invisible apps. • They can run on Slack, WeChat, Facebook Messenger, plain SMS, or Amazon Echo. • They can be entirely driven by artificial intelligence, or there can be a human behind the curtain.
  • 171. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 173 Conversational UI • ..
  • 172. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 174 • .. Check Balance replenish Charts --__--
  • 173. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 175 Conversational UI • Amazon Echo is controlled by voice, but has a companion app.
  • 174. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 1766Microsoft Research Cortan a
  • 175. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 177 • ..
  • 176. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 178 • ..
  • 177. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 179 Speech Recognition Breakthrough for the Spoken, Translated Word • Published on Nov 8, 2012 • Chief Research Officer Rick Rashid demonstrates a speech recognition breakthrough via machine translation that converts his spoken English words into computer- generated Chinese language. The breakthrough is patterned after deep neural networks and significantly reduces errors in spoken as well as written translation. • For moreinformation on Speech Recognition and Translation, visit – http://www.microsoft.com/translator/skype.aspx • Excellent Video (please watch all this video!) – https://www.youtube.com/watch?v=Nu-nlQqFCKg (Minute 7:11) – English text (ASR)  Chinese Text  Text to speech system (sound like english speaker)
  • 178. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 180 • /.. English text (ASR)  Chinese Text  Text to speech system (sound like english speaker)
  • 179. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 181 ASR (Audio signal  word sequence) • .. HMM, Deep Learning, Language models
  • 180. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 182 Tipping point: Humans no longer the center to the data universe • ..
  • 181. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 183 IoT/IoE • ..
  • 182. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 184 Personal; society; M2M; crowdsourcing • Society – Graphs: Social, professional; – Quantified self: Eating; Sleeping; exercising – Voting – Education – Healthcare…. Economics, shopping, etc. • Internet of things – Tracking Wildebeests in Serengeti, Tanzania (not just with GPS tags, but also with cameras at key strategic locations through out the Serengeti • Population changes in species; Scheduling safaris – 1 Billion smart meters by 2020; • 1 Petabyte of data per day? 10^9 =10^12 10^15 • 1 Billion smart meters (One megabye of data per device per day; Poll meter 1000 times per day; 1000 bytes of data each time – Smart cities • Etc.
  • 183. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 185 Japanese to English • .. http://www.ustar-consortium.com/research.html
  • 184. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 186 From analytics to closed loop control systems Historical Realtime Future Analytical Now Predictive Customerexitrate
  • 185. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 187 From analytics to closed loop control systems Historical Realtime Future Analytical Now Predictive Customerexitrate Decisive
  • 186. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 188 From analytics to closed loop control systems Historical Realtime Future Analytical $ Now Predictive $$ Customerexitrate Decisive $$$
  • 187. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 189 Managers and CEOs see the value of DA Data (Science) improves KPIs dramatically Summary Stats and Reports Offline Data Mining (e.g, user Profiles) Realitime decision making Personalization LTV Advanced BI, Regional Sales KPIPerformanceImprovement (e.g.,Sales) 10-20% 20-30% 2X-10X 10X+ Churn, Repeat, BigSpender Realtime Recommendations, LookAlike Modeling Historical Realtime Future Ads (DSP/DMP) Amazon Google Netflix Oracle, SQL Hadoop (Omniture, Hyperion) SAS, SPSS Cloudera, R
  • 188. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 190 Autonomous Vehicles • ..
  • 189. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 191 • ..
  • 190. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 192 Autonomous Vehicles • .. An image of what Google's self-driving car sees when it makes a left turn. http://www.rand.org/pubs/research_briefs/RB9755.html
  • 191. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 193 autonomous vehicles • Research in autonomous cars started in the 1980s, but the technology wasn't there. • Perhaps the first significant event was the 2005 DARPA Grand Challenge, in which the goal was to have a driverless car go through a 132-mile off-road course. Stanford finished in first place. The car was equipped with various sensors (laser, vision, radar), whose readings needed to be synthesized (using probabilistic techniques that we'll learn from this class) to localize the car and then to generate control signals for the steering, throttle, and brake. • In 2007, DARPA created an even harder Urban Challenge, which was won by CMU. • In 2009, Google started a self-driving car program, and since then, their self-driving cars have driven over 1 million miles on freeways and streets. • In January 2015, Uber hired about 50 people from CMU's robotics department to build self-driving cars. • While there are still technological and policy issues to be worked out, the potential impact on transportation is huge.
  • 192. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 194 • .. http://www.nature.com/news/auto omous-vehicles-no-drivers- required-1.16832 http://asirt.org/initiatives/informing road-users/road-safety-facts/road crash-statistics 800Million parking spots in US
  • 193. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 195 Save fuel, Safer logistics • .. http://peloton-tech.com/
  • 194. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 196 Data Science in Ecommerce • .. This is just a subset
  • 195. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 197 Defining Product Strategy for the optimum product mix • Ecommerce, and bricks and mortar businesses – What products should they sell? – What price should be offered for the products and when? • Data science algorithms help ecommerce businesses define and optimize the product mix. – Every ecommerce business has a product team that looks into the design process where data science algorithms can help the business with forecasting like- • What are the loopholes in the product mix? • What should they make? • How many quantities should be ordered as initial batch from the factory outlet? • When should they halt the supply of those products? • When should they sell? • Data scientists versus Data Analysts – work on advanced predictive and prescriptive analytics – whereas data analysts will merely look into the retrospective analysis like
  • 196. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 198 • https://www.aicure.com/
  • 197. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 199 Do you know if your patients are taking their meds? • ..
  • 198. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 200 Trust but verify! • ..
  • 199. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 201 Rank patients • ..
  • 200. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 202 Alerts • ..
  • 201. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 203 Machine learning at Scale Algorithms: Machine Learning and Analytics Big Data: human-centric, M2M, IoT Machines: Cloud Computing Parallel Frameworks: MapReduce:cmdLine, Hadoop, MRJob,Spark Security/Privacy Machine learning at Scale
  • 202. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 204 Lecture Outline • Introduction • Artificial Intelligence • Machine Learning • Data Science • Applications • What’s next?
  • 203. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 205 150,000 Data Scientists needed in US [McKinsey Report on Big Data 2011] With such enormous potential to change the world, it will come as no surprise that data scientists are in huge demand
  • 204. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 206 Top 10 Best Jobs in the US as of 2/2016 How much you make The demand for your skills How easily you can advance 117K Median salary 1,700 openings right now
  • 205. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 207 From analytics to closed loop control systems Historical Realtime Future Analytical $ Now Predictive $$ Customerexitrate Decisive $$$ IoE, Deep Learning, GPU, Data,Bandwidth (5G)
  • 206. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 208 •Architecture
  • 207. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 209 Cool Thing #2: Schema on Read LOAD DATA FIRST, ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS What implications does this have? BEFORE: ETL, schema design upfront, tossing out original data, comprehensive data study Keep original data around! Have multiple views of the same data! Work with unstructured data sooner! Store first, figure out what to do with it later! WITH HADOOP:
  • 208. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 210 Cool Thing #4: Unstructured Data • Unstructured data: media, text, forms, log data lumped structured data • Query languages like SQL and Pig assume some sort of “structure” • MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer
  • 209. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 211 Left outer join: return all rows from left table even if there are no matches in the right lable • .. Customers is the left Customers Orders CustomerName OrderID A 2 A 4 A 3 B BLANK
  • 210. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 212 Inner Join or simply join • . CustomerName OrderID A 2 A 4 A 3 B BLANK
  • 211. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 213 Join Question: Ecommerce Company • Given – Transaction Logfile/DB/CSVfile (1 Billion transactions) • User ID, Date, Time, Referring URL, item purchased, price, etc.. – User Information/Location file/DB (1Million records) • User ID, HomeCountry, HomeState, HomeZipCode, etc.. • 5 numbers X 2 bytes X * 10^6 = 10^7 (Around 10 MEG) • Join Transaction DB with Location DB using the USER_ID (e.g., Phone number) • Complete this job within one hour every hour! • Using Hadoop, what type of join would you recommend? – NOTE: remember to specify type of join, role of each table, and how do it in Hadoop TASK
  • 212. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 214 • In memory join with user table broadcast to all nodes • Left = User table; right = Transactions table • Right outer join: – Transaction + User
  • 213. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 215 Join part 2 • Left table (Customer information table) • Right table (Transaction table) • Question: Left/Right/Inner/Outer Join? • Right join: – some customers may not exist • HashJoin? Reduce side Join?
  • 214. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 216 Advertising ~2% of US GDP; $140B WW "Half the money I spend on advertising is wasted; the trouble is, I don't know which half." - John Wanamaker, father of modern advertising. – Less than 1% of all impressions lead to measureable ROI Despite its problems (Attribution, etc.) • US GDP = $14.1 Trillion (Global $56 Trillion, 56x1012) • US Advertising Spend – ~$275 Billion across all media • (2% of GDP since the early 1900s) • In 2015, Worldwide online advertising was $150Billion – I.e., about 20% of all ad spending across all media – $42 billion global mobile-advertising market in 2014 – $100 billion global mobile-advertising market in 2016 $400 Million on Super Bowl Advertising TV/Online Cover in more detail in Week 12
  • 215. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 217 NativeX: Art and Science of Native Mobile Advertising SUPPLY/PublishersCONSUMERS DEMAND/Advertisers A p p P u b l i s h e r N a t I v e X S S P A d N e t E x c h a n g e D S P A d v e r t i s e r A d A g e n c y
  • 216. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 218 “OLTP” Data Pipeline “OLAP” Data Pipeline • Offline Data • Logging Data • Used for Reporting and Modeling • Online Data • Used in Real Time • Used for Offer Serving Realtime Batch NativeX Data Pipelines Data Science Predictive Analytics Pipeline • Offline Batch Modeling • Real-timeAd Serving ETL (Extract, Transform,and Load) Ad serving data pipelines Devices SDK Bid X CTRAD, Context X 1000 =eCPMAd 100 Milliseconds
  • 217. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 219 “OLTP” Data Pipeline “OLAP” Data Pipeline • Offline Data • Logging Data • Used for Reporting and Modeling • Online Data • Used in Real Time • Used for Offer Serving Realtime Batch NativeX Data Pipelines Data Science Predictive Analytics Pipeline • Offline Batch Modeling • Real-timeAd Serving ETL (Extract, Transform,and Load) Ad serving data pipelines
  • 218. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 220 Ranking Ads (more) at Turn Inc.
  • 219. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 221 Devices Potential Ad serving architecture SDK Streaming Spark Ad Servers Aurora Cube Cassandra SQL Server EMR Modeling Java / Python / R Excel Pivots Self-Service S3 S3 S3 BI Pipeline Data Science Pipeline Glacie r Spark MemCache December 2015 View Activity Tracking Raw Data Archived Activity Tracking EC2 Cluster Tableau Reporting Services Reporting APIs Hourly ETL EC2 Instance Data Warehouse EventTracking Data (Logs) Device Profiles Device Data Configuration / Lookup Data
  • 220. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 222 NativeX: Art and Science of Native Mobile Advertising SUPPLY/PublishersCONSUMERS DEMAND/Advertisers A p p P u b l i s h e r N a t I v e X S S P A d N e t E x c h a n g e D S P A d v e r t i s e r A d A g e n c y • DOE Native Ads • Yield mgt • LTV/Churn • SDK • LTV/Churn • Event-based CPA • Flexible, multiple conversions • Segment-based targeting • Forecasting • Coldstart • Pacing • Metrics and Evaluation
  • 221. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 223 • End Deep Artificial Intelligence Talk
  • 222. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 224 Data Mining Lectures Lecture 18: Credit Scoring ICS 278: Data Mining Lecture 18: Credit Scoring Padhraic Smyth Department of Information and Computer Science University of California, Irvine
  • 223. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 225 Data Mining Lectures Lecture 18: Credit Scoring Presentations for Next Week • Names for each day will be emailed out by tomorrow • Instructions: – Email me your presentations by 12 noon the day of your presentation (no later please) – I will load them on my laptop (so no need to bring a machine) – Each presentation will be 6 minutes long + 2 minutes questions • So probably about 4 to 8 (max) slides per presentation
  • 224. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 226 Data Mining Lectures Lecture 18: Credit Scoring References on Credit Scoring Statistical Classification Methods in Consumer Credit Scoring: a Review D. J. Hand and W. E. Henley Journal of the Royal Statistical Society: Series A Volume 160: Issue 3, November 1997 Available online at class Web page under lecture notes Also: Credit Scoring and its Applications: L. C. Thomas, D. B. Edelman, J. N. Crook, SIAM, 2002 Credit Risk Modeling, E. Mays (editor), American Management Association, 1998.
  • 225. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 227 Data Mining Lectures Lecture 18: Credit Scoring Outline • Credit Scoring – Problem definition, standard notation • Data Sources • Models – Logistic regression, trees, linear regression, etc • Model building issues – Problem of reject inference • Practical issues – Cutoff selection, updating models
  • 226. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 228 Data Mining Lectures Lecture 18: Credit Scoring The Problem of Credit Scoring • Applicants apply for a bank loan – Population 1 is rejected – Population 2 is accepted • Population 2a repays their loan -> labeled “good” • Population 2b goes into some form of default -> labeled “bad” • Model building – Build a model that can discriminate population 2a from population 2b – Usually treated as a classification problem – Typically want to estimate p(good | features) and rank individuals this way • Widely used by banks and credit card companies – Similar problems occur in direct marketing and other
  • 227. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 229 Data Mining Lectures Lecture 18: Credit Scoring Many different applications for Customer Scoring • Other financial applications: – Delinquent loans: who is most likely to pay up • Uses historical data on who paid in the past • Often used to create “portfolios” ofdelinquentdebt – Customer revenue • How much will each customergenerate in revenue over the next K years • Predicting marketing response – Cost of a mailer to a customer is order of $1 dollar – Targeted marketing • Rank customers interms of “likelihood to respond” • “Churn” prediction – Predicting which customers are most likely to switch to another brand – E.g., wireless phone service – Scores used to rank customers and then target most likely with incentives • Many more….
  • 228. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 230 Data Mining Lectures Lecture 18: Credit Scoring Some background • History – General ideas started in the 1950’s • e.g., Bill Fair and Eric Isaac -> FairIsaac -> FICO scores – Initially a bit contraversial • Worries about it being unfair to some segments of society – US Equal Opportunity Credit Acts, 1975/76 • Skepticism that “machine generated rules” from data could outperform human generated guidelines – First adopted in credit-card approvals (1960’s) – Later broadly adopted in home-loans, etc – Now widely accepted and used by almost all banks, credit- granting agencies, etc
  • 229. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 231 Data Mining Lectures Lecture 18: Credit Scoring Data Sources • Data from the loan application – Age, address, income, profession, SS#, number of credit cards, savings, etc – Easy to obtain • Internal Performance data – How the individual has performed on other loans with the same bank – May only be available for a subset of customers • External Performance data: – Credit Reports • How the individual has performedhistorically on all loans and credit cards • Relatively expensive to obtain (e.g., $1 per individual) – Court Judgements – Real Estate records • Macro-level external data – Demographic characteristics for applicant’s zip code or census tract
  • 230. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 232 Data Mining Lectures Lecture 18: Credit Scoring Loan Application Data • Issues – Data entry errors (e.g., birthday = date of loan application) – Deliberate falsifications (e.g., over-reporting of income) – Legal issues • US Equal Credit Opportunity Acts, 1975/76 • Illegal to use race, color, religion, national origin, sex, marital status, or age in the decision to grant credit • But what if other variables are highly predictive of some of these variables?
  • 231. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 233 Data Mining Lectures Lecture 18: Credit Scoring Variable Name Description Codings dob Year of birth If unknown the year will be 99 nkid Number of children number dep Number of other dependents number phon Is there a home phone 1=yes, 0 = no sinc Spouse's income aes Applicant's employment status V = Government W = housewife M = military P = private sector B = public sector R = retired E = self employed T = student U = unemployed N = others Z = no response
  • 232. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 234 Data Mining Lectures Lecture 18: Credit Scoring Variable Name Description Codings dainc Applicant's income res Residential status O = Owner F = tenant furnished U = Tenant Unfurnished P = With parents N = Other Z = No response dhval Value of Home 0 = no response or not owner 000001 = zero value blank = no response dmort Mortgage balance outstanding 0 = no response or not owner 000001 = zero balance blank = no response doutm Outgoings on mortgage or rent doutl Outgoings on Loans douthp Outgoings on Hire Purchase doutcc Outgoings on credit cards Bad Good/bad indicator 1 = Bad 0 = Good
  • 233. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 235 Data Mining Lectures Lecture 18: Credit Scoring
  • 234. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 236 Data Mining Lectures Lecture 18: Credit Scoring Credit Report Data • Available from 3 major bureaus in the US: – Experian, Trans-Union, and Equifax • Data in the form of a list of transactions/events – Typically needs to be converted into feature-value form • E.g., “number of credit cards opened in past 12 months” – Can result in a huge number of features • Cost varies as a function of type and time-window of data requested – Interesting problem: “cost-optimal” downloading of selected credit report features adapted to each individual as a function of cheaper features
  • 235. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 237 Data Mining Lectures Lecture 18: Credit Scoring Defining Good and Bad • Good versus Bad – Not necessarily clear how to define 2 classes – E.g., • bad = ever 3 or more payments in arrears? • Bad = 2 or more payments in arrears more than once? – A “spectrum” of behavior • Never any problems in payments • Occasional problems • Persistent problems – Typical to discard the intermediate cases and also those with insufficient experience to reliably classify them • Not ideal theoretically, but convenient
  • 236. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 238 Data Mining Lectures Lecture 18: Credit Scoring Selecting a Data Set for Model Building • Sample selection – Typical sample sizes ~ 10k to 100k per class – Should be representative of customers who will apply in the future – Need to be able to get the relevant variables for this set of customers • Internal performance data • External performance data • Etc • External data sources (e.g., credit reports) can result in a very large number of possible variables – E.g., in the 1000’s
  • 237. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 239 Data Mining Lectures Lecture 18: Credit Scoring Models used in Credit Scoring • Regression: – Ignore the fact that we are estimating a probability – Typically linear regression is used • Classification (more common approach) – Logistic regression (most widely used) – Decision trees (becoming more popular) – Neural networks (experimented with, but not used in practice so much) – Nearest neighbors – Model combining - some work in this area – SVMs - too new, relatively unproven • General comments – Many trade-secrets, companies like FairIsaac do not publish details – Generally the industry is conservative: prefer well-established methods – Classification accuracy is only one part of the overall solution….
  • 238. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 240 Data Mining Lectures Lecture 18: Credit Scoring g-1( ) w0 + w1x1 +…+ wpxp=p Logistic Regression Models Training Data log(odds) ( ) p 1 - p log logit(p) 0.0 1.0 p 0.5 logit(p ) 0 Note that near 0, logit(p) is almost linear, so linear and logistic regression will be similar in this region w0 + w1x1
  • 239. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 241 Data Mining Lectures Lecture 18: Credit Scoring Modeling Example Model Bad Risk Rate (%) k nearest neighbor with special metric 43.09 k nearest neighbor (standard) 43.25 logistic regression 43.30 linear regression 43.36 decision tree 43.77 (from Hand and Henley paper)
  • 240. Large-Scale Machine Learning, MIDS, UC Berkeley © 2015 James G. Shanahan Contact:James.Shanahan @ gmail.com 242 Data Mining Lectures Lecture 18: Credit Scoring Evaluation Methods • Decile/Centile reporting: – Rank customers by predicted scores – Report “lift” rate in each decile (and cumulatively) compared to accepting everyone • Receiver Operation Characteristics – Vary classification threshold – Plot proportion of good risks accepted vs. bad risks accepted • Bad Risk rate = bad risk among those accepted – Let p = proportion of good risks – Let a = proportion accepted e.g., can show that, with a > p, the bad risk rate among those accepted is lower bounded by 1 – p/a e.g., p = 0.45, a =0.70 => bad risk rate must be between 0.35 and 0.78