SlideShare une entreprise Scribd logo
1  sur  127
Machine Learning
Algorithms
Girish Khanzode
Contents
• Supervised Learning Model
• Linear Regression
• KNN
• DecisionTree Learning
• OptimizedTree Induction
• Random Forest
• Logistic Regression
• SVM
• Naive Bayes Classifier
• Clustering
• K-means Clustering
• Cluster Classification
• AlgorithmicComplexity
• Dimensionality Reduction
• PCA
• Fisher Linear Discriminant
• LDA
• Kernel PCA
• SVD
• HMM
• Model Evaluation
• Confusion Matrix
• K-Fold CrossValidation
• References
Machine Learning and Pattern Classification
• Predictive modelling is building a model capable of making predictions
• Such a model includes a machine learning algorithm that learns certain properties from a
training dataset in order to make those predictions
• Predictive modelling types - Regression and pattern classification
• Regression models analyze relationships between variables and trends in order to make
predictions about continuous variables
– Prediction of the maximum temperature for the upcoming days in weather forecasting
• Pattern classification assigns discrete class labels to particular observations as outcomes of
a prediction
– Prediction of a sunny, rainy or snowy day
Machine Learning Methodologies
• Supervised learning
– Learning from labelled data
– Classification, Regression, Prediction, Function Approximation
• Unsupervised learning
– Learning from unlabelled data
– Clustering,Visualization, Dimensionality Reduction
Machine Learning Methodologies
• Semi-supervised learning
– mix of Supervised and Unsupervised learning
– usually small part of data is labelled
• Reinforcement learning
– Model learns from a series of actions by maximizing a reward function
– The reward function can either be maximized by penalizing bad actions
and/or rewarding good actions
– Example - training of self-driving car using feedback from the environment
Applications
• Speech recognition
• Effective web search
• Recommendation systems
• Computer vision
• Information retrieval
• Spam filtering
• Computational finance
• Fraud detection
• Medical diagnosis
• Stock market analysis
• Structural health monitoring
LearningTypes
Machine Learning Algorithms
Learning Process
• Supervised LearningAlgorithms are used in classification and prediction
• Training set - each record contains a set of attributes, one of the
attributes is the class
• Classification or prediction algorithm learns from training data about
relationship between predictor variables and outcome variable
• This process results in
– Classification model
– Predictive model
Learning Process
Typical Steps in ML
Supervised Learning Model
• The class labels in the dataset used to build the classification model are
known
• Example - a dataset for spam filtering would contain spam messages as
well as "ham" (= not-spam) messages
• In a supervised learning problem, it is known which message in the
training set is spam or ham and this information is used to train our model
in order to classify new unseen messages
Supervised Learning Model
Classification and Regression
Linear Regression
• A standard and simple mathematical technique for predicting numeric outcome
• Oldest and most widely used predictive model
• Goal - minimize the sum of the squared errors to fit a straight line to a set of data points
• Fits a linear function to a set of data points
• Form of the function
– Y = β0 + β1*X1 + β2*X2 + … + βn*Xn
– Y is the target variable and X1, X2, ... Xn are the predictor variables
– β1, β2, … βn are the coefficients that multiply the predictor variables
– β0 is constant
• Linear regression with multiple variables
– Scale the data, and implement the gradient descent and the cost function
Linear Regression
K Nearest Neighbors - KNN
• A simple algorithm that stores all available cases and classifies new cases based on a
similarity measure
• Extremely simple to implement
• Lazy Learning - function is only approximated locally and all computation is deferred until
classification
• Has a weighted version and can also be used for regression
• Usually works very well when there is a distance between examples (Euclidean, Manhattan)
• Slow speed when training set is large (say 10^6 examples) and distance calculation is non-
trivial
• Only a single hyper-parameter – K (usually optimized using cross-validation)
KNN
KNN Classification
Non-Default
Default
Age
Loan$
DecisionTree Learning
• Decision trees classify instances or examples by starting at the root of the
tree and moving through it until a leaf node
• A method for approximating discrete-valued functions
• Decision tree is a classifier in the form of a tree structure
– Decision node - specifies a test on a single attribute
– Leaf node - indicates the value of the target attribute
– Branch - split of one attribute
– Path - a disjunction of test to make the final decision
When to Consider DecisionTrees
• Attribute-value description- object or case must be expressible in terms
of a fixed collection of properties or attributes
– hot, mild, cold
• Predefined classes (target values) - the target function has discrete
output values
– Boolean or multiclass
– Sufficient data - enough training cases should be provided to learn the model
• Possibly noisy training data
• Missing attribute values
DecisionTree Applications
• Credit risk analysis
• Manufacturing – chemical material evaluation
• Production – Process optimization
• Biomedical Engineering – identify features to use in implantable devices
• Astronomy – filter noise from Hubble telescope images
• Molecular biology – analyze amino acid sequences in Human Genome project
• Pharmacology – drug efficacy analysis
• Planning – scheduling of PCB assembly lines
• Medicine – analysis of syndromes
Strengths
• Trees are inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many simple
data sets
• Generates understandable rules
• Handles continuous and categorical variables
• Provides a clear indication of which fields are most important for
prediction or classification
Weaknesses
• Not suitable for prediction of continuous attribute
• Perform poorly with many classes and small data
• Computationally expensive to train
– At each node each candidate splitting field must be sorted before its best split can
be found
– In some algorithms combinations of fields are used and a search must be made for
optimal combining weights
– Pruning algorithms can also be expensive since many candidate sub-trees must be
formed and compared
• Not suitable for non-rectangular regions
Tree Representation
• Each node in the tree specifies a test for some
attribute of the instance
• Each branch corresponds to an attribute value
• Each leaf node assigns a classification
Tree Representation
Tree Induction
Problems of Random split
• The tree can grow huge
• These trees are hard to understand
• Larger trees are typically less accurate than smaller trees
• So most tree construction methods use some greedy manner
– find the feature that best divides positive examples from negative
examples for Information gain
OptimizedTree Induction
• Greedy strategy - Split the records based on an attribute test
that optimizes certain criterion
• Issues
– Determine root node
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
OptimizedTree Induction
• Selection of an attribute at each node
– Choose the most useful attribute for classifying training examples
• Information gain
– Measures how well a given attribute separates the training examples
according to their target classification
– This measure is used to select among the candidate attributes at each
step while growing the tree
Entropy
• A measure of homogeneity of the set of examples
• Given a set S of positive and negative examples of some target
concept (a 2-class problem), the entropy of set S relative to this
binary classification
– E(S) = - p(P)log2 p(P) – p(N)log2 p(N)
• Example
– Suppose S has 25 examples, 15 positive and 10 negatives [15+, 10-]
– Then entropy of S relative to this classification
• E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)
Entropy
Information Gain
• Information gain measures the expected reduction in entropy or uncertainty
• Values(A) is the set of all possible values for attributeA andSv is subset of S for which attributeA has
value v, Sv = {s in S |A(s) = v}
• First term in the equation is the entropy of the original collection S
• Second term is the expected value of the entropy after S is partitioned using attributeA
• It is the expected reduction in entropy caused by partitioning the examples according to this attribute
• It is the number of bits saved when encoding the target value of an arbitrary member of S by knowing the
value of attributeA
( )
( , ) ( ) ( )v
v
v Values A
S
Gain S A Entropy S Entropy S
S
  
A simple example
• Guess the outcome of next week's game between the MallRats and the
Chinooks
• Available knowledge / Attribute
– was the game at Home or Away
– was the starting time 5pm, 7pm or 9pm
– Did Joe play center, or forward
– whether that opponent's center was tall or not
– …..
Basket ball data
Problem Data
• The game will be away at 9pm and that Joe will play center on offense…
• A classification problem
• Generalizing the learned rule to new examples
Examples
• Before partitioning, the entropy is
– H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1
• Using the where attribute, divide into 2 subsets
– Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1
– Entropy of the second set H(away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1
• Expected entropy after partitioning
– 12/20 * H(home) + 8/20 * H(away) = 1
Examples
• Using the when attribute, divide into 3 subsets
– Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4);
– Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12);
– Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0
• Expected entropy after partitioning
– 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65
• Information gain 1-0.65 = 0.35
Decision
• Knowing the when attribute values provides larger information gain than where
• Therefore the when attribute should be chosen for testing prior to the where
attribute
• Similarly we can compute the information gain for other attributes
• At each node choose the attribute with the largest information gain
• Stopping rule
– Every attribute has already been included along this path through the tree or
– The training examples associated with this leaf node all have the same target attribute
value - entropy is zero
Continuous Attribute
• Each non-leaf node is a test
• Its edge partitions the attribute into subsets (easy for discrete attribute)
• For continuous attribute
– Partition the continuous value of attribute A into a discrete set of intervals
– Create a new Boolean attribute Ac, looking for a threshold c,
How to choose c ?
if
otherwise
c
c
true A c
A
false

 

Evaluation
• Training accuracy
– How many training instances can be correctly classify based on the available data?
– Is it high when tree is deep/large or when there is less confliction in the training
instances
– Higher value does not mean good generalization
• Testing accuracy
– Given a number of new instances how many of them can be correctly classified?
– Cross validation
DecisionTree Creation Algorithms
• ID3
• C4.5
• Hunt’s Algorithm
• CART
• SLIQ,SPRINT
Random Forest
• An ensemble classifier that consists of many decision trees
• Outputs the class that is the mode of the class's output by
individual trees
• The method combines Breiman's bagging idea and the
random selection of features
• Used for classification and regression
Random Forest
Algorithm
• Let the number of training cases be N and number of variables in the classifier M
• The number m of input variables to be used to determine the decision at a node of the tree -
m should be much less than M
• Choose a training set for this tree by choosing n times with replacement from all N available
training cases
• Use the rest of cases to estimate the error of the tree by predicting their classes
• For each node of the tree, randomly choose m variables on which to base the decision at
that node
• Calculate the best split based on these m variables in the training set
• Each tree is fully grown and not pruned
Gini Index
• Random forest uses Gini index taken from CART learning system to
construct decision trees
• The Gini Index of node impurity is the measure most commonly chosen
for classification type problems
• How to select N? - Build trees until the error no longer decreases
• How to select M? -Try to recommend defaults, half of them and twice of
them and pick the best
Random Forest - Flow Chart
Working of Random Forest
• For prediction, a new sample is pushed down the tree
• It is assigned the label of the training sample in the terminal node it ends
up in
• This procedure is iterated over all trees in the ensemble
• Average vote of all trees is reported as random forest prediction
Random Forest - Advantages
• One of the most accurate learning algorithms
• Produces a highly accurate classifier
• Runs efficiently on large databases
• Handles thousands of input variables without variable deletion
• Gives estimates of what variables are important in classification
• Generates an internal unbiased estimate of the generalization error as
the forest building progresses
• Effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing
Random Forest - Advantages
• Methods for balancing error in class population unbalanced data sets
• Prototypes are computed that give information about the relation
between the variables and the classification
• Computes proximities between pairs of cases that can be used in
clustering, locating outliers or by scaling gives interesting views of data
• Above capabilities can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection
• Offers an experimental method for detecting variable interactions
Random Forest - Disadvantages
• Random forests have been observed to overfit for some datasets with
noisy classification/regression tasks
• For data including categorical variables with different number of levels,
random forests are biased in favor of those attributes with more levels
• Therefore the variable importance scores from random forest are not
reliable for this type of data
Logistic Regression
• Models the relationship between a dependent and one or more independent variables
• Allows to look at the fit of the model as well as at the significance of the relationships
(between dependent and independent variables) being modelled
• Estimates the probability of an event occurring - the probability of a pupil continuing in
education post 16
• Predict from a knowledge of relevant independent variables the probability (p) that it is 1
(event occurring) rather than 0
• While in linear regression the relationship between the dependent and the independent
variables is linear, this assumption is not made in logistic regression
Logistic Regression
• Logistic regression function $$ P =
frac{e^{alpha+{beta}x}}{1+e^{alpha+{beta}x}} $$
• P is the probability of a 1 and e is base of natural logarithm (about 2.718)
• $$alpha$$ and $$beta$$ are the parameters of the model
• The value of $$alpha$$ yields P when x is zero and $$beta$$ indicates how the
probability of a 1 changes when x changes by a single unit
• Because the relation between x and P is nonlinear, $$beta$$ does not have as
straightforward an interpretation in this model as it does in ordinary linear
regression
Logistic Regression
SupportVector Machine - SVM
• A supervised learning model with associated learning algorithms that analyze
data and recognize patterns
• Given a set of training examples, each marked for belonging to one of two
categories, SVM training algorithm builds a model that assigns new examples into
one category or the other, making it a non-probabilistic binary linear classifier
• An SVM model is a representation of the examples as points in space, mapped so
that the examples of the separate categories are divided by a clear gap that is as
wide as possible
• New examples are then mapped into that same space and predicted to belong to
a category based on which side of the gap they fall on
SVM
Naive Bayes Classifier
• A family of simple probabilistic classifiers based on applying Bayes'
theorem with strong (naive) independence assumptions between the
features
• A popular method for text categorization, the problem of judging
documents as belonging to one category or the other such as spam or
legitimate, sports or politics etc with word frequencies as the features
• Highly scalable, requires a number of parameters linear in the number of
variables (features/predictors) in a learning problem
Conditional Probability Model
Naive Bayes - Example
UNSUPERVISED LEARNING
Clustering
• A technique to find similar groups in data clusters
• Groups data instances that are similar to (near) each other in one cluster and data
instances that are very different (far away) from each other into different clusters
• Called an unsupervised learning task - since no class values denoting an a priori
grouping of the data instances are given, which is the case in supervised learning
• One of the most utilized data mining techniques
• A long history and used in almost every field like medicine, psychology, botany,
sociology, biology, archeology, marketing, insurance, libraries and text clustering
Clustering
Applications
• Group people of similar sizes together to make small, medium and largeT-Shirts
– Tailor-made for each person - too expensive
– One-size-fits-all - does not fit all
• In marketing, segment customers according to their similarities
– Targeted marketing
• Given a collection of text documents, organize them according to their content
similarities
– To produce a topic hierarchy
Aspects of clustering
• Clustering algorithms
– Partitional clustering
– Hierarchical clustering
• A distance function - similarity or dissimilarity
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• Quality of a clustering process depends on algorithm, distance function and
application
K-means Clustering
• A partitional clustering algorithm
• Classify a given data set through a certain number of k clusters (k is fixed)
• Let the set of data points D be {x1, x2, …, xn}
– xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr
– r = number of attributes (dimensions) in the data
• Algorithm partitions given data into k clusters
– Each cluster has a cluster center (centroid)
– K is user defined
K-means Clustering
K-Means Algorithm
1. Choose k
2. Randomly choose k data points (seeds) as initial centroids
3. Assign each data point to the closest centroid
4. Re-compute the centroids using the current cluster
memberships
5. If a convergence criterion is not met, go to 3
k initial means (in
this case k=3) are
randomly generated
within the data
domain
k clusters are
created by
associating every
observation with
the nearest mean
The centroid of
each of the k
clusters becomes
the new mean
Steps 2 and 3 are
repeated until
convergence has
been reached
K-Means Algorithm
Stopping / Convergence Criterion
1. No (or minimum) re-assignments of data points to different clusters
2. No (or minimum) change of centroids or minimum decrease in the sum
of squared error (SSE)
– Cj is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data
points in Cj), and dist(x, mj) is the distance between data point x and centroid mj



k
j
C j
j
distSSE
1
2
),(x
mx
Example
+
+
Example
K Means - Strengths
• Simple to understand and implement
• Efficient:Time complexity O(tkn) where
– n is number of data points
– k is number of clusters
– t is number of iterations
• Since both k and t are small - a linear algorithm
• Most popular clustering algorithm
• Terminates at a local optimum if SSE is used
• The global optimum is hard to find due to complexity
K Means -Weaknesses
• Only applicable if mean is defined
– For categorical data, k-mode - the centroid is represented by most frequent
values
• User must specify k
• Sensitive to outliers
– Outliers are data points that are very far away from other data points
– Outliers could be errors in the data recording or some special data points with
very different values
K Means -Weaknesses
Handling Outliers
• Remove data points in the clustering process that are much further away
from the centroids than other data points
• Perform random sampling
– Since in sampling we only choose a small subset of the data points, the
chance of selecting an outlier is very small
– Assign the rest of the data points to the clusters by distance or similarity
comparison or classification
Sensitivity to Initial Seeds
Use of Different Seeds for Good Results
There are some methods to help choose good seeds
Weaknesses
+
Unsuitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres)
K-Means
• Still the most popular algorithm - simplicity, efficiency
• Other clustering algorithms have their own weaknesses
• No clear evidence that any other clustering algorithm performs better in
general
– although other algorithms could be more suitable for some specific types of
data or applications
• Comparing different clustering algorithms is a difficult task
• No one knows the correct clusters
Clusters Representation
• Use the centroid of each cluster to represent the cluster
• Compute the radius and standard deviation of the cluster to determine its
spread in each dimension
• Centroid representation alone works well if the clusters are of the hyper-
spherical shape
• If clusters are elongated or are of other shapes, centroids are not
sufficient
Cluster Classification
• All the points in a cluster have the same class label - the
cluster ID
• Run a supervised learning algorithm on the data to find a
classification model
Cluster Classification
Distance BetweenTwo Clusters
• Single link
• Complete link
• Average link
• Centroids
• …
Single Link Method
• The distance between two clusters is the distance between two closest
data points in the two clusters, one data point from each cluster
• It can find arbitrarily shaped clusters, but
– It may cause the undesirable chain effect by noisy points
Complete Link Method
• Distance between two clusters is the distance of two furthest data points
in the two clusters
• Sensitive to outliers because they are far away
Average Link Method
• Distance between two clusters is the average distance of all pair-wise
distances between the data points in two clusters
• A compromise between
– the sensitivity of complete-link clustering to outliers
– the tendency of single-link clustering to form long chains that do not
correspond to the intuitive notion of clusters as compact, spherical objects
Centroid Method
• Distance between two clusters is the distance
between their centroids
Algorithmic Complexity
• All the algorithms are at least O(n2)
– n is the number of data points
• Single link can be done in O(n2)
• Complete and average links can be done in O(n2logn)
• Due the complexity, hard to use for large data sets
– Sampling
– Scale-up methods (BIRCH)
Distance Functions
• Key to clustering
• similarity and dissimilarity are also commonly used terms
• Numerous distance functions
– Different types of data
• Numeric data
• Nominal data
– Different specific applications
Distance Functions - Numeric Attributes
• Euclidean distance
• Manhattan (city block) distance
• Denote distance with dist(xi, xj) where xi and xj are data points (vectors)
• They are special cases of Minkowski distance
• h is positive integer
hh
jrir
h
ji
h
jiji xxxxxxdist
1
2211 ))(...)()((),( xx
Distance Formulae
• If h = 2, it is the Euclidean distance
• If h = 1, it is the Manhattan distance
• Weighted Euclidean distance
22
22
2
11 )(...)()(),( jrirjijiji xxxxxxdist xx
||...||||),( 2211 jrirjijiji xxxxxxdist xx
22
222
2
111 )(...)()(),( jrirrjijiji xxwxxwxxwdist xx
Distance Formulae
• Squared Euclidean distance - to place progressively greater weight on
data points that are further apart
• Chebychev distance - one wants to define two data points as different if
they are different on any one of the attributes
22
22
2
11 )(...)()(),( jrirjijiji xxxxxxdist xx
|)|...,|,||,max(|),( 2211 jrirjijiji xxxxxxdist xx
Curse of Dimensionality
• Various problems that arise analyzing and organizing data in high
dimensional spaces do not occur in low dimensional space like 2D or 3D
• In the context of classification/function approximation, performance of
classification algorithm can improve by removing irrelevant features
Dimensionality Reduction - Applications
• Information Retrieval – web documents where dimensionality is
vocabulary of words
• Recommender systems – large scale of ratings matrix
• Social networks – social graph with large number of users
• Biology – gene expressions
• Image processing – facial recognition
Dimensionality Reduction
• Defying the curse of dimensionality - simpler models result in improved generalization
• Classification algorithm may not scale up to the size of the full feature set either in space or
time
• Improves understanding of domain
• Cheaper to collect and store data based on reduced feature set
• TwoTechniques
– FeatureConstruction
– Feature Selection
Dimensionality Reduction
Techniques
• Linear Discriminant Analysis – LDA
– Tries to identify attributes that account for the most variance between classes
– LDA compared to PCA is a supervised method using known labels
• Principal component analysis – PCA
– Identifies combination of linearly co-related attributes (principal components or directions in
feature space) that accounts for the most variance of data
– Plot the different samples on 2 first principal components
• Singular Value Decomposition – SVD
– Factorization of real or complex matrix
– Derived from PCA
Feature Construction
• Linear methods
– Principal component analysis (PCA)
– Independent component analysis (ICA)
– Fisher Linear Discriminant (LDA)
– ….
• Non-linear methods
– Kernel PCA
– Non linear component analysis (NLCA)
– Local linear embedding (LLE)
– ….
Principal component analysis - PCA
• A tool in exploratory data analysis and to create predictive models
• Involves calculating Eigenvalue decomposition of a data covariance matrix,
usually after mean centering the data for each attribute
• Mathematically defined as an orthogonal linear transformation to map data to a
new coordinate system such that the greatest variance by any projection of the
data comes to lie on the first coordinate (called the first principal component), the
second greatest variance on the second coordinate, and so on
• Theoretically the optimal linear scheme, in terms of least mean square error, for
compressing a set of high dimensional vectors into a set of lower dimensional
vectors and then reconstructing the original set
PCA
PCA
Fisher Linear Discriminant
• A classification method that projects high-dimensional data
onto a line and performs classification in this one-
dimensional space
• The projection maximizes the distance between the means of
the two classes while minimizing the variance within each
class
Fisher Linear Discriminant
Linear Discriminant Analysis - LDA
• A generalization of Fisher's linear
discriminant
• A method used to find a linear
combination of features that
characterizes or separates two or more
classes of objects or events
• The resulting combination may be used
as a linear classifier or more commonly
for dimensionality reduction before later
classification
Difference
Kernel PCA
• Classic PCA approach is a linear projection technique that works well if
the data is linearly separable
• In the case of linearly inseparable data, a nonlinear technique is required
if the task is to reduce the dimensionality of a dataset
Kernel PCA
• The basic idea to deal with linearly inseparable data is to project it onto a higher
dimensional space where it becomes linearly separable
• Consider a nonlinear mapping function ϕ so that the mapping of a sample xx can be written
as xx→ϕ(xx), which is called kernel function
• The term kernel describes a function that calculates the dot product of the images of the
samples xx under ϕ
• κ(xxi,xxj)=ϕ(xxi)Tϕ(xxj)
• Function ϕ maps the original d-dimensional features into a larger k-dimensional feature
space by creating nonlinear combinations of the original features
Kernel PCA
SingularValue Decomposition - SVD
• A mechanism to break a matrix into simpler meaningful pieces
• Used to detect groupings in data
• A factorization of a real or complex matrix
• A general rectangular M-by-N matrix A has a SVD into the product of an
M-by-N orthogonal matrix U, an N-by-N diagonal matrix of singular
values S and the transpose of an N-by-N orthogonal square matrixV
– A = U SV^T
SVD
Hidden Markov Models - HMM
• A statistical Markov model in which the system being modelled is
assumed to be a Markov process with unobserved (hidden) states
• Used in pattern recognition, such as handwriting and speech analysis
• In simpler Markov models like a Markov chain, the state is directly visible
to the observer, and therefore the state transition probabilities are the
only parameters
• In HMM, the state is not directly visible, but output, dependent on the
state, is visible
Hidden Markov Models - HMM
• Each state has a probability distribution over the possible output tokens
• Therefore the sequence of tokens generated by an HMM gives some
information about the sequence of states
• Adjective hidden refers to the state sequence through which the model
passes, not to the parameters of the model
• The model is still referred to as a 'hidden' Markov model even if these
parameters are known exactly
HMM
MODEL EVALUATION
Model Evaluation
• How accurate is the classifier?
• When the classifier is wrong, how is it wrong?
• Decide on which classifier (which parameters) to use
and to estimate what the performance of the system
will be
Testing Set
• Split the available data into a training set and a test set
• Train the classifier in the training set and evaluate based on
the test set
Classifier Accuracy
• The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by
the classifier
• Often also referred to as recognition rate
• Error rate (or misclassification rate) is the opposite of
accuracy
False PositiveVs. Negative
• When is the model wrong?
– False positives vs. false negatives
– Related to type I and type II errors in statistics
• Often there is a different cost associated with false
positives and false negatives
– Diagnosing diseases
Confusion Matrix
• Mechanism to illustrate how a model is performing in terms
of false positives and false negatives
• Provides more information than a single accuracy figure
• Allows thinking about the cost of mistakes
• Extendable to any number of classes
Confusion Matrix
More Accuracy Measures
Area Under ROC Curve - AUC
• ROC curves can be used to
compare models
• Bigger the AUC, the more
accurate the model
• ROC index is the area under
the ROC curve
Gini-Statistic
• Calculated from the ROC Curve
– Gini = 2 *AUC – 1
• Where the AUC is the area under the ROC curve
K-Fold CrossValidation
• Divide the entire data set into k folds
• For each of k experiments, use kth fold for testing and
everything else for training
K-Fold CrossValidation
• The accuracy of the system is calculated as the average error across the k
folds
• The main advantages of k-fold cross validation are that every example is
used in testing at some stage and the problem of an unfortunate split is
avoided
• Any value can be used for k
– 10 is most common
– Depends on the data set
References
1. W. L. Chao, J. J. Ding, “Integrated Machine Learning Algorithms for Human Age Estimation”, NTU, 2011
2. Phil Simon (March 18, 2013). Too Big to Ignore: The Business Case for Big Data. Wiley
3. Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7
4. Harnad, Stevan (2008), "The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence", in Epstein, Robert;
Peters, Grace, The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest for the Thinking Computer,
Kluwer
5. Russell, Stuart; Norvig, Peter (2003) [1995] Artificial Intelligence: A Modern Approach (2nd ed.) Prentice Hall
6. Langley, Pat (2011). "The changing science of machine learning". Machine Learning 82 (3): 275–279. doi:10.1007/s10994-011-5242-y
7. Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew (2012). "Improving First and Second-Order Methods by Modeling
Uncertainty". In Sra, Suvrit; Nowozin, Sebastian; Wright, Stephen J. Optimization for Machine Learning. MIT Press. p. 404
8. MI Jordan (2014-09-10). "statistics and machine learning“ Cornell University Library. "Breiman : Statistical Modeling: The Two
Cultures (with comments and a rejoinder by the author)”
9. Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer
10. Yoshua Bengio (2009). Learning Deep Architectures for AI. Now Publishers Inc. pp. 1–3. ISBN 978-1-60198-294-0
11. A. M. Tillmann, "On the Computational Intractability of Exact and Approximate Dictionary Learning", IEEE Signal Processing
Letters 22(1), 2015: 45–49
12. Aharon, M, M Elad, and A Bruckstein. 2006. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse
Representation." Signal Processing, IEEE Transactions on 54 (11): 4311-4322
13. Goldberg, David E.; Holland, John H. (1988). "Genetic algorithms and machine learning". Machine Learning 3 (2): 95–99
ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

Contenu connexe

Tendances

Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning AlgorithmsWalaa Hamdy Assy
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt Poojamanic
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine LearningAnkit Rai
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning TechniquesTara ram Goyal
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningEng Teong Cheah
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningKmPooja4
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 

Tendances (20)

Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Machine Can Think
Machine Can ThinkMachine Can Think
Machine Can Think
 
Machine learning
Machine learningMachine learning
Machine learning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning Techniques
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Supervised learning
  Supervised learning  Supervised learning
Supervised learning
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 

Similaire à Machine Learning

ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2Nandhini S
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analyticsDinakar nk
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selectionMarco Meoni
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit ivmalathieswaran29
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 

Similaire à Machine Learning (20)

ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
KNN
KNNKNN
KNN
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
7 decision tree
7 decision tree7 decision tree
7 decision tree
 
AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
 
Moviereview prjct
Moviereview prjctMoviereview prjct
Moviereview prjct
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 

Plus de Girish Khanzode (13)

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
IR
IRIR
IR
 
NLP
NLPNLP
NLP
 
NLTK
NLTKNLTK
NLTK
 
NoSql
NoSqlNoSql
NoSql
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Hadoop
HadoopHadoop
Hadoop
 
Language R
Language RLanguage R
Language R
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
 

Dernier

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Dernier (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

Machine Learning

  • 2. Contents • Supervised Learning Model • Linear Regression • KNN • DecisionTree Learning • OptimizedTree Induction • Random Forest • Logistic Regression • SVM • Naive Bayes Classifier • Clustering • K-means Clustering • Cluster Classification • AlgorithmicComplexity • Dimensionality Reduction • PCA • Fisher Linear Discriminant • LDA • Kernel PCA • SVD • HMM • Model Evaluation • Confusion Matrix • K-Fold CrossValidation • References
  • 3. Machine Learning and Pattern Classification • Predictive modelling is building a model capable of making predictions • Such a model includes a machine learning algorithm that learns certain properties from a training dataset in order to make those predictions • Predictive modelling types - Regression and pattern classification • Regression models analyze relationships between variables and trends in order to make predictions about continuous variables – Prediction of the maximum temperature for the upcoming days in weather forecasting • Pattern classification assigns discrete class labels to particular observations as outcomes of a prediction – Prediction of a sunny, rainy or snowy day
  • 4. Machine Learning Methodologies • Supervised learning – Learning from labelled data – Classification, Regression, Prediction, Function Approximation • Unsupervised learning – Learning from unlabelled data – Clustering,Visualization, Dimensionality Reduction
  • 5. Machine Learning Methodologies • Semi-supervised learning – mix of Supervised and Unsupervised learning – usually small part of data is labelled • Reinforcement learning – Model learns from a series of actions by maximizing a reward function – The reward function can either be maximized by penalizing bad actions and/or rewarding good actions – Example - training of self-driving car using feedback from the environment
  • 6. Applications • Speech recognition • Effective web search • Recommendation systems • Computer vision • Information retrieval • Spam filtering • Computational finance • Fraud detection • Medical diagnosis • Stock market analysis • Structural health monitoring
  • 9. Learning Process • Supervised LearningAlgorithms are used in classification and prediction • Training set - each record contains a set of attributes, one of the attributes is the class • Classification or prediction algorithm learns from training data about relationship between predictor variables and outcome variable • This process results in – Classification model – Predictive model
  • 12. Supervised Learning Model • The class labels in the dataset used to build the classification model are known • Example - a dataset for spam filtering would contain spam messages as well as "ham" (= not-spam) messages • In a supervised learning problem, it is known which message in the training set is spam or ham and this information is used to train our model in order to classify new unseen messages
  • 15. Linear Regression • A standard and simple mathematical technique for predicting numeric outcome • Oldest and most widely used predictive model • Goal - minimize the sum of the squared errors to fit a straight line to a set of data points • Fits a linear function to a set of data points • Form of the function – Y = β0 + β1*X1 + β2*X2 + … + βn*Xn – Y is the target variable and X1, X2, ... Xn are the predictor variables – β1, β2, … βn are the coefficients that multiply the predictor variables – β0 is constant • Linear regression with multiple variables – Scale the data, and implement the gradient descent and the cost function
  • 17. K Nearest Neighbors - KNN • A simple algorithm that stores all available cases and classifies new cases based on a similarity measure • Extremely simple to implement • Lazy Learning - function is only approximated locally and all computation is deferred until classification • Has a weighted version and can also be used for regression • Usually works very well when there is a distance between examples (Euclidean, Manhattan) • Slow speed when training set is large (say 10^6 examples) and distance calculation is non- trivial • Only a single hyper-parameter – K (usually optimized using cross-validation)
  • 18. KNN
  • 20. DecisionTree Learning • Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node • A method for approximating discrete-valued functions • Decision tree is a classifier in the form of a tree structure – Decision node - specifies a test on a single attribute – Leaf node - indicates the value of the target attribute – Branch - split of one attribute – Path - a disjunction of test to make the final decision
  • 21. When to Consider DecisionTrees • Attribute-value description- object or case must be expressible in terms of a fixed collection of properties or attributes – hot, mild, cold • Predefined classes (target values) - the target function has discrete output values – Boolean or multiclass – Sufficient data - enough training cases should be provided to learn the model • Possibly noisy training data • Missing attribute values
  • 22. DecisionTree Applications • Credit risk analysis • Manufacturing – chemical material evaluation • Production – Process optimization • Biomedical Engineering – identify features to use in implantable devices • Astronomy – filter noise from Hubble telescope images • Molecular biology – analyze amino acid sequences in Human Genome project • Pharmacology – drug efficacy analysis • Planning – scheduling of PCB assembly lines • Medicine – analysis of syndromes
  • 23. Strengths • Trees are inexpensive to construct • Extremely fast at classifying unknown records • Easy to interpret for small-sized trees • Accuracy is comparable to other classification techniques for many simple data sets • Generates understandable rules • Handles continuous and categorical variables • Provides a clear indication of which fields are most important for prediction or classification
  • 24. Weaknesses • Not suitable for prediction of continuous attribute • Perform poorly with many classes and small data • Computationally expensive to train – At each node each candidate splitting field must be sorted before its best split can be found – In some algorithms combinations of fields are used and a search must be made for optimal combining weights – Pruning algorithms can also be expensive since many candidate sub-trees must be formed and compared • Not suitable for non-rectangular regions
  • 25. Tree Representation • Each node in the tree specifies a test for some attribute of the instance • Each branch corresponds to an attribute value • Each leaf node assigns a classification
  • 28. Problems of Random split • The tree can grow huge • These trees are hard to understand • Larger trees are typically less accurate than smaller trees • So most tree construction methods use some greedy manner – find the feature that best divides positive examples from negative examples for Information gain
  • 29. OptimizedTree Induction • Greedy strategy - Split the records based on an attribute test that optimizes certain criterion • Issues – Determine root node – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting
  • 30. OptimizedTree Induction • Selection of an attribute at each node – Choose the most useful attribute for classifying training examples • Information gain – Measures how well a given attribute separates the training examples according to their target classification – This measure is used to select among the candidate attributes at each step while growing the tree
  • 31. Entropy • A measure of homogeneity of the set of examples • Given a set S of positive and negative examples of some target concept (a 2-class problem), the entropy of set S relative to this binary classification – E(S) = - p(P)log2 p(P) – p(N)log2 p(N) • Example – Suppose S has 25 examples, 15 positive and 10 negatives [15+, 10-] – Then entropy of S relative to this classification • E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)
  • 33. Information Gain • Information gain measures the expected reduction in entropy or uncertainty • Values(A) is the set of all possible values for attributeA andSv is subset of S for which attributeA has value v, Sv = {s in S |A(s) = v} • First term in the equation is the entropy of the original collection S • Second term is the expected value of the entropy after S is partitioned using attributeA • It is the expected reduction in entropy caused by partitioning the examples according to this attribute • It is the number of bits saved when encoding the target value of an arbitrary member of S by knowing the value of attributeA ( ) ( , ) ( ) ( )v v v Values A S Gain S A Entropy S Entropy S S   
  • 34. A simple example • Guess the outcome of next week's game between the MallRats and the Chinooks • Available knowledge / Attribute – was the game at Home or Away – was the starting time 5pm, 7pm or 9pm – Did Joe play center, or forward – whether that opponent's center was tall or not – …..
  • 36. Problem Data • The game will be away at 9pm and that Joe will play center on offense… • A classification problem • Generalizing the learned rule to new examples
  • 37. Examples • Before partitioning, the entropy is – H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1 • Using the where attribute, divide into 2 subsets – Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1 – Entropy of the second set H(away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1 • Expected entropy after partitioning – 12/20 * H(home) + 8/20 * H(away) = 1
  • 38. Examples • Using the when attribute, divide into 3 subsets – Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4); – Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12); – Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0 • Expected entropy after partitioning – 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65 • Information gain 1-0.65 = 0.35
  • 39. Decision • Knowing the when attribute values provides larger information gain than where • Therefore the when attribute should be chosen for testing prior to the where attribute • Similarly we can compute the information gain for other attributes • At each node choose the attribute with the largest information gain • Stopping rule – Every attribute has already been included along this path through the tree or – The training examples associated with this leaf node all have the same target attribute value - entropy is zero
  • 40. Continuous Attribute • Each non-leaf node is a test • Its edge partitions the attribute into subsets (easy for discrete attribute) • For continuous attribute – Partition the continuous value of attribute A into a discrete set of intervals – Create a new Boolean attribute Ac, looking for a threshold c, How to choose c ? if otherwise c c true A c A false    
  • 41. Evaluation • Training accuracy – How many training instances can be correctly classify based on the available data? – Is it high when tree is deep/large or when there is less confliction in the training instances – Higher value does not mean good generalization • Testing accuracy – Given a number of new instances how many of them can be correctly classified? – Cross validation
  • 42. DecisionTree Creation Algorithms • ID3 • C4.5 • Hunt’s Algorithm • CART • SLIQ,SPRINT
  • 43. Random Forest • An ensemble classifier that consists of many decision trees • Outputs the class that is the mode of the class's output by individual trees • The method combines Breiman's bagging idea and the random selection of features • Used for classification and regression
  • 45. Algorithm • Let the number of training cases be N and number of variables in the classifier M • The number m of input variables to be used to determine the decision at a node of the tree - m should be much less than M • Choose a training set for this tree by choosing n times with replacement from all N available training cases • Use the rest of cases to estimate the error of the tree by predicting their classes • For each node of the tree, randomly choose m variables on which to base the decision at that node • Calculate the best split based on these m variables in the training set • Each tree is fully grown and not pruned
  • 46. Gini Index • Random forest uses Gini index taken from CART learning system to construct decision trees • The Gini Index of node impurity is the measure most commonly chosen for classification type problems • How to select N? - Build trees until the error no longer decreases • How to select M? -Try to recommend defaults, half of them and twice of them and pick the best
  • 47. Random Forest - Flow Chart
  • 48. Working of Random Forest • For prediction, a new sample is pushed down the tree • It is assigned the label of the training sample in the terminal node it ends up in • This procedure is iterated over all trees in the ensemble • Average vote of all trees is reported as random forest prediction
  • 49. Random Forest - Advantages • One of the most accurate learning algorithms • Produces a highly accurate classifier • Runs efficiently on large databases • Handles thousands of input variables without variable deletion • Gives estimates of what variables are important in classification • Generates an internal unbiased estimate of the generalization error as the forest building progresses • Effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing
  • 50. Random Forest - Advantages • Methods for balancing error in class population unbalanced data sets • Prototypes are computed that give information about the relation between the variables and the classification • Computes proximities between pairs of cases that can be used in clustering, locating outliers or by scaling gives interesting views of data • Above capabilities can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection • Offers an experimental method for detecting variable interactions
  • 51. Random Forest - Disadvantages • Random forests have been observed to overfit for some datasets with noisy classification/regression tasks • For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels • Therefore the variable importance scores from random forest are not reliable for this type of data
  • 52. Logistic Regression • Models the relationship between a dependent and one or more independent variables • Allows to look at the fit of the model as well as at the significance of the relationships (between dependent and independent variables) being modelled • Estimates the probability of an event occurring - the probability of a pupil continuing in education post 16 • Predict from a knowledge of relevant independent variables the probability (p) that it is 1 (event occurring) rather than 0 • While in linear regression the relationship between the dependent and the independent variables is linear, this assumption is not made in logistic regression
  • 53. Logistic Regression • Logistic regression function $$ P = frac{e^{alpha+{beta}x}}{1+e^{alpha+{beta}x}} $$ • P is the probability of a 1 and e is base of natural logarithm (about 2.718) • $$alpha$$ and $$beta$$ are the parameters of the model • The value of $$alpha$$ yields P when x is zero and $$beta$$ indicates how the probability of a 1 changes when x changes by a single unit • Because the relation between x and P is nonlinear, $$beta$$ does not have as straightforward an interpretation in this model as it does in ordinary linear regression
  • 55. SupportVector Machine - SVM • A supervised learning model with associated learning algorithms that analyze data and recognize patterns • Given a set of training examples, each marked for belonging to one of two categories, SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier • An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible • New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on
  • 56. SVM
  • 57. Naive Bayes Classifier • A family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features • A popular method for text categorization, the problem of judging documents as belonging to one category or the other such as spam or legitimate, sports or politics etc with word frequencies as the features • Highly scalable, requires a number of parameters linear in the number of variables (features/predictors) in a learning problem
  • 59. Naive Bayes - Example
  • 61. Clustering • A technique to find similar groups in data clusters • Groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters • Called an unsupervised learning task - since no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning • One of the most utilized data mining techniques • A long history and used in almost every field like medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries and text clustering
  • 63. Applications • Group people of similar sizes together to make small, medium and largeT-Shirts – Tailor-made for each person - too expensive – One-size-fits-all - does not fit all • In marketing, segment customers according to their similarities – Targeted marketing • Given a collection of text documents, organize them according to their content similarities – To produce a topic hierarchy
  • 64. Aspects of clustering • Clustering algorithms – Partitional clustering – Hierarchical clustering • A distance function - similarity or dissimilarity • Clustering quality – Inter-clusters distance  maximized – Intra-clusters distance  minimized • Quality of a clustering process depends on algorithm, distance function and application
  • 65. K-means Clustering • A partitional clustering algorithm • Classify a given data set through a certain number of k clusters (k is fixed) • Let the set of data points D be {x1, x2, …, xn} – xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr – r = number of attributes (dimensions) in the data • Algorithm partitions given data into k clusters – Each cluster has a cluster center (centroid) – K is user defined
  • 67. K-Means Algorithm 1. Choose k 2. Randomly choose k data points (seeds) as initial centroids 3. Assign each data point to the closest centroid 4. Re-compute the centroids using the current cluster memberships 5. If a convergence criterion is not met, go to 3
  • 68. k initial means (in this case k=3) are randomly generated within the data domain k clusters are created by associating every observation with the nearest mean The centroid of each of the k clusters becomes the new mean Steps 2 and 3 are repeated until convergence has been reached K-Means Algorithm
  • 69. Stopping / Convergence Criterion 1. No (or minimum) re-assignments of data points to different clusters 2. No (or minimum) change of centroids or minimum decrease in the sum of squared error (SSE) – Cj is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points in Cj), and dist(x, mj) is the distance between data point x and centroid mj    k j C j j distSSE 1 2 ),(x mx
  • 72. K Means - Strengths • Simple to understand and implement • Efficient:Time complexity O(tkn) where – n is number of data points – k is number of clusters – t is number of iterations • Since both k and t are small - a linear algorithm • Most popular clustering algorithm • Terminates at a local optimum if SSE is used • The global optimum is hard to find due to complexity
  • 73. K Means -Weaknesses • Only applicable if mean is defined – For categorical data, k-mode - the centroid is represented by most frequent values • User must specify k • Sensitive to outliers – Outliers are data points that are very far away from other data points – Outliers could be errors in the data recording or some special data points with very different values
  • 75. Handling Outliers • Remove data points in the clustering process that are much further away from the centroids than other data points • Perform random sampling – Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small – Assign the rest of the data points to the clusters by distance or similarity comparison or classification
  • 77. Use of Different Seeds for Good Results There are some methods to help choose good seeds
  • 78. Weaknesses + Unsuitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres)
  • 79. K-Means • Still the most popular algorithm - simplicity, efficiency • Other clustering algorithms have their own weaknesses • No clear evidence that any other clustering algorithm performs better in general – although other algorithms could be more suitable for some specific types of data or applications • Comparing different clustering algorithms is a difficult task • No one knows the correct clusters
  • 80. Clusters Representation • Use the centroid of each cluster to represent the cluster • Compute the radius and standard deviation of the cluster to determine its spread in each dimension • Centroid representation alone works well if the clusters are of the hyper- spherical shape • If clusters are elongated or are of other shapes, centroids are not sufficient
  • 81. Cluster Classification • All the points in a cluster have the same class label - the cluster ID • Run a supervised learning algorithm on the data to find a classification model
  • 83. Distance BetweenTwo Clusters • Single link • Complete link • Average link • Centroids • …
  • 84. Single Link Method • The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster • It can find arbitrarily shaped clusters, but – It may cause the undesirable chain effect by noisy points
  • 85. Complete Link Method • Distance between two clusters is the distance of two furthest data points in the two clusters • Sensitive to outliers because they are far away
  • 86. Average Link Method • Distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters • A compromise between – the sensitivity of complete-link clustering to outliers – the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects
  • 87. Centroid Method • Distance between two clusters is the distance between their centroids
  • 88. Algorithmic Complexity • All the algorithms are at least O(n2) – n is the number of data points • Single link can be done in O(n2) • Complete and average links can be done in O(n2logn) • Due the complexity, hard to use for large data sets – Sampling – Scale-up methods (BIRCH)
  • 89. Distance Functions • Key to clustering • similarity and dissimilarity are also commonly used terms • Numerous distance functions – Different types of data • Numeric data • Nominal data – Different specific applications
  • 90. Distance Functions - Numeric Attributes • Euclidean distance • Manhattan (city block) distance • Denote distance with dist(xi, xj) where xi and xj are data points (vectors) • They are special cases of Minkowski distance • h is positive integer hh jrir h ji h jiji xxxxxxdist 1 2211 ))(...)()((),( xx
  • 91. Distance Formulae • If h = 2, it is the Euclidean distance • If h = 1, it is the Manhattan distance • Weighted Euclidean distance 22 22 2 11 )(...)()(),( jrirjijiji xxxxxxdist xx ||...||||),( 2211 jrirjijiji xxxxxxdist xx 22 222 2 111 )(...)()(),( jrirrjijiji xxwxxwxxwdist xx
  • 92. Distance Formulae • Squared Euclidean distance - to place progressively greater weight on data points that are further apart • Chebychev distance - one wants to define two data points as different if they are different on any one of the attributes 22 22 2 11 )(...)()(),( jrirjijiji xxxxxxdist xx |)|...,|,||,max(|),( 2211 jrirjijiji xxxxxxdist xx
  • 93. Curse of Dimensionality • Various problems that arise analyzing and organizing data in high dimensional spaces do not occur in low dimensional space like 2D or 3D • In the context of classification/function approximation, performance of classification algorithm can improve by removing irrelevant features
  • 94. Dimensionality Reduction - Applications • Information Retrieval – web documents where dimensionality is vocabulary of words • Recommender systems – large scale of ratings matrix • Social networks – social graph with large number of users • Biology – gene expressions • Image processing – facial recognition
  • 95. Dimensionality Reduction • Defying the curse of dimensionality - simpler models result in improved generalization • Classification algorithm may not scale up to the size of the full feature set either in space or time • Improves understanding of domain • Cheaper to collect and store data based on reduced feature set • TwoTechniques – FeatureConstruction – Feature Selection
  • 97. Techniques • Linear Discriminant Analysis – LDA – Tries to identify attributes that account for the most variance between classes – LDA compared to PCA is a supervised method using known labels • Principal component analysis – PCA – Identifies combination of linearly co-related attributes (principal components or directions in feature space) that accounts for the most variance of data – Plot the different samples on 2 first principal components • Singular Value Decomposition – SVD – Factorization of real or complex matrix – Derived from PCA
  • 98. Feature Construction • Linear methods – Principal component analysis (PCA) – Independent component analysis (ICA) – Fisher Linear Discriminant (LDA) – …. • Non-linear methods – Kernel PCA – Non linear component analysis (NLCA) – Local linear embedding (LLE) – ….
  • 99. Principal component analysis - PCA • A tool in exploratory data analysis and to create predictive models • Involves calculating Eigenvalue decomposition of a data covariance matrix, usually after mean centering the data for each attribute • Mathematically defined as an orthogonal linear transformation to map data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on • Theoretically the optimal linear scheme, in terms of least mean square error, for compressing a set of high dimensional vectors into a set of lower dimensional vectors and then reconstructing the original set
  • 100. PCA
  • 101. PCA
  • 102. Fisher Linear Discriminant • A classification method that projects high-dimensional data onto a line and performs classification in this one- dimensional space • The projection maximizes the distance between the means of the two classes while minimizing the variance within each class
  • 104. Linear Discriminant Analysis - LDA • A generalization of Fisher's linear discriminant • A method used to find a linear combination of features that characterizes or separates two or more classes of objects or events • The resulting combination may be used as a linear classifier or more commonly for dimensionality reduction before later classification
  • 106. Kernel PCA • Classic PCA approach is a linear projection technique that works well if the data is linearly separable • In the case of linearly inseparable data, a nonlinear technique is required if the task is to reduce the dimensionality of a dataset
  • 107. Kernel PCA • The basic idea to deal with linearly inseparable data is to project it onto a higher dimensional space where it becomes linearly separable • Consider a nonlinear mapping function ϕ so that the mapping of a sample xx can be written as xx→ϕ(xx), which is called kernel function • The term kernel describes a function that calculates the dot product of the images of the samples xx under ϕ • κ(xxi,xxj)=ϕ(xxi)Tϕ(xxj) • Function ϕ maps the original d-dimensional features into a larger k-dimensional feature space by creating nonlinear combinations of the original features
  • 109. SingularValue Decomposition - SVD • A mechanism to break a matrix into simpler meaningful pieces • Used to detect groupings in data • A factorization of a real or complex matrix • A general rectangular M-by-N matrix A has a SVD into the product of an M-by-N orthogonal matrix U, an N-by-N diagonal matrix of singular values S and the transpose of an N-by-N orthogonal square matrixV – A = U SV^T
  • 110. SVD
  • 111. Hidden Markov Models - HMM • A statistical Markov model in which the system being modelled is assumed to be a Markov process with unobserved (hidden) states • Used in pattern recognition, such as handwriting and speech analysis • In simpler Markov models like a Markov chain, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters • In HMM, the state is not directly visible, but output, dependent on the state, is visible
  • 112. Hidden Markov Models - HMM • Each state has a probability distribution over the possible output tokens • Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states • Adjective hidden refers to the state sequence through which the model passes, not to the parameters of the model • The model is still referred to as a 'hidden' Markov model even if these parameters are known exactly
  • 113. HMM
  • 115. Model Evaluation • How accurate is the classifier? • When the classifier is wrong, how is it wrong? • Decide on which classifier (which parameters) to use and to estimate what the performance of the system will be
  • 116. Testing Set • Split the available data into a training set and a test set • Train the classifier in the training set and evaluate based on the test set
  • 117. Classifier Accuracy • The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier • Often also referred to as recognition rate • Error rate (or misclassification rate) is the opposite of accuracy
  • 118. False PositiveVs. Negative • When is the model wrong? – False positives vs. false negatives – Related to type I and type II errors in statistics • Often there is a different cost associated with false positives and false negatives – Diagnosing diseases
  • 119. Confusion Matrix • Mechanism to illustrate how a model is performing in terms of false positives and false negatives • Provides more information than a single accuracy figure • Allows thinking about the cost of mistakes • Extendable to any number of classes
  • 122. Area Under ROC Curve - AUC • ROC curves can be used to compare models • Bigger the AUC, the more accurate the model • ROC index is the area under the ROC curve
  • 123. Gini-Statistic • Calculated from the ROC Curve – Gini = 2 *AUC – 1 • Where the AUC is the area under the ROC curve
  • 124. K-Fold CrossValidation • Divide the entire data set into k folds • For each of k experiments, use kth fold for testing and everything else for training
  • 125. K-Fold CrossValidation • The accuracy of the system is calculated as the average error across the k folds • The main advantages of k-fold cross validation are that every example is used in testing at some stage and the problem of an unfortunate split is avoided • Any value can be used for k – 10 is most common – Depends on the data set
  • 126. References 1. W. L. Chao, J. J. Ding, “Integrated Machine Learning Algorithms for Human Age Estimation”, NTU, 2011 2. Phil Simon (March 18, 2013). Too Big to Ignore: The Business Case for Big Data. Wiley 3. Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7 4. Harnad, Stevan (2008), "The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence", in Epstein, Robert; Peters, Grace, The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest for the Thinking Computer, Kluwer 5. Russell, Stuart; Norvig, Peter (2003) [1995] Artificial Intelligence: A Modern Approach (2nd ed.) Prentice Hall 6. Langley, Pat (2011). "The changing science of machine learning". Machine Learning 82 (3): 275–279. doi:10.1007/s10994-011-5242-y 7. Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew (2012). "Improving First and Second-Order Methods by Modeling Uncertainty". In Sra, Suvrit; Nowozin, Sebastian; Wright, Stephen J. Optimization for Machine Learning. MIT Press. p. 404 8. MI Jordan (2014-09-10). "statistics and machine learning“ Cornell University Library. "Breiman : Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)” 9. Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer 10. Yoshua Bengio (2009). Learning Deep Architectures for AI. Now Publishers Inc. pp. 1–3. ISBN 978-1-60198-294-0 11. A. M. Tillmann, "On the Computational Intractability of Exact and Approximate Dictionary Learning", IEEE Signal Processing Letters 22(1), 2015: 45–49 12. Aharon, M, M Elad, and A Bruckstein. 2006. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation." Signal Processing, IEEE Transactions on 54 (11): 4311-4322 13. Goldberg, David E.; Holland, John H. (1988). "Genetic algorithms and machine learning". Machine Learning 3 (2): 95–99
  • 127. ThankYou Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode