3. Machine Learning and Pattern Classification
• Predictive modelling is building a model capable of making predictions
• Such a model includes a machine learning algorithm that learns certain properties from a
training dataset in order to make those predictions
• Predictive modelling types - Regression and pattern classification
• Regression models analyze relationships between variables and trends in order to make
predictions about continuous variables
– Prediction of the maximum temperature for the upcoming days in weather forecasting
• Pattern classification assigns discrete class labels to particular observations as outcomes of
a prediction
– Prediction of a sunny, rainy or snowy day
4. Machine Learning Methodologies
• Supervised learning
– Learning from labelled data
– Classification, Regression, Prediction, Function Approximation
• Unsupervised learning
– Learning from unlabelled data
– Clustering,Visualization, Dimensionality Reduction
5. Machine Learning Methodologies
• Semi-supervised learning
– mix of Supervised and Unsupervised learning
– usually small part of data is labelled
• Reinforcement learning
– Model learns from a series of actions by maximizing a reward function
– The reward function can either be maximized by penalizing bad actions
and/or rewarding good actions
– Example - training of self-driving car using feedback from the environment
6. Applications
• Speech recognition
• Effective web search
• Recommendation systems
• Computer vision
• Information retrieval
• Spam filtering
• Computational finance
• Fraud detection
• Medical diagnosis
• Stock market analysis
• Structural health monitoring
9. Learning Process
• Supervised LearningAlgorithms are used in classification and prediction
• Training set - each record contains a set of attributes, one of the
attributes is the class
• Classification or prediction algorithm learns from training data about
relationship between predictor variables and outcome variable
• This process results in
– Classification model
– Predictive model
12. Supervised Learning Model
• The class labels in the dataset used to build the classification model are
known
• Example - a dataset for spam filtering would contain spam messages as
well as "ham" (= not-spam) messages
• In a supervised learning problem, it is known which message in the
training set is spam or ham and this information is used to train our model
in order to classify new unseen messages
15. Linear Regression
• A standard and simple mathematical technique for predicting numeric outcome
• Oldest and most widely used predictive model
• Goal - minimize the sum of the squared errors to fit a straight line to a set of data points
• Fits a linear function to a set of data points
• Form of the function
– Y = β0 + β1*X1 + β2*X2 + … + βn*Xn
– Y is the target variable and X1, X2, ... Xn are the predictor variables
– β1, β2, … βn are the coefficients that multiply the predictor variables
– β0 is constant
• Linear regression with multiple variables
– Scale the data, and implement the gradient descent and the cost function
17. K Nearest Neighbors - KNN
• A simple algorithm that stores all available cases and classifies new cases based on a
similarity measure
• Extremely simple to implement
• Lazy Learning - function is only approximated locally and all computation is deferred until
classification
• Has a weighted version and can also be used for regression
• Usually works very well when there is a distance between examples (Euclidean, Manhattan)
• Slow speed when training set is large (say 10^6 examples) and distance calculation is non-
trivial
• Only a single hyper-parameter – K (usually optimized using cross-validation)
20. DecisionTree Learning
• Decision trees classify instances or examples by starting at the root of the
tree and moving through it until a leaf node
• A method for approximating discrete-valued functions
• Decision tree is a classifier in the form of a tree structure
– Decision node - specifies a test on a single attribute
– Leaf node - indicates the value of the target attribute
– Branch - split of one attribute
– Path - a disjunction of test to make the final decision
21. When to Consider DecisionTrees
• Attribute-value description- object or case must be expressible in terms
of a fixed collection of properties or attributes
– hot, mild, cold
• Predefined classes (target values) - the target function has discrete
output values
– Boolean or multiclass
– Sufficient data - enough training cases should be provided to learn the model
• Possibly noisy training data
• Missing attribute values
22. DecisionTree Applications
• Credit risk analysis
• Manufacturing – chemical material evaluation
• Production – Process optimization
• Biomedical Engineering – identify features to use in implantable devices
• Astronomy – filter noise from Hubble telescope images
• Molecular biology – analyze amino acid sequences in Human Genome project
• Pharmacology – drug efficacy analysis
• Planning – scheduling of PCB assembly lines
• Medicine – analysis of syndromes
23. Strengths
• Trees are inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many simple
data sets
• Generates understandable rules
• Handles continuous and categorical variables
• Provides a clear indication of which fields are most important for
prediction or classification
24. Weaknesses
• Not suitable for prediction of continuous attribute
• Perform poorly with many classes and small data
• Computationally expensive to train
– At each node each candidate splitting field must be sorted before its best split can
be found
– In some algorithms combinations of fields are used and a search must be made for
optimal combining weights
– Pruning algorithms can also be expensive since many candidate sub-trees must be
formed and compared
• Not suitable for non-rectangular regions
25. Tree Representation
• Each node in the tree specifies a test for some
attribute of the instance
• Each branch corresponds to an attribute value
• Each leaf node assigns a classification
28. Problems of Random split
• The tree can grow huge
• These trees are hard to understand
• Larger trees are typically less accurate than smaller trees
• So most tree construction methods use some greedy manner
– find the feature that best divides positive examples from negative
examples for Information gain
29. OptimizedTree Induction
• Greedy strategy - Split the records based on an attribute test
that optimizes certain criterion
• Issues
– Determine root node
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
30. OptimizedTree Induction
• Selection of an attribute at each node
– Choose the most useful attribute for classifying training examples
• Information gain
– Measures how well a given attribute separates the training examples
according to their target classification
– This measure is used to select among the candidate attributes at each
step while growing the tree
31. Entropy
• A measure of homogeneity of the set of examples
• Given a set S of positive and negative examples of some target
concept (a 2-class problem), the entropy of set S relative to this
binary classification
– E(S) = - p(P)log2 p(P) – p(N)log2 p(N)
• Example
– Suppose S has 25 examples, 15 positive and 10 negatives [15+, 10-]
– Then entropy of S relative to this classification
• E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)
33. Information Gain
• Information gain measures the expected reduction in entropy or uncertainty
• Values(A) is the set of all possible values for attributeA andSv is subset of S for which attributeA has
value v, Sv = {s in S |A(s) = v}
• First term in the equation is the entropy of the original collection S
• Second term is the expected value of the entropy after S is partitioned using attributeA
• It is the expected reduction in entropy caused by partitioning the examples according to this attribute
• It is the number of bits saved when encoding the target value of an arbitrary member of S by knowing the
value of attributeA
( )
( , ) ( ) ( )v
v
v Values A
S
Gain S A Entropy S Entropy S
S
34. A simple example
• Guess the outcome of next week's game between the MallRats and the
Chinooks
• Available knowledge / Attribute
– was the game at Home or Away
– was the starting time 5pm, 7pm or 9pm
– Did Joe play center, or forward
– whether that opponent's center was tall or not
– …..
36. Problem Data
• The game will be away at 9pm and that Joe will play center on offense…
• A classification problem
• Generalizing the learned rule to new examples
37. Examples
• Before partitioning, the entropy is
– H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1
• Using the where attribute, divide into 2 subsets
– Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1
– Entropy of the second set H(away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1
• Expected entropy after partitioning
– 12/20 * H(home) + 8/20 * H(away) = 1
38. Examples
• Using the when attribute, divide into 3 subsets
– Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4);
– Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12);
– Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0
• Expected entropy after partitioning
– 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65
• Information gain 1-0.65 = 0.35
39. Decision
• Knowing the when attribute values provides larger information gain than where
• Therefore the when attribute should be chosen for testing prior to the where
attribute
• Similarly we can compute the information gain for other attributes
• At each node choose the attribute with the largest information gain
• Stopping rule
– Every attribute has already been included along this path through the tree or
– The training examples associated with this leaf node all have the same target attribute
value - entropy is zero
40. Continuous Attribute
• Each non-leaf node is a test
• Its edge partitions the attribute into subsets (easy for discrete attribute)
• For continuous attribute
– Partition the continuous value of attribute A into a discrete set of intervals
– Create a new Boolean attribute Ac, looking for a threshold c,
How to choose c ?
if
otherwise
c
c
true A c
A
false
41. Evaluation
• Training accuracy
– How many training instances can be correctly classify based on the available data?
– Is it high when tree is deep/large or when there is less confliction in the training
instances
– Higher value does not mean good generalization
• Testing accuracy
– Given a number of new instances how many of them can be correctly classified?
– Cross validation
43. Random Forest
• An ensemble classifier that consists of many decision trees
• Outputs the class that is the mode of the class's output by
individual trees
• The method combines Breiman's bagging idea and the
random selection of features
• Used for classification and regression
45. Algorithm
• Let the number of training cases be N and number of variables in the classifier M
• The number m of input variables to be used to determine the decision at a node of the tree -
m should be much less than M
• Choose a training set for this tree by choosing n times with replacement from all N available
training cases
• Use the rest of cases to estimate the error of the tree by predicting their classes
• For each node of the tree, randomly choose m variables on which to base the decision at
that node
• Calculate the best split based on these m variables in the training set
• Each tree is fully grown and not pruned
46. Gini Index
• Random forest uses Gini index taken from CART learning system to
construct decision trees
• The Gini Index of node impurity is the measure most commonly chosen
for classification type problems
• How to select N? - Build trees until the error no longer decreases
• How to select M? -Try to recommend defaults, half of them and twice of
them and pick the best
48. Working of Random Forest
• For prediction, a new sample is pushed down the tree
• It is assigned the label of the training sample in the terminal node it ends
up in
• This procedure is iterated over all trees in the ensemble
• Average vote of all trees is reported as random forest prediction
49. Random Forest - Advantages
• One of the most accurate learning algorithms
• Produces a highly accurate classifier
• Runs efficiently on large databases
• Handles thousands of input variables without variable deletion
• Gives estimates of what variables are important in classification
• Generates an internal unbiased estimate of the generalization error as
the forest building progresses
• Effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing
50. Random Forest - Advantages
• Methods for balancing error in class population unbalanced data sets
• Prototypes are computed that give information about the relation
between the variables and the classification
• Computes proximities between pairs of cases that can be used in
clustering, locating outliers or by scaling gives interesting views of data
• Above capabilities can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection
• Offers an experimental method for detecting variable interactions
51. Random Forest - Disadvantages
• Random forests have been observed to overfit for some datasets with
noisy classification/regression tasks
• For data including categorical variables with different number of levels,
random forests are biased in favor of those attributes with more levels
• Therefore the variable importance scores from random forest are not
reliable for this type of data
52. Logistic Regression
• Models the relationship between a dependent and one or more independent variables
• Allows to look at the fit of the model as well as at the significance of the relationships
(between dependent and independent variables) being modelled
• Estimates the probability of an event occurring - the probability of a pupil continuing in
education post 16
• Predict from a knowledge of relevant independent variables the probability (p) that it is 1
(event occurring) rather than 0
• While in linear regression the relationship between the dependent and the independent
variables is linear, this assumption is not made in logistic regression
53. Logistic Regression
• Logistic regression function $$ P =
frac{e^{alpha+{beta}x}}{1+e^{alpha+{beta}x}} $$
• P is the probability of a 1 and e is base of natural logarithm (about 2.718)
• $$alpha$$ and $$beta$$ are the parameters of the model
• The value of $$alpha$$ yields P when x is zero and $$beta$$ indicates how the
probability of a 1 changes when x changes by a single unit
• Because the relation between x and P is nonlinear, $$beta$$ does not have as
straightforward an interpretation in this model as it does in ordinary linear
regression
55. SupportVector Machine - SVM
• A supervised learning model with associated learning algorithms that analyze
data and recognize patterns
• Given a set of training examples, each marked for belonging to one of two
categories, SVM training algorithm builds a model that assigns new examples into
one category or the other, making it a non-probabilistic binary linear classifier
• An SVM model is a representation of the examples as points in space, mapped so
that the examples of the separate categories are divided by a clear gap that is as
wide as possible
• New examples are then mapped into that same space and predicted to belong to
a category based on which side of the gap they fall on
57. Naive Bayes Classifier
• A family of simple probabilistic classifiers based on applying Bayes'
theorem with strong (naive) independence assumptions between the
features
• A popular method for text categorization, the problem of judging
documents as belonging to one category or the other such as spam or
legitimate, sports or politics etc with word frequencies as the features
• Highly scalable, requires a number of parameters linear in the number of
variables (features/predictors) in a learning problem
61. Clustering
• A technique to find similar groups in data clusters
• Groups data instances that are similar to (near) each other in one cluster and data
instances that are very different (far away) from each other into different clusters
• Called an unsupervised learning task - since no class values denoting an a priori
grouping of the data instances are given, which is the case in supervised learning
• One of the most utilized data mining techniques
• A long history and used in almost every field like medicine, psychology, botany,
sociology, biology, archeology, marketing, insurance, libraries and text clustering
63. Applications
• Group people of similar sizes together to make small, medium and largeT-Shirts
– Tailor-made for each person - too expensive
– One-size-fits-all - does not fit all
• In marketing, segment customers according to their similarities
– Targeted marketing
• Given a collection of text documents, organize them according to their content
similarities
– To produce a topic hierarchy
64. Aspects of clustering
• Clustering algorithms
– Partitional clustering
– Hierarchical clustering
• A distance function - similarity or dissimilarity
• Clustering quality
– Inter-clusters distance maximized
– Intra-clusters distance minimized
• Quality of a clustering process depends on algorithm, distance function and
application
65. K-means Clustering
• A partitional clustering algorithm
• Classify a given data set through a certain number of k clusters (k is fixed)
• Let the set of data points D be {x1, x2, …, xn}
– xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr
– r = number of attributes (dimensions) in the data
• Algorithm partitions given data into k clusters
– Each cluster has a cluster center (centroid)
– K is user defined
67. K-Means Algorithm
1. Choose k
2. Randomly choose k data points (seeds) as initial centroids
3. Assign each data point to the closest centroid
4. Re-compute the centroids using the current cluster
memberships
5. If a convergence criterion is not met, go to 3
68. k initial means (in
this case k=3) are
randomly generated
within the data
domain
k clusters are
created by
associating every
observation with
the nearest mean
The centroid of
each of the k
clusters becomes
the new mean
Steps 2 and 3 are
repeated until
convergence has
been reached
K-Means Algorithm
69. Stopping / Convergence Criterion
1. No (or minimum) re-assignments of data points to different clusters
2. No (or minimum) change of centroids or minimum decrease in the sum
of squared error (SSE)
– Cj is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data
points in Cj), and dist(x, mj) is the distance between data point x and centroid mj
k
j
C j
j
distSSE
1
2
),(x
mx
72. K Means - Strengths
• Simple to understand and implement
• Efficient:Time complexity O(tkn) where
– n is number of data points
– k is number of clusters
– t is number of iterations
• Since both k and t are small - a linear algorithm
• Most popular clustering algorithm
• Terminates at a local optimum if SSE is used
• The global optimum is hard to find due to complexity
73. K Means -Weaknesses
• Only applicable if mean is defined
– For categorical data, k-mode - the centroid is represented by most frequent
values
• User must specify k
• Sensitive to outliers
– Outliers are data points that are very far away from other data points
– Outliers could be errors in the data recording or some special data points with
very different values
75. Handling Outliers
• Remove data points in the clustering process that are much further away
from the centroids than other data points
• Perform random sampling
– Since in sampling we only choose a small subset of the data points, the
chance of selecting an outlier is very small
– Assign the rest of the data points to the clusters by distance or similarity
comparison or classification
79. K-Means
• Still the most popular algorithm - simplicity, efficiency
• Other clustering algorithms have their own weaknesses
• No clear evidence that any other clustering algorithm performs better in
general
– although other algorithms could be more suitable for some specific types of
data or applications
• Comparing different clustering algorithms is a difficult task
• No one knows the correct clusters
80. Clusters Representation
• Use the centroid of each cluster to represent the cluster
• Compute the radius and standard deviation of the cluster to determine its
spread in each dimension
• Centroid representation alone works well if the clusters are of the hyper-
spherical shape
• If clusters are elongated or are of other shapes, centroids are not
sufficient
81. Cluster Classification
• All the points in a cluster have the same class label - the
cluster ID
• Run a supervised learning algorithm on the data to find a
classification model
84. Single Link Method
• The distance between two clusters is the distance between two closest
data points in the two clusters, one data point from each cluster
• It can find arbitrarily shaped clusters, but
– It may cause the undesirable chain effect by noisy points
85. Complete Link Method
• Distance between two clusters is the distance of two furthest data points
in the two clusters
• Sensitive to outliers because they are far away
86. Average Link Method
• Distance between two clusters is the average distance of all pair-wise
distances between the data points in two clusters
• A compromise between
– the sensitivity of complete-link clustering to outliers
– the tendency of single-link clustering to form long chains that do not
correspond to the intuitive notion of clusters as compact, spherical objects
88. Algorithmic Complexity
• All the algorithms are at least O(n2)
– n is the number of data points
• Single link can be done in O(n2)
• Complete and average links can be done in O(n2logn)
• Due the complexity, hard to use for large data sets
– Sampling
– Scale-up methods (BIRCH)
89. Distance Functions
• Key to clustering
• similarity and dissimilarity are also commonly used terms
• Numerous distance functions
– Different types of data
• Numeric data
• Nominal data
– Different specific applications
90. Distance Functions - Numeric Attributes
• Euclidean distance
• Manhattan (city block) distance
• Denote distance with dist(xi, xj) where xi and xj are data points (vectors)
• They are special cases of Minkowski distance
• h is positive integer
hh
jrir
h
ji
h
jiji xxxxxxdist
1
2211 ))(...)()((),( xx
91. Distance Formulae
• If h = 2, it is the Euclidean distance
• If h = 1, it is the Manhattan distance
• Weighted Euclidean distance
22
22
2
11 )(...)()(),( jrirjijiji xxxxxxdist xx
||...||||),( 2211 jrirjijiji xxxxxxdist xx
22
222
2
111 )(...)()(),( jrirrjijiji xxwxxwxxwdist xx
92. Distance Formulae
• Squared Euclidean distance - to place progressively greater weight on
data points that are further apart
• Chebychev distance - one wants to define two data points as different if
they are different on any one of the attributes
22
22
2
11 )(...)()(),( jrirjijiji xxxxxxdist xx
|)|...,|,||,max(|),( 2211 jrirjijiji xxxxxxdist xx
93. Curse of Dimensionality
• Various problems that arise analyzing and organizing data in high
dimensional spaces do not occur in low dimensional space like 2D or 3D
• In the context of classification/function approximation, performance of
classification algorithm can improve by removing irrelevant features
94. Dimensionality Reduction - Applications
• Information Retrieval – web documents where dimensionality is
vocabulary of words
• Recommender systems – large scale of ratings matrix
• Social networks – social graph with large number of users
• Biology – gene expressions
• Image processing – facial recognition
95. Dimensionality Reduction
• Defying the curse of dimensionality - simpler models result in improved generalization
• Classification algorithm may not scale up to the size of the full feature set either in space or
time
• Improves understanding of domain
• Cheaper to collect and store data based on reduced feature set
• TwoTechniques
– FeatureConstruction
– Feature Selection
97. Techniques
• Linear Discriminant Analysis – LDA
– Tries to identify attributes that account for the most variance between classes
– LDA compared to PCA is a supervised method using known labels
• Principal component analysis – PCA
– Identifies combination of linearly co-related attributes (principal components or directions in
feature space) that accounts for the most variance of data
– Plot the different samples on 2 first principal components
• Singular Value Decomposition – SVD
– Factorization of real or complex matrix
– Derived from PCA
98. Feature Construction
• Linear methods
– Principal component analysis (PCA)
– Independent component analysis (ICA)
– Fisher Linear Discriminant (LDA)
– ….
• Non-linear methods
– Kernel PCA
– Non linear component analysis (NLCA)
– Local linear embedding (LLE)
– ….
99. Principal component analysis - PCA
• A tool in exploratory data analysis and to create predictive models
• Involves calculating Eigenvalue decomposition of a data covariance matrix,
usually after mean centering the data for each attribute
• Mathematically defined as an orthogonal linear transformation to map data to a
new coordinate system such that the greatest variance by any projection of the
data comes to lie on the first coordinate (called the first principal component), the
second greatest variance on the second coordinate, and so on
• Theoretically the optimal linear scheme, in terms of least mean square error, for
compressing a set of high dimensional vectors into a set of lower dimensional
vectors and then reconstructing the original set
102. Fisher Linear Discriminant
• A classification method that projects high-dimensional data
onto a line and performs classification in this one-
dimensional space
• The projection maximizes the distance between the means of
the two classes while minimizing the variance within each
class
104. Linear Discriminant Analysis - LDA
• A generalization of Fisher's linear
discriminant
• A method used to find a linear
combination of features that
characterizes or separates two or more
classes of objects or events
• The resulting combination may be used
as a linear classifier or more commonly
for dimensionality reduction before later
classification
106. Kernel PCA
• Classic PCA approach is a linear projection technique that works well if
the data is linearly separable
• In the case of linearly inseparable data, a nonlinear technique is required
if the task is to reduce the dimensionality of a dataset
107. Kernel PCA
• The basic idea to deal with linearly inseparable data is to project it onto a higher
dimensional space where it becomes linearly separable
• Consider a nonlinear mapping function ϕ so that the mapping of a sample xx can be written
as xx→ϕ(xx), which is called kernel function
• The term kernel describes a function that calculates the dot product of the images of the
samples xx under ϕ
• κ(xxi,xxj)=ϕ(xxi)Tϕ(xxj)
• Function ϕ maps the original d-dimensional features into a larger k-dimensional feature
space by creating nonlinear combinations of the original features
109. SingularValue Decomposition - SVD
• A mechanism to break a matrix into simpler meaningful pieces
• Used to detect groupings in data
• A factorization of a real or complex matrix
• A general rectangular M-by-N matrix A has a SVD into the product of an
M-by-N orthogonal matrix U, an N-by-N diagonal matrix of singular
values S and the transpose of an N-by-N orthogonal square matrixV
– A = U SV^T
111. Hidden Markov Models - HMM
• A statistical Markov model in which the system being modelled is
assumed to be a Markov process with unobserved (hidden) states
• Used in pattern recognition, such as handwriting and speech analysis
• In simpler Markov models like a Markov chain, the state is directly visible
to the observer, and therefore the state transition probabilities are the
only parameters
• In HMM, the state is not directly visible, but output, dependent on the
state, is visible
112. Hidden Markov Models - HMM
• Each state has a probability distribution over the possible output tokens
• Therefore the sequence of tokens generated by an HMM gives some
information about the sequence of states
• Adjective hidden refers to the state sequence through which the model
passes, not to the parameters of the model
• The model is still referred to as a 'hidden' Markov model even if these
parameters are known exactly
115. Model Evaluation
• How accurate is the classifier?
• When the classifier is wrong, how is it wrong?
• Decide on which classifier (which parameters) to use
and to estimate what the performance of the system
will be
116. Testing Set
• Split the available data into a training set and a test set
• Train the classifier in the training set and evaluate based on
the test set
117. Classifier Accuracy
• The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by
the classifier
• Often also referred to as recognition rate
• Error rate (or misclassification rate) is the opposite of
accuracy
118. False PositiveVs. Negative
• When is the model wrong?
– False positives vs. false negatives
– Related to type I and type II errors in statistics
• Often there is a different cost associated with false
positives and false negatives
– Diagnosing diseases
119. Confusion Matrix
• Mechanism to illustrate how a model is performing in terms
of false positives and false negatives
• Provides more information than a single accuracy figure
• Allows thinking about the cost of mistakes
• Extendable to any number of classes
122. Area Under ROC Curve - AUC
• ROC curves can be used to
compare models
• Bigger the AUC, the more
accurate the model
• ROC index is the area under
the ROC curve
124. K-Fold CrossValidation
• Divide the entire data set into k folds
• For each of k experiments, use kth fold for testing and
everything else for training
125. K-Fold CrossValidation
• The accuracy of the system is calculated as the average error across the k
folds
• The main advantages of k-fold cross validation are that every example is
used in testing at some stage and the problem of an unfortunate split is
avoided
• Any value can be used for k
– 10 is most common
– Depends on the data set
126. References
1. W. L. Chao, J. J. Ding, “Integrated Machine Learning Algorithms for Human Age Estimation”, NTU, 2011
2. Phil Simon (March 18, 2013). Too Big to Ignore: The Business Case for Big Data. Wiley
3. Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7
4. Harnad, Stevan (2008), "The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence", in Epstein, Robert;
Peters, Grace, The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest for the Thinking Computer,
Kluwer
5. Russell, Stuart; Norvig, Peter (2003) [1995] Artificial Intelligence: A Modern Approach (2nd ed.) Prentice Hall
6. Langley, Pat (2011). "The changing science of machine learning". Machine Learning 82 (3): 275–279. doi:10.1007/s10994-011-5242-y
7. Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew (2012). "Improving First and Second-Order Methods by Modeling
Uncertainty". In Sra, Suvrit; Nowozin, Sebastian; Wright, Stephen J. Optimization for Machine Learning. MIT Press. p. 404
8. MI Jordan (2014-09-10). "statistics and machine learning“ Cornell University Library. "Breiman : Statistical Modeling: The Two
Cultures (with comments and a rejoinder by the author)”
9. Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer
10. Yoshua Bengio (2009). Learning Deep Architectures for AI. Now Publishers Inc. pp. 1–3. ISBN 978-1-60198-294-0
11. A. M. Tillmann, "On the Computational Intractability of Exact and Approximate Dictionary Learning", IEEE Signal Processing
Letters 22(1), 2015: 45–49
12. Aharon, M, M Elad, and A Bruckstein. 2006. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse
Representation." Signal Processing, IEEE Transactions on 54 (11): 4311-4322
13. Goldberg, David E.; Holland, John H. (1988). "Genetic algorithms and machine learning". Machine Learning 3 (2): 95–99
127. ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode