A Deep Dive into Classification with Naive Bayes. Along the way we take a look at some basics from Ian Witten's Data Mining book and dig into the algorithm.
Presented on Wed Apr 27 2011 at SeaHUG in Seattle, WA.
Contrasts with “1Rule” method (1Rule uses 1 attribute)NB allows all attributes to make contributions that are equally important and independent of one another
This classifier produces a probability estimate for each class rather than a predictionConsidered “Supervised Learning”
comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forestsAn advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
Pr[E|H] -> all evidence for instances with H->”yes”Pr[H] -> percent of instances w/ this outcomePr[E] -> sum of the values ( ) for all outcomes
Book reference: snow crashFor each attribute “a” there are multiple values, and given these combinations we need to look at how many times the instances were actually classified each class.In training we use the term “outcome”, in classification we use the term “class”Example: say we have 2 attributes to an instance
We don’t take into account some of the other things like “missing values” here
Now that we’ve established the case for Naïve Bayes + Text show how it fits in with other classifications algos
*** Need to sell case for using another feature calculating mechanic ***when one class has more training examples than anotherNaive Bayes selects poor weights for the decision boundary. To balance the amount of training examples used per estimatethey introduced a “complement class” formulation of Naive Bayes.A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other word
Term frequency =num occurrences of the considered term ti in document dj / sizeof ( words in doc dj )Normalized to protect against bias in larger docsIDF = log( Normalized Frequency for a term(feature) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that documentWeight Normalized Tffor a given feature in a given label = sum of Normalized Frequency of the feature across all the documents in the label.
Need to get a better handle on Sigma_kirSigmaWijhttps://cwiki.apache.org/MAHOUT/bayesian.html