SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Classification with Naïve Bayes,[object Object],A Deep Dive into Apache Mahout,[object Object]
Today’s speaker – Josh Patterson,[object Object],josh@cloudera.com / twitter: @jpatanooga,[object Object],Master’s Thesis: self-organizing mesh networks,[object Object],Published in IAAI-09: TinyTermite: A Secure Routing Algorithm,[object Object],Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff),[object Object],Led small team which designed classification techniques for time series and Map Reduce,[object Object],Open source work at http://openpdc.codeplex.com,[object Object],Now: Solutions Architect at Cloudera,[object Object],2,[object Object]
What is Classification?,[object Object],Supervised Learning,[object Object],We give the system a set of instances to learn from,[object Object],System builds knowledge of some structure,[object Object],Learns “concepts”,[object Object],System can then classify new instances,[object Object]
Supervised vs Unsupervised Learning,[object Object],Supervised,[object Object],Give system examples/instances of multiple concepts,[object Object],System learns “concepts”,[object Object],More “hands on”,[object Object],Example: Naïve Bayes, Neural Nets,[object Object],Unsupervised,[object Object],Uses unlabled data,[object Object],Builds joint density model,[object Object],Example: k-means clustering,[object Object]
Naïve Bayes,[object Object],Called Naïve Bayes because its based on “Baye’sRule” and “naively” assumes independence given the label,[object Object],It is only valid to multiply probabilities when the events are independent,[object Object],Simplistic assumption in real life,[object Object],Despite the name, Naïve works well on actual datasets,[object Object]
Naïve Bayes Classifier,[object Object],Simple probabilistic classifier based on ,[object Object],applying Baye’s theorem (from Bayesian statistics) ,[object Object],strong (naive) independence assumptions. ,[object Object],A more descriptive term for the underlying probability model would be “independent feature model".,[object Object]
Naïve Bayes Classifier (2),[object Object],Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. ,[object Object],Example: ,[object Object],a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. ,[object Object],Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.,[object Object]
A Little Bit o’ Theory,[object Object]
Condensing Meaning,[object Object],To train our system we need,[object Object],Total number input training instances (count),[object Object],Counts tuples: ,[object Object],{attributen,outcomeo,valuem} ,[object Object],Total counts of each outcomeo,[object Object],{outcome-count},[object Object],To Calculate each Pr[En|H],[object Object],({attributen,outcomeo,valuem} / {outcome-count} ),[object Object],…From the Vapor of That Last Big Equation,[object Object]
A Real Example From Witten, et al,[object Object]
Enter Apache Mahout,[object Object],What is it?,[object Object],Apache Mahout is a scalable machine learning library that supports large data sets,[object Object],What Are the Major Algorithm Type?,[object Object],Classification,[object Object],Recommendation,[object Object],Clustering,[object Object],http://mahout.apache.org/,[object Object]
Mahout Algorithms,[object Object]
Naïve Bayes and Text,[object Object],Naive Bayes does not model text well. ,[object Object],“Tackling the Poor Assumptions of Naive Bayes Text Classifiers”,[object Object],http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf,[object Object],Mahout does some modifications based around TF-IDF scoring (Next Slide),[object Object],Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification,[object Object]
High Level Algorithm,[object Object],For Each Feature(word) in each Doc:,[object Object],Calc: “Weight Normalized Tf-Idf”,[object Object],for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf,[object Object],We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0,[object Object],Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N  ) ],[object Object]
BayesDriver Training Workflow,[object Object],Naïve Bayes Training MapReduce Workflow in Mahout,[object Object]
Logical Classification Process,[object Object],Gather, Clean, and Examine the Training Data,[object Object],Really get to know your data!,[object Object],Train the Classifier, allowing the system to “Learn” the “Concepts”,[object Object],But not “overfit” to this specific training data set,[object Object],Classify New Unseen Instances,[object Object],With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance,[object Object]
How Is Classification Done?,[object Object],Sequentially or via Map Reduce,[object Object],TestClassifier.java,[object Object],Creates ClassifierContext,[object Object],For Each File in Dir,[object Object],For Each Line,[object Object],Break line into map of tokens,[object Object],Feed array of words to Classifier engine for new classification/label,[object Object],Collect classifications as output,[object Object]
A Quick Note About Training Data…,[object Object],Your classifier can only be as good as the training data lets it be…,[object Object],If you don’t do good data prep, everything will perform poorly,[object Object],Data collection and pre-processing takes the bulk of the time,[object Object]
Enough Math, Run the Code,[object Object],Download and install Mahout,[object Object],http://www.apache.org,[object Object],Run 20Newsgroups Example,[object Object],https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups,[object Object],Uses Naïve Bayes Classification,[object Object],Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset,[object Object]
Generate Test and Train Dataset,[object Object],Training Dataset:,[object Object],mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br />  -p examples/bin/work/20news-bydate/20news-bydate-train br />  -o examples/bin/work/20news-bydate/bayes-train-input br />  -a org.apache.mahout.vectorizer.DefaultAnalyzerbr />  -c UTF-8,[object Object],Test Dataset:,[object Object],mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br />  -p examples/bin/work/20news-bydate/20news-bydate-test br />  -o examples/bin/work/20news-bydate/bayes-test-input br />  -a org.apache.mahout.vectorizer.DefaultAnalyzer br />  -c UTF-8,[object Object]
Train and Test Classifier,[object Object],Train:,[object Object],$MAHOUT_HOME/bin/mahout trainclassifier br />  -i 20news-input/bayes-train-input br />  -o newsmodel br />  -type bayes br />  -ng 3 br />  -source hdfs,[object Object],Test:,[object Object],$MAHOUT_HOME/bin/mahout testclassifier br />  -m newsmodel br />  -d 20news-input br />  -type bayes br />  -ng 3 br />  -source hdfs br />  -method mapreduce,[object Object]
Other Use Cases,[object Object],Predictive Analytics,[object Object],You’ll hear this term a lot in the field, especially in the context of SAS,[object Object],General Supervised Learning Classification,[object Object],We can recognize a lot of things with practice,[object Object],And lots of tuning!,[object Object],Document Classification,[object Object],Sentiment Analysis,[object Object]
Questions?,[object Object],We’re Hiring!,[object Object],Cloudera’sDistro of Apache Hadoop:,[object Object],http://www.cloudera.com,[object Object],Resources,[object Object],“Tackling the Poor Assumptions of Naive Bayes Text Classifiers”,[object Object],http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf,[object Object]

Contenu connexe

Tendances

NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineRishabh Gupta
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inferencebutest
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machineDr. Radhey Shyam
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machinesNawal Sharma
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classificationDr-Dipali Meher
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...Edureka!
 

Tendances (20)

NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Fp growth
Fp growthFp growth
Fp growth
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inference
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Anfis (1)
Anfis (1)Anfis (1)
Anfis (1)
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machine
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Bagging.pptx
Bagging.pptxBagging.pptx
Bagging.pptx
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
Tree pruning
 Tree pruning Tree pruning
Tree pruning
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
 
data mining
data miningdata mining
data mining
 

En vedette

Lecture 5: Bayesian Classification
Lecture 5: Bayesian ClassificationLecture 5: Bayesian Classification
Lecture 5: Bayesian ClassificationMarina Santini
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classificationManu Chandel
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 

En vedette (7)

Lecture 5: Bayesian Classification
Lecture 5: Bayesian ClassificationLecture 5: Bayesian Classification
Lecture 5: Bayesian Classification
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 

Similaire à Classification with Naive Bayes

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationSigOpt
 
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)CODE WHITE GmbH
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101John Ternent
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Paper Scopus - The Naive Bayes algorithm for learning data analytics.pdf
Paper Scopus - The Naive Bayes algorithm for learning data analytics.pdfPaper Scopus - The Naive Bayes algorithm for learning data analytics.pdf
Paper Scopus - The Naive Bayes algorithm for learning data analytics.pdfviettran102053
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 

Similaire à Classification with Naive Bayes (20)

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning Optimization
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Paper Scopus - The Naive Bayes algorithm for learning data analytics.pdf
Paper Scopus - The Naive Bayes algorithm for learning data analytics.pdfPaper Scopus - The Naive Bayes algorithm for learning data analytics.pdf
Paper Scopus - The Naive Bayes algorithm for learning data analytics.pdf
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 

Plus de Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial IntelligenceJosh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkJosh Patterson
 

Plus de Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
 

Classification with Naive Bayes

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.

Notes de l'éditeur

  1. https://cwiki.apache.org/MAHOUT/books-tutorials-and-talks.html
  2. Contrasts with “1Rule” method (1Rule uses 1 attribute)NB allows all attributes to make contributions that are equally important and independent of one another
  3. This classifier produces a probability estimate for each class rather than a predictionConsidered “Supervised Learning”
  4. comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forestsAn advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
  5. Pr[E|H] -> all evidence for instances with H->”yes”Pr[H] -> percent of instances w/ this outcomePr[E] -> sum of the values ( ) for all outcomes
  6. Book reference: snow crashFor each attribute “a” there are multiple values, and given these combinations we need to look at how many times the instances were actually classified each class.In training we use the term “outcome”, in classification we use the term “class”Example: say we have 2 attributes to an instance
  7. We don’t take into account some of the other things like “missing values” here
  8. Now that we’ve established the case for Naïve Bayes + Text  show how it fits in with other classifications algos
  9. *** Need to sell case for using another feature calculating mechanic ***when one class has more training examples than anotherNaive Bayes selects poor weights for the decision boundary. To balance the amount of training examples used per estimatethey introduced a “complement class” formulation of Naive Bayes.A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other word
  10. Term frequency =num occurrences of the considered term ti in document dj / sizeof ( words in doc dj )Normalized to protect against bias in larger docsIDF = log( Normalized Frequency for a term(feature) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that documentWeight Normalized Tffor a given feature in a given label = sum of Normalized Frequency of the feature across all the documents in the label.
  11. Need to get a better handle on Sigma_kirSigmaWijhttps://cwiki.apache.org/MAHOUT/bayesian.html
  12. https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
  13. Can also test sequentially