SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
{ “Mahout” : “Scalable Machine Learning Library” }
{ “Presented By” : “Varad Meru”,
“Company” : “Orzota, Inc”,
“Twitter” : “@vrdmr” }
1
{ “Mahout” : “Introduction” }
2
{ “Introduction” : “History and Etymology” }
• A Scalable Machine Learning Library built on Hadoop, written in Java.
• Driven by Ng et al.’s paper “MapReduce for Machine Learning
on Multicore”
• Started as a Lucene sub-project. Became Apache TLP in April 2010.
• Latest version out – 0.6 (released on 6th Feb 2012).
• Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop.
• Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten.
• Taste Recommendation Framework was added
later by Sean Owen.
3
Figure 1.1 Apache Mahout and its related projects within the Apache Foundation.
Much of Mahout’s work has been to not only implement these algorithms conventionally,
and scalable way, but also to convert some of these algorithms to work at scale on to
Hadoop’s mascot is an elephant, which at last explains the project name!
Mahout incubates a number of techniques and algorithms, many still in developm
experimental phase. At this early stage in the project's life, three core themes are evident
filtering / recommender engines, clustering, and classification. This is by no means all tha
Mahout, but are the most prominent and mature themes at the time of writing. These the
scope of this book.
Chances are that if you are reading this, you are already aware of the interesting pot
three families of techniques. But just in case, read on.
2
{ “Mahout” : “Machine Learning” }
4
{ “Machine Learning” : “Introduction” }
“Machine Learning is Programming Computers to optimize a
Performance Criterion using Example Data or Past Experience”
• Branch of Artificial Intelligence
• Design and Development of Algorithms
• Computers Evolve Behavior based on Empirical Data .
• Supervised Learning
• Using Labeled training data, to create a Classifier that can predict output for unseen inputs.
• Unsupervised Learning
• Using Unlabeled training data to create a function that can predict output.
• Semi-Supervised Learning
5
{ “Machine Learning” : “Applications” }
• Recommend Friends, Dates, Products to end-user.
• Classify content into pre-defined groups.
• Find Similar content based on Object Properties.
• Identify key topics in large Collections of Text.
• Detect Anomalies within given data.
• Ranking Search Results with User Feedback Learning.
• Classifying DNA sequences.
• Sentiment Analysis/ Opinion Mining
• Computer Vision.
• Natural Language Processing,
• BioInformatics.
• Speech and HandWriting Recognition.
• Others ...
6
{“Machine Learning”: “Challenges”}
• BigData
• Yesterdays Processing on next
generation Data.
• Time for Processing
• Large and Cheap Storage
7
Size Classification Tools
Lines
Sample Data
Analysis and
Visualization
Whiteboard,
bash,...
KBs - low MBs
Prototype Data
Analysis and
Visualization
Matlab, Octave, R,
Processing,
bash,...
MBs - low GBs
Online Data
Storage MySQL (DBs),...
MBs - low GBs
Online Data
Analysis
NumPy, SciPy,
Weka, BLAS/
LAPACK,...
MBs - low GBs
Online Data
Visualization
Flare, AmCharts,
Raphael, Protovis,...
GBs - TBs - PBs
Big Data
Storage
HDFS, HBase,
Cassandra,...
GBs - TBs - PBs
Big Data
Analysis
Hive, Mahout,
Hama, Giraph,...
{ “Machine Learning” : “Mahout for Big Data”}
• Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”.
• Some Algorithms won’t scale to massive machine clusters
• Others fit logically on MapReduce framework like Apache Hadoop
• Most Mahout implementations are MapReduce enabled
• Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”.
• The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as
Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library.
• The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine
Learning Open-Source Softwares)
8
{ “Mahout” : “Internals” }
9
10
{ “Internals” : “Architecture” }
Math%
Vectors/Matrices/SVD%
Recommenders%Clustering%Classifica9on%
Freq.%
Pa>ern%
Mining%
Evolu9onary%
Algorithms%
U9li9es%
Lucene/Vectorizer%
Collec9ons%
(primi9ves)%
Apache%
Hadoop%
Applica9ons%
Examples%
Regression%
Dimension%
Reduc9on%
• Scalable
• Dual-Mode (Sequential and MapReduce Enabled)
• Support for easy Extension.
• Large Number of Data Source Enabled including the newer NoSQL variants.
• It is a Java library. It is a framework of tools intended to be used and adapted by developers.
• Advanced Implementations of Java’s Collections Framework for better Performance.
11
{ “Internals” : “Features” }
{ “Mahout” : “Algorithms” }
12
• Help Users find items they might like based on historical behavior and preferences
• Top-level packages define the Mahout interfaces to these key abstractions:
• DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel,
MongoDBDataModel, CassandraDataModel
• UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood.
• Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering
Recommender.
13
{ “Algorithms” : “Recommender Systems”, “id” : “Introduction”}
14
{ “Algorithms” : “Recommender Systems”, “id” : “Example”}
0
 1
 1
 1
1
 0
 1
 1
0
 1
 0
 0
1
 0
 1
 1
1
 1
 1
 1
1
 0
 1
 1
1
 0
 0
 0
1
 1
 1
 0
1
 1
 0
 1
Binary Values
Recommendation
Alice
Bob
John
Jane
Bill
Steve
Larry
Don
Jack
15
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}
1
1/3 –
0.33
5/8 –
0.625
5/8 –
0.625
1/3 –
0.33
1
3/8 –
0.375
3/8 –
0.375
5/8 –
0.625
3/8 –
0.375
1
5/7 –
0.714
5/8 –
0.625
3/8 –
0.375
5/7 –
0.714
1
Tanimoto Coefficient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B
16
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}
1
 0.507
 0.772
 0.772
0.507
 1
 0.707
 0.707
0.772
 0.707
 1
 0.833
0.772
 0.707
 0.833
 1
Cosine Coefficient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B
• Assigning Data to discreet Categories.
• Train a model on Labeled Data
• Run the Model on new, Unlabeled Data
• Classifier: An algorithm that implements classification, especially in a concrete implementation.
• Classification Algorithms
• Maximum entropy classifier
• Naïve Bayes classifier
• Decision trees, decision lists
• Support vector machines
• Kernel estimation and K-nearest-neighbor algorithms
• Perceptrons
• Neural networks (multi-level perceptrons)
17
{ “Algorithms” : “Classification” , “id” : “Introduction”}
Spam
 Not spam
?
18
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Train: Not Spam
President Obama’s Nobel Prize Speech
19
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Train: Spam
Spam Email Content
20
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Run
“Order a trial Adobe chicken daily
EAB-List new summer savings, welcome!”
21
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”}
• Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.
• Training:
• Read the Features
• Calculate per-Document Statistics
• Normalize across Categories
• Calculate normalizing factor of each label
• Testing
• Classification (fifth job, explicitly invoked)
algorithm through which the system will learn, and the variables used as input are key steps in the
phase of building the classification system.
The basic steps in building a classification system are illustrated in figure 13.2.
Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train
algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc
with new input examples to estimate the target variable.
The figure shows two phases of the classification process, with the upper path representing training
classification model and the lower path providing new examples for which the model will assign catego
(the target variables) as a way to emulate decisions. For the training phase, input for the train
• Grouping unstructured data without any training data.
• Self learning from experience.
• Small intra-cluster distance - Trying for local and global Minima
• Large inter-cluster distance
• Mahout’s Canopy Clustering
map reduce algorithm is often
used to compute initial cluster
centroids.
22
{ “Algorithms” : “Clustering” , “id” : “Introduction”}
23
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
24
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
25
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
26
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
27
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
28
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
29
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
30
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
31
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
32
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
Cats
Dogs
33
{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}
+
C0 C1 C2 C3
M0 M1 M2 M3
IO0 IO1 IO2 IO3
R0 R1
FO0 FO1
chunks
mappers
Reducers
MapPhaseReducePhase
Shuffling Data
• Assume: Number of Cluster is far lesser than Number of Points.
• Therefore, |Clusters| << |Points|
• Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids.
34
{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}
M0 M1 M2 M3
<clusterID, observation>
R0 R1
Important arguments
--maxIter
--convergenceDelta
--method
35
{ “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”}
Map phase: assign cluster IDs
Reduce phase: reset centroids
36
{ “Algorithms” : “Other Algorithms” }
• Classification
‣ Stochastic Gradient Descent
‣ Support Vector Machines
‣ Random Forests
• Clustering
‣ Latent Dirichlet Allocation
- Topic models
‣ Fuzzy K-Means
- Points are assigned multiple clusters
‣ Canopy clustering
- Fast approximations of clusters
‣ Spectral clustering
- Treat points as a graph
• Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions
• Dimensionality Reduction
• Regression
37
{ “Algorithms” : “Future” }
• Classification
‣ Decision Trees such as J48 and ID3
• Clustering
‣ DBScan and CoWeb Clustering techniques
• Evolutionary Algorithms
‣ Classical Genetic Algorithms
• Association Rules
‣ Apriori. (It has an alternative frequent itemset algorithm implementation).
{ “Mahout” : “Summary” }
38
{ “Summary”: “Apache Mahout” }
39
• Scalable Library
40
• Scalable Library
• Three Primary Areas of
Focus
{ “Summary”: “Apache Mahout” }
41
• Scalable Library
• Three Primary Areas of
Focus
• Other Algorithms
{ “Summary”: “Apache Mahout” }
42
• Scalable Library
• Three Primary Areas of
Focus
• Other Algorithms
• All in your friendly
neighborhood MapReduce
{ “Summary”: “Apache Mahout” }
{ “Mahout” : “Demo” }
43
{ “Mahout” : “Questions” }
44
{ “Mahout” : “References” }
45
• Books
• “Mahout in Action”, Owen et. al., Manning Pub.
• “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub.
• “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer
Pub.
• Videos
• CS-229, Machine Learning at Stanford University - Prof. Andrew Ng.
• Collaborative filtering at scale - Sean Owen
• Distributed Item-based Collaborative Filtering - Sebastian Schelter
• EMail Classification with Mahout - Grant Ingersoll @ Lucid Imagination
46
{ “References” : “Mahout Books, Tutorials, Links”, “id” : 1}
• WWW
• http://mahout.apache.org - Mahout@Apache
• http://hadoop.apache.org - Hadoop@Apache
• dev@mahout.apache.org - Developer mailing list
• user@mahout.apache.org - User mailing list
• http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout
47
{ “References” : “Mahout Books, Tutorials, Links”, “id” : 2}
{ “Mahout” : “The End” }
48
{“Thank You” : “Have a Nice and Green Day” }

Contenu connexe

Tendances

Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopCosmoAIMS Bassett
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkInSemble
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data Vaibhav Kurkute
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning ClassifiersMostafa
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at PipedriveAndré Karpištšenko
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 

Tendances (20)

Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 

En vedette

10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 

En vedette (20)

10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 

Similaire à Introduction to Mahout and Machine Learning

Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionSri Ambati
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...Lucas Jellema
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMateusz Dymczyk
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
 
The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...Lucas Jellema
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018Apache MXNet
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.ASHISH JAGTAP
 

Similaire à Introduction to Mahout and Machine Learning (20)

Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
 
The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 

Plus de Varad Meru

Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningVarad Meru
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemVarad Meru
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...Varad Meru
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemVarad Meru
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An OverviewVarad Meru
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Varad Meru
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionVarad Meru
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Varad Meru
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducationVarad Meru
 

Plus de Varad Meru (16)

Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction Problem
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An Overview
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducation
 

Dernier

SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 

Dernier (20)

SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 

Introduction to Mahout and Machine Learning

  • 1. { “Mahout” : “Scalable Machine Learning Library” } { “Presented By” : “Varad Meru”, “Company” : “Orzota, Inc”, “Twitter” : “@vrdmr” } 1
  • 2. { “Mahout” : “Introduction” } 2
  • 3. { “Introduction” : “History and Etymology” } • A Scalable Machine Learning Library built on Hadoop, written in Java. • Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore” • Started as a Lucene sub-project. Became Apache TLP in April 2010. • Latest version out – 0.6 (released on 6th Feb 2012). • Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop. • Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten. • Taste Recommendation Framework was added later by Sean Owen. 3 Figure 1.1 Apache Mahout and its related projects within the Apache Foundation. Much of Mahout’s work has been to not only implement these algorithms conventionally, and scalable way, but also to convert some of these algorithms to work at scale on to Hadoop’s mascot is an elephant, which at last explains the project name! Mahout incubates a number of techniques and algorithms, many still in developm experimental phase. At this early stage in the project's life, three core themes are evident filtering / recommender engines, clustering, and classification. This is by no means all tha Mahout, but are the most prominent and mature themes at the time of writing. These the scope of this book. Chances are that if you are reading this, you are already aware of the interesting pot three families of techniques. But just in case, read on. 2
  • 4. { “Mahout” : “Machine Learning” } 4
  • 5. { “Machine Learning” : “Introduction” } “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience” • Branch of Artificial Intelligence • Design and Development of Algorithms • Computers Evolve Behavior based on Empirical Data . • Supervised Learning • Using Labeled training data, to create a Classifier that can predict output for unseen inputs. • Unsupervised Learning • Using Unlabeled training data to create a function that can predict output. • Semi-Supervised Learning 5
  • 6. { “Machine Learning” : “Applications” } • Recommend Friends, Dates, Products to end-user. • Classify content into pre-defined groups. • Find Similar content based on Object Properties. • Identify key topics in large Collections of Text. • Detect Anomalies within given data. • Ranking Search Results with User Feedback Learning. • Classifying DNA sequences. • Sentiment Analysis/ Opinion Mining • Computer Vision. • Natural Language Processing, • BioInformatics. • Speech and HandWriting Recognition. • Others ... 6
  • 7. {“Machine Learning”: “Challenges”} • BigData • Yesterdays Processing on next generation Data. • Time for Processing • Large and Cheap Storage 7 Size Classification Tools Lines Sample Data Analysis and Visualization Whiteboard, bash,... KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R, Processing, bash,... MBs - low GBs Online Data Storage MySQL (DBs),... MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, BLAS/ LAPACK,... MBs - low GBs Online Data Visualization Flare, AmCharts, Raphael, Protovis,... GBs - TBs - PBs Big Data Storage HDFS, HBase, Cassandra,... GBs - TBs - PBs Big Data Analysis Hive, Mahout, Hama, Giraph,...
  • 8. { “Machine Learning” : “Mahout for Big Data”} • Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”. • Some Algorithms won’t scale to massive machine clusters • Others fit logically on MapReduce framework like Apache Hadoop • Most Mahout implementations are MapReduce enabled • Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”. • The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library. • The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine Learning Open-Source Softwares) 8
  • 9. { “Mahout” : “Internals” } 9
  • 10. 10 { “Internals” : “Architecture” } Math% Vectors/Matrices/SVD% Recommenders%Clustering%Classifica9on% Freq.% Pa>ern% Mining% Evolu9onary% Algorithms% U9li9es% Lucene/Vectorizer% Collec9ons% (primi9ves)% Apache% Hadoop% Applica9ons% Examples% Regression% Dimension% Reduc9on%
  • 11. • Scalable • Dual-Mode (Sequential and MapReduce Enabled) • Support for easy Extension. • Large Number of Data Source Enabled including the newer NoSQL variants. • It is a Java library. It is a framework of tools intended to be used and adapted by developers. • Advanced Implementations of Java’s Collections Framework for better Performance. 11 { “Internals” : “Features” }
  • 12. { “Mahout” : “Algorithms” } 12
  • 13. • Help Users find items they might like based on historical behavior and preferences • Top-level packages define the Mahout interfaces to these key abstractions: • DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel • UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood. • Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering Recommender. 13 { “Algorithms” : “Recommender Systems”, “id” : “Introduction”}
  • 14. 14 { “Algorithms” : “Recommender Systems”, “id” : “Example”} 0 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 1 Binary Values Recommendation Alice Bob John Jane Bill Steve Larry Don Jack
  • 15. 15 { “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”} 1 1/3 – 0.33 5/8 – 0.625 5/8 – 0.625 1/3 – 0.33 1 3/8 – 0.375 3/8 – 0.375 5/8 – 0.625 3/8 – 0.375 1 5/7 – 0.714 5/8 – 0.625 3/8 – 0.375 5/7 – 0.714 1 Tanimoto Coefficient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
  • 16. 16 { “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”} 1 0.507 0.772 0.772 0.507 1 0.707 0.707 0.772 0.707 1 0.833 0.772 0.707 0.833 1 Cosine Coefficient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
  • 17. • Assigning Data to discreet Categories. • Train a model on Labeled Data • Run the Model on new, Unlabeled Data • Classifier: An algorithm that implements classification, especially in a concrete implementation. • Classification Algorithms • Maximum entropy classifier • Naïve Bayes classifier • Decision trees, decision lists • Support vector machines • Kernel estimation and K-nearest-neighbor algorithms • Perceptrons • Neural networks (multi-level perceptrons) 17 { “Algorithms” : “Classification” , “id” : “Introduction”} Spam Not spam ?
  • 18. 18 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Train: Not Spam President Obama’s Nobel Prize Speech
  • 19. 19 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Train: Spam Spam Email Content
  • 20. 20 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Run “Order a trial Adobe chicken daily EAB-List new summer savings, welcome!”
  • 21. 21 { “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”} • Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs. • Training: • Read the Features • Calculate per-Document Statistics • Normalize across Categories • Calculate normalizing factor of each label • Testing • Classification (fifth job, explicitly invoked) algorithm through which the system will learn, and the variables used as input are key steps in the phase of building the classification system. The basic steps in building a classification system are illustrated in figure 13.2. Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc with new input examples to estimate the target variable. The figure shows two phases of the classification process, with the upper path representing training classification model and the lower path providing new examples for which the model will assign catego (the target variables) as a way to emulate decisions. For the training phase, input for the train
  • 22. • Grouping unstructured data without any training data. • Self learning from experience. • Small intra-cluster distance - Trying for local and global Minima • Large inter-cluster distance • Mahout’s Canopy Clustering map reduce algorithm is often used to compute initial cluster centroids. 22 { “Algorithms” : “Clustering” , “id” : “Introduction”}
  • 23. 23 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 24. 24 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 25. 25 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 26. 26 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 27. 27 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 28. 28 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 29. 29 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 30. 30 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 31. 31 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 32. 32 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”} Cats Dogs
  • 33. 33 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} + C0 C1 C2 C3 M0 M1 M2 M3 IO0 IO1 IO2 IO3 R0 R1 FO0 FO1 chunks mappers Reducers MapPhaseReducePhase Shuffling Data
  • 34. • Assume: Number of Cluster is far lesser than Number of Points. • Therefore, |Clusters| << |Points| • Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids. 34 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} M0 M1 M2 M3 <clusterID, observation> R0 R1 Important arguments --maxIter --convergenceDelta --method
  • 35. 35 { “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”} Map phase: assign cluster IDs Reduce phase: reset centroids
  • 36. 36 { “Algorithms” : “Other Algorithms” } • Classification ‣ Stochastic Gradient Descent ‣ Support Vector Machines ‣ Random Forests • Clustering ‣ Latent Dirichlet Allocation - Topic models ‣ Fuzzy K-Means - Points are assigned multiple clusters ‣ Canopy clustering - Fast approximations of clusters ‣ Spectral clustering - Treat points as a graph • Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions • Dimensionality Reduction • Regression
  • 37. 37 { “Algorithms” : “Future” } • Classification ‣ Decision Trees such as J48 and ID3 • Clustering ‣ DBScan and CoWeb Clustering techniques • Evolutionary Algorithms ‣ Classical Genetic Algorithms • Association Rules ‣ Apriori. (It has an alternative frequent itemset algorithm implementation).
  • 38. { “Mahout” : “Summary” } 38
  • 39. { “Summary”: “Apache Mahout” } 39 • Scalable Library
  • 40. 40 • Scalable Library • Three Primary Areas of Focus { “Summary”: “Apache Mahout” }
  • 41. 41 • Scalable Library • Three Primary Areas of Focus • Other Algorithms { “Summary”: “Apache Mahout” }
  • 42. 42 • Scalable Library • Three Primary Areas of Focus • Other Algorithms • All in your friendly neighborhood MapReduce { “Summary”: “Apache Mahout” }
  • 43. { “Mahout” : “Demo” } 43
  • 44. { “Mahout” : “Questions” } 44
  • 45. { “Mahout” : “References” } 45
  • 46. • Books • “Mahout in Action”, Owen et. al., Manning Pub. • “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub. • “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer Pub. • Videos • CS-229, Machine Learning at Stanford University - Prof. Andrew Ng. • Collaborative filtering at scale - Sean Owen • Distributed Item-based Collaborative Filtering - Sebastian Schelter • EMail Classification with Mahout - Grant Ingersoll @ Lucid Imagination 46 { “References” : “Mahout Books, Tutorials, Links”, “id” : 1}
  • 47. • WWW • http://mahout.apache.org - Mahout@Apache • http://hadoop.apache.org - Hadoop@Apache • dev@mahout.apache.org - Developer mailing list • user@mahout.apache.org - User mailing list • http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout 47 { “References” : “Mahout Books, Tutorials, Links”, “id” : 2}
  • 48. { “Mahout” : “The End” } 48 {“Thank You” : “Have a Nice and Green Day” }