Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like Matlab, Octave or similar. Of course, there is a big grain of truth in this statement, but we, Java engineers, also can take the best of machine learning universe from an applied perspective by using our native language and familiar frameworks like Apache Spark. During this introductory presentation, you will get acquainted with the simplest machine learning tasks and algorithms, like regression, classification, clustering, widen your outlook and use Apache Spark MLlib to distinguish pop music from heavy metal and simply have fun.
Source code: https://github.com/tmatyashovsky/spark-ml-samples
Design by Yarko Filevych: http://filevych.com/
6. “I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
6
7. “I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
7
15. Date & time
Conference name
Speaker
Talk name
Track
Duration
Type
Overall impression
Overall rating
Number of slides
Time spent on live
coding
Number of jokes
Etc.
15
33. 33
Initialize cluster centroids:
assign each example to the closest
cluster centroid
Recalculate centroids as an average (mean) of
examples assigned to a cluster
37. Collect data set of lyrics:
Abba, Ace of base, Backstreet Boys, Britney Spears,
Christina Aguilera, Madonna, etc.
Black Sabbath, In Flames, Iron Maiden, Metallica,
Moonspell, Nightwish, Sentenced, etc.
Create training set, i.e. label (0|1) + features
Train logistic regression (or other classification
algorithm)
https://github.com/tmatyashovsky/spark-ml-samples
37
43. 43
Verse Cosine Distance
baby one more time 0.482028
crazy for you 0.437875
show me the meaning
of being lonely
0.258147
highway to hell -0.1120049
kill them all -0.231876
https://github.com/tmatyashovsky/spark-ml-samples
56. Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster
56
57. Introduces a few new data types, e.g.
vector (dense and sparse), labeled point,
rating, etc.
Allows to invoke various algorithms on
distributed datasets (RDD/Dataset)
http://spark.apache.org/docs/latest/mllib-guide.html
57
59. Utilities: linear algebra, statistics, etc.
Features extraction, features transforming, etc.
Regression
Classification
Clustering
Collaborative filtering, e.g. alternating least squares
Dimensionality reduction
And many more
http://spark.apache.org/docs/latest/mllib-guide.html
59
60. ”All” spark.mllib features plus:
• Pipelines
• Persistence
• Model selection and tuning:
• Train validation split
• K-folds cross validation
http://spark.apache.org/docs/latest/ml-guide.html
60
64. I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
64
66. I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
66
68. Im a rolling thunder a pouring rain
Im comin on like a hurricane
My lightnings flashing across the sky
Youre only young but youre gonna die
I wont take no prisoners wont spare no lives
Nobodys putting up a fight
I got my bell Im gonna take you to hell
Im gonna get you Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
68
1
2
3
4
5
6
7
8
70. im a rolling thunder a pouring rain
im comin on like a hurricane
My lightnings flashing across the sky
youre only young but youre gonna die
I wont take no prisoners wont spare no lives
nobodys putting up a fight
I got my bell im gonna take you to hell
im gonna get you satan get you
https://github.com/tmatyashovsky/spark-ml-samples
70
1
2
3
4
5
6
7
8
72. im rolling thunder pouring rain
im comin like hurricane
lightnings flashing across sky
youre young youre gonna die
wont take prisoners wont spare lives
nobodiys putting fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
72
1
2
3
4
5
6
7
8
74. 4
im roll thunder pour rain
im comin like hurrican
lightn flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
74
1
2
3
4
5
6
7
8
verse1
verse2
75. 8
im roll thunder pour rain
im comin like hurrican
Light n flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
75
1
2
3
4
5
6
7
8
verse1
86. 86
ML is not as complex as it seems from an applied
perspective
Existing libraries and frameworks reduce a lot of
tedious work
For instance, Spark MLlib can help to build nice ML
pipelines
Quantity of jokes used. Liked or not liked the speaker.
Assign or index each example to the cluster centroid closest to it
Recalculate or move centroids as an average (mean) of examples assigned to a clusterRepeat until centroids not longer move
Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors.
Behind the scenes - a two-layer neural net that processes text.
Captures semantic and morphologic similarity so similar words are close in the vector space
Similar words would be clustered together in the high dimensional sphere.
If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close.
For two completely random words, the similarity is pretty close to 0.
On an opposite side there is not an antonym, but usually just a noise.
Used Google News Negative 300.
My corpus - 8316 words
Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.