THORNY PATH TO DATA MINING PROJECTS
Java is often criticized for hard parsing CSV datasets, poor matrix and vectors manipulations. This makes it hard to easy and efficiently implement certain types of machine learning algorithms. In many cases data scientists choose R or Python languages for modeling and problem solution and you as a Java developer should rewrite R algorithms in Java or integrate many small Python scripts in Java application.
But why so many Highload tools like Cassandra, Hadoop, Giraph, Spark are written in Java or executed on JVM? What is the secret of successful implementation and running? Maybe we should forget old manufacturing approach when we separate developers from research engineers in production projects?
During the report, we will discuss typical Data Mining tasks, advantages and disadvantages of Hadoop ecosystem, battle between Spark and Hadoop for a place under the Sun, difference between popular Machine Learning tools and libraries.
Attendees of my talk will become more familiar with different abbreviations and buzz words and also will get useful tips about self-education way in this area.
2. 2JDD conference
About
I am a <graph theory, machine learning,
traffic jams prediction, BigData algorithms>
scientist
But I'm a <Java, NoSQL, Hadoop, Spark>
programmer
6. 6JDD conference
In this topic …
A lot of strange pictures and technologies from crazy zoo
We talk about
• Data Mining
• Hadoop ecosystem
• Spark and its friends
• Machine Learning libraries
27. 27JDD conference
Before patterns discovering you should ..
• Select small pieces
• Define default values for missed
data
• Remove strange signals from data
• Merge some tables in one if
required
31. 31JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• What is the revenue prediction for next year?
32. 32JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• What is the revenue prediction for next year?
• Can you recommend music for users?
35. 35JDD conference
Data Sources
• Relational Databases
• Data warehouses (Historical data)
• Files in CSV or in binary format
• Internet or electronic mails
• Scientific, research (R, Octave,
Matlab)
38. 38JDD conference
What is Cluster Analysis?
It is the process of finding model of function that describes
and distinguishes data class to predict the class of objects
whose class label is unknown.
42. 42JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
Classification
43. 43JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
• Main goal: find a function
(classifier) that maps input data
to a category (class)
Classification
49. 49JDD conference
• A small amount of ML algorithms
• All your matrixes are belong to us!
• Single thread model
• Java support
• Octave in Java?
Why not Octave?
51. 51JDD conference
• 25% of R packs are written in Java
• Syntax is too sweet
• You should read 1000 lines in docs
to write 1 line of code
• Single thread model for 95%
algorithms
Why not R?
53. 53JDD conference
• High-level language
• Have you ever heard about a
Jython?
• Long way to real Highload
production
• We are not Python developers
Why not Python?
65. 65JDD conference
MapReduce for iterative calculations
• High complexity of graph problem reduction to key-value
model
• Iteration algorithms, but multiple chained jobs in M/R
with full saving and reading of each state
Think like a vertex…
68. 68JDD conference
Java API for Data mining, JSR 73 and JSR 247
• javax.datamining.supervised defines the supervised
function-related interfaces
• javax.datamining.algorithm contains all mining algorithm
subclass packages
• JDM 2.0 adds Text Mining, Time series and so on..
JDM
72. 72JDD conference
SPMF
• It’s codebase of algorithms in pattern mining field
• It has cool examples and implementation of 109
algorithms
• Cool performance results in specific area
• Codebase grows very fast
• Not so many classification algorithms are covered
73. 73JDD conference
Mahout
• Scalable machine learning with Samsara
• Advanced Implementations of Java’s Collections Framework
for better Performance.
• New algorithms will build on Spark platform
• Collaborative Filtering, Classification, Clustering,
Dimensionality Reduction, Miscellaneous are supported
74. 74JDD conference
Code sample Mahout (K-Means)
// read the point values and generate vectors from input data
final List vectors = vectorize(points);
// Write data to sequence hadoop sequence files
writePointsToFile(configuration, vectors);
// Write initial centers for clusters
writeClusterInitialCenters(configuration, vectors);
// Run K-means algorithm
final Path inputPath = new Path(POINTS_PATH);
final Path clustersPath = new Path(CLUSTERS_PATH);
final Path outputPath = new Path(OUTPUT_PATH);
HadoopUtil.delete(configuration, outputPath);
KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false);
// Read and print output values
readAndPrintOutputValues(configuration);
80. 80JDD conference
SPARK: the bloody son of MR
• MapReduce in memory
• Up to 50x faster than Hadoop
• RDD is a basic building block
(immutable distributed
collections of objects)
83. 83JDD conference
Code sample MLlib (K-Means)
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// Save and load model
clusters.save(sc.sc(), "myModelPath");
KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
84. 84JDD conference
MLlib
• .. extends scikit-learn (Python lib) and Mahout
• .. runs fully on Spark
• .. is documented
• .. is well for large datasets and parallelized algorithms