SlideShare une entreprise Scribd logo
1  sur  93
Télécharger pour lire hors ligne
THORNY PATH TO
DATA MINING PROJECTS
Alexey Zinovyev, Java Trainer in EPAM
2JDD conference
About
I am a <graph theory, machine learning,
traffic jams prediction, BigData algorithms>
scientist
But I'm a <Java, NoSQL, Hadoop, Spark>
programmer
3JDD conference
What do I know about Krakow?
4JDD conference
W kociołkach bigos grzano; w słowach wydać trudno bigosu
smak przedziwny, kolor i woń cudną
5JDD conference
True story about Poland!
6JDD conference
In this topic …
A lot of strange pictures and technologies from crazy zoo
We talk about
• Data Mining
• Hadoop ecosystem
• Spark and its friends
• Machine Learning libraries
7JDD conference
Are you a Hadoop developer?
8JDD conference
Let’s do THIS!
9JDD conference
The Good Old Days
10JDD conference
One of these fine days...
11JDD conference
We need in Python dev 'cause Data Mining
12JDD conference
No, you are JavaEE developer only, continue …
13JDD conference
Write your backends, dude!
14JDD conference
Let’s talk about it, Java-boy...
15JDD conference
Can a Java programmer to be a Data Scientist?
16JDD conference
Sexy Data Scientist
17JDD conference
Real Data Scientist
18JDD conference
And what I tell you, young man
19JDD conference
And what I tell you, young man
WHAT IS DATA MINING?
21JDD conference
Statistics?
22JDD conference
Not OLAP, 100%
23JDD conference
Hey, man, predict something!
24JDD conference
Hey, man, predict something!
25JDD conference
Man or sofa?
26JDD conference
It’s Time for Java Superhero, yeah!
27JDD conference
Before patterns discovering you should ..
• Select small pieces
• Define default values for missed
data
• Remove strange signals from data
• Merge some tables in one if
required
SUBJECT AREA
29JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
30JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
31JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• What is the revenue prediction for next year?
32JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• What is the revenue prediction for next year?
• Can you recommend music for users?
DATA
34JDD conference
Datasets
• Facebook users, tweets
• Trade transactions
• Government
• Medicine (genomic data)
• Telecommunications
35JDD conference
Data Sources
• Relational Databases
• Data warehouses (Historical data)
• Files in CSV or in binary format
• Internet or electronic mails
• Scientific, research (R, Octave,
Matlab)
PATTERN MINING
37JDD conference
Association rule learning
38JDD conference
What is Cluster Analysis?
It is the process of finding model of function that describes
and distinguishes data class to predict the class of objects
whose class label is unknown.
39JDD conference
Different algorithms – different results
40JDD conference
Regression
41JDD conference
• Training set of classified
examples (supervised learning)
Classification
42JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
Classification
43JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
• Main goal: find a function
(classifier) that maps input data
to a category (class)
Classification
44JDD conference
Decision trees
45JDD conference
Cruel Tree
46JDD conference
Green circle is blue square or red
triangle? Let’s ask its neighbors!
kNN (k-nearest neighbor)
FASHION LANGUAGES
48JDD conference
Octave
49JDD conference
• A small amount of ML algorithms
• All your matrixes are belong to us!
• Single thread model
• Java support
• Octave in Java?
Why not Octave?
50JDD conference
Do you like
this GUI?
51JDD conference
• 25% of R packs are written in Java
• Syntax is too sweet
• You should read 1000 lines in docs
to write 1 line of code
• Single thread model for 95%
algorithms
Why not R?
52JDD conference
Now Python is an idol for young scientists
due to the low barrier to entry
Why not Python?
53JDD conference
• High-level language
• Have you ever heard about a
Jython?
• Long way to real Highload
production
• We are not Python developers
Why not Python?
54JDD conference
DM libraries
55JDD conference
Think Java
JAVA ECOSYSTEM
57JDD conference
Hadoop
58JDD conference
How to make features from Hadoop cluster?
59JDD conference
Pig & Hive
60JDD conference
Hive
61JDD conference
PIG (Triangle count)
62JDD conference
Why do we need in special graph approach?
HOW TO MAKE GRAPH
FEATURES
64JDD conference
SNA
65JDD conference
MapReduce for iterative calculations
• High complexity of graph problem reduction to key-value
model
• Iteration algorithms, but multiple chained jobs in M/R
with full saving and reading of each state
Think like a vertex…
66JDD conference
Data vs
Graph
67JDD conference
TRAIN
MODEL
68JDD conference
Java API for Data mining, JSR 73 and JSR 247
• javax.datamining.supervised defines the supervised
function-related interfaces
• javax.datamining.algorithm contains all mining algorithm
subclass packages
• JDM 2.0 adds Text Mining, Time series and so on..
JDM
69JDD conference
Who knows Weka?
70JDD conference
• Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL
databases
• Source code of all algorithms in Java
• Preprocessing tools: discretization, normalization,
resampling, attribute selection, transforming and combining
Weka
71JDD conference
Weka
72JDD conference
SPMF
• It’s codebase of algorithms in pattern mining field
• It has cool examples and implementation of 109
algorithms
• Cool performance results in specific area
• Codebase grows very fast
• Not so many classification algorithms are covered
73JDD conference
Mahout
• Scalable machine learning with Samsara
• Advanced Implementations of Java’s Collections Framework
for better Performance.
• New algorithms will build on Spark platform
• Collaborative Filtering, Classification, Clustering,
Dimensionality Reduction, Miscellaneous are supported
74JDD conference
Code sample Mahout (K-Means)
// read the point values and generate vectors from input data
final List vectors = vectorize(points);
// Write data to sequence hadoop sequence files
writePointsToFile(configuration, vectors);
// Write initial centers for clusters
writeClusterInitialCenters(configuration, vectors);
// Run K-means algorithm
final Path inputPath = new Path(POINTS_PATH);
final Path clustersPath = new Path(CLUSTERS_PATH);
final Path outputPath = new Path(OUTPUT_PATH);
HadoopUtil.delete(configuration, outputPath);
KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false);
// Read and print output values
readAndPrintOutputValues(configuration);
75JDD conference
Hadoop
ecosystem
HADOOP IS NOT SEXY
77JDD conference
Whaaaat?
78JDD conference
Map Reduce Job Writing
79JDD conference
YARN?
80JDD conference
SPARK: the bloody son of MR
• MapReduce in memory
• Up to 50x faster than Hadoop
• RDD is a basic building block
(immutable distributed
collections of objects)
81JDD conference
Mahout’s killer?
82JDD conference
MLlib supports
• Classification and regression
• Collaborative filtering
• Clustering
• Dimensionality reduction
• Optimization
83JDD conference
Code sample MLlib (K-Means)
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// Save and load model
clusters.save(sc.sc(), "myModelPath");
KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
84JDD conference
MLlib
• .. extends scikit-learn (Python lib) and Mahout
• .. runs fully on Spark
• .. is documented
• .. is well for large datasets and parallelized algorithms
85JDD conference
It solves all problems!
86JDD conference
In conclusion
• Think about your data
87JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
88JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
89JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
• Learn algorithms
90JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
• Learn algorithms
• Write Java code
91JDD conference
Hold your data and go ahead!
92JDD conference
CALL ME
IF YOU WANT TO KNOW MORE
Thanks a lot
93JDD conference
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw
LinkedIn: https://www.linkedin.com/in/zaleslaw

Contenu connexe

En vedette

En vedette (11)

Javascript development done right
Javascript development done rightJavascript development done right
Javascript development done right
 
Real world gobbledygook
Real world gobbledygookReal world gobbledygook
Real world gobbledygook
 
Introduction to type classes in 30 min
Introduction to type classes in 30 minIntroduction to type classes in 30 min
Introduction to type classes in 30 min
 
Apache spark when things go wrong
Apache spark   when things go wrongApache spark   when things go wrong
Apache spark when things go wrong
 
Know your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmKnow your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvm
 
Functional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heavenFunctional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heaven
 
Testing and Testable Code
Testing and Testable CodeTesting and Testable Code
Testing and Testable Code
 
Fun never stops. introduction to haskell programming language
Fun never stops. introduction to haskell programming languageFun never stops. introduction to haskell programming language
Fun never stops. introduction to haskell programming language
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]
 

Similaire à JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Lviv Startup Club
 

Similaire à JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev (20)

JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Introduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template LibraryIntroduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template Library
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
 

Dernier

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Dernier (20)

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 

JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev