SlideShare une entreprise Scribd logo
1  sur  43
Machine Learning & Spark
Predict Prices of Houses
About me Ran Silberman
Architect at Tikal Knowledge
Big Data Consultant
mailto:ran@tikalk.com
Predict house prices
Use Case:
Predict house prices in Tel Aviv based on parameters of houses
Use data from Kaggle
Hypotheses:
1. ML: not only for mathematicians!
2. ML + Big-Data = Spark!
Technology Stack
Spark Machine Learning library
Scala
Apache Spark - RDD
Block-1
Block-2
Block-3
Block-4
HDFS Input File
Partition-1
Partition-2
Partition-3
Partition-4
RDD
Partition-1
Partition-2
Partition-3
Partition-4
RDD
Load Map
Block-1
Block-2
HDFS Output
Block-3
Block-4
Write
Spark MlLib
Block-1
Block-2
Block-3
Training Data
Partition-1
Partition-2
Partition-3
RDD
Algorithm
Function
ML Model
Load
Build
Model
Block-1
Block-2
Block-3
Test Data
Predict
Partition-1
Partition-2
Partition-3
RDD
Load
Spark MLlib API
RDD-based
API
DataFrame-based
API
Spark 2.0
Package: spark.mllib Package: spark.ml
Approach
1. Explore the Data
2. Assess Algorithms
3. Put it all together
Explore the Data
Cond Bsmt
Cond
Year
built
Bsmt
area
Roof
Style
Grnd
Liv
Area
Grg
cars
Grg
area
Year
sold
Sale
Price
7 Good 2003 856 Gable 1710 2 548 2008 208500
6 Good 1976 1262 Gable 1262 2 460 2007 181500
7 Exc 2001 920 Hip 1786 2 608 2008 223500
7 Fair 1915 756 Hip 1717 3 642 2006 140000
8 Typical 2000 1145 Gable 2198 3 836 2008 250000
8 No Bsmt 2004 1686 Flat 1694 (null) 636 2007 307000
Data Format
208500 1:7.00 2:5.00 3:2003.00 4:856.00 5:1710.00 6:2.00
7:548.00 8:856.00 9:2.00 10:8.00
181500 1:6.00 2:8.00 3:1976.00 4:1262.00 5:1262.00 6:2.00
7:460.00 8:1262.00 9:2.00 10:6.00
223500 1:7.00 2:5.00 3:2001.00 4:920.00 5:1786.00 6:2.00
7:608.00 8:920.00 9:2.00 10:6.00
“Label”
value 1st
feature 2nd
feature 3rd
feature
10th
feature
Roof Types
One-Hot Encoding
RoofStyle: [“Gable”,”Hip”,”Flat”]
[{“Gable”,0},{“Hip”,1},{“Flat”,2}]
Line # 1: RoofStyle: “Gable” Vector: [1.0, 0, 0]
Line # 2: RoofStyle: “Hip” Vector: [0, 1.0, 0]
Line # 3: RoofStyle: “Flat” Vector: [0, 0, 1.0]
Pipelines in Spark MLlib
val indexer = new StringIndexer()
.setInputCol("roofType").setOutputCol("roofIndex")
val encoder = new OneHotEncoder()
.setInputCol("roofIndex").setOutputCol("features")
val lr = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(indexer, encoder, lr))
val model = pipeline.fit(dataFrame)
Transform Strings
to numeric index
encode index values to
one-hot vector
Linear Regression
algorithm
Pipeline put it
all togetherTrain Model
Linear Regression using Spark-MLlib
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(trainingFile)
val lr = new LinearRegression()
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Linear Regression
Class
Train model
Predict
using model
Load data file
Print RMSE
Set Spark
Session
Cost Function - Root Mean squared Error
Way to define how well our model can predict
Root Mean Square Error
Linear Regression - 10 Features 49972
Find Correlations
val seriesX: RDD[Double] = sc.parallelize(grdLivArea) // X-axis
val seriesY: RDD[Double] = sc.parallelize(SalePrice) // Y-axis
val correlation: Double = Statistics.corr(seriesX, seriesY)
Compute correlation
using scala Statistics lib
2. Find Best Correlated Features
Correlation Matrix - Best features to use
Bivariate Analysis of several Features
Linear Regression - Improve Training Data
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(betterFeaturesFile)
val lr = new LinearRegression()
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Features set with
best correlation
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression (Bivariate Analysis)
Gradient Descent of Cost Function on one feature
= RMSE
Linear Regression - Set more iterations
val lr = new LinearRegression()
val model = lr.fit(training).setMaxIter(50)
Set Higher number
of iterations (was 10)
numIterations: 33
RMSE: 35583
Output Window
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new DecisionTreeRegressor()
.setMaxDepth(3)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Decision Tree
Regressor Class
Train model
Predict
using model
Load data file
Print RMSE
Split data to
training and test
Decision Tree a > 7
a > 6
b>1389
$126,901
a > 8
b>1964 b>1964 c>1996
$162,206 $194,954 $257,919 $252,771 $320,193 $551,666$368,137
a: Overall Quality (1-10)
b: Grd. Living Area (ft.)
c: Year built
yes
yes
yesyes
yes
yesyes
no
no no
no no no no
level-1
level-2
level-3
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree 46643
Decision Tree - set Tree Depth
val dt = new DecisionTreeRegressor()
.setMaxDepth(3)
RMSE = 46643
val dt = new DecisionTreeRegressor()
.setMaxDepth(20)
RMSE = 40571
val dt = new DecisionTreeRegressor()
.setMaxDepth(8)
RMSE = 37198
Bias-Variance tradeoff: Tree-depth
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new RandomForestRegressor()
.setMaxDepth(8).setNumTrees(20)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Random Forest
Regressor Class
Set num of trees
Random Forest
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest (depth = 8, #Trees = 10) 35023
Random Forest - set Number of Trees
val dt = new RandomForestRegressor()
.setMaxDepth(8).setNumTrees(10)
RMSE = 35023
val dt = new RandomForestRegressor()
.setMaxDepth(20).setNumTrees(100)
RMSE = 34607
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest (depth = 8, #Trees = 10) 35023
Random Forest (depth = 20, #Trees = 100) 34607
Summary
1. Explored the Data - data transformations, correlations
2. Assessed Algorithms: Regression, Tree, Forest
3. Play with parameters: # iterations, tree depth, # of trees
Thank You!

Contenu connexe

Tendances

Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Alexander Hendorf
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON Padma shree. T
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youLuc Bors
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data StructureSakthi Dasans
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkMartin Goodson
 

Tendances (20)

Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
 

Similaire à Machine learning using spark

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...DataStax
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for EveryoneGiovanna Roda
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik ErlandsonDatabricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineJan Wiegelmann
 

Similaire à Machine learning using spark (20)

Sparklyr
SparklyrSparklyr
Sparklyr
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 

Plus de Ran Silberman

Clash of clans data structures
Clash of clans   data structuresClash of clans   data structures
Clash of clans data structuresRan Silberman
 
Dev ops for big data cluster management tools
Dev ops for big data  cluster management toolsDev ops for big data  cluster management tools
Dev ops for big data cluster management toolsRan Silberman
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
From a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandFrom a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandRan Silberman
 

Plus de Ran Silberman (6)

Clash of clans data structures
Clash of clans   data structuresClash of clans   data structures
Clash of clans data structures
 
Dev ops for big data cluster management tools
Dev ops for big data  cluster management toolsDev ops for big data  cluster management tools
Dev ops for big data cluster management tools
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
From a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandFrom a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised Land
 

Dernier

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Dernier (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Machine learning using spark

  • 1. Machine Learning & Spark Predict Prices of Houses
  • 2. About me Ran Silberman Architect at Tikal Knowledge Big Data Consultant mailto:ran@tikalk.com
  • 3. Predict house prices Use Case: Predict house prices in Tel Aviv based on parameters of houses Use data from Kaggle
  • 4. Hypotheses: 1. ML: not only for mathematicians! 2. ML + Big-Data = Spark!
  • 5. Technology Stack Spark Machine Learning library Scala
  • 6. Apache Spark - RDD Block-1 Block-2 Block-3 Block-4 HDFS Input File Partition-1 Partition-2 Partition-3 Partition-4 RDD Partition-1 Partition-2 Partition-3 Partition-4 RDD Load Map Block-1 Block-2 HDFS Output Block-3 Block-4 Write
  • 7. Spark MlLib Block-1 Block-2 Block-3 Training Data Partition-1 Partition-2 Partition-3 RDD Algorithm Function ML Model Load Build Model Block-1 Block-2 Block-3 Test Data Predict Partition-1 Partition-2 Partition-3 RDD Load
  • 8. Spark MLlib API RDD-based API DataFrame-based API Spark 2.0 Package: spark.mllib Package: spark.ml
  • 9. Approach 1. Explore the Data 2. Assess Algorithms 3. Put it all together
  • 11. Cond Bsmt Cond Year built Bsmt area Roof Style Grnd Liv Area Grg cars Grg area Year sold Sale Price 7 Good 2003 856 Gable 1710 2 548 2008 208500 6 Good 1976 1262 Gable 1262 2 460 2007 181500 7 Exc 2001 920 Hip 1786 2 608 2008 223500 7 Fair 1915 756 Hip 1717 3 642 2006 140000 8 Typical 2000 1145 Gable 2198 3 836 2008 250000 8 No Bsmt 2004 1686 Flat 1694 (null) 636 2007 307000
  • 12. Data Format 208500 1:7.00 2:5.00 3:2003.00 4:856.00 5:1710.00 6:2.00 7:548.00 8:856.00 9:2.00 10:8.00 181500 1:6.00 2:8.00 3:1976.00 4:1262.00 5:1262.00 6:2.00 7:460.00 8:1262.00 9:2.00 10:6.00 223500 1:7.00 2:5.00 3:2001.00 4:920.00 5:1786.00 6:2.00 7:608.00 8:920.00 9:2.00 10:6.00 “Label” value 1st feature 2nd feature 3rd feature 10th feature
  • 13.
  • 15. One-Hot Encoding RoofStyle: [“Gable”,”Hip”,”Flat”] [{“Gable”,0},{“Hip”,1},{“Flat”,2}] Line # 1: RoofStyle: “Gable” Vector: [1.0, 0, 0] Line # 2: RoofStyle: “Hip” Vector: [0, 1.0, 0] Line # 3: RoofStyle: “Flat” Vector: [0, 0, 1.0]
  • 16. Pipelines in Spark MLlib val indexer = new StringIndexer() .setInputCol("roofType").setOutputCol("roofIndex") val encoder = new OneHotEncoder() .setInputCol("roofIndex").setOutputCol("features") val lr = new LinearRegression() val pipeline = new Pipeline().setStages(Array(indexer, encoder, lr)) val model = pipeline.fit(dataFrame) Transform Strings to numeric index encode index values to one-hot vector Linear Regression algorithm Pipeline put it all togetherTrain Model
  • 17.
  • 18. Linear Regression using Spark-MLlib val spark = SparkSession.builder().getOrCreate() val training = spark.read.format("libsvm").load(trainingFile) val lr = new LinearRegression() val model = lr.fit(training) model.transform(test).show println(s" ${model.summary.rootMeanSquaredError}") Linear Regression Class Train model Predict using model Load data file Print RMSE Set Spark Session
  • 19. Cost Function - Root Mean squared Error Way to define how well our model can predict
  • 20. Root Mean Square Error Linear Regression - 10 Features 49972
  • 21. Find Correlations val seriesX: RDD[Double] = sc.parallelize(grdLivArea) // X-axis val seriesY: RDD[Double] = sc.parallelize(SalePrice) // Y-axis val correlation: Double = Statistics.corr(seriesX, seriesY) Compute correlation using scala Statistics lib
  • 22. 2. Find Best Correlated Features
  • 23. Correlation Matrix - Best features to use
  • 24. Bivariate Analysis of several Features
  • 25. Linear Regression - Improve Training Data val spark = SparkSession.builder().getOrCreate() val training = spark.read.format("libsvm").load(betterFeaturesFile) val lr = new LinearRegression() val model = lr.fit(training) model.transform(test).show println(s" ${model.summary.rootMeanSquaredError}") Features set with best correlation
  • 26. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654
  • 28. Gradient Descent of Cost Function on one feature = RMSE
  • 29. Linear Regression - Set more iterations val lr = new LinearRegression() val model = lr.fit(training).setMaxIter(50) Set Higher number of iterations (was 10) numIterations: 33 RMSE: 35583 Output Window
  • 30. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583
  • 31. Decision Tree Regressor val data = spark.read.format("libsvm").load(logFile) val Array(train, test) = data.randomSplit(Array(0.7, 0.3)) val dt = new DecisionTreeRegressor() .setMaxDepth(3) val model = dt.fit(train) val predictions = model.transform(test) val rmse = new RegressionEvaluator().evaluate(predictions) println(s"RMSE = $rmse") Decision Tree Regressor Class Train model Predict using model Load data file Print RMSE Split data to training and test
  • 32. Decision Tree a > 7 a > 6 b>1389 $126,901 a > 8 b>1964 b>1964 c>1996 $162,206 $194,954 $257,919 $252,771 $320,193 $551,666$368,137 a: Overall Quality (1-10) b: Grd. Living Area (ft.) c: Year built yes yes yesyes yes yesyes no no no no no no no level-1 level-2 level-3
  • 33. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree 46643
  • 34. Decision Tree - set Tree Depth val dt = new DecisionTreeRegressor() .setMaxDepth(3) RMSE = 46643 val dt = new DecisionTreeRegressor() .setMaxDepth(20) RMSE = 40571 val dt = new DecisionTreeRegressor() .setMaxDepth(8) RMSE = 37198
  • 36. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree (depth = 3) 46643 Decision Tree (depth = 20) 40571 Decision Tree (depth = 8) 37198
  • 37. Random Forest Regressor val data = spark.read.format("libsvm").load(logFile) val Array(train, test) = data.randomSplit(Array(0.7, 0.3)) val dt = new RandomForestRegressor() .setMaxDepth(8).setNumTrees(20) val model = dt.fit(train) val predictions = model.transform(test) val rmse = new RegressionEvaluator().evaluate(predictions) println(s"RMSE = $rmse") Random Forest Regressor Class Set num of trees
  • 39. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree (depth = 3) 46643 Decision Tree (depth = 20) 40571 Decision Tree (depth = 8) 37198 Random Forest (depth = 8, #Trees = 10) 35023
  • 40. Random Forest - set Number of Trees val dt = new RandomForestRegressor() .setMaxDepth(8).setNumTrees(10) RMSE = 35023 val dt = new RandomForestRegressor() .setMaxDepth(20).setNumTrees(100) RMSE = 34607
  • 41. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree (depth = 3) 46643 Decision Tree (depth = 20) 40571 Decision Tree (depth = 8) 37198 Random Forest (depth = 8, #Trees = 10) 35023 Random Forest (depth = 20, #Trees = 100) 34607
  • 42. Summary 1. Explored the Data - data transformations, correlations 2. Assessed Algorithms: Regression, Tree, Forest 3. Play with parameters: # iterations, tree depth, # of trees

Notes de l'éditeur

  1. Look how this code is simple…Can deal with very big data But need to make all features have float values...
  2. Look how this code is simple…Can deal with very big data But need to make all features have float values...
  3. Look how this code is simple…Can deal with very big data But things are not that simple...
  4. Look how this code is simple…Can deal with very big data But things are not that simple...
  5. Look how this code is simple…Can deal with very big data But things are not that simple...
  6. Some problems with trees: If the tree has few levels of depths then all predicted values are among a predefined list If the tree has too many levels of depths, we get an overfitting problem
  7. Look how this code is simple…Can deal with very big data But things are not that simple...
  8. Look how this code is simple…Can deal with very big data But things are not that simple...