Machine learning using spark

Machine Learning & Spark
Predict Prices of Houses

About me Ran Silberman
Architect at Tikal Knowledge
Big Data Consultant
mailto:ran@tikalk.com

Predict house prices
Use Case:
Predict house prices in Tel Aviv based on parameters of houses
Use data from Kaggle

Hypotheses:
1. ML: not only for mathematicians!
2. ML + Big-Data = Spark!

Technology Stack
Spark Machine Learning library
Scala

Apache Spark - RDD
Block-1
Block-2
Block-3
Block-4
HDFS Input File
Partition-1
Partition-2
Partition-3
Partition-4
RDD
Partition-1
Partition-2
Partition-3
Partition-4
RDD
Load Map
Block-1
Block-2
HDFS Output
Block-3
Block-4
Write

Spark MlLib
Block-1
Block-2
Block-3
Training Data
Partition-1
Partition-2
Partition-3
RDD
Algorithm
Function
ML Model
Load
Build
Model
Block-1
Block-2
Block-3
Test Data
Predict
Partition-1
Partition-2
Partition-3
RDD
Load

Spark MLlib API
RDD-based
API
DataFrame-based
API
Spark 2.0
Package: spark.mllib Package: spark.ml

Approach
1. Explore the Data
2. Assess Algorithms
3. Put it all together

Cond Bsmt
Cond
Year
built
Bsmt
area
Roof
Style
Grnd
Liv
Area
Grg
cars
Grg
area
Year
sold
Sale
Price
7 Good 2003 856 Gable 1710 2 548 2008 208500
6 Good 1976 1262 Gable 1262 2 460 2007 181500
7 Exc 2001 920 Hip 1786 2 608 2008 223500
7 Fair 1915 756 Hip 1717 3 642 2006 140000
8 Typical 2000 1145 Gable 2198 3 836 2008 250000
8 No Bsmt 2004 1686 Flat 1694 (null) 636 2007 307000

Data Format
208500 1:7.00 2:5.00 3:2003.00 4:856.00 5:1710.00 6:2.00
7:548.00 8:856.00 9:2.00 10:8.00
181500 1:6.00 2:8.00 3:1976.00 4:1262.00 5:1262.00 6:2.00
7:460.00 8:1262.00 9:2.00 10:6.00
223500 1:7.00 2:5.00 3:2001.00 4:920.00 5:1786.00 6:2.00
7:608.00 8:920.00 9:2.00 10:6.00
“Label”
value 1st
feature 2nd
feature 3rd
feature
10th
feature

One-Hot Encoding
RoofStyle: [“Gable”,”Hip”,”Flat”]
[{“Gable”,0},{“Hip”,1},{“Flat”,2}]
Line # 1: RoofStyle: “Gable” Vector: [1.0, 0, 0]
Line # 2: RoofStyle: “Hip” Vector: [0, 1.0, 0]
Line # 3: RoofStyle: “Flat” Vector: [0, 0, 1.0]

Pipelines in Spark MLlib
val indexer = new StringIndexer()
.setInputCol("roofType").setOutputCol("roofIndex")
val encoder = new OneHotEncoder()
.setInputCol("roofIndex").setOutputCol("features")
val lr = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(indexer, encoder, lr))
val model = pipeline.fit(dataFrame)
Transform Strings
to numeric index
encode index values to
one-hot vector
Linear Regression
algorithm
Pipeline put it
all togetherTrain Model

Linear Regression using Spark-MLlib
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(trainingFile)
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Linear Regression
Class
Train model
Predict
using model
Load data file
Print RMSE
Set Spark
Session

Cost Function - Root Mean squared Error
Way to define how well our model can predict

Root Mean Square Error
Linear Regression - 10 Features 49972

Find Correlations
val seriesX: RDD[Double] = sc.parallelize(grdLivArea) // X-axis
val seriesY: RDD[Double] = sc.parallelize(SalePrice) // Y-axis
val correlation: Double = Statistics.corr(seriesX, seriesY)
Compute correlation
using scala Statistics lib

2. Find Best Correlated Features

Correlation Matrix - Best features to use

Bivariate Analysis of several Features

Linear Regression - Improve Training Data
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(betterFeaturesFile)
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Features set with
best correlation

Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654

Linear Regression (Bivariate Analysis)

Gradient Descent of Cost Function on one feature
= RMSE

Linear Regression - Set more iterations
val model = lr.fit(training).setMaxIter(50)
Set Higher number
of iterations (was 10)
numIterations: 33
RMSE: 35583
Output Window

Linear Regression - Higher num of iterations 35583

Decision Tree Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new DecisionTreeRegressor()
.setMaxDepth(3)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Decision Tree
Regressor Class
Train model
Predict
using model
Load data file
Print RMSE
Split data to
training and test

Decision Tree a > 7
a > 6
b>1389
$126,901
a > 8
b>1964 b>1964 c>1996
$162,206 $194,954 $257,919 $252,771 $320,193 $551,666$368,137
a: Overall Quality (1-10)
b: Grd. Living Area (ft.)
c: Year built
yes
yes
yesyes
yes
yesyes
no
no no
no no no no
level-1
level-2
level-3

Decision Tree 46643

Decision Tree - set Tree Depth
.setMaxDepth(3)
RMSE = 46643
.setMaxDepth(20)
RMSE = 40571
.setMaxDepth(8)
RMSE = 37198

Bias-Variance tradeoff: Tree-depth

Decision Tree (depth = 3) 46643

Random Forest Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new RandomForestRegressor()
.setMaxDepth(8).setNumTrees(20)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Random Forest
Regressor Class
Set num of trees

Random Forest (depth = 8, #Trees = 10) 35023

Random Forest - set Number of Trees
RMSE = 35023
RMSE = 34607

Summary
1. Explored the Data - data transformations, correlations
2. Assessed Algorithms: Regression, Tree, Forest
3. Play with parameters: # iterations, tree depth, # of trees

Machine learning using spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Machine learning using spark

Similaire à Machine learning using spark (20)

Plus de Ran Silberman

Plus de Ran Silberman (6)

Dernier

Dernier (20)

Machine learning using spark

Notes de l'éditeur