16. Pipelines in Spark MLlib
val indexer = new StringIndexer()
.setInputCol("roofType").setOutputCol("roofIndex")
val encoder = new OneHotEncoder()
.setInputCol("roofIndex").setOutputCol("features")
val lr = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(indexer, encoder, lr))
val model = pipeline.fit(dataFrame)
Transform Strings
to numeric index
encode index values to
one-hot vector
Linear Regression
algorithm
Pipeline put it
all togetherTrain Model
17.
18. Linear Regression using Spark-MLlib
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(trainingFile)
val lr = new LinearRegression()
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Linear Regression
Class
Train model
Predict
using model
Load data file
Print RMSE
Set Spark
Session
19. Cost Function - Root Mean squared Error
Way to define how well our model can predict
25. Linear Regression - Improve Training Data
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(betterFeaturesFile)
val lr = new LinearRegression()
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Features set with
best correlation
26. Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
29. Linear Regression - Set more iterations
val lr = new LinearRegression()
val model = lr.fit(training).setMaxIter(50)
Set Higher number
of iterations (was 10)
numIterations: 33
RMSE: 35583
Output Window
30. Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
31. Decision Tree Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new DecisionTreeRegressor()
.setMaxDepth(3)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Decision Tree
Regressor Class
Train model
Predict
using model
Load data file
Print RMSE
Split data to
training and test
32. Decision Tree a > 7
a > 6
b>1389
$126,901
a > 8
b>1964 b>1964 c>1996
$162,206 $194,954 $257,919 $252,771 $320,193 $551,666$368,137
a: Overall Quality (1-10)
b: Grd. Living Area (ft.)
c: Year built
yes
yes
yesyes
yes
yesyes
no
no no
no no no no
level-1
level-2
level-3
33. Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree 46643
34. Decision Tree - set Tree Depth
val dt = new DecisionTreeRegressor()
.setMaxDepth(3)
RMSE = 46643
val dt = new DecisionTreeRegressor()
.setMaxDepth(20)
RMSE = 40571
val dt = new DecisionTreeRegressor()
.setMaxDepth(8)
RMSE = 37198
36. Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
37. Random Forest Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new RandomForestRegressor()
.setMaxDepth(8).setNumTrees(20)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Random Forest
Regressor Class
Set num of trees
39. Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest (depth = 8, #Trees = 10) 35023
40. Random Forest - set Number of Trees
val dt = new RandomForestRegressor()
.setMaxDepth(8).setNumTrees(10)
RMSE = 35023
val dt = new RandomForestRegressor()
.setMaxDepth(20).setNumTrees(100)
RMSE = 34607
41. Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest (depth = 8, #Trees = 10) 35023
Random Forest (depth = 20, #Trees = 100) 34607
42. Summary
1. Explored the Data - data transformations, correlations
2. Assessed Algorithms: Regression, Tree, Forest
3. Play with parameters: # iterations, tree depth, # of trees
Look how this code is simple…Can deal with very big data
But need to make all features have float values...
Look how this code is simple…Can deal with very big data
But need to make all features have float values...
Look how this code is simple…Can deal with very big data
But things are not that simple...
Look how this code is simple…Can deal with very big data
But things are not that simple...
Look how this code is simple…Can deal with very big data
But things are not that simple...
Some problems with trees:
If the tree has few levels of depths then all predicted values are among a predefined list
If the tree has too many levels of depths, we get an overfitting problem
Look how this code is simple…Can deal with very big data
But things are not that simple...
Look how this code is simple…Can deal with very big data
But things are not that simple...