SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
Extending Spark ML
Estimators and Transformers
kroszk@
Built with
public APIs*
*Scala only - see developer for details.
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of January!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Seth:
● Data Scientist at Cloudera
● Previously machine learning engineer at IBM’s Spark Technology Center
● Two years contributing to Spark MLlib
● Twitter: @shendrickson16
● Linkedin https://www.linkedin.com/in/sethah
● Github https://github.com/sethah
● SlideShare http://www.slideshare.net/SethHendrickson
Spark Technology
Center
5
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Might know some Apache Spark
● Possibly know some Scala
● Think machine learning is kind of cool
● Don’t overly mind a grab-bag of topics
Lori Erickson
What are we going to talk about?
● What Spark ML pipelines look like
● What Estimators and Transformers are
● How to implement both of them
● What tools can help us
● Publishing your fancy new Spark model so other’s (like me) can use it!
● Holden will of course try and sell you many copies of her new book if you
have an expense account.
Loading data Spark SQL (DataSets)
sparkSession.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc.
● load(“path”)
Jess Johnson
Loading some simple CSV data
val df = spark.read
.option("inferSchema", "true")
.option("delimiter", ";")
.format("csv")
.load("hdfs:///user/data/admissions.csv")
Jess Johnson
Spark ML Pipelines
Pipeline
Stage ?data
Pipeline
Stage
Pipeline
Stage
Pipeline
Stage
...
Pipeline
Spark ML Pipelines
Pipeline
Stage ?data
Pipeline
Stage
Pipeline
Stage
Pipeline
Stage
...
Pipeline
data ?
Also a pipeline stage!
Two main types of pipeline stages
Pipeline
Stage ?data
Transformer Estimatordata data data transformer
Pipelines are estimators
Pipeline
data model
Also an estimator!
Transformer Transformer Estimator
PipelineModels are transformers
PipelineModel
data data
Also a transformer!
Transformer Transformer Transformer
How are transformers made?
Estimator
data
class Estimator extends PipelineStage {
def fit(dataset: Dataset[_]): Transformer = {
// magic happens here
}
}
Transformer
How is new data made?
Transformer ( data )
class Transformer extends PipelineStage {
def transform(df: Dataset[_]): DataFrame
}
new data.transform
Feature transformations
+-----+-----+----+--------+
|admit| gre| gpa|prestige|
+-----+-----+----+--------+
| no|380.0|3.61| 3.0|
| yes|660.0|3.67| 3.0|
| yes|800.0| 4.0| 1.0|
| yes|640.0|3.19| 4.0|
| no|520.0|2.93| 4.0|
+-----+-----+----+--------+
val assembler = new VectorAssembler()
.setInputCols(Array("gre", "gpa", "prestige"))
val df2 = assembler.transform(df)
VectorAssembler
+-----+-----+----+--------+----------------+
|admit| gre| gpa|prestige| features|
+-----+-----+----+--------+----------------+
| no|380.0|3.61| 3.0|[380.0,3.61,3.0]|
| yes|660.0|3.67| 3.0|[660.0,3.67,3.0]|
| yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]|
| yes|640.0|3.19| 4.0|[640.0,3.19,4.0]|
| no|520.0|2.93| 4.0|[520.0,2.93,4.0]|
+-----+-----+----+--------+----------------+
Train a classifier on the transformed data
StringIndexer
StringIndexerModel
val si = new StringIndexer().setInputCol("admit").setOutputCol("label")
val siModel = si.fit(df2)
val df3 = siModel.transform(df2)
+-----+-----+----+--------+----------------+
|admit| gre| gpa|prestige| features|
+-----+-----+----+--------+----------------+
| no|380.0|3.61| 3.0|[380.0,3.61,3.0]|
| yes|660.0|3.67| 3.0|[660.0,3.67,3.0]|
| yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]|
| yes|640.0|3.19| 4.0|[640.0,3.19,4.0]|
| no|520.0|2.93| 4.0|[520.0,2.93,4.0]|
+-----+-----+----+--------+----------------+
+-----+-----+----+--------+----------------+-----+
|admit| gre| gpa|prestige| features|label|
+-----+-----+----+--------+----------------+-----+
| no|380.0|3.61| 3.0|[380.0,3.61,3.0]| 0.0|
| yes|660.0|3.67| 3.0|[660.0,3.67,3.0]| 1.0|
| yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]| 1.0|
| yes|640.0|3.19| 4.0|[640.0,3.19,4.0]| 1.0|
| no|520.0|2.93| 4.0|[520.0,2.93,4.0]| 0.0|
+-----+-----+----+--------+----------------+-----+
Train a classifier on the transformed data
+----------------+-----+
| features|label|
+----------------+-----+
|[380.0,3.61,3.0]| 0.0|
|[660.0,3.67,3.0]| 1.0|
| [800.0,4.0,1.0]| 1.0|
|[640.0,3.19,4.0]| 1.0|
|[520.0,2.93,4.0]| 0.0|
+----------------+-----+
DecisionTreeClassifier
DecisionTree
ClassificationModel
+----------------+-----+----------+
| features|label|prediction|
+----------------+-----+----------+
|[380.0,3.61,3.0]| 0.0| 0.0|
|[660.0,3.67,3.0]| 1.0| 0.0|
| [800.0,4.0,1.0]| 1.0| 1.0|
|[640.0,3.19,4.0]| 1.0| 1.0|
|[520.0,2.93,4.0]| 0.0| 0.0|
+----------------+-----+----------+
val dt = new DecisionTreeClassifier()
val dtModel = dt.fit(df3)
val df4 = dtModel.transform(df3)
Or just throw it all in a pipeline
● Keeping track of intermediate data and calling fit/transform on every stage is
way too much work
● This problem is worse when more stages are used
● Use a pipeline instead!
val assembler = new VectorAssembler()
assembler.setInputCols(Array("gre", "gpa", "prestige"))
val sb = new StringIndexer()
sb.setInputCol("admit").setOutputCol("label")
val dt = new DecisionTreeClassifier()
val pipeline = new Pipeline()
pipeline.setStages(Array(assembler, sb, dt))
val pipelineModel = pipeline.fit(df)
Yay! You have an ML pipeline!
Photo by Jessica Fiess-Hill
Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
● You can also check out spark-packages for some more
● But possible not your special AwesomeFooBazinatorML
PROcarterse Follow
& data prep stages...
● org.apache.spark.ml.feature
○ ~30 elements from VectorAssembler to Tokenizer, to PCA, etc.
● Often simpler to understand while getting started with
building our own stages
PROcarterse Follow
So now begins our adventure to add stages
So what does a pipeline stage look like?
Must provide:
● transformSchema (used to validate input schema is
reasonable) & copy
Often have:
● Special params for configuration (so we can do
meta-algorithms)
Wendy Piersall
Building a simple transformer:
class HardCodedWordCountStage(override val uid: String) extends Transformer {
def this() = this(Identifiable.randomUID("hardcodedwordcount"))
def copy(extra: ParamMap): HardCodedWordCountStage = {
defaultCopy(extra)
}
...
}
Not to be confused with the Transformers franchise from Hasbro and Tomy.
Verify the input schema is reasonable:
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is a string
val idx = schema.fieldIndex("happy_pandas")
val field = schema.fields(idx)
if (field.dataType != StringType) {
throw new Exception(s"Input type ${field.dataType} did not match
input type StringType")
}
// Add the return field
schema.add(StructField("happy_panda_counts", IntegerType, false))
}
How is transformSchema used?
● When you call fit on a pipeline it calls transformSchema
on the pipeline stages in order
● This is used to verify that things should work
● Ideally allows pipelines to fail fast when misconfigured,
instead of at the final stage of a 48-hour process
● Doesn’t always work that way :p
Tricia Hall
Do the “work” (e.g. predict labels or w/e):
def transform(df: Dataset[_]): DataFrame = {
val wordcount = udf { in: String => in.split(" ").size }
df.select(col("*"),
wordcount(df.col("happy_pandas")).as("happy_panda_counts"))
}
vic15
What about configuring our stage?
class ConfigurableWordCount(override val uid: String) extends
Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input
column")
final val outputCol = new Param[String](this, "outputCol", "The
output column")
def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
Jason Wesley Upton
So why do we configure it that way?
● Allow meta algorithms to work on it
● If you look inside of spark you’ll see “sharedParams” for
common params (like input column)
● We can’t access those unless we pretend to be inside of
org.apache.spark - so we have to make our own
Tricia Hall
So how to make an estimator?
● Very similar, instead of directly providing transform
provide a `fit` which returns a “model” which implements
the estimator interface as shown above
● Also take a look at the algorithms in Spark itself (helpful
traits you can mixin to take care of many common things).
● Let’s look at a simple one now!
sneakerdog
A simple string indexer estimator
class SimpleIndexer(override val uid: String) extends
Estimator[SimpleIndexerModel] with SimpleIndexerParams {
….
override def fit(dataset: Dataset[_]): SimpleIndexerModel = {
import dataset.sparkSession.implicits._
val words = dataset.select(dataset($(inputCol)).as[String]).distinct
.collect()
new SimpleIndexerModel(uid, words)
}
}
Quick aside: What’ts that “$(inputCol)”?
● How you get access to a configuration parameter
● Inside stage only (external use getInputCol just like
Java™ :p)
And our friend the transformer is back:
class SimpleIndexerModel(
override val uid: String, words: Array[String]) extends
Model[SimpleIndexerModel] with SimpleIndexerParams {
...
private val labelToIndex: Map[String, Double] = words.zipWithIndex.
map{case (x, y) => (x, y.toDouble)}.toMap
override def transform(dataset: Dataset[_]): DataFrame = {
val indexer = udf { label: String => labelToIndex(label) }
dataset.select(col("*"),
indexer(dataset($(inputCol)).cast(StringType)).as($(outputCol)))
Still not to be confused with the Transformers franchise from Hasbro and Tomy.
Ok so how do you make the train function?
● Read some papers on the algorithm(s) you care about
● Most likely some iterative approach (pro-tip: RDDs >
Datasets for iterative)
○ Seth has some interesting work around pluggable
optimizers
● Closed form solution? Go have a party!
What else can you add to your models?
● Put in an ML pipeline
● Do hyper-parameter tuning
And if you have some coffee left over:
● Persistence*
○ MLWriter & MLReader give you the basics
○ You’ll have to do a lot of work yourself :(
● Serving*
*With enough coffee. Not guaranteed.
Ok so I put my new fancy thing on GitHub
● Yay thank you!
● Please publish to maven central
● Also consider listing on spark-packages + user@ list
○ Let me know ( holden@pigscanfly.ca ) :)
● Think of the Python users (and I guess the R users) too?
Custom Estimators/Transformers in the Wild
Classification/Regression
xgboost
Deep Learning!
MXNet
Feature Transformation
FeatureHasher
More resources:
● High Performance Spark Example Repo has some
sample models
○ Of course buy several copies of the book - it is the gift of the season :p
● The models inside of Spark itself (use some internal APIs
but a good starting point)
● Nick Pentreath’s FeatureHasher
● O’Reilly radar blog post
https://www.oreilly.com/learning/extend-structured-streami
ng-for-spark-ml
Captain Pancakes
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
Coming soon:
High Performance Spark
Learning PySpark
The next book…..
Available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Extending ML is covered in Chapter 9
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
● Should be finished between May 22nd ~ June 18th :D
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
And some upcoming talks:
● June
○ Berlin Buzzwords
○ Scala Swarm (Porto, Portugal)
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Bonus/Appendix slides
Cross-validation
because saving a test set is effort
● Automagically* fit your model params
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
Jonathan Kotta
Cross-validation
because saving a test set is effort & a reason to integrate
// ParamGridBuilder constructs an Array of parameter
combinations.
val paramGrid: Array[ParamMap] = new ParamGridBuilder()
.addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0))
.build()
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
val cvModel = cv.fit(df)
val bestModel = cvModel.bestModel
Jonathan Kotta
So what does a pipeline stage look like?
Are either an:
● Estimator - has a method called “fit” which returns a
Transformer (e.g. NaiveBayes, etc.)
● Transformer - no need to train can directly transform (e.g.
HashingTF, VectorAssembler, etc.) (with transform)
Wendy Piersall
We’ve left out a lot of “transformSchema”...
● It is necessary (but I’m lazy)
● But there are helper classes that can implement some of
the boiler plate we’ve been skipping
● Classifier & Estimator base classes are your friends
● They provide transformSchema
Let’s make a Classifier* :)
// Example only - not for production use.
class SimpleNaiveBayes(val uid: String)
extends Classifier[Vector, SimpleNaiveBayes, SimpleNaiveBayesModel] {
Input type Trained Model
Let’s make a Classifier* :)
override def train(ds: Dataset[_]): SimpleNaiveBayesModel = {
import ds.sparkSession.implicits._
ds.cache()
….
…
….
}
If you reallllly want to see inside the ...s (1/5)
// Get the number of features by peaking at the first row
val numFeatures: Integer = ds.select(col($(featuresCol))).head
.get(0).asInstanceOf[Vector].size
// Determine the number of records for each class
val groupedByLabel = ds.select(col($(labelCol)).as[Double]).groupByKey(x =>
x)
val classCounts = groupedByLabel.agg(count("*").as[Long])
.sort(col("value")).collect().toMap
// Select the labels and features so we can more easily map over them.
// Note: we do this as a DataFrame using the untyped API because the Vector
// UDT is no longer public.
val df = ds.select(col($(labelCol)).cast(DoubleType), col($(featuresCol)))
If you reallllly want to see inside the ...s (2/5)
// Note: you can use getNumClasses & extractLabeledPoints to get an RDD
instead
// Using the RDD approach is common when integrating with legacy machine
learning code
// or iterative algorithms which can create large query plans.
// Here we use `Datasets` since neither of those apply.
// Compute the number of documents
val numDocs = ds.count
// Get the number of classes.
// Note this estimator assumes they start at 0 and go to numClasses
val numClasses = getNumClasses(ds)
If you reallllly want to see inside the ...s (3/5)
// Figure out the non-zero frequency of each feature for each label and
// output label index pairs using a case clas to make it easier to work
with.
val labelCounts: Dataset[LabeledToken] = df.flatMap {
case Row(label: Double, features: Vector) =>
features.toArray.zip(Stream from 1)
.filter{vIdx => vIdx._2 == 1.0}
.map{case (v, idx) => LabeledToken(label, idx)}
}
// Use the typed Dataset aggregation API to count the number of non-zero
// features for each label-feature index.
val aggregatedCounts: Array[((Double, Integer), Long)] = labelCounts
.groupByKey(x => (x.label, x.index))
.agg(count("*").as[Long]).collect()
val theta = Array.fill(numClasses)(new Array[Double](numFeatures))
If you reallllly want to see inside the ...s (4/5)
// Compute the denominator for the general prioirs
val piLogDenom = math.log(numDocs + numClasses)
// Compute the priors for each class
val pi = classCounts.map{case(_, cc) =>
math.log(cc.toDouble) - piLogDenom }.toArray
// For each label/feature update the probabilities
aggregatedCounts.foreach{case ((label, featureIndex), count) =>
// log of number of documents for this label + 2.0 (smoothing)
val thetaLogDenom = math.log(
classCounts.get(label).map(_.toDouble).getOrElse(0.0) + 2.0)
theta(label.toInt)(featureIndex) = math.log(count + 1.0) - thetaLogDenom
}
// Unpersist now that we are done computing everything
ds.unpersist()
If you reallllly want to see inside the ...s (5/5)
// Construct a model
new SimpleNaiveBayesModel(uid, numClasses, numFeatures, Vectors.dense(pi),
new DenseMatrix(numClasses, theta(0).length, theta.flatten, true))
}
override def copy(extra: ParamMap) = {
defaultCopy(extra)
}
}
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most active)
● Much faster than Hadoop Map/Reduce
● Good when data is too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
DataFrames & Datasets
Totally the future
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
● Has runtime schema information
● Allows for relational queries & supports SQL
● Declarative - many optimizations applied automagically
● Input for Spark Machine Learning
Helen Olney
What is the performance like?
Andrew Skudder
Spark ML pipelines
Tokenizer HashingTF String Indexer Naive Bayes
Tokenizer HashingTF String Indexer Naive Bayes
fit(df)
Estimator
Transformer
● Consist of different stages (estimators or transformers)
● Themselves are an estimator
We are going to
build a stage
together!
Minimal data prep:
● At a minimum most algorithms in Spark work on feature
vectors of doubles (and if labeled - doubles too)
Imports:
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.linalg.{Vector => SparkVector}
Huang
Yun
Chung
Minimal prep continued
// Combines a list of double input features into a vector
val assembler = new VectorAssembler()
assembler.setInputCols(Array("age", "education-num"))
// String indexer converts a set of strings into doubles
val sb = new StringIndexer()
sb.setInputCol("category").setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = new Pipeline()
pipeline.setStages(Array(assembler, sb))
Huang
Yun
Chung
Minimal prep continued
val assembler = new VectorAssembler()
assembler.setInputCols(Array("gre", "gpa", "prestige"))
val si = new StringIndexer()
si.setInputCol("admit").setOutputCol("label")
val pipeline = new Pipeline()
pipeline.setStages(Array(assembler, si))
Huang
Yun
Chung
+-----+-----+----+--------+----------------+-----+
|admit| gre| gpa|prestige| features|label|
+-----+-----+----+--------+----------------+-----+
| no|380.0|3.61| 3.0|[380.0,3.61,3.0]| 0.0|
| yes|660.0|3.67| 3.0|[660.0,3.67,3.0]| 1.0|
| yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]| 1.0|
| yes|640.0|3.19| 4.0|[640.0,3.19,4.0]| 1.0|
| no|520.0|2.93| 4.0|[520.0,2.93,4.0]| 0.0|
+-----+-----+----+--------+----------------+-----+
+-----+-----+----+--------+----------------+
|admit| gre| gpa|prestige| features|
+-----+-----+----+--------+----------------+
| no|380.0|3.61| 3.0|[380.0,3.61,3.0]|
| yes|660.0|3.67| 3.0|[660.0,3.67,3.0]|
| yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]|
| yes|640.0|3.19| 4.0|[640.0,3.19,4.0]|
| no|520.0|2.93| 4.0|[520.0,2.93,4.0]|
+-----+-----+----+--------+----------------+
+-----+-----+----+--------+
|admit| gre| gpa|prestige|
+-----+-----+----+--------+
| no|380.0|3.61| 3.0|
| yes|660.0|3.67| 3.0|
| yes|800.0| 4.0| 1.0|
| yes|640.0|3.19| 4.0|
| no|520.0|2.93| 4.0|
+-----+-----+----+--------+
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
val model = pipeline.fit(df)
val prepared = model.transform(df)
Andrey
Let's train a model on our prepared data:
// Specify model
val dt = new DecisionTreeClassifier()
dt.setFeaturesCol("features")
dt.setPredictionCol("prediction")
// Fit it
val dtModel = dt.fit(prepared)
Or wait let's just add it to the pipeline:
// Specify model
val dt = new DecisionTreeClassifier()
dt.setFeaturesCol("features")
dt.setPredictionCol("prediction")
// Add to the pipeline
pipeline.setStages(Array(assembler, si, dt))
pipelineModel = pipeline.fit(df)
And predict the results on the same data:
pipelineModel.transform(df).select("prediction",
"label").take(20)
+----------+-----+
|prediction|label|
+----------+-----+
| 0.0| 0.0|
| 0.0| 1.0|
| 1.0| 1.0|
| 1.0| 1.0|
| 0.0| 0.0|
+----------+-----+

Contenu connexe

Tendances

Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...DataStax
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Rust Programming Language
Rust Programming LanguageRust Programming Language
Rust Programming LanguageJaeju Kim
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaEdureka!
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta LakeDatabricks
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpJoseph Arriola
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksLuciano Resende
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...VMware Tanzu
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
 

Tendances (20)

Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Rust Programming Language
Rust Programming LanguageRust Programming Language
Rust Programming Language
 
Rapid json tutorial
Rapid json tutorialRapid json tutorial
Rapid json tutorial
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
COBOL to Apache Spark
COBOL to Apache SparkCOBOL to Apache Spark
COBOL to Apache Spark
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta Lake
 
Spark
SparkSpark
Spark
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
OrientDB
OrientDBOrientDB
OrientDB
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 

Similaire à Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Karau and Seth Hendrickson

Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!Holden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark MLHolden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 

Similaire à Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Karau and Seth Hendrickson (20)

Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Dernier (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Karau and Seth Hendrickson

  • 1. Extending Spark ML Estimators and Transformers kroszk@ Built with public APIs* *Scala only - see developer for details.
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● Apache Spark committer (as of January!) :) ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Seth: ● Data Scientist at Cloudera ● Previously machine learning engineer at IBM’s Spark Technology Center ● Two years contributing to Spark MLlib ● Twitter: @shendrickson16 ● Linkedin https://www.linkedin.com/in/sethah ● Github https://github.com/sethah ● SlideShare http://www.slideshare.net/SethHendrickson
  • 5. Spark Technology Center 5 IBM Spark Technology Center Founded in 2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 50 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now an Apache Incubator project. Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center Spark Technology Center
  • 6. Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Might know some Apache Spark ● Possibly know some Scala ● Think machine learning is kind of cool ● Don’t overly mind a grab-bag of topics Lori Erickson
  • 7. What are we going to talk about? ● What Spark ML pipelines look like ● What Estimators and Transformers are ● How to implement both of them ● What tools can help us ● Publishing your fancy new Spark model so other’s (like me) can use it! ● Holden will of course try and sell you many copies of her new book if you have an expense account.
  • 8. Loading data Spark SQL (DataSets) sparkSession.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. ● load(“path”) Jess Johnson
  • 9. Loading some simple CSV data val df = spark.read .option("inferSchema", "true") .option("delimiter", ";") .format("csv") .load("hdfs:///user/data/admissions.csv") Jess Johnson
  • 10. Spark ML Pipelines Pipeline Stage ?data Pipeline Stage Pipeline Stage Pipeline Stage ... Pipeline
  • 11. Spark ML Pipelines Pipeline Stage ?data Pipeline Stage Pipeline Stage Pipeline Stage ... Pipeline data ? Also a pipeline stage!
  • 12. Two main types of pipeline stages Pipeline Stage ?data Transformer Estimatordata data data transformer
  • 13. Pipelines are estimators Pipeline data model Also an estimator! Transformer Transformer Estimator
  • 14. PipelineModels are transformers PipelineModel data data Also a transformer! Transformer Transformer Transformer
  • 15. How are transformers made? Estimator data class Estimator extends PipelineStage { def fit(dataset: Dataset[_]): Transformer = { // magic happens here } } Transformer
  • 16. How is new data made? Transformer ( data ) class Transformer extends PipelineStage { def transform(df: Dataset[_]): DataFrame } new data.transform
  • 17. Feature transformations +-----+-----+----+--------+ |admit| gre| gpa|prestige| +-----+-----+----+--------+ | no|380.0|3.61| 3.0| | yes|660.0|3.67| 3.0| | yes|800.0| 4.0| 1.0| | yes|640.0|3.19| 4.0| | no|520.0|2.93| 4.0| +-----+-----+----+--------+ val assembler = new VectorAssembler() .setInputCols(Array("gre", "gpa", "prestige")) val df2 = assembler.transform(df) VectorAssembler +-----+-----+----+--------+----------------+ |admit| gre| gpa|prestige| features| +-----+-----+----+--------+----------------+ | no|380.0|3.61| 3.0|[380.0,3.61,3.0]| | yes|660.0|3.67| 3.0|[660.0,3.67,3.0]| | yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]| | yes|640.0|3.19| 4.0|[640.0,3.19,4.0]| | no|520.0|2.93| 4.0|[520.0,2.93,4.0]| +-----+-----+----+--------+----------------+
  • 18. Train a classifier on the transformed data StringIndexer StringIndexerModel val si = new StringIndexer().setInputCol("admit").setOutputCol("label") val siModel = si.fit(df2) val df3 = siModel.transform(df2) +-----+-----+----+--------+----------------+ |admit| gre| gpa|prestige| features| +-----+-----+----+--------+----------------+ | no|380.0|3.61| 3.0|[380.0,3.61,3.0]| | yes|660.0|3.67| 3.0|[660.0,3.67,3.0]| | yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]| | yes|640.0|3.19| 4.0|[640.0,3.19,4.0]| | no|520.0|2.93| 4.0|[520.0,2.93,4.0]| +-----+-----+----+--------+----------------+ +-----+-----+----+--------+----------------+-----+ |admit| gre| gpa|prestige| features|label| +-----+-----+----+--------+----------------+-----+ | no|380.0|3.61| 3.0|[380.0,3.61,3.0]| 0.0| | yes|660.0|3.67| 3.0|[660.0,3.67,3.0]| 1.0| | yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]| 1.0| | yes|640.0|3.19| 4.0|[640.0,3.19,4.0]| 1.0| | no|520.0|2.93| 4.0|[520.0,2.93,4.0]| 0.0| +-----+-----+----+--------+----------------+-----+
  • 19. Train a classifier on the transformed data +----------------+-----+ | features|label| +----------------+-----+ |[380.0,3.61,3.0]| 0.0| |[660.0,3.67,3.0]| 1.0| | [800.0,4.0,1.0]| 1.0| |[640.0,3.19,4.0]| 1.0| |[520.0,2.93,4.0]| 0.0| +----------------+-----+ DecisionTreeClassifier DecisionTree ClassificationModel +----------------+-----+----------+ | features|label|prediction| +----------------+-----+----------+ |[380.0,3.61,3.0]| 0.0| 0.0| |[660.0,3.67,3.0]| 1.0| 0.0| | [800.0,4.0,1.0]| 1.0| 1.0| |[640.0,3.19,4.0]| 1.0| 1.0| |[520.0,2.93,4.0]| 0.0| 0.0| +----------------+-----+----------+ val dt = new DecisionTreeClassifier() val dtModel = dt.fit(df3) val df4 = dtModel.transform(df3)
  • 20. Or just throw it all in a pipeline ● Keeping track of intermediate data and calling fit/transform on every stage is way too much work ● This problem is worse when more stages are used ● Use a pipeline instead! val assembler = new VectorAssembler() assembler.setInputCols(Array("gre", "gpa", "prestige")) val sb = new StringIndexer() sb.setInputCol("admit").setOutputCol("label") val dt = new DecisionTreeClassifier() val pipeline = new Pipeline() pipeline.setStages(Array(assembler, sb, dt)) val pipelineModel = pipeline.fit(df)
  • 21. Yay! You have an ML pipeline! Photo by Jessica Fiess-Hill
  • 22. Pipeline API has many models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS ● You can also check out spark-packages for some more ● But possible not your special AwesomeFooBazinatorML PROcarterse Follow
  • 23. & data prep stages... ● org.apache.spark.ml.feature ○ ~30 elements from VectorAssembler to Tokenizer, to PCA, etc. ● Often simpler to understand while getting started with building our own stages PROcarterse Follow
  • 24. So now begins our adventure to add stages
  • 25. So what does a pipeline stage look like? Must provide: ● transformSchema (used to validate input schema is reasonable) & copy Often have: ● Special params for configuration (so we can do meta-algorithms) Wendy Piersall
  • 26. Building a simple transformer: class HardCodedWordCountStage(override val uid: String) extends Transformer { def this() = this(Identifiable.randomUID("hardcodedwordcount")) def copy(extra: ParamMap): HardCodedWordCountStage = { defaultCopy(extra) } ... } Not to be confused with the Transformers franchise from Hasbro and Tomy.
  • 27. Verify the input schema is reasonable: override def transformSchema(schema: StructType): StructType = { // Check that the input type is a string val idx = schema.fieldIndex("happy_pandas") val field = schema.fields(idx) if (field.dataType != StringType) { throw new Exception(s"Input type ${field.dataType} did not match input type StringType") } // Add the return field schema.add(StructField("happy_panda_counts", IntegerType, false)) }
  • 28. How is transformSchema used? ● When you call fit on a pipeline it calls transformSchema on the pipeline stages in order ● This is used to verify that things should work ● Ideally allows pipelines to fail fast when misconfigured, instead of at the final stage of a 48-hour process ● Doesn’t always work that way :p Tricia Hall
  • 29. Do the “work” (e.g. predict labels or w/e): def transform(df: Dataset[_]): DataFrame = { val wordcount = udf { in: String => in.split(" ").size } df.select(col("*"), wordcount(df.col("happy_pandas")).as("happy_panda_counts")) } vic15
  • 30. What about configuring our stage? class ConfigurableWordCount(override val uid: String) extends Transformer { final val inputCol= new Param[String](this, "inputCol", "The input column") final val outputCol = new Param[String](this, "outputCol", "The output column") def setInputCol(value: String): this.type = set(inputCol, value) def setOutputCol(value: String): this.type = set(outputCol, value) Jason Wesley Upton
  • 31. So why do we configure it that way? ● Allow meta algorithms to work on it ● If you look inside of spark you’ll see “sharedParams” for common params (like input column) ● We can’t access those unless we pretend to be inside of org.apache.spark - so we have to make our own Tricia Hall
  • 32. So how to make an estimator? ● Very similar, instead of directly providing transform provide a `fit` which returns a “model” which implements the estimator interface as shown above ● Also take a look at the algorithms in Spark itself (helpful traits you can mixin to take care of many common things). ● Let’s look at a simple one now! sneakerdog
  • 33. A simple string indexer estimator class SimpleIndexer(override val uid: String) extends Estimator[SimpleIndexerModel] with SimpleIndexerParams { …. override def fit(dataset: Dataset[_]): SimpleIndexerModel = { import dataset.sparkSession.implicits._ val words = dataset.select(dataset($(inputCol)).as[String]).distinct .collect() new SimpleIndexerModel(uid, words) } }
  • 34. Quick aside: What’ts that “$(inputCol)”? ● How you get access to a configuration parameter ● Inside stage only (external use getInputCol just like Java™ :p)
  • 35. And our friend the transformer is back: class SimpleIndexerModel( override val uid: String, words: Array[String]) extends Model[SimpleIndexerModel] with SimpleIndexerParams { ... private val labelToIndex: Map[String, Double] = words.zipWithIndex. map{case (x, y) => (x, y.toDouble)}.toMap override def transform(dataset: Dataset[_]): DataFrame = { val indexer = udf { label: String => labelToIndex(label) } dataset.select(col("*"), indexer(dataset($(inputCol)).cast(StringType)).as($(outputCol))) Still not to be confused with the Transformers franchise from Hasbro and Tomy.
  • 36. Ok so how do you make the train function? ● Read some papers on the algorithm(s) you care about ● Most likely some iterative approach (pro-tip: RDDs > Datasets for iterative) ○ Seth has some interesting work around pluggable optimizers ● Closed form solution? Go have a party!
  • 37. What else can you add to your models? ● Put in an ML pipeline ● Do hyper-parameter tuning And if you have some coffee left over: ● Persistence* ○ MLWriter & MLReader give you the basics ○ You’ll have to do a lot of work yourself :( ● Serving* *With enough coffee. Not guaranteed.
  • 38. Ok so I put my new fancy thing on GitHub ● Yay thank you! ● Please publish to maven central ● Also consider listing on spark-packages + user@ list ○ Let me know ( holden@pigscanfly.ca ) :) ● Think of the Python users (and I guess the R users) too?
  • 39. Custom Estimators/Transformers in the Wild Classification/Regression xgboost Deep Learning! MXNet Feature Transformation FeatureHasher
  • 40. More resources: ● High Performance Spark Example Repo has some sample models ○ Of course buy several copies of the book - it is the gift of the season :p ● The models inside of Spark itself (use some internal APIs but a good starting point) ● Nick Pentreath’s FeatureHasher ● O’Reilly radar blog post https://www.oreilly.com/learning/extend-structured-streami ng-for-spark-ml Captain Pancakes
  • 41. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action Coming soon: High Performance Spark Learning PySpark
  • 42. The next book….. Available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark ● Extending ML is covered in Chapter 9 Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark ● Should be finished between May 22nd ~ June 18th :D * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 43. And some upcoming talks: ● June ○ Berlin Buzzwords ○ Scala Swarm (Porto, Portugal)
  • 44. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)
  • 46. Cross-validation because saving a test set is effort ● Automagically* fit your model params ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools Jonathan Kotta
  • 47. Cross-validation because saving a test set is effort & a reason to integrate // ParamGridBuilder constructs an Array of parameter combinations. val paramGrid: Array[ParamMap] = new ParamGridBuilder() .addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0)) .build() val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) val cvModel = cv.fit(df) val bestModel = cvModel.bestModel Jonathan Kotta
  • 48. So what does a pipeline stage look like? Are either an: ● Estimator - has a method called “fit” which returns a Transformer (e.g. NaiveBayes, etc.) ● Transformer - no need to train can directly transform (e.g. HashingTF, VectorAssembler, etc.) (with transform) Wendy Piersall
  • 49. We’ve left out a lot of “transformSchema”... ● It is necessary (but I’m lazy) ● But there are helper classes that can implement some of the boiler plate we’ve been skipping ● Classifier & Estimator base classes are your friends ● They provide transformSchema
  • 50. Let’s make a Classifier* :) // Example only - not for production use. class SimpleNaiveBayes(val uid: String) extends Classifier[Vector, SimpleNaiveBayes, SimpleNaiveBayesModel] { Input type Trained Model
  • 51. Let’s make a Classifier* :) override def train(ds: Dataset[_]): SimpleNaiveBayesModel = { import ds.sparkSession.implicits._ ds.cache() …. … …. }
  • 52. If you reallllly want to see inside the ...s (1/5) // Get the number of features by peaking at the first row val numFeatures: Integer = ds.select(col($(featuresCol))).head .get(0).asInstanceOf[Vector].size // Determine the number of records for each class val groupedByLabel = ds.select(col($(labelCol)).as[Double]).groupByKey(x => x) val classCounts = groupedByLabel.agg(count("*").as[Long]) .sort(col("value")).collect().toMap // Select the labels and features so we can more easily map over them. // Note: we do this as a DataFrame using the untyped API because the Vector // UDT is no longer public. val df = ds.select(col($(labelCol)).cast(DoubleType), col($(featuresCol)))
  • 53. If you reallllly want to see inside the ...s (2/5) // Note: you can use getNumClasses & extractLabeledPoints to get an RDD instead // Using the RDD approach is common when integrating with legacy machine learning code // or iterative algorithms which can create large query plans. // Here we use `Datasets` since neither of those apply. // Compute the number of documents val numDocs = ds.count // Get the number of classes. // Note this estimator assumes they start at 0 and go to numClasses val numClasses = getNumClasses(ds)
  • 54. If you reallllly want to see inside the ...s (3/5) // Figure out the non-zero frequency of each feature for each label and // output label index pairs using a case clas to make it easier to work with. val labelCounts: Dataset[LabeledToken] = df.flatMap { case Row(label: Double, features: Vector) => features.toArray.zip(Stream from 1) .filter{vIdx => vIdx._2 == 1.0} .map{case (v, idx) => LabeledToken(label, idx)} } // Use the typed Dataset aggregation API to count the number of non-zero // features for each label-feature index. val aggregatedCounts: Array[((Double, Integer), Long)] = labelCounts .groupByKey(x => (x.label, x.index)) .agg(count("*").as[Long]).collect() val theta = Array.fill(numClasses)(new Array[Double](numFeatures))
  • 55. If you reallllly want to see inside the ...s (4/5) // Compute the denominator for the general prioirs val piLogDenom = math.log(numDocs + numClasses) // Compute the priors for each class val pi = classCounts.map{case(_, cc) => math.log(cc.toDouble) - piLogDenom }.toArray // For each label/feature update the probabilities aggregatedCounts.foreach{case ((label, featureIndex), count) => // log of number of documents for this label + 2.0 (smoothing) val thetaLogDenom = math.log( classCounts.get(label).map(_.toDouble).getOrElse(0.0) + 2.0) theta(label.toInt)(featureIndex) = math.log(count + 1.0) - thetaLogDenom } // Unpersist now that we are done computing everything ds.unpersist()
  • 56. If you reallllly want to see inside the ...s (5/5) // Construct a model new SimpleNaiveBayesModel(uid, numClasses, numFeatures, Vectors.dense(pi), new DenseMatrix(numClasses, theta(0).length, theta.flatten, true)) } override def copy(extra: ParamMap) = { defaultCopy(extra) } }
  • 57. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Much faster than Hadoop Map/Reduce ● Good when data is too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 58. DataFrames & Datasets Totally the future ● Distributed collection ● Recomputed on node failure ● Distributes data & work across the cluster ● Lazily evaluated (transformations & actions) ● Has runtime schema information ● Allows for relational queries & supports SQL ● Declarative - many optimizations applied automagically ● Input for Spark Machine Learning Helen Olney
  • 59. What is the performance like? Andrew Skudder
  • 60. Spark ML pipelines Tokenizer HashingTF String Indexer Naive Bayes Tokenizer HashingTF String Indexer Naive Bayes fit(df) Estimator Transformer ● Consist of different stages (estimators or transformers) ● Themselves are an estimator We are going to build a stage together!
  • 61. Minimal data prep: ● At a minimum most algorithms in Spark work on feature vectors of doubles (and if labeled - doubles too) Imports: import org.apache.spark.ml._ import org.apache.spark.ml.feature._ import org.apache.spark.ml.classification._ import org.apache.spark.ml.linalg.{Vector => SparkVector} Huang Yun Chung
  • 62. Minimal prep continued // Combines a list of double input features into a vector val assembler = new VectorAssembler() assembler.setInputCols(Array("age", "education-num")) // String indexer converts a set of strings into doubles val sb = new StringIndexer() sb.setInputCol("category").setOutputCol("category-index") // Can be used to combine pipeline components together val pipeline = new Pipeline() pipeline.setStages(Array(assembler, sb)) Huang Yun Chung
  • 63. Minimal prep continued val assembler = new VectorAssembler() assembler.setInputCols(Array("gre", "gpa", "prestige")) val si = new StringIndexer() si.setInputCol("admit").setOutputCol("label") val pipeline = new Pipeline() pipeline.setStages(Array(assembler, si)) Huang Yun Chung +-----+-----+----+--------+----------------+-----+ |admit| gre| gpa|prestige| features|label| +-----+-----+----+--------+----------------+-----+ | no|380.0|3.61| 3.0|[380.0,3.61,3.0]| 0.0| | yes|660.0|3.67| 3.0|[660.0,3.67,3.0]| 1.0| | yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]| 1.0| | yes|640.0|3.19| 4.0|[640.0,3.19,4.0]| 1.0| | no|520.0|2.93| 4.0|[520.0,2.93,4.0]| 0.0| +-----+-----+----+--------+----------------+-----+ +-----+-----+----+--------+----------------+ |admit| gre| gpa|prestige| features| +-----+-----+----+--------+----------------+ | no|380.0|3.61| 3.0|[380.0,3.61,3.0]| | yes|660.0|3.67| 3.0|[660.0,3.67,3.0]| | yes|800.0| 4.0| 1.0| [800.0,4.0,1.0]| | yes|640.0|3.19| 4.0|[640.0,3.19,4.0]| | no|520.0|2.93| 4.0|[520.0,2.93,4.0]| +-----+-----+----+--------+----------------+ +-----+-----+----+--------+ |admit| gre| gpa|prestige| +-----+-----+----+--------+ | no|380.0|3.61| 3.0| | yes|660.0|3.67| 3.0| | yes|800.0| 4.0| 1.0| | yes|640.0|3.19| 4.0| | no|520.0|2.93| 4.0| +-----+-----+----+--------+
  • 64. So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data val model = pipeline.fit(df) val prepared = model.transform(df) Andrey
  • 65. Let's train a model on our prepared data: // Specify model val dt = new DecisionTreeClassifier() dt.setFeaturesCol("features") dt.setPredictionCol("prediction") // Fit it val dtModel = dt.fit(prepared)
  • 66. Or wait let's just add it to the pipeline: // Specify model val dt = new DecisionTreeClassifier() dt.setFeaturesCol("features") dt.setPredictionCol("prediction") // Add to the pipeline pipeline.setStages(Array(assembler, si, dt)) pipelineModel = pipeline.fit(df)
  • 67. And predict the results on the same data: pipelineModel.transform(df).select("prediction", "label").take(20) +----------+-----+ |prediction|label| +----------+-----+ | 0.0| 0.0| | 0.0| 1.0| | 1.0| 1.0| | 1.0| 1.0| | 0.0| 0.0| +----------+-----+