1. Apache Spark, an Introduction
Jonathan Lacefield – Solution Architect
DataStax
2. Disclaimer
The contents of this presentation represent my
personal views and do not reflect or represent
any views of my employer.
This is my take on Spark.
This is not DataStax’s take on Spark.
3. Notes
• Meetup Sponsor:
– Data Exchange Platform
– Core Software Engineering – Equifax
• Announcement:
– Data Exchange Platform is currently hiring to build the
next generation data platform. We are looking for
people with experience in one or more of the
following skills: Spark, Storm, Kafka, samza, Hadoop,
Cassandra
– How to apply?
– Email aravind.yarram@equifax.com
4. Introduction
• Jonathan Lacefield
– Solutions Architect, DataStax
– Former Dev, DBA, Architect, reformed PM
– Email: jlacefie@gmail.com
– Twitter: @jlacefie
– LinkedIn: www.linkedin.com/in/jlacefield
This deck represents my own views and not the
views of my employer
5. DataStax Introduction
DataStax delivers Apache Cassandra in a database platform
purpose built for the performance and availability demands of
IOT, web, and mobile applications, giving enterprises a secure
always-on database that remains operationally simple when
scaled in a single datacenter or across multiple datacenters
and clouds.
Includes
1. Apache Cassandra
2. Apache Spark
3. Apache SOLR
4. Apache Hadoop
5. Graph Coming Soon
6. DataStax, What we Do (Use Cases)
• Fraud Detection
• Personalization
• Internet of Things
• Messaging
• Lists of Things (Products, Playlists, etc)
• Smaller set of other things too!
We are all about working with temporal data sets at
large volumes with high transaction counts
(velocity).
7. Agenda
• Set Baseline (Pre Distributed Days and
Hadoop)
• Spark Conceptual Introduction
• Spark Key Concepts (Core)
• Spark Look at Each Module
– Spark SQL
– MLIB
– Spark Streaming
– GraphX
14. • Started in 2009 in Berkley’s AMP Lab
• Open Sources in 2010
• Commercial Provider is Databricks – http://databricks.com
• Solve 2 Big Hadoop Pain Points
Speed - In Memory and Fault Tolerant
Ease of Use – API of operations and datasets
15. Use Cases for Apache Spark
• Data ETL
• Interactive dashboard creation for customers
• Streaming (e.g., fraud detection, real-time
video optimization)
• “Complex analytics” (e.g., anomaly detection,
trend analysis)
16. Key Concepts - Core
• Resilient Distributed Datasets (RDDs) – Spark’s datasets
• Spark Context – Provides information on the Spark environment
and the application
• Transformations - Transforms data
• Actions - Triggers actual processing
• Directed Acyclic Graph (DAG) – Spark’s execution algorithm
• Broadcast Variables – Read only variables on Workers
• Accumulators – Variables that can be added to with an
associated function on Workers
• Driver - “Main” application container for Spark Execution
• Executors – Execute tasks on data
• Resource Manager – Manages task assignment and status
• Worker – Execute and Cache
17. Resilient Distributed Datasets (RDDs)
• Fault tolerant collection of elements that enable
parallel processing
• Spark’s Main Abstraction
• Transformation and Actions are executed against
RDDs
• Can persist in Memory, on Disk, or both
• Can be partitioned to control parallel processing
• Can be reused
– HUGE Efficiencies with processing
18. RDDs - Resilient
Source – databricks.com
HDFS File Filtered RDD Mapped RDD
filter
(func = someFilter(…))
map
(func = someAction(...))
RDDs track lineage information that can be used to
efficiently recompute lost data
20. RDDs – From the API
val someRdd = sc.textFile(someURL)
• Create an RDD from a text file
val lines = sc.parallelize(List("pandas", "i like pandas"))
• Create an RDD from a list of elements
• Can create RDDs from many different sources
• RDDs can, and should, be persisted in most cases
– lines.persist() or lines.cache()
• See here for more info
– http://spark.apache.org/docs/1.2.0/programming-guide.html
21. Transformations
• Create one RDD and transform the contents into another RDD
• Examples
– Map
– Filter
– Union
– Distinct
– Join
• Complete list -
http://spark.apache.org/docs/1.2.0/programming-guide.html
• Lazy execution
– Transformations aren’t applied to an RDD until an Action is executed
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
22. Actions
• Cause data to be returned to driver or saved to output
• Cause data retrieval and execution of all
Transformations on RDDs
• Common Actions
– Reduce
– Collect
– Take
– SaveAs….
• Complete list - http://spark.apache.org/docs/1.2.0/programming-
guide.html
• errorsRDD.take(1)
23. Example App
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”,
sys.argv[0], None)
lines = sc.textFile(sys.argv[1])
counts = lines.flatMap(lambda s: s.split(“ ”))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
counts.saveAsTextFile(sys.argv[2])
Based on source from – databricks.com
1
2
3
27. Spark SQL
Abstraction of Spark API to support SQL like interaction
Parse
Analyze
LogicalPlan
Optimize
Spark SQL
HiveQL
PhysicalPlan
Execute
Catalyst SQL Core
• Programming Guide - https://spark.apache.org/docs/1.2.0/sql-programming-guide.html
• Used for code source in examples
• Catalyst - http://spark-summit.org/talk/armbrust-catalyst-a-query-optimization-framework-for-spark-and-shark/
28. SQLContext and SchemaRDD
val sc: SparkContext // An existing SparkContext. val sqlContext = new
org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
SchemaRDD can be created
1) Using reflection to infer schema Structure from an existing RDD
2) Programmable interface to create Schema and apply to an RDD
29. SchemaRDD Creation - Reflection
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
30. SchemaRDD Creation - Explicit
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t => "Name: " + t(0)).collect().foreach(println)
31. Data Frames
• Data Frames will replace SchemaRDD
• https://databricks.com/blog/2015/02/17/intr
oducing-dataframes-in-spark-for-large-scale-
data-science.html
42. Initializing Streaming Context
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
46. Initializing Socket Stream
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
47. Initializing Twitter Stream
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)
48. Custom Receiver (WebSocket)
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val rsvp = ssc.receiverStream(new
WebSocketReceiver("ws://stream.meetup.com/2/rsvps"))
import org.apache.spark.streaming.receiver.Receiver
class WebSocketReceiver(url: String)
extends Receiver[String](storageLevel)
{
// ...
}
52. Multiple Streams Transformation
2 1
5 4 3 12
union
1s 1s
* Chars.union(Digits)
2 1
E D C AB
2
E 5 D 4
1s
1
C 3 B 2
1s
A 1
Digits
Chars
53. Word Count
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
55. Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
Window Length = 60 sec
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
...
56. Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
Window Length = 60s
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
14 13
5s5s
Sliding Interval =
10s
...
57. Window Length = 60s
Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
14 13
5s5s
16 15
5s5s
Sliding Interval =
10s
...
59. Word Count by Window
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKeyAndWindow((a:Int,b:Int) => a+b, Seconds(60),
Seconds(10))
60. Large Window Considerations
• Large windows:
1. Take longer to process
2. Require larger batch interval for stable processing
• Hour-scale windows are not recommended
• For multi-hour aggregations use real data stores (e.g Cassandra)
• Spark Streaming is NOT design to be a persistent data store
• Set spark.cleaner.ttl and spark.streaming.unpersist (be careful)
63. Saving to Cassandra
import org.apache.spark._
import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
64. Start Processing
import org.apache.spark._
import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
scc.start()
scc.awaitTermination()
66. Scaling Streaming
• How to scale stream processing?
Kafka
Producer
Spark
Receiver
Spark
Processor
Output
67. Parallelism – Partitioning
• Partition input stream (e.g. by topics)
• Each receiver can be run on separate worker
Kafka
Topic 2
Spark
Receiver 2
Spark
Processor
Output
Kafka
Topic 3
Spark
Receiver 3
Spark
Processor
Output
Kafka
Topic 1
Spark
Receiver 1
Spark
Processor
Output
Kafka
Topic N
Spark
Receiver N
Spark
Processor
Output
68. Parallelism – Partitioning
• Partition stream (e.g. by topics)
• Use union() to create single DStream
• Transformations applied on the unified stream
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
69. Parallelism – RePartitioning
• Explicitly repartition input stream
• Distribute received batches across specified number of machines
Twitter
Producer
Spark
Receiver
Spark
Processor
Output
Spark
Processor
Output
Spark
Processor
Output
Spark
Processor
Output
70. Parallelism – RePartitioning
• Explicitly repartition input stream
• Distribute received batches across specified number of machines
• Use inputstream.repartition(N)
val numWorkers = 5
val twitterStream = TwitterUtils.createStream(...)
twitterStream.repartition(numWorkers)
71. Parallelism – Tasks
• Each block processed by separate task
• To increase parallel tasks, increase number of blocks in a batch
• Tasks per Receiver per Batch ≈ Batch Interval / Block Interval
• Example: 2s batch / 200ms block = 10 tasks
• CPU cores will not be utilized if number of tasks is too low
• Consider tuning default number of parallel tasks
spark.default.parallelism
74. Fault Tolerance – RDD
• Recomputing RDD may be unavailable for stream source
• Protect data by replicating RDD
• RDD replication controlled by org.apache.spark.storage.StorageLevel
• Use storage level with _2 suffix (2 replicas):
– DISK_ONLY_2
– MEMORY_ONLY_2
– MEMORY_ONLY_SER_2
– MEMORY_AND_DISK_2
– MEMORY_AND_DISK_SER_2 Default for most receivers
75. Fault Tolerance – Checkpointing
• Periodically writes:
1. DAG/metadata of DStream(s)
2. RDD data for some stateful transformations (updateStateByKey &
reduceByKeyAndWindow*)
• Uses fault-tolerant distributed file system for persistence.
• After failure, StreamingContext recreated from checkpoint data
on restart.
• Choose interval carefully as storage will impact processing times.
76. Fault Tolerance – Checkpointing
import org.apache.spark._
import org.apache.spark.streaming._
val checkpointDirectory = "words.cp" // Directory name for checkpoint data
def createContext(): StreamingContext = {
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
ssc
}
val conf = new SparkConf().setAppName(appName).setMaster(master)
// Get StreamingContext from checkpoint data or create a new one
val scc = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
scc.start()
scc.awaitTermination()
79. State of Data
1. Data received and replicated
• Will survive failure of 1 replica
2. Data received but only buffered for replication
• Not replicated yet
• Needs recomputation if lost
80. Receiver Reliability Types
1. Reliable Receivers
• Receiver acknowledges source only after ensuring that data replicated.
• Source needs to support message ack. E.g. Kafka, Flume.
2. Unreliable Receivers
• Data can be lost in case of failure.
• Source doesn’t support message ack. E.g. Twitter.
81. Fault Tolerance
• Spark 1.2 adds Write Ahead Log (WAL) support for Streaming
• Protection for Unreliable Receivers
• See SPARK-3129 for architecture details
State / Receiver
Type
Received,
Replicated
Received, Only
Buffered
Reliable Receiver Safe Safe
Unreliable
Receiver
Safe Data Loss
82. GraphX
• Alpha release
• Provides Graph computation capabilities on
top of RDDs
• Resilient Distributed Property Graph: a
directed multigraph with properties attached
to each vertex and edge.
• The goal of the GraphX project is to unify
graph-parallel and data-parallel computation
in one system with a single composable API.
83. I am not a Graph-guy yet.
Who here is working with Graph today?
84. Handy Tools
• Ooyala Spark Job Server -
https://github.com/ooyala/spark-jobserver
• Monitoring with Graphite and Grafana –
http://www.hammerlab.org/2015/02/27/mon
itoring-spark-with-graphite-and-grafana/