SlideShare une entreprise Scribd logo
1  sur  110
Télécharger pour lire hors ligne
From 0 to Streaming
Cassandra and Spark Streaming
Russell Spitzer
+ =
Who am I?
• Bioinformatics Ph.D from UCSF
• Works on the integration of
Cassandra (C*) with Hadoop, Solr,
and SPARK!
• Spends a lot of time spinning up
clusters on EC2, GCE, Azure, …

http://www.datastax.com/dev/blog/
testing-cassandra-1000-nodes-at-
a-time
• Writing FAQ’s for Spark
Troubleshooting

http://www.datastax.com/dev/blog/
common-spark-troubleshooting
From 0 to Streaming
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
From 0 to Streaming
Connecting Cassandra To Spark
Spark Cassandra Connector
Spark SQL
RDD Basics
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
From 0 to Streaming
Connecting Cassandra To Spark
Spark Cassandra Connector
Spark SQL
RDD Basics
Spark Streaming
Streaming Basics
Writing Streaming Applications
Custom Receivers
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
Part 1: What is Spark
Not this ^
Spark is a Distributed Analytics
Platform
HADOOP
•Has Generalized DAG execution
•Integrated SQL Queries
•Streaming
•Easy Abstraction for Datasets
•Support in lots of languages
All in one package!
Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets? RDD!
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
RDDs Can be Generated
from a Variety of Sources
Textfiles
Parallelized Collections
RDDs Can be Generated
from a Variety of Sources
Textfiles
Parallelized Collections
Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd	
  =	
  sc.textFile("num.txt")	
  
val	
  rdd2	
  =	
  rdd.map(	
  x	
  =>	
  x.toInt	
  *2	
  )	
  
val	
  rdd3	
  =	
  rdd2.filter(	
  _	
  >	
  4)	
  	
  
rdd3.collect
Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd	
  =	
  sc.textFile("num.txt")	
  
val	
  rdd2	
  =	
  rdd.map(	
  x	
  =>	
  x.toInt	
  *2	
  )	
  
val	
  rdd3	
  =	
  rdd2.filter(	
  _	
  >	
  4)	
  	
  
rdd3.collect
rdd
Create
Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd	
  =	
  sc.textFile("num.txt")	
  
val	
  rdd2	
  =	
  rdd.map(	
  x	
  =>	
  x.toInt	
  *2	
  )	
  
val	
  rdd3	
  =	
  rdd2.filter(	
  _	
  >	
  4)	
  	
  
rdd3.collect
rdd rdd2
Transform
Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd	
  =	
  sc.textFile("num.txt")	
  
val	
  rdd2	
  =	
  rdd.map(	
  x	
  =>	
  x.toInt	
  *2	
  )	
  
val	
  rdd3	
  =	
  rdd2.filter(	
  _	
  >	
  4)	
  	
  
rdd3.collect
rdd rdd2 rdd3
Transform
Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd rdd2 rdd3rdd
ACTION
rdd	
  =	
  sc.textFile("num.txt")	
  
val	
  rdd2	
  =	
  rdd.map(	
  x	
  =>	
  x.toInt	
  *2	
  )	
  
val	
  rdd3	
  =	
  rdd2.filter(	
  _	
  >	
  4)	
  	
  
rdd3.collect
Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd rdd2 rdd3rdd
ACTION
rdd2
rdd	
  =	
  sc.textFile("num.txt")	
  
val	
  rdd2	
  =	
  rdd.map(	
  x	
  =>	
  x.toInt	
  *2	
  )	
  
val	
  rdd3	
  =	
  rdd2.filter(	
  _	
  >	
  4)	
  	
  
rdd3.collect
Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd	
  =	
  sc.textFile("num.txt")	
  
val	
  rdd2	
  =	
  rdd.map(	
  x	
  =>	
  x.toInt	
  *2	
  )	
  
val	
  rdd3	
  =	
  rdd2.filter(	
  _	
  >	
  4)	
  	
  
rdd3.collect
rdd rdd2 rdd3
ACTION
Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
Executor
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
1 1’
Executor
2 2’
Transformation
RDD’
1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
3 3’
Executor
4 4’
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
5 5’
Executor
6 6’
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
3’
4’
1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 5’ 6’
7’ 8’ 9’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
5 5’
Node Failure
1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
Reapply Transformation
5 5’
Node Failure
1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
Reapply Transformation
5’
Because the actions on any partition can be tracked
backwards we can recover from failure without redoing the
entire RDD
Use the Spark Shell to
quickly try out code samples
Available in
and
Pyspark
Spark Shell
Spark Context is the Core Api for
all Communication with Spark
val conf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
new SparkContext(conf)
Almost all options can also be set as environment
variables or on the command line during spark-submit!
Deploy Compiled Jars using
Spark Submit
https://spark.apache.org/docs/1.1.0/submitting-applications.html
Some of the commonly used options are:
--class: The entry point for your application
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--conf: Arbitrary Spark configuration property in key=value format.
spark-­‐submit	
  -­‐-­‐class	
  MainClass	
  JarYouWantDistributedToExecutor.jar
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Spark-Submit
Jar
Deploy Compiled Jars using
Spark Submit
https://spark.apache.org/docs/1.1.0/submitting-applications.html
Some of the commonly used options are:
--class: The entry point for your application
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--conf: Arbitrary Spark configuration property in key=value format.
spark-­‐submit	
  -­‐-­‐class	
  MainClass	
  JarYouWantDistributedToExecutor.jar
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Spark-Submit
Jar
Co-locate Spark and C* for
Best Performance
C*
C*C*
C*
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Running Spark Workers
on the same nodes as
your C* Cluster will save
network hops when
reading and writing
Use a Separate Datacenter
for your Analytics Workloads
C*
C*C*
C*
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
C*
C*C*
C*
OLTP OLAP
Part 2: Connecting Spark To
Cassandra
Exactly like this ^
DataStax OSS Connector
Spark to Cassandra
https://github.com/datastax/spark-­‐cassandra-­‐connector
Keyspace Table
Cassandra Spark
RDD[CassandraRow]
RDD[Tuples]
Bundled	
  and	
  Supported	
  with	
  DSE	
  >	
  4.5!
Spark Cassandra Connector uses the
DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of
tokens
Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
Several Easy Ways To Use the
Spark Cassandra Connector
• SparkSQL
• Scala
• Java
• RDD Manipulation
• Scala
• Java
• Python
Requirements for Following
Code Examples
The following examples use are targeted at
Spark 1.1.X
Cassandra 2.0.X
or if you are using DataStax Enterprise
DSE 4.6.x
Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
scala>	
  val	
  rdd	
  =	
  sc.cassandraTable("candy","inventory")	
  
scala>	
  rdd.count	
  
res13:	
  Long	
  =	
  4
cassandraTable
Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
scala>	
  val	
  rdd	
  =	
  sc.cassandraTable("candy","inventory")	
  
scala>	
  rdd.count	
  
res13:	
  Long	
  =	
  4
cassandraTable
count
4
Basics: take() and collect()
sc.cassandraTable("candy","inventory").take(1)	
  
Array(CassandraRow{brand:	
  Wonka,	
  name:	
  Gobstopper,	
  amount:	
  10})
Basics: take() and collect()
sc.cassandraTable("candy","inventory").take(1)	
  
Array(CassandraRow{brand:	
  Wonka,	
  name:	
  Gobstopper,	
  amount:	
  10})
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
Basics: take() and collect()
sc.cassandraTable("candy","inventory").take(1)	
  
Array(CassandraRow{brand:	
  Wonka,	
  name:	
  Gobstopper,	
  amount:	
  10})
sc.cassandraTable("candy","inventory").collect	
  
Array[com.datastax.spark.connector.CassandraRow]	
  =	
  Array(CassandraRow{…})
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
Basics: take() and collect()
sc.cassandraTable("candy","inventory").take(1)	
  
Array(CassandraRow{brand:	
  Wonka,	
  name:	
  Gobstopper,	
  amount:	
  10})
sc.cassandraTable("candy","inventory").collect	
  
Array[com.datastax.spark.connector.CassandraRow]	
  =	
  Array(CassandraRow{…})
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
cassandraTable
collect
9 NYC
Array of CassandraRows
9 NYC9 NYC9 NYCWonka Gob 10
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
cassandraTable
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
10
get[Int]
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
10
get[Int]
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>	
  case	
  class	
  invRow	
  (	
  brand:String,	
  name:String,	
  amount:Integer)	
  
scala>	
  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>	
  case	
  class	
  invRow	
  (	
  brand:String,	
  name:String,	
  amount:Integer)	
  
scala>	
  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>	
  case	
  class	
  invRow	
  (	
  brand:String,	
  name:String,	
  amount:Integer)	
  
scala>	
  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount
Getting Values From Cassandra Rows
scala>	
  sc.cassandraTable("candy","inventory")	
  
.take(1)(0)	
  
.get[Int]("amount")	
  
res5:	
  Int	
  =	
  10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>	
  case	
  class	
  invRow	
  (	
  brand:String,	
  name:String,	
  amount:Integer)	
  
scala>	
  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount
amount
10
Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")	
  
.filter(	
  _.amount	
  <5)	
  
.saveToCassandra("candy","low")
Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")	
  
.filter(	
  _.amount	
  <5)	
  
.saveToCassandra("candy","low")
cassandraTable
Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")	
  
.filter(	
  _.amount	
  <5)	
  
.saveToCassandra("candy","low")
cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount
Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")	
  
.filter(	
  _.amount	
  <5)	
  
.saveToCassandra("candy","low")
cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount
C*
C*C*
C*
Under the hood this is done via the
Cassandra Java Driver
Several Easy Ways To Use the
Spark Cassandra Connector
• SparkSQL
• Scala
• Java
• RDD Manipulation
• Scala
• Java
• Python
Spark Sql Provides a Fast SQL
Like Syntax For Cassandra!
HQL
SQL
Catalyst
Query Plan
Grab Data
Filter
Group
Return Results
SchemaRDD
SQL In, RDD’s Out
Building a Context Object For
interacting with Spark SQL
In the DSE Spark Shell both HiveContext and Cassandra Sql Context are created
automatically on startup
import	
  org.apache.spark.sql.cassandra.CassandraSQLContext	
  
val	
  sc:	
  SparkContext	
  =	
  ...	
  
val	
  csc	
  =	
  new	
  CassandraSQLContext(sc)
JavaSparkContext	
  jsc	
  =	
  new	
  JavaSparkContext(conf);	
  
//	
  create	
  a	
  Cassandra	
  Spark	
  SQL	
  context	
  
CassandraSQLContext	
  csc	
  =	
  new	
  
CassandraSQLContext(jsc.sc());
Since HiveContext Requires the Hive Driver accessing C* Directly,
HC only available in DSE.
Workaround: get SchemaRDD’s with Cassandra Sql Context then Register with HC
Reading Data From
Cassandra With SQL Syntax
scala>	
  csc.sql(	
  
"SELECT	
  *	
  FROM	
  candy.inventory").collect	
  
Array[org.apache.spark.sql.Row]	
  =	
  Array(	
  
[Wonka,Gobstopper,10],	
  	
  
[Wonka,WonkaBar,3],	
  	
  
[CandyTown,ChocoIsland,5],	
  	
  
[CandyTown,SugarMountain,2]	
  
)
QueryPlan
Reading Data From
Cassandra With SQL Syntax
scala>	
  csc.sql(	
  
"SELECT	
  *	
  FROM	
  candy.inventory").collect	
  
Array[org.apache.spark.sql.Row]	
  =	
  Array(	
  
[Wonka,Gobstopper,10],	
  	
  
[Wonka,WonkaBar,3],	
  	
  
[CandyTown,ChocoIsland,5],	
  	
  
[CandyTown,SugarMountain,2]	
  
)
SchemaRDDQueryPlan
Reading Data From
Cassandra With SQL Syntax
scala>	
  csc.sql(	
  
"SELECT	
  *	
  FROM	
  candy.inventory").collect	
  
Array[org.apache.spark.sql.Row]	
  =	
  Array(	
  
[Wonka,Gobstopper,10],	
  	
  
[Wonka,WonkaBar,3],	
  	
  
[CandyTown,ChocoIsland,5],	
  	
  
[CandyTown,SugarMountain,2]	
  
)
SchemaRDDQueryPlan
Counting Data From
Cassandra With SQL Syntax
scala>	
  csc.sql("SELECT	
  COUNT(*)	
  FROM	
  candy.inventory").collect	
  
res5:	
  Array[org.apache.spark.sql.Row]	
  =	
  Array([4])
Counting Data From
Cassandra With SQL Syntax
scala>	
  csc.sql("SELECT	
  COUNT(*)	
  FROM	
  candy.inventory").collect	
  
res5:	
  Array[org.apache.spark.sql.Row]	
  =	
  Array([4])
Joining Data From
Cassandra With SQL Syntax
scala>	
  csc.sql("	
  
SELECT	
  *	
  FROM	
  candy.inventory	
  as	
  inventory	
  	
  
JOIN	
  candy.requests	
  as	
  requests	
  	
  
WHERE	
  inventory.name	
  =	
  requests.name").collect	
  
res12:	
  Array[org.apache.spark.sql.Row]	
  =	
  Array(	
  
[Wonka,WonkaBar,3,Russ,WonkaBar,2],	
  	
  
[CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]	
  
)
Joining Data From
Cassandra With SQL Syntax
scala>	
  csc.sql("	
  
SELECT	
  *	
  FROM	
  candy.inventory	
  as	
  inventory	
  	
  
JOIN	
  candy.requests	
  as	
  requests	
  	
  
WHERE	
  inventory.name	
  =	
  requests.name").collect	
  
res12:	
  Array[org.apache.spark.sql.Row]	
  =	
  Array(	
  
[Wonka,WonkaBar,3,Russ,WonkaBar,2],	
  	
  
[CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]	
  
)
Insert to another Cassandra
Table
csc.sql("	
  
INSERT	
  INTO	
  candy.low	
  	
  
SELECT	
  *	
  FROM	
  candy.inventory	
  as	
  inv	
  	
  
WHERE	
  inv.amount	
  <	
  5	
  ").collect
Insert to another Cassandra
Table
csc.sql("	
  
INSERT	
  INTO	
  candy.low	
  	
  
SELECT	
  *	
  FROM	
  candy.inventory	
  as	
  inv	
  	
  
WHERE	
  inv.amount	
  <	
  5	
  ").collect
Part 3: How To Stream To
Cassandra From Spark
Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.
Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.
Streaming Analytics:
We do our analytics on the data as it arrives.
The data won’t be stale and neither will our
analytics
DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Streaming involves a receiver or set of receivers each of which publishes a DStream
DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Batch Batch
RDD RDD RDD RDD
The DStream is (Discretized) into batches, the timing of which is set in the
Spark Streaming Context. Each Batch is made up of RDDs.
Streaming Provides Extra
Functions
DStream
RDD RDD RDD RDDRDD RDDRDD RDD
Time
Window 1-2
Window 2-3
Window 3-4
Windowing gives us easy access to slices of data in time
Receivers that Come With
Spark Streaming
And more!
Demo Streaming Application:
Analyze HttpRequests with Spark Streaming
Spark Cassandra
HttpServerTraffic
Spark Executor
Source Included in DSE 4.6.0
Spark Receivers only really need to
describe how to publish to a DStream
	
  case	
  class	
  HttpRequest(	
  
	
  	
  	
  	
  timeuuid:	
  UUID,	
  
	
  	
  	
  	
  method:	
  String,	
  
	
  	
  	
  	
  headers:	
  Map[String,	
  List[String]],	
  
	
  	
  	
  	
  uri:	
  URI,	
  
	
  	
  	
  	
  body:	
  String)	
  extends	
  ReceiverClass
First we need to define a Case Class to make moving around
HttpRequest information Easier. This type will be used to
specify what type of DStream we are creating.
Spark Receivers only really need to
describe how to publish to a DStream
class	
  HttpReceiver(port:	
  Int)	
  
	
  	
  extends	
  Receiver[HttpRequest]
(StorageLevel.MEMORY_AND_DISK_2)	
  	
  
with	
  Logging	
  
{	
  
	
  	
  def	
  onStart():	
  Unit	
  =	
  {}	
  
	
  	
  def	
  onStop():	
  Unit	
  =	
  {}	
  
}
Now we just need to write the code for a receiver to actually
publish these HttpRequest Objects
Receiver
[HttpRequest]
Spark Receivers only really need to
describe how to publish to a DStream
import	
  com.sun.net.httpserver.{	
  
HttpExchange,	
  HttpHandler,	
  HttpServer}	
  
def	
  onStart():	
  Unit	
  =	
  {	
  	
  
	
  	
  	
  val	
  s	
  =	
  HttpServer.create(new	
  InetSocketAddress(p),	
  0)	
  
	
  	
  	
  s.createContext("/",	
  new	
  StreamHandler())	
  
	
  	
  	
  s.start()	
  
	
  	
  	
  server	
  =	
  Some(s)	
  
}	
  
	
  	
  
def	
  onStop():	
  Unit	
  =	
  server	
  map(_.stop(0))	
  
This will start up our server and direct all HttpTraffic to be
handled by StreamHandler
Receiver
[HttpRequest]
HttpServer
Spark Receivers only really need to
describe how to publish to a DStream
	
  class	
  StreamHandler	
  extends	
  HttpHandler	
  {	
  
	
  	
  	
  override	
  def	
  handle(transaction:	
  HttpExchange):	
  Unit	
  =	
  {	
  
	
  	
  	
  	
  	
  	
  val	
  dataReader	
  =	
  new	
  BufferedReader(new	
  
InputStreamReader(transaction.getRequestBody))	
  
	
  	
  	
  	
  	
  	
  val	
  data	
  =	
  Stream.continually(dataReader.readLine).takeWhile(_	
  !=	
  
null).mkString("n")	
  
	
  	
  	
  	
  	
  val	
  headers:	
  Map[String,	
  List[String]]	
  =	
  
transaction.getRequestHeaders.toMap.map	
  {	
  case	
  (k,	
  v)	
  =>	
  (k,	
  v.toList)}	
  
	
  	
  	
  	
  	
  	
  store(HttpRequest(	
  
	
  	
  	
  	
  	
  	
  	
  	
  UUIDs.timeBased(),	
  
	
  	
  	
  	
  	
  	
  	
  	
  transaction.getRequestMethod,	
  
	
  	
  	
  	
  	
  	
  	
  	
  headers,	
  
	
  	
  	
  	
  	
  	
  	
  	
  transaction.getRequestURI,	
  
	
  	
  	
  	
  	
  	
  	
  	
  data))	
  
	
  	
  	
  	
  	
  	
  transaction.sendResponseHeaders(200,	
  0)	
  
	
  	
  	
  	
  	
  	
  val	
  response	
  =	
  transaction.getResponseBody	
  
	
  	
  	
  	
  	
  	
  response.close()	
  //	
  Empty	
  response	
  body	
  
	
  	
  	
  	
  	
  	
  transaction.close()	
  //	
  Finish	
  Transaction	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
StreamHandler actually does the work
publishing events to the DStream.
Receiver
[HttpRequest]
HttpServer
StreamHandler
Streaming Context sets Batch Timing
val	
  ssc	
  =	
  new	
  StreamingContext(conf,	
  Seconds(5))	
  
val	
  multipleStreams	
  =	
  (1	
  to	
  config.numDstreams).map	
  {	
  i	
  =>	
  
ssc.receiverStream[HttpRequest](new	
  HttpReceiver(config.port))	
  
}	
  
val	
  requests	
  =	
  ssc.union(multipleStreams)
Create One Receiver Per Node
val	
  ssc	
  =	
  new	
  StreamingContext(conf,	
  Seconds(5))	
  
val	
  multipleStreams	
  =	
  (1	
  to	
  config.numDstreams).map	
  {	
  i	
  =>	
  
ssc.receiverStream[HttpRequest](new	
  HttpReceiver(config.port))	
  
}	
  
val	
  requests	
  =	
  ssc.union(multipleStreams)
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Merge Separate DStreams into One
val	
  ssc	
  =	
  new	
  StreamingContext(conf,	
  Seconds(5))	
  
val	
  multipleStreams	
  =	
  (1	
  to	
  config.numDstreams).map	
  {	
  i	
  =>	
  
ssc.receiverStream[HttpRequest](new	
  HttpReceiver(config.port))	
  
}	
  
val	
  requests	
  =	
  ssc.union(multipleStreams)
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
requests[HttpRequest]
Cassandra Tables to Store HttpEvents
CREATE	
  TABLE	
  IF	
  NOT	
  EXISTS	
  timeline	
  (	
  
timesegment	
  bigint	
  ,	
  
url	
  text,	
  
t_uuid	
  timeuuid	
  ,	
  
method	
  text,	
  
headers	
  map	
  <text,	
  text>,	
  
body	
  text	
  ,	
  
PRIMARY	
  KEY	
  ((url,	
  timesegment)	
  ,	
  t_uuid	
  ))
Persist Every Event That
Comes into the System
Cassandra Tables to Store HttpEvents
CREATE	
  TABLE	
  IF	
  NOT	
  EXISTS	
  timeline	
  (	
  
timesegment	
  bigint	
  ,	
  
url	
  text,	
  
t_uuid	
  timeuuid	
  ,	
  
method	
  text,	
  
headers	
  map	
  <text,	
  text>,	
  
body	
  text	
  ,	
  
PRIMARY	
  KEY	
  ((url,	
  timesegment)	
  ,	
  t_uuid	
  ))
CREATE	
  TABLE	
  IF	
  NOT	
  EXISTS	
  method_agg(	
  
url	
  text,	
  
method	
  text,	
  
time	
  timestamp,	
  
count	
  bigint,	
  
PRIMARY	
  KEY	
  ((url,method),	
  time))
Persist Every Event That
Comes into the System
Table For Counting the
Number of Accesses to
each Url Over Time
Cassandra Tables to Store HttpEvents
CREATE	
  TABLE	
  IF	
  NOT	
  EXISTS	
  timeline	
  (	
  
timesegment	
  bigint	
  ,	
  
url	
  text,	
  
t_uuid	
  timeuuid	
  ,	
  
method	
  text,	
  
headers	
  map	
  <text,	
  text>,	
  
body	
  text	
  ,	
  
PRIMARY	
  KEY	
  ((url,	
  timesegment)	
  ,	
  t_uuid	
  ))
CREATE	
  TABLE	
  IF	
  NOT	
  EXISTS	
  method_agg(	
  
url	
  text,	
  
method	
  text,	
  
time	
  timestamp,	
  
count	
  bigint,	
  
PRIMARY	
  KEY	
  ((url,method),	
  time))
CREATE	
  TABLE	
  IF	
  NOT	
  EXISTS	
  sorted_urls(	
  
url	
  text,	
  
time	
  timestamp,	
  
count	
  bigint,	
  
PRIMARY	
  KEY	
  (time,	
  count)	
  
)
Persist Every Event That
Comes into the System
Table For Counting the
Number of Accesses to
each Url Over Time
Table for finding the most
popular url in each batch
Persist the events without doing any
manipulation
requests.map	
  {	
  
	
  	
  	
  	
  	
  	
  request	
  =>	
  
	
  	
  	
  	
  	
  	
  	
  	
  timelineRow(	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  timesegment	
  =	
  UUIDs.unixTimestamp(request.timeuuid)	
  /	
  10000L,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  url	
  =	
  request.uri.toString,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  t_uuid	
  =	
  request.timeuuid,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  method	
  =	
  request.method,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  headers	
  =	
  request.headers.map	
  {	
  case	
  (k,	
  v)	
  =>	
  (k,	
  
v.mkString("#"))},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  body	
  =	
  request.body)	
  
	
  	
  	
  	
  }.saveToCassandra("requests_ks",	
  "timeline")
Results
Persist the events without doing any
manipulation
requests.map	
  {	
  
	
  	
  	
  	
  	
  	
  request	
  =>	
  
	
  	
  	
  	
  	
  	
  	
  	
  timelineRow(	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  timesegment	
  =	
  UUIDs.unixTimestamp(request.timeuuid)	
  /	
  10000L,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  url	
  =	
  request.uri.toString,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  t_uuid	
  =	
  request.timeuuid,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  method	
  =	
  request.method,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  headers	
  =	
  request.headers.map	
  {	
  case	
  (k,	
  v)	
  =>	
  (k,	
  
v.mkString("#"))},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  body	
  =	
  request.body)	
  
	
  	
  	
  	
  }.saveToCassandra("requests_ks",	
  "timeline")
timesegment url t_uuid method Headers Body
Results
timelineRow
Persist the events without doing any
manipulation
requests.map	
  {	
  
	
  	
  	
  	
  	
  	
  request	
  =>	
  
	
  	
  	
  	
  	
  	
  	
  	
  timelineRow(	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  timesegment	
  =	
  UUIDs.unixTimestamp(request.timeuuid)	
  /	
  10000L,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  url	
  =	
  request.uri.toString,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  t_uuid	
  =	
  request.timeuuid,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  method	
  =	
  request.method,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  headers	
  =	
  request.headers.map	
  {	
  case	
  (k,	
  v)	
  =>	
  (k,	
  
v.mkString("#"))},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  body	
  =	
  request.body)	
  
	
  	
  	
  	
  }.saveToCassandra("requests_ks",	
  "timeline")
C*
C*C*
C*
timesegment url t_uuid method Headers Body
Results
timelineRow
Aggregate Requests by
URI and Method
requests.map(request	
  =>	
  (request.method,	
  request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  ((m,	
  u),	
  c)	
  =>	
  ((m,	
  u),	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  ((m,	
  u),	
  c,	
  t)	
  =>	
  	
  
methodAggRow(time	
  =	
  t,	
  url	
  =	
  u,	
  method	
  =	
  m,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "method_agg")
method uri
Aggregate Requests by
URI and Method
requests.map(request	
  =>	
  (request.method,	
  request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  ((m,	
  u),	
  c)	
  =>	
  ((m,	
  u),	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  ((m,	
  u),	
  c,	
  t)	
  =>	
  	
  
methodAggRow(time	
  =	
  t,	
  url	
  =	
  u,	
  method	
  =	
  m,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "method_agg")
method uri
method uri count
CountByValue
method uri count time
Aggregate Requests by
URI and Method
requests.map(request	
  =>	
  (request.method,	
  request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  ((m,	
  u),	
  c)	
  =>	
  ((m,	
  u),	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  ((m,	
  u),	
  c,	
  t)	
  =>	
  	
  
methodAggRow(time	
  =	
  t,	
  url	
  =	
  u,	
  method	
  =	
  m,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "method_agg")
method uri
method uri count
countByValue
transform
method uri count time
Aggregate Requests by
URI and Method
requests.map(request	
  =>	
  (request.method,	
  request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  ((m,	
  u),	
  c)	
  =>	
  ((m,	
  u),	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  ((m,	
  u),	
  c,	
  t)	
  =>	
  	
  
methodAggRow(time	
  =	
  t,	
  url	
  =	
  u,	
  method	
  =	
  m,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "method_agg")
method uri
method uri count
countByValue
transform
C*
C*C*
C*
saveToCassandra
method uri count time
Aggregate Requests by
URI and Method
requests.map(request	
  =>	
  (request.method,	
  request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  ((m,	
  u),	
  c)	
  =>	
  ((m,	
  u),	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  ((m,	
  u),	
  c,	
  t)	
  =>	
  	
  
methodAggRow(time	
  =	
  t,	
  url	
  =	
  u,	
  method	
  =	
  m,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "method_agg")
method uri
method uri count
countByValue
transform
C*
C*C*
C*
saveToCassandra
Sort Aggregates by Batch
requests.map(request	
  =>	
  (request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  (u,	
  c)	
  =>	
  (u,	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  (u,	
  c,	
  t)	
  =>	
  sortedUrlRow(time	
  =	
  t,	
  url	
  =	
  u,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "sorted_urls")
uri
Sort Aggregates by Batch
requests.map(request	
  =>	
  (request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  (u,	
  c)	
  =>	
  (u,	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  (u,	
  c,	
  t)	
  =>	
  sortedUrlRow(time	
  =	
  t,	
  url	
  =	
  u,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "sorted_urls")
uri
uri count
countByValue
Sort Aggregates by Batch
requests.map(request	
  =>	
  (request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  (u,	
  c)	
  =>	
  (u,	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  (u,	
  c,	
  t)	
  =>	
  sortedUrlRow(time	
  =	
  t,	
  url	
  =	
  u,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "sorted_urls")
uri
uri count
uri count time
countByValue
transform
Sort Aggregates by Batch
requests.map(request	
  =>	
  (request.uri.toString))	
  
	
  	
  	
  	
  	
  	
  .countByValue()	
  
	
  	
  	
  	
  	
  	
  .transform((rdd,	
  time)	
  =>	
  rdd.map	
  {	
  	
  
case	
  (u,	
  c)	
  =>	
  (u,	
  c,	
  time.milliseconds)})	
  
	
  	
  	
  	
  	
  	
  .map	
  {	
  case	
  (u,	
  c,	
  t)	
  =>	
  sortedUrlRow(time	
  =	
  t,	
  url	
  =	
  u,	
  count	
  =	
  c)}	
  
	
  	
  	
  	
  	
  	
  .saveToCassandra("requests_ks",	
  "sorted_urls")
uri
uri count
uri count time
countByValue
transform
Let Cassandra
Do the Sorting! PRIMARY KEY (time, count)
C*
C*C*
C*saveToCassandra
Start the application!
	
  	
  	
  	
  ssc.start()	
  
	
  	
  	
  	
  ssc.awaitTermination()
This will start the streaming application
piping all incoming data to Cassandra!
Live Demo
Demo run Script
#Start Streaming Application
echo "Starting Streaming Receiver(s): Logging to http_receiver.log"
cd HttpSparkStream
dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d
$NUM_SPARK_NODES > ../http_receiver.log 2>&1 &
cd ..
echo "Waiting for 60 Seconds for streaming to come online"
sleep 60
#Start Http Requester
echo "Starting to send requests against streaming receivers: Logging to http_requester.log"
cd HttpRequestGenerator
./sbt/sbt "run -i $SPARK_NODE_IPS " > ../http_requester.log 2>&1 &
cd ..
#Monitor Results Via Cqlsh
watch -n 5 './monitor_queries.sh'
Live Demo
I hope this gives you some
exciting ideas for your
applications!
Questions?
Thanks for coming to the meetup!!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth
language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on
PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!
Email us: Community@DataStax.com!
Getting started with Cassandra?!
In production?!
Tweet us: @PlanetCassandra!

Contenu connexe

Tendances

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousRussell Spitzer
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Russell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team ApachePatrick McFadin
 

Tendances (20)

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 

Similaire à Zero to Streaming: Spark and Cassandra

11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonDatabricks
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraDenis Dus
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 

Similaire à Zero to Streaming: Spark and Cassandra (20)

11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Spark core
Spark coreSpark core
Spark core
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 

Dernier

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Dernier (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Zero to Streaming: Spark and Cassandra

  • 1. From 0 to Streaming Cassandra and Spark Streaming Russell Spitzer + =
  • 2. Who am I? • Bioinformatics Ph.D from UCSF • Works on the integration of Cassandra (C*) with Hadoop, Solr, and SPARK! • Spends a lot of time spinning up clusters on EC2, GCE, Azure, …
 http://www.datastax.com/dev/blog/ testing-cassandra-1000-nodes-at- a-time • Writing FAQ’s for Spark Troubleshooting
 http://www.datastax.com/dev/blog/ common-spark-troubleshooting
  • 3. From 0 to Streaming Spark How does it work? What are the main Components? Cluster Layout Spark Submit
  • 4. From 0 to Streaming Connecting Cassandra To Spark Spark Cassandra Connector Spark SQL RDD Basics Spark How does it work? What are the main Components? Cluster Layout Spark Submit
  • 5. From 0 to Streaming Connecting Cassandra To Spark Spark Cassandra Connector Spark SQL RDD Basics Spark Streaming Streaming Basics Writing Streaming Applications Custom Receivers Spark How does it work? What are the main Components? Cluster Layout Spark Submit
  • 6. Part 1: What is Spark Not this ^
  • 7. Spark is a Distributed Analytics Platform HADOOP •Has Generalized DAG execution •Integrated SQL Queries •Streaming •Easy Abstraction for Datasets •Support in lots of languages All in one package!
  • 8. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Generic DAG Execution Yes! Great Abstraction For Datasets? RDD! Spark
 Worker Spark
 Worker Spark Master Spark Worker Resilient Distributed Dataset Spark Executor Spark Partition
  • 9. Spark Provides a Simple and Efficient framework for Distributed Computations Spark
 Worker Spark
 Worker Spark Master Spark Worker Resilient Distributed Dataset Spark Executor Spark Partition Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application
  • 10. Spark Provides a Simple and Efficient framework for Distributed Computations Spark
 Worker Spark
 Worker Spark Master Spark Worker Resilient Distributed Dataset Spark Executor Spark Partition Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application
  • 11. Spark Provides a Simple and Efficient framework for Distributed Computations Spark
 Worker Spark
 Worker Spark Master Spark Worker Resilient Distributed Dataset Spark Executor Spark Partition Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application
  • 12. RDDs Can be Generated from a Variety of Sources Textfiles Parallelized Collections
  • 13. RDDs Can be Generated from a Variety of Sources Textfiles Parallelized Collections
  • 14. Transformations and Actions RDD’s are immutable New RDD’s created with transforms Only when we call an action are the transforms applied rdd  =  sc.textFile("num.txt")   val  rdd2  =  rdd.map(  x  =>  x.toInt  *2  )   val  rdd3  =  rdd2.filter(  _  >  4)     rdd3.collect
  • 15. Transformations and Actions RDD’s are immutable New RDD’s created with transforms Only when we call an action are the transforms applied rdd  =  sc.textFile("num.txt")   val  rdd2  =  rdd.map(  x  =>  x.toInt  *2  )   val  rdd3  =  rdd2.filter(  _  >  4)     rdd3.collect rdd Create
  • 16. Transformations and Actions RDD’s are immutable New RDD’s created with transforms Only when we call an action are the transforms applied rdd  =  sc.textFile("num.txt")   val  rdd2  =  rdd.map(  x  =>  x.toInt  *2  )   val  rdd3  =  rdd2.filter(  _  >  4)     rdd3.collect rdd rdd2 Transform
  • 17. Transformations and Actions RDD’s are immutable New RDD’s created with transforms Only when we call an action are the transforms applied rdd  =  sc.textFile("num.txt")   val  rdd2  =  rdd.map(  x  =>  x.toInt  *2  )   val  rdd3  =  rdd2.filter(  _  >  4)     rdd3.collect rdd rdd2 rdd3 Transform
  • 18. Transformations and Actions RDD’s are immutable New RDD’s created with transforms Only when we call an action are the transforms applied rdd rdd2 rdd3rdd ACTION rdd  =  sc.textFile("num.txt")   val  rdd2  =  rdd.map(  x  =>  x.toInt  *2  )   val  rdd3  =  rdd2.filter(  _  >  4)     rdd3.collect
  • 19. Transformations and Actions RDD’s are immutable New RDD’s created with transforms Only when we call an action are the transforms applied rdd rdd2 rdd3rdd ACTION rdd2 rdd  =  sc.textFile("num.txt")   val  rdd2  =  rdd.map(  x  =>  x.toInt  *2  )   val  rdd3  =  rdd2.filter(  _  >  4)     rdd3.collect
  • 20. Transformations and Actions RDD’s are immutable New RDD’s created with transforms Only when we call an action are the transforms applied rdd  =  sc.textFile("num.txt")   val  rdd2  =  rdd.map(  x  =>  x.toInt  *2  )   val  rdd3  =  rdd2.filter(  _  >  4)     rdd3.collect rdd rdd2 rdd3 ACTION
  • 21. Application of Transformations is done one Partition per Executor 1 32 4 5 6 7 8 9 RDD Executor Executor Transformation RDD’
  • 22. Application of Transformations is done one Partition per Executor 1 32 4 5 6 7 8 9 RDD Executor 1 1’ Executor 2 2’ Transformation RDD’
  • 23. 1 32 4 5 6 7 8 9 RDD Executor Executor 1’ 2’ Transformation RDD’ Application of Transformations is done one Partition per Executor
  • 24. 1 32 4 5 6 7 8 9 RDD Executor 3 3’ Executor 4 4’ 1’ 2’ Transformation RDD’ Application of Transformations is done one Partition per Executor
  • 25. 1 32 4 5 6 7 8 9 RDD Executor 5 5’ Executor 6 6’ 1’ 2’ Transformation RDD’ Application of Transformations is done one Partition per Executor 3’ 4’
  • 26. 1 32 4 5 6 7 8 9 RDD Executor Executor 1’ 3’2’ 4’ 5’ 6’ 7’ 8’ 9’ Transformation RDD’ Application of Transformations is done one Partition per Executor
  • 27. 1 32 4 5 6 7 8 9 RDD Executor Executor 1’ 3’2’ 4’ 6’ 7’ 8’ 9’ RDD’ Failed Transformations Can be Redone By Reapplying the Transformation to the Old Partition 5 5’ Node Failure
  • 28. 1 32 4 5 6 7 8 9 RDD Executor Executor 1’ 3’2’ 4’ 6’ 7’ 8’ 9’ RDD’ Failed Transformations Can be Redone By Reapplying the Transformation to the Old Partition Reapply Transformation 5 5’ Node Failure
  • 29. 1 32 4 5 6 7 8 9 RDD Executor Executor 1’ 3’2’ 4’ 6’ 7’ 8’ 9’ RDD’ Failed Transformations Can be Redone By Reapplying the Transformation to the Old Partition Reapply Transformation 5’ Because the actions on any partition can be tracked backwards we can recover from failure without redoing the entire RDD
  • 30. Use the Spark Shell to quickly try out code samples Available in and Pyspark Spark Shell
  • 31. Spark Context is the Core Api for all Communication with Spark val conf = new SparkConf() .setAppName(appName) .setMaster(master) .set("spark.cassandra.auth.username", "cassandra") .set("spark.cassandra.auth.password", "cassandra") new SparkContext(conf) Almost all options can also be set as environment variables or on the command line during spark-submit!
  • 32. Deploy Compiled Jars using Spark Submit https://spark.apache.org/docs/1.1.0/submitting-applications.html Some of the commonly used options are: --class: The entry point for your application --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) --conf: Arbitrary Spark configuration property in key=value format. spark-­‐submit  -­‐-­‐class  MainClass  JarYouWantDistributedToExecutor.jar Spark
 Worker Spark
 Worker Spark Master Spark Worker Spark-Submit Jar
  • 33. Deploy Compiled Jars using Spark Submit https://spark.apache.org/docs/1.1.0/submitting-applications.html Some of the commonly used options are: --class: The entry point for your application --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) --conf: Arbitrary Spark configuration property in key=value format. spark-­‐submit  -­‐-­‐class  MainClass  JarYouWantDistributedToExecutor.jar Spark
 Worker Spark
 Worker Spark Master Spark Worker Spark-Submit Jar
  • 34. Co-locate Spark and C* for Best Performance C* C*C* C* Spark
 Worker Spark
 Worker Spark Master Spark Worker Running Spark Workers on the same nodes as your C* Cluster will save network hops when reading and writing
  • 35. Use a Separate Datacenter for your Analytics Workloads C* C*C* C* Spark
 Worker Spark
 Worker Spark Master Spark Worker C* C*C* C* OLTP OLAP
  • 36. Part 2: Connecting Spark To Cassandra Exactly like this ^
  • 37. DataStax OSS Connector Spark to Cassandra https://github.com/datastax/spark-­‐cassandra-­‐connector Keyspace Table Cassandra Spark RDD[CassandraRow] RDD[Tuples] Bundled  and  Supported  with  DSE  >  4.5!
  • 38. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* Spark C* Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1-1000 Tokens 1001 -2000 Tokens … RDD’s read into different splits based on sets of tokens
  • 39. Setting up C* and Spark DSE > 4.5.0 Just start your nodes with dse cassandra -k Apache Cassandra Follow the excellent guide by Al Tobey http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
  • 40. Several Easy Ways To Use the Spark Cassandra Connector • SparkSQL • Scala • Java • RDD Manipulation • Scala • Java • Python
  • 41. Requirements for Following Code Examples The following examples use are targeted at Spark 1.1.X Cassandra 2.0.X or if you are using DataStax Enterprise DSE 4.6.x
  • 42. Basics: Getting a Table and Counting CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use candy; CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ; CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 ); INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2); INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
  • 43. Basics: Getting a Table and Counting CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use candy; CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ; CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 ); INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2); INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1); scala>  val  rdd  =  sc.cassandraTable("candy","inventory")   scala>  rdd.count   res13:  Long  =  4 cassandraTable
  • 44. Basics: Getting a Table and Counting CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use candy; CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ; CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 ); INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 ); INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2); INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1); scala>  val  rdd  =  sc.cassandraTable("candy","inventory")   scala>  rdd.count   res13:  Long  =  4 cassandraTable count 4
  • 45. Basics: take() and collect() sc.cassandraTable("candy","inventory").take(1)   Array(CassandraRow{brand:  Wonka,  name:  Gobstopper,  amount:  10})
  • 46. Basics: take() and collect() sc.cassandraTable("candy","inventory").take(1)   Array(CassandraRow{brand:  Wonka,  name:  Gobstopper,  amount:  10}) cassandraTable take(1) Array of CassandraRows Wonka Gob 10
  • 47. Basics: take() and collect() sc.cassandraTable("candy","inventory").take(1)   Array(CassandraRow{brand:  Wonka,  name:  Gobstopper,  amount:  10}) sc.cassandraTable("candy","inventory").collect   Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{…}) cassandraTable take(1) Array of CassandraRows Wonka Gob 10
  • 48. Basics: take() and collect() sc.cassandraTable("candy","inventory").take(1)   Array(CassandraRow{brand:  Wonka,  name:  Gobstopper,  amount:  10}) sc.cassandraTable("candy","inventory").collect   Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{…}) cassandraTable take(1) Array of CassandraRows Wonka Gob 10 cassandraTable collect 9 NYC Array of CassandraRows 9 NYC9 NYC9 NYCWonka Gob 10
  • 49. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10
  • 50. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10 cassandraTable
  • 51. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10 cassandraTable take(1) Array of CassandraRows Wonka Gob 10
  • 52. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10 10 get[Int] cassandraTable take(1) Array of CassandraRows Wonka Gob 10
  • 53. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10 10 get[Int] cassandraTable take(1) Array of CassandraRows Wonka Gob 10 scala>  case  class  invRow  (  brand:String,  name:String,  amount:Integer)   scala>  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
  • 54. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10 10 get[Int] http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html cassandraTable take(1) Array of CassandraRows Wonka Gob 10 scala>  case  class  invRow  (  brand:String,  name:String,  amount:Integer)   scala>  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount cassandraTable
  • 55. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10 10 get[Int] http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html cassandraTable take(1) Array of CassandraRows Wonka Gob 10 scala>  case  class  invRow  (  brand:String,  name:String,  amount:Integer)   scala>  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount cassandraTable take(1) Array of invRows Wonka Gob 10 Brand Name Amount
  • 56. Getting Values From Cassandra Rows scala>  sc.cassandraTable("candy","inventory")   .take(1)(0)   .get[Int]("amount")   res5:  Int  =  10 10 get[Int] http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html cassandraTable take(1) Array of CassandraRows Wonka Gob 10 scala>  case  class  invRow  (  brand:String,  name:String,  amount:Integer)   scala>  sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount cassandraTable take(1) Array of invRows Wonka Gob 10 Brand Name Amount amount 10
  • 57. Saving Back to Cassandra CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
  • 58. Saving Back to Cassandra CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name )); sc.cassandraTable[invRow]("candy","inventory")   .filter(  _.amount  <5)   .saveToCassandra("candy","low")
  • 59. Saving Back to Cassandra CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name )); sc.cassandraTable[invRow]("candy","inventory")   .filter(  _.amount  <5)   .saveToCassandra("candy","low") cassandraTable
  • 60. Saving Back to Cassandra CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name )); sc.cassandraTable[invRow]("candy","inventory")   .filter(  _.amount  <5)   .saveToCassandra("candy","low") cassandraTable amount 1 <5_ (Anonymous Param) Filter Wonka Gob 10 Brand Name Amount
  • 61. Saving Back to Cassandra CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name )); sc.cassandraTable[invRow]("candy","inventory")   .filter(  _.amount  <5)   .saveToCassandra("candy","low") cassandraTable amount 1 <5_ (Anonymous Param) Filter Wonka Gob 10 Brand Name Amount C* C*C* C* Under the hood this is done via the Cassandra Java Driver
  • 62. Several Easy Ways To Use the Spark Cassandra Connector • SparkSQL • Scala • Java • RDD Manipulation • Scala • Java • Python
  • 63. Spark Sql Provides a Fast SQL Like Syntax For Cassandra! HQL SQL Catalyst Query Plan Grab Data Filter Group Return Results SchemaRDD SQL In, RDD’s Out
  • 64. Building a Context Object For interacting with Spark SQL In the DSE Spark Shell both HiveContext and Cassandra Sql Context are created automatically on startup import  org.apache.spark.sql.cassandra.CassandraSQLContext   val  sc:  SparkContext  =  ...   val  csc  =  new  CassandraSQLContext(sc) JavaSparkContext  jsc  =  new  JavaSparkContext(conf);   //  create  a  Cassandra  Spark  SQL  context   CassandraSQLContext  csc  =  new   CassandraSQLContext(jsc.sc()); Since HiveContext Requires the Hive Driver accessing C* Directly, HC only available in DSE. Workaround: get SchemaRDD’s with Cassandra Sql Context then Register with HC
  • 65. Reading Data From Cassandra With SQL Syntax scala>  csc.sql(   "SELECT  *  FROM  candy.inventory").collect   Array[org.apache.spark.sql.Row]  =  Array(   [Wonka,Gobstopper,10],     [Wonka,WonkaBar,3],     [CandyTown,ChocoIsland,5],     [CandyTown,SugarMountain,2]   ) QueryPlan
  • 66. Reading Data From Cassandra With SQL Syntax scala>  csc.sql(   "SELECT  *  FROM  candy.inventory").collect   Array[org.apache.spark.sql.Row]  =  Array(   [Wonka,Gobstopper,10],     [Wonka,WonkaBar,3],     [CandyTown,ChocoIsland,5],     [CandyTown,SugarMountain,2]   ) SchemaRDDQueryPlan
  • 67. Reading Data From Cassandra With SQL Syntax scala>  csc.sql(   "SELECT  *  FROM  candy.inventory").collect   Array[org.apache.spark.sql.Row]  =  Array(   [Wonka,Gobstopper,10],     [Wonka,WonkaBar,3],     [CandyTown,ChocoIsland,5],     [CandyTown,SugarMountain,2]   ) SchemaRDDQueryPlan
  • 68. Counting Data From Cassandra With SQL Syntax scala>  csc.sql("SELECT  COUNT(*)  FROM  candy.inventory").collect   res5:  Array[org.apache.spark.sql.Row]  =  Array([4])
  • 69. Counting Data From Cassandra With SQL Syntax scala>  csc.sql("SELECT  COUNT(*)  FROM  candy.inventory").collect   res5:  Array[org.apache.spark.sql.Row]  =  Array([4])
  • 70. Joining Data From Cassandra With SQL Syntax scala>  csc.sql("   SELECT  *  FROM  candy.inventory  as  inventory     JOIN  candy.requests  as  requests     WHERE  inventory.name  =  requests.name").collect   res12:  Array[org.apache.spark.sql.Row]  =  Array(   [Wonka,WonkaBar,3,Russ,WonkaBar,2],     [CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]   )
  • 71. Joining Data From Cassandra With SQL Syntax scala>  csc.sql("   SELECT  *  FROM  candy.inventory  as  inventory     JOIN  candy.requests  as  requests     WHERE  inventory.name  =  requests.name").collect   res12:  Array[org.apache.spark.sql.Row]  =  Array(   [Wonka,WonkaBar,3,Russ,WonkaBar,2],     [CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]   )
  • 72. Insert to another Cassandra Table csc.sql("   INSERT  INTO  candy.low     SELECT  *  FROM  candy.inventory  as  inv     WHERE  inv.amount  <  5  ").collect
  • 73. Insert to another Cassandra Table csc.sql("   INSERT  INTO  candy.low     SELECT  *  FROM  candy.inventory  as  inv     WHERE  inv.amount  <  5  ").collect
  • 74. Part 3: How To Stream To Cassandra From Spark
  • 75. Streaming is Cool and if you like Streaming you will be cool too Your Data is Delicious Like a Candy
  • 76. Streaming is Cool and if you like Streaming you will be cool too Your Data is Delicious Like a Candy You want it right now!
  • 77. Streaming is Cool and if you like Streaming you will be cool too Your Data is Delicious Like a Candy You want it right now! Batch Analytics: Waiting to do analysis after data has accumulated means data may be out of date or unimportant by the time we process it.
  • 78. Streaming is Cool and if you like Streaming you will be cool too Your Data is Delicious Like a Candy You want it right now! Batch Analytics: Waiting to do analysis after data has accumulated means data may be out of date or unimportant by the time we process it. Streaming Analytics: We do our analytics on the data as it arrives. The data won’t be stale and neither will our analytics
  • 79. DStreams: Basic unit of Spark Streaming Receiver DStream Events Streaming involves a receiver or set of receivers each of which publishes a DStream
  • 80. DStreams: Basic unit of Spark Streaming Receiver DStream Events Batch Batch RDD RDD RDD RDD The DStream is (Discretized) into batches, the timing of which is set in the Spark Streaming Context. Each Batch is made up of RDDs.
  • 81. Streaming Provides Extra Functions DStream RDD RDD RDD RDDRDD RDDRDD RDD Time Window 1-2 Window 2-3 Window 3-4 Windowing gives us easy access to slices of data in time
  • 82. Receivers that Come With Spark Streaming And more!
  • 83. Demo Streaming Application: Analyze HttpRequests with Spark Streaming Spark Cassandra HttpServerTraffic Spark Executor Source Included in DSE 4.6.0
  • 84. Spark Receivers only really need to describe how to publish to a DStream  case  class  HttpRequest(          timeuuid:  UUID,          method:  String,          headers:  Map[String,  List[String]],          uri:  URI,          body:  String)  extends  ReceiverClass First we need to define a Case Class to make moving around HttpRequest information Easier. This type will be used to specify what type of DStream we are creating.
  • 85. Spark Receivers only really need to describe how to publish to a DStream class  HttpReceiver(port:  Int)      extends  Receiver[HttpRequest] (StorageLevel.MEMORY_AND_DISK_2)     with  Logging   {      def  onStart():  Unit  =  {}      def  onStop():  Unit  =  {}   } Now we just need to write the code for a receiver to actually publish these HttpRequest Objects Receiver [HttpRequest]
  • 86. Spark Receivers only really need to describe how to publish to a DStream import  com.sun.net.httpserver.{   HttpExchange,  HttpHandler,  HttpServer}   def  onStart():  Unit  =  {          val  s  =  HttpServer.create(new  InetSocketAddress(p),  0)        s.createContext("/",  new  StreamHandler())        s.start()        server  =  Some(s)   }       def  onStop():  Unit  =  server  map(_.stop(0))   This will start up our server and direct all HttpTraffic to be handled by StreamHandler Receiver [HttpRequest] HttpServer
  • 87. Spark Receivers only really need to describe how to publish to a DStream  class  StreamHandler  extends  HttpHandler  {        override  def  handle(transaction:  HttpExchange):  Unit  =  {              val  dataReader  =  new  BufferedReader(new   InputStreamReader(transaction.getRequestBody))              val  data  =  Stream.continually(dataReader.readLine).takeWhile(_  !=   null).mkString("n")            val  headers:  Map[String,  List[String]]  =   transaction.getRequestHeaders.toMap.map  {  case  (k,  v)  =>  (k,  v.toList)}              store(HttpRequest(                  UUIDs.timeBased(),                  transaction.getRequestMethod,                  headers,                  transaction.getRequestURI,                  data))              transaction.sendResponseHeaders(200,  0)              val  response  =  transaction.getResponseBody              response.close()  //  Empty  response  body              transaction.close()  //  Finish  Transaction          }      }   StreamHandler actually does the work publishing events to the DStream. Receiver [HttpRequest] HttpServer StreamHandler
  • 88. Streaming Context sets Batch Timing val  ssc  =  new  StreamingContext(conf,  Seconds(5))   val  multipleStreams  =  (1  to  config.numDstreams).map  {  i  =>   ssc.receiverStream[HttpRequest](new  HttpReceiver(config.port))   }   val  requests  =  ssc.union(multipleStreams)
  • 89. Create One Receiver Per Node val  ssc  =  new  StreamingContext(conf,  Seconds(5))   val  multipleStreams  =  (1  to  config.numDstreams).map  {  i  =>   ssc.receiverStream[HttpRequest](new  HttpReceiver(config.port))   }   val  requests  =  ssc.union(multipleStreams) Receiver [HttpRequest] HttpServer StreamHandler Receiver [HttpRequest] HttpServer StreamHandler Receiver [HttpRequest] HttpServer StreamHandler
  • 90. Merge Separate DStreams into One val  ssc  =  new  StreamingContext(conf,  Seconds(5))   val  multipleStreams  =  (1  to  config.numDstreams).map  {  i  =>   ssc.receiverStream[HttpRequest](new  HttpReceiver(config.port))   }   val  requests  =  ssc.union(multipleStreams) Receiver [HttpRequest] HttpServer StreamHandler Receiver [HttpRequest] HttpServer StreamHandler Receiver [HttpRequest] HttpServer StreamHandler requests[HttpRequest]
  • 91. Cassandra Tables to Store HttpEvents CREATE  TABLE  IF  NOT  EXISTS  timeline  (   timesegment  bigint  ,   url  text,   t_uuid  timeuuid  ,   method  text,   headers  map  <text,  text>,   body  text  ,   PRIMARY  KEY  ((url,  timesegment)  ,  t_uuid  )) Persist Every Event That Comes into the System
  • 92. Cassandra Tables to Store HttpEvents CREATE  TABLE  IF  NOT  EXISTS  timeline  (   timesegment  bigint  ,   url  text,   t_uuid  timeuuid  ,   method  text,   headers  map  <text,  text>,   body  text  ,   PRIMARY  KEY  ((url,  timesegment)  ,  t_uuid  )) CREATE  TABLE  IF  NOT  EXISTS  method_agg(   url  text,   method  text,   time  timestamp,   count  bigint,   PRIMARY  KEY  ((url,method),  time)) Persist Every Event That Comes into the System Table For Counting the Number of Accesses to each Url Over Time
  • 93. Cassandra Tables to Store HttpEvents CREATE  TABLE  IF  NOT  EXISTS  timeline  (   timesegment  bigint  ,   url  text,   t_uuid  timeuuid  ,   method  text,   headers  map  <text,  text>,   body  text  ,   PRIMARY  KEY  ((url,  timesegment)  ,  t_uuid  )) CREATE  TABLE  IF  NOT  EXISTS  method_agg(   url  text,   method  text,   time  timestamp,   count  bigint,   PRIMARY  KEY  ((url,method),  time)) CREATE  TABLE  IF  NOT  EXISTS  sorted_urls(   url  text,   time  timestamp,   count  bigint,   PRIMARY  KEY  (time,  count)   ) Persist Every Event That Comes into the System Table For Counting the Number of Accesses to each Url Over Time Table for finding the most popular url in each batch
  • 94. Persist the events without doing any manipulation requests.map  {              request  =>                  timelineRow(                      timesegment  =  UUIDs.unixTimestamp(request.timeuuid)  /  10000L,                      url  =  request.uri.toString,                      t_uuid  =  request.timeuuid,                      method  =  request.method,                      headers  =  request.headers.map  {  case  (k,  v)  =>  (k,   v.mkString("#"))},                      body  =  request.body)          }.saveToCassandra("requests_ks",  "timeline") Results
  • 95. Persist the events without doing any manipulation requests.map  {              request  =>                  timelineRow(                      timesegment  =  UUIDs.unixTimestamp(request.timeuuid)  /  10000L,                      url  =  request.uri.toString,                      t_uuid  =  request.timeuuid,                      method  =  request.method,                      headers  =  request.headers.map  {  case  (k,  v)  =>  (k,   v.mkString("#"))},                      body  =  request.body)          }.saveToCassandra("requests_ks",  "timeline") timesegment url t_uuid method Headers Body Results timelineRow
  • 96. Persist the events without doing any manipulation requests.map  {              request  =>                  timelineRow(                      timesegment  =  UUIDs.unixTimestamp(request.timeuuid)  /  10000L,                      url  =  request.uri.toString,                      t_uuid  =  request.timeuuid,                      method  =  request.method,                      headers  =  request.headers.map  {  case  (k,  v)  =>  (k,   v.mkString("#"))},                      body  =  request.body)          }.saveToCassandra("requests_ks",  "timeline") C* C*C* C* timesegment url t_uuid method Headers Body Results timelineRow
  • 97. Aggregate Requests by URI and Method requests.map(request  =>  (request.method,  request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  ((m,  u),  c)  =>  ((m,  u),  c,  time.milliseconds)})              .map  {  case  ((m,  u),  c,  t)  =>     methodAggRow(time  =  t,  url  =  u,  method  =  m,  count  =  c)}              .saveToCassandra("requests_ks",  "method_agg") method uri
  • 98. Aggregate Requests by URI and Method requests.map(request  =>  (request.method,  request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  ((m,  u),  c)  =>  ((m,  u),  c,  time.milliseconds)})              .map  {  case  ((m,  u),  c,  t)  =>     methodAggRow(time  =  t,  url  =  u,  method  =  m,  count  =  c)}              .saveToCassandra("requests_ks",  "method_agg") method uri method uri count CountByValue
  • 99. method uri count time Aggregate Requests by URI and Method requests.map(request  =>  (request.method,  request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  ((m,  u),  c)  =>  ((m,  u),  c,  time.milliseconds)})              .map  {  case  ((m,  u),  c,  t)  =>     methodAggRow(time  =  t,  url  =  u,  method  =  m,  count  =  c)}              .saveToCassandra("requests_ks",  "method_agg") method uri method uri count countByValue transform
  • 100. method uri count time Aggregate Requests by URI and Method requests.map(request  =>  (request.method,  request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  ((m,  u),  c)  =>  ((m,  u),  c,  time.milliseconds)})              .map  {  case  ((m,  u),  c,  t)  =>     methodAggRow(time  =  t,  url  =  u,  method  =  m,  count  =  c)}              .saveToCassandra("requests_ks",  "method_agg") method uri method uri count countByValue transform C* C*C* C* saveToCassandra
  • 101. method uri count time Aggregate Requests by URI and Method requests.map(request  =>  (request.method,  request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  ((m,  u),  c)  =>  ((m,  u),  c,  time.milliseconds)})              .map  {  case  ((m,  u),  c,  t)  =>     methodAggRow(time  =  t,  url  =  u,  method  =  m,  count  =  c)}              .saveToCassandra("requests_ks",  "method_agg") method uri method uri count countByValue transform C* C*C* C* saveToCassandra
  • 102. Sort Aggregates by Batch requests.map(request  =>  (request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  (u,  c)  =>  (u,  c,  time.milliseconds)})              .map  {  case  (u,  c,  t)  =>  sortedUrlRow(time  =  t,  url  =  u,  count  =  c)}              .saveToCassandra("requests_ks",  "sorted_urls") uri
  • 103. Sort Aggregates by Batch requests.map(request  =>  (request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  (u,  c)  =>  (u,  c,  time.milliseconds)})              .map  {  case  (u,  c,  t)  =>  sortedUrlRow(time  =  t,  url  =  u,  count  =  c)}              .saveToCassandra("requests_ks",  "sorted_urls") uri uri count countByValue
  • 104. Sort Aggregates by Batch requests.map(request  =>  (request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  (u,  c)  =>  (u,  c,  time.milliseconds)})              .map  {  case  (u,  c,  t)  =>  sortedUrlRow(time  =  t,  url  =  u,  count  =  c)}              .saveToCassandra("requests_ks",  "sorted_urls") uri uri count uri count time countByValue transform
  • 105. Sort Aggregates by Batch requests.map(request  =>  (request.uri.toString))              .countByValue()              .transform((rdd,  time)  =>  rdd.map  {     case  (u,  c)  =>  (u,  c,  time.milliseconds)})              .map  {  case  (u,  c,  t)  =>  sortedUrlRow(time  =  t,  url  =  u,  count  =  c)}              .saveToCassandra("requests_ks",  "sorted_urls") uri uri count uri count time countByValue transform Let Cassandra Do the Sorting! PRIMARY KEY (time, count) C* C*C* C*saveToCassandra
  • 106. Start the application!        ssc.start()          ssc.awaitTermination() This will start the streaming application piping all incoming data to Cassandra!
  • 107. Live Demo Demo run Script #Start Streaming Application echo "Starting Streaming Receiver(s): Logging to http_receiver.log" cd HttpSparkStream dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES > ../http_receiver.log 2>&1 & cd .. echo "Waiting for 60 Seconds for streaming to come online" sleep 60 #Start Http Requester echo "Starting to send requests against streaming receivers: Logging to http_requester.log" cd HttpRequestGenerator ./sbt/sbt "run -i $SPARK_NODE_IPS " > ../http_requester.log 2>&1 & cd .. #Monitor Results Via Cqlsh watch -n 5 './monitor_queries.sh'
  • 109. I hope this gives you some exciting ideas for your applications! Questions?
  • 110. Thanks for coming to the meetup!! DataStax Academy offers free online Cassandra training! Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth language and migration pages! Find a way to contribute back to the community: talk at a meetup, or share your story on PlanetCassandra.org! Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly! Email us: Community@DataStax.com! Getting started with Cassandra?! In production?! Tweet us: @PlanetCassandra!