ETL with SPARK - First Spark London meetup

Supercharging ETL with Spark
Rafal Kwasny
First Spark London Meetup
2014-05-28

About me
• Sysadmin/DevOps background
• Worked as DevOps @Visualdna
• Now building game analytics platform
@Sony Computer Entertainment Europe

Outline
• What is ETL
• How do we do it in the standard Hadoop stack
• How can we supercharge it with Spark
• Real-life use cases
• How to deploy Spark
• Lessons learned

Standard technology stack
Get the data

Load into HDFS / S3

Extract & Transform & Load

Query, Analyze, train ML models

Real Time pipeline

Hadoop
• Industry standard
• Have you ever looked at Hadoop code and
tried to fix something?

How simple is simple?
”Simple YARN application to run n copies of a unix command -
deliberately kept simple (with minimal error handling etc.)”
➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git
(…)
➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l
232

ETL Workflow
• Get some data from S3/HDFS
• Map
• Shuffle
• Reduce
• Save to S3/HDFS

ETL Workflow
• Get some data from S3/HDFS
• Map
• Shuffle
• Reduce
• Save to S3/HDFS
Repeat 10 times

Issue: Test run time
• Job startup time ~20s to run a job that does nothing
• Hard to test the code without a cluster ( cascading
simulation mode != real life )

Issue: new applications
MapReduce awkward for key big data workloads:
• Low latency dispatch (E.G. quick queries)
• Iterative algorithms (E.G. ML, Graph…)
• Streaming data ingest

Issue: hardware is moving on
Hardware had advanced since Hadoop started:
• Very large RAMs, Faster networks (10Gb+)
• Bandwidth to disk not keeping up
• 1 GB of RAM ~ $0.75/month *
*based on a spot price of AWS r3.8xlarge instance

How can we
supercharge our ETL?

Use Spark
• Fast and Expressive Cluster Computing Engine
• Compatible with Apache Hadoop
• In-memory storage
• Rich APIs in Java, Scala, Python

Why Spark?
• Up to 40x faster than Hadoop MapReduce
( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ )
• Jobs can be scheduled and run in <1s
• Typically less code (2-5x)
• Seamless Hadoop/HDFS integration
• REPL
• Accessible Source in terms of LOC and modularity

Why Spark?
• Berkeley Data Analytics Stack ecosystem:
• Spark, Spark Streaming, Shark, BlinkDB, MLlib
• Deep integration into Hadoop ecosystem
• Read/write Hadoop formats
• Interoperability with other ecosystem components
• Runs on Mesos & YARN, also MR1
• EC2, EMR
• HDFS, S3

Using RAM for in-memory caching

Stack
Also:
• SHARK ( Hive on Spark )
• Tachyon ( off heap caching )
• SparkR ( R wrapper )
• BlinkDB ( Approximate Queries)

Spark use-cases
• next-generation ETL platform
• No more “multiple chained MapReduce jobs”
architecture
• Less jobs to worry about
• Better sleep for your DevOps team

Sessionization
Add session_id to events

Why add session id?
Combine all user activity into user sessions

Adding session ID
user_id timestamp Referrer URL
user1 1401207490 http://fb.com http://webpage/
user2 1401207491 http://twitter.com http://webpage/
user1 1401207543 http://webpage/ http://webpage/login
user1 140120841 http://webpage/login http://webpage/add_to_cart
user2 1401207491 http://webpage/ http://webpage/product1

Group by user
user_id timestamp Referrer URL
user1 1401207490 http://fb.com http://webpage/
user1 1401207543 http://webpage/ http://webpage/login
user1 140120841 http://webpage/login http://webpage/add_to_cart
user2 1401207491 http://twitter.com http://webpage/
user2 1401207491 http://webpage/ http://webpage/product1

Add unique session id
user_id timestamp session_id Referrer URL
user1
140120749
0
8fddc743bfbafdc
45e071e5c126ce
ca7
http://fb.com http://webpage/
user1
140120754
3
8fddc743bfbafdc
45e071e5c126ce
ca7
http://webpage/ http://webpage/login
user1 140120841
8fddc743bfbafdc
45e071e5c126ce
ca7
http://webpage/lo
gin
http://webpage/add_to_
cart
user2
140120749
1
c00e742152500
8584d9d1ff4201
cbf65
http://twitter.com http://webpage/
140120749
c00e742152500
http://webpage/product

Join with external data
user_id timestamp session_id new_user Referrer URL
user1 1401207490
8fddc743bfba
fdc45e071e5
c126ceca7
TRUE http://fb.com http://webpage/
user1 1401207543
8fddc743bfba
fdc45e071e5
c126ceca7
TRUE
http://webpag
e/
http://webpage/l
ogin
user1 140120841
8fddc743bfba
fdc45e071e5
c126ceca7
TRUE
http://webpag
e/login
http://webpage/
add_to_cart
user2 1401207491
c00e7421525
008584d9d1ff
4201cbf65
FALSE http://twitter.c
om
http://webpage/
c00e7421525

Sessionize user clickstream
• Filter interesting events
• Group by user
• Add unique sessionId
• Join with external data sources
• Write output

val input = sc.textFile("file:///tmp/input")
val rawEvents = input
.map(line => line.split("t"))
val userInfo = sc.textFile("file:///tmp/userinfo")
.map(line => line.split("t"))
.map(user => (user(0),user))
val processedEvents = rawEvents
.map(arr => (arr(0),arr))
.cogroup(userInfo)
.flatMapValues(k => {
val new_user = k._2.length match {
case x if x > 0 => "true"
case _ => "false"
}
val session_id = java.util.UUID.randomUUID.toString
k._1.map(line =>
line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3)
)
})
.map(k => k._2)

Why is it better?
• Single spark job
• Easier to maintain than 3 consecutive map reduce
stages
• Can be unit tested

v1.0 - running on EC2
• Start with an EC2 script
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves>
—instance-type=c3.xlarge launch <cluster-name>
If it does not work for you - modify it, it’s just a simple
python+boto

v2.0 - Autoscaling on spot instances
1x Master - on-demand (c3.large)
XX Slaves - spot instances depending on usage patterns (r3.*)
• no HDFS
• persistence in memory + S3

Other options
• Mesos
• YARN
• MR1

JVM issues
• java.lang.OutOfMemoryError: GC overhead limit exceeded
• add more memory?
val sparkConf = new SparkConf()
.set("spark.executor.memory", "120g")
.set("spark.storage.memoryFraction","0.3")
.set("spark.shuffle.memoryFraction","0.3")
• increase parallelism:
sc.textFile("s3://..path", 10000)
groupByKey(10000)

Full GC
2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G-
>45G(110G), 79.3771030 secs]
2014-05-21T10:16:42.580+0000: 280.087: Total time for which
application threads were stopped: 79.3773830 seconds
we want to avoid this
• Use G1GC + Java 8
• Store data serialized
set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
set("spark.kryo.registrator","scee.SceeKryoRegistrator")

Bugs
• for example: cdh5 does not work with Amazon S3 out of the
box ( thx to Sean it will be fixed in next release )
• If in doubt use the provided ec2/spark-ec2 script
• ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves>
—instance-type=c3.xlarge launch <cluster-name>

Tips & Tricks
• you do not need to package whole spark with your app, just
specify dependencies as provided in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" %
„provided"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" %
"provided"
assembly jar size from 120MB -> 5MB
• always ensure you are compiling agains the same version of
artifacts, if not ”bad things will happen”™

Future - Spark 1.0
• Voting in progress to release Spark 1.0.0 RC11
• Spark SQL
• History server
• Job Submission Tool
• Java 8 support

Spark - Hadoop done right
• Faster to run, less code to write
• Deploying Spark can be easy and cost-effective
• Still rough around the edges but improves quickly

ETL with SPARK - First Spark London meetup

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à ETL with SPARK - First Spark London meetup

Similaire à ETL with SPARK - First Spark London meetup (20)

Dernier

Dernier (20)

ETL with SPARK - First Spark London meetup

Notes de l'éditeur