3. About me
• Sysadmin/DevOps background
• Worked as DevOps @Visualdna
• Now building game analytics platform
@Sony Computer Entertainment Europe
4. Outline
• What is ETL
• How do we do it in the standard Hadoop stack
• How can we supercharge it with Spark
• Real-life use cases
• How to deploy Spark
• Lessons learned
10. Hadoop
• Industry standard
• Have you ever looked at Hadoop code and
tried to fix something?
11. How simple is simple?
”Simple YARN application to run n copies of a unix command -
deliberately kept simple (with minimal error handling etc.)”
➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git
(…)
➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l
232
12. ETL Workflow
• Get some data from S3/HDFS
• Map
• Shuffle
• Reduce
• Save to S3/HDFS
13. ETL Workflow
• Get some data from S3/HDFS
• Map
• Shuffle
• Reduce
• Save to S3/HDFS
Repeat 10 times
14. Issue: Test run time
• Job startup time ~20s to run a job that does nothing
• Hard to test the code without a cluster ( cascading
simulation mode != real life )
15. Issue: new applications
MapReduce awkward for key big data workloads:
• Low latency dispatch (E.G. quick queries)
• Iterative algorithms (E.G. ML, Graph…)
• Streaming data ingest
16. Issue: hardware is moving on
Hardware had advanced since Hadoop started:
• Very large RAMs, Faster networks (10Gb+)
• Bandwidth to disk not keeping up
• 1 GB of RAM ~ $0.75/month *
*based on a spot price of AWS r3.8xlarge instance
18. Use Spark
• Fast and Expressive Cluster Computing Engine
• Compatible with Apache Hadoop
• In-memory storage
• Rich APIs in Java, Scala, Python
19. Why Spark?
• Up to 40x faster than Hadoop MapReduce
( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ )
• Jobs can be scheduled and run in <1s
• Typically less code (2-5x)
• Seamless Hadoop/HDFS integration
• REPL
• Accessible Source in terms of LOC and modularity
20. Why Spark?
• Berkeley Data Analytics Stack ecosystem:
• Spark, Spark Streaming, Shark, BlinkDB, MLlib
• Deep integration into Hadoop ecosystem
• Read/write Hadoop formats
• Interoperability with other ecosystem components
• Runs on Mesos & YARN, also MR1
• EC2, EMR
• HDFS, S3
27. Spark use-cases
• next-generation ETL platform
• No more “multiple chained MapReduce jobs”
architecture
• Less jobs to worry about
• Better sleep for your DevOps team
38. v1.0 - running on EC2
• Start with an EC2 script
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves>
—instance-type=c3.xlarge launch <cluster-name>
If it does not work for you - modify it, it’s just a simple
python+boto
39. v2.0 - Autoscaling on spot instances
1x Master - on-demand (c3.large)
XX Slaves - spot instances depending on usage patterns (r3.*)
• no HDFS
• persistence in memory + S3
42. JVM issues
• java.lang.OutOfMemoryError: GC overhead limit exceeded
• add more memory?
val sparkConf = new SparkConf()
.set("spark.executor.memory", "120g")
.set("spark.storage.memoryFraction","0.3")
.set("spark.shuffle.memoryFraction","0.3")
• increase parallelism:
sc.textFile("s3://..path", 10000)
groupByKey(10000)
43. Full GC
2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G-
>45G(110G), 79.3771030 secs]
2014-05-21T10:16:42.580+0000: 280.087: Total time for which
application threads were stopped: 79.3773830 seconds
we want to avoid this
• Use G1GC + Java 8
• Store data serialized
set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
set("spark.kryo.registrator","scee.SceeKryoRegistrator")
44. Bugs
• for example: cdh5 does not work with Amazon S3 out of the
box ( thx to Sean it will be fixed in next release )
• If in doubt use the provided ec2/spark-ec2 script
• ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves>
—instance-type=c3.xlarge launch <cluster-name>
45. Tips & Tricks
• you do not need to package whole spark with your app, just
specify dependencies as provided in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" %
„provided"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" %
"provided"
assembly jar size from 120MB -> 5MB
• always ensure you are compiling agains the same version of
artifacts, if not ”bad things will happen”™
46. Future - Spark 1.0
• Voting in progress to release Spark 1.0.0 RC11
• Spark SQL
• History server
• Job Submission Tool
• Java 8 support
47. Spark - Hadoop done right
• Faster to run, less code to write
• Deploying Spark can be easy and cost-effective
• Still rough around the edges but improves quickly