11. BDAS Summary (1/2)
Spark Core General purpose low level low latency processing engine.
Supports: HDFS API, Amazon S3 API, and Hive metadata
Shark Replaces Hive’s execution engine from MapReduce by Spark
Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP
sockets
MLlib MLlib = low level machine library running on Spark.
MLbase (in dev) Competitor to Mahout, runs on top of MLlib.
GraphX (in dev) Enable users to interactively build, transform, and reason about
graph structured at scale
12. BDAS Summary (2/2)
BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response
Times on Very Large Data
SparkR (alpha) Run R on top of Spark
Tachyon A reliable in-memory distributed file system providing a HDFS
compatible API.
Can persist data to HDFS, Amazon S3, LocalFS, etc.
Mesos Cluster resource manager, multi-tenancy
13.
14.
15.
16. Spark and the future of
big data applications
Eric Baldeschwieler (Tech Advisor)
18. Spark’s current (v1.0) challenges
Better job scheduling tools
Increase focus on ETL
R bindings
Extend SparkSQL to run on more data stores
Add more machine learning algorithms
Basics: stability, profiling & debugging, error
reporting, logging, etc.
28. SNAP: Scalable Nucleotide
Alignment Program
=> A new genome aligner based on Spark that
is 10-100X faster and simultaneously more
accurate than existing tools based on
MapReduce or other algorithms [1]
[1] https://amplab.cs.berkeley.edu/projects/snap/
29. SNAP helps save a life [1]
A teenager was hospitalized for 5 weeks
without successful diagnosis
He developed brain seizures and was placed in
a medically induced coma
With a sample of his spinal fluid and the use of
Snap, a rare infectious bacterium was found
Boy was treated, and discharged 4 weeks later
[1] https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/