This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
4. Big Data
• Data size beyond systems capability
– Terabyte, Petabyte, Exabyte
• Storage
– Commodity servers, RAID, SAN
• Processing
– In reasonable response time
– A challenge here
4
5. Server
Tradition processing tools
• Move what ?
– the data to the code or
– the code to the data
5
Data
Server
move data to code
move code to data
Code
6. Traditional processing tools
• Traditional tools
– RDBMS, DWH, BI
– High cost
– Difficult to scale beyond certain data size
• price/performance skew
• data variety not supported
6
7. Map-Reduce and NoSQL
• Hadoop toolset
– Free and open source
– Commodity hardware
– Scales to exabytes(1018), maybe even more
• Not only SQL
– Storage and query processing only
– Complements Hadoop toolset
– Volume, velocity and variety
7
8. All is well ?
• Hadoop was designed for batch processing
• Disk based processing: slow
• Many tools to enhance Hadoop’s capabilities
– Distributed cache, Haloop, Hive, HBase
• Not for interactive and iterative
8
10. What is singularity ?
10
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 3 4 5 6 7
AIcapacity
Decade
Decade vs AI capacity
Point of
singularity
11. Technological singularity
• When AI capability exceeds Human capacity
• AI or non-AI singularity
• 2045: http://en.wikipedia.org/wiki/Ray_Kurzweil
– The predicted year
11
13. History of Spark
13
Spark 1.3.1
released
2015
March
Spark 1.0.0
released.
100TB sort
achieved in
23 mins
2014
Spark
donated to
Apache
Software
Foundation
2013
Spark
becomes
open
source.
2010
Spark
created by
Matei
Zaharia at
UC
Berkeley
2009
26. Interactive data analytics
• For Spark and Flink
• Web front end
• At the back end, it connects to
– SQL systems(Eg: Hive)
– Spark
– Flink
26
27. Deployment Architecture
27
Spark / Flink /
Hive
Zeppelin daemon
Web
browser 1
Web
browser 2
Web
browser 3
Web Server
Local
Interpreters
Optional
Remote
Interpreters
28. Notebook
• Is where you do your data analysis
• Web UI REPL with pluggable interpreters
• Interpreters
– Scala, Python, Angular, SparkSQL, Markdown and Shell
28
30. User Interface features
• Markdown
• Dynamic HTML generation
• Dynamic chart generation
• Screen sharing via websockets
30
31. SQL Interpreter
• SQL shell
– Query spark data using SQL queries
– Return normal text, HTML or chart type results
31
32. Scala interpreter for Spark
• Similar to the Spark shell
• Upload your data into Spark
• Query the data sets(RDDs) in your Spark server
• Execute map-reduce tasks
• Actions on RDD
• Transformations on RDD
32
41. Zeppelin views: Table from SQL
41
%sql select age, count(1) from bank where
marital="${marital=single,single|divorced|married}"
group by age order by age
45. Share variables: MVVM
• Between Scala/Python/Spark and Angular
• Observe scala variables from angular
45
Scala-Spark Angular
x = “foo” x = “bar”
Zeppelin
47. Screen sharing using Zeppelin
• Share your graphical reports
– Live sharing
– Get the share URL from zeppelin and share with others
– Uses websockets
• Embed live reports in web pages
47