BDM26: Spark Summit 2014 Debriefing

•Télécharger en tant que PPTX, PDF•

3 j'aime•500 vues

David Lauzon

Highlights of the most interesting topics discussed at the Spark Summit 2014 in San Francisco, California

Logiciels

Spark Summit 2014
Debriefing
David Lauzon
Presented at Big Data Montreal #26 on July 8th 2014

Plan
● Spark Summit 2014 summary
● Tachyon
● BlinkDB
● Databricks Cloud

Disclaimer
I haven’t use Spark yet
I haven’t validated all the info gathered in this
presentation
Try it out for yourself :-)

Spark’s Role in the Big
Data Ecosystem
Matei Zaharia (CTO, Databricks)

“Spark is now the most active
project in the Hadoop ecosystem”

“The goal of Spark is to be a unified
platform and standard library for big
data apps”

What’s Next for BDAS?
Mike Franklin
(Director, UC Berkeley AMPLab)

LAYERS
Application
Data Processing
Resource
Management
Data
Management

BDAS Summary (1/2)
Spark Core General purpose low level low latency processing engine.
Supports: HDFS API, Amazon S3 API, and Hive metadata
Shark Replaces Hive’s execution engine from MapReduce by Spark
Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP
sockets
MLlib MLlib = low level machine library running on Spark.
MLbase (in dev) Competitor to Mahout, runs on top of MLlib.
GraphX (in dev) Enable users to interactively build, transform, and reason about
graph structured at scale

BDAS Summary (2/2)
BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response
Times on Very Large Data
SparkR (alpha) Run R on top of Spark
Tachyon A reliable in-memory distributed file system providing a HDFS
compatible API.
Can persist data to HDFS, Amazon S3, LocalFS, etc.
Mesos Cluster resource manager, multi-tenancy

Spark and the future of
big data applications
Eric Baldeschwieler (Tech Advisor)

Spark’s current (v1.0) challenges
Better job scheduling tools
Increase focus on ETL
R bindings
Extend SparkSQL to run on more data stores
Add more machine learning algorithms
Basics: stability, profiling & debugging, error
reporting, logging, etc.

Spark’s current (v1.0) challenges
Better stability
Profiling & debugging
Error reporting
Logging

The Future of Spark
Patrick Wendell (Databricks)

Timeline
and:
● join optimisations
● MLib: from 15 to 30 algorithms
● Core internal API for pluggable
implementations

The Emergence of the
Enterprise Data Hub
Mike Olson (Chief Strategy Officer,
Cloudera)

This means that sooner or later ...
Hadoop
MapReduce

Spark meets Genomics:
Helping Fight the Big C
with the Big D
David Patterson (AMP Lab, UC Berkeley)

SNAP: Scalable Nucleotide
Alignment Program
=> A new genome aligner based on Spark that
is 10-100X faster and simultaneously more
accurate than existing tools based on
MapReduce or other algorithms [1]
[1] https://amplab.cs.berkeley.edu/projects/snap/

SNAP helps save a life [1]
A teenager was hospitalized for 5 weeks
without successful diagnosis
He developed brain seizures and was placed in
a medically induced coma
With a sample of his spinal fluid and the use of
Snap, a rare infectious bacterium was found
Boy was treated, and discharged 4 weeks later
[1] https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/

Databricks Update and
Announcing Databricks
Cloud
Ion Stoica (CEO, Databricks)

Databricks Cloud Demo
The following video extract integrates:
● Databricks Workspace
● Databricks Platform
● Spark Streaming
● Spark SQL
● Spark MLLib

Databricks Cloud Demo
14min extract:
http://youtu.be/dJQ5lV5Tldw?t=26m57s
Full video:
https://www.youtube.com/watch?v=dJQ5lV5Tldw

Databricks Cloud
Great tool for data scientists

Conclusion
Most interesting Spark related projects:
● SparkSQL
● BlinkDB
● Tachyon
● Databricks Cloud

Recommandé

Spark Summit EU talk by Pat PattersonSpark Summit

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks

Big Data – A New Testing ChallengeTEST Huddle

BDTC2015 databricks-辛湜-state of sparkJerry Wen

Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks

Learn to Use Databricks for Data ScienceDatabricks

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit

Recommandé

Spark Summit EU talk by Pat PattersonSpark Summit

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks

Big Data – A New Testing ChallengeTEST Huddle

BDTC2015 databricks-辛湜-state of sparkJerry Wen

Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks

Learn to Use Databricks for Data ScienceDatabricks

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit

Graph-Powered Machine Learning GraphAware

The Power of Unified Analytics with Ali Ghodsi Databricks

Scaling and Modernizing Data Platform with DatabricksDatabricks

Obfuscating LinkedIn Member DataDataWorks Summit

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...Spark Summit

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

H2O Advancements - Arno CandelSri Ambati

Apache Spark Model Deployment Databricks

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

Graph Data: a New Data Management FrontierDemai Ni

Big Data TestingQA InfoTech

Bridging the Completeness of Big Data on DatabricksDatabricks

Anomaly Detection at Scale!Databricks

Data Science at Scale by Sarah GuidoSpark Summit

Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks

Spark Summit Keynote by Suren NathanSpark Summit

Saving Energy in Homes with a Unified Approach to Data and AIDatabricks

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks

Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit

BDM32: AdamCloud Project - Part IIDavid Lauzon

BDM25 - Spark runtime internalDavid Lauzon

Contenu connexe

Tendances

Graph-Powered Machine Learning GraphAware

The Power of Unified Analytics with Ali Ghodsi Databricks

Scaling and Modernizing Data Platform with DatabricksDatabricks

Obfuscating LinkedIn Member DataDataWorks Summit

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...Spark Summit

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

H2O Advancements - Arno CandelSri Ambati

Apache Spark Model Deployment Databricks

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

Graph Data: a New Data Management FrontierDemai Ni

Big Data TestingQA InfoTech

Bridging the Completeness of Big Data on DatabricksDatabricks

Anomaly Detection at Scale!Databricks

Data Science at Scale by Sarah GuidoSpark Summit

Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks

Spark Summit Keynote by Suren NathanSpark Summit

Saving Energy in Homes with a Unified Approach to Data and AIDatabricks

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks

Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit

Tendances (20)

Graph-Powered Machine Learning

The Power of Unified Analytics with Ali Ghodsi

Scaling and Modernizing Data Platform with Databricks

Obfuscating LinkedIn Member Data

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

H2O Advancements - Arno Candel

Apache Spark Model Deployment

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Graph Data: a New Data Management Frontier

Big Data Testing

Bridging the Completeness of Big Data on Databricks

Anomaly Detection at Scale!

Data Science at Scale by Sarah Guido

Bootstrapping of PySpark Models for Factorial A/B Tests

Spark Summit Keynote by Suren Nathan

Saving Energy in Homes with a Unified Approach to Data and AI

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...

Spark in the Enterprise - 2 Years Later by Alan Saldich

En vedette

BDM32: AdamCloud Project - Part IIDavid Lauzon

BDM25 - Spark runtime internalDavid Lauzon

BDM29: AdamCloud Project - Part IDavid Lauzon

BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon

BDM24 - Cassandra use case at Netflix 20140429 montrealmeetupDavid Lauzon

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere

หนังสือภาษาไทย Spark InternalBhuridech Sudsee

Unified Big Data Processing with Apache SparkC4Media

QCon2016--Drive Best Spark Performance on AILex Yu

Introduction to Spark - DataFactZDataFactZ

Fun[ctional] spark with scalaDavid Vallejo Navarro

Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal

Resilient Distributed DataSets - Apache SPARKTaposh Roy

Apache Spark Majid Hajibaba

Cassandra Data Maintenance with SparkDataStax Academy

Make 2016 your year of SMACK talkDataStax Academy

Apache Spark: What's under the hoodAdarsh Pannu

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

Spark Deep DiveCorey Nolet

En vedette (20)

BDM32: AdamCloud Project - Part II

BDM25 - Spark runtime internal

BDM29: AdamCloud Project - Part I

BDM8 - Near-realtime Big Data Analytics using Impala

BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...

หนังสือภาษาไทย Spark Internal

Unified Big Data Processing with Apache Spark

QCon2016--Drive Best Spark Performance on AI

Introduction to Spark - DataFactZ

Fun[ctional] spark with scala

Apache Spark Streaming: Architecture and Fault Tolerance

Resilient Distributed DataSets - Apache SPARK

Apache Spark

Cassandra Data Maintenance with Spark

Make 2016 your year of SMACK talk

Apache Spark: What's under the hood

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

Spark Deep Dive

Similaire à BDM26: Spark Summit 2014 Debriefing

Why spark by Stratio - v.1.0Stratio

Started with-apache-sparkHappiest Minds Technologies

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

IBM Strategy for SparkMark Kerzner

Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi

Big Data & Open Source - Neil JadhavSwapnil (Neil) Jadhav

Energy analytics with Apache Spark workshopQuantUniversity

Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board

Sparkr sigmodwaqasm86

RDBMS vs Hadoop vs SparkLaxmi8

Apache spark architecture (Big Data and Analytics)Jyotasana Bharti

Scalable Machine Learning with PySparkLadle Patel

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

Introduction To Data Science with Apache Spark ZaranTech LLC

Dev Ops TrainingSpark Summit

Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo

INFO491FinalPaperJessica Morris

Sanath pabba hadoop resume 1.0Pabba Gupta

BigData_Krishna Kumar SharmaKrishna Kumar Sharma

Spark vs HadoopOlesya Eidam

Similaire à BDM26: Spark Summit 2014 Debriefing (20)

Why spark by Stratio - v.1.0

Started with-apache-spark

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

IBM Strategy for Spark

Transitioning Compute Models: Hadoop MapReduce to Spark

Big Data & Open Source - Neil Jadhav

Energy analytics with Apache Spark workshop

Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

Sparkr sigmod

RDBMS vs Hadoop vs Spark

Apache spark architecture (Big Data and Analytics)

Scalable Machine Learning with PySpark

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Introduction To Data Science with Apache Spark

Dev Ops Training

Introduction to Spark: Data Analysis and Use Cases in Big Data

INFO491FinalPaper

Sanath pabba hadoop resume 1.0

BigData_Krishna Kumar Sharma

Spark vs Hadoop

Dernier

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

SoftTeco - Software Development Company Profileakrivarotava

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

Precise and Complete Requirements? An Elusive GoalLionel Briand

What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions

SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin

Post Quantum Cryptography – The Impact on Identityteam-WIBU

eSoftTools IMAP Backup Software and migration toolsosttopstonverter

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools

Dernier (20)

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording

Salesforce Implementation Services PPT By ABSYZ

SoftTeco - Software Development Company Profile

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx

Precise and Complete Requirements? An Elusive Goal

What’s New in VictoriaMetrics: Q1 2024 Updates

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Odoo 14 - eLearning Module In Odoo 14 Enterprise

2024 DevNexus Patterns for Resiliency: Shuffle shards

Post Quantum Cryptography – The Impact on Identity

eSoftTools IMAP Backup Software and migration tools

A healthy diet for your Java application Devoxx France.pdf

Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton

BDM26: Spark Summit 2014 Debriefing

1. Spark Summit 2014 Debriefing David Lauzon Presented at Big Data Montreal #26 on July 8th 2014

2. Plan ● Spark Summit 2014 summary ● Tachyon ● BlinkDB ● Databricks Cloud

3. Disclaimer I haven’t use Spark yet I haven’t validated all the info gathered in this presentation Try it out for yourself :-)

4. Spark’s Role in the Big Data Ecosystem Matei Zaharia (CTO, Databricks)

5. “Spark is now the most active project in the Hadoop ecosystem”

7. “The goal of Spark is to be a unified platform and standard library for big data apps”

8. native driver

9. What’s Next for BDAS? Mike Franklin (Director, UC Berkeley AMPLab)

10. LAYERS Application Data Processing Resource Management Data Management

11. BDAS Summary (1/2) Spark Core General purpose low level low latency processing engine. Supports: HDFS API, Amazon S3 API, and Hive metadata Shark Replaces Hive’s execution engine from MapReduce by Spark Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP sockets MLlib MLlib = low level machine library running on Spark. MLbase (in dev) Competitor to Mahout, runs on top of MLlib. GraphX (in dev) Enable users to interactively build, transform, and reason about graph structured at scale

12. BDAS Summary (2/2) BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response Times on Very Large Data SparkR (alpha) Run R on top of Spark Tachyon A reliable in-memory distributed file system providing a HDFS compatible API. Can persist data to HDFS, Amazon S3, LocalFS, etc. Mesos Cluster resource manager, multi-tenancy

13.

14.

15.

16. Spark and the future of big data applications Eric Baldeschwieler (Tech Advisor)

17. Big Data Application Model

18. Spark’s current (v1.0) challenges Better job scheduling tools Increase focus on ETL R bindings Extend SparkSQL to run on more data stores Add more machine learning algorithms Basics: stability, profiling & debugging, error reporting, logging, etc.

19. Spark’s current (v1.0) challenges Better stability Profiling & debugging Error reporting Logging

20. The Future of Spark Patrick Wendell (Databricks)

21. Timeline and: ● join optimisations ● MLib: from 15 to 30 algorithms ● Core internal API for pluggable implementations

22. The Emergence of the Enterprise Data Hub Mike Olson (Chief Strategy Officer, Cloudera)

23.

24. (a vision of the future)

25. This means that sooner or later ... Hadoop MapReduce

26.

27. Spark meets Genomics: Helping Fight the Big C with the Big D David Patterson (AMP Lab, UC Berkeley)

28. SNAP: Scalable Nucleotide Alignment Program => A new genome aligner based on Spark that is 10-100X faster and simultaneously more accurate than existing tools based on MapReduce or other algorithms [1] [1] https://amplab.cs.berkeley.edu/projects/snap/

29. SNAP helps save a life [1] A teenager was hospitalized for 5 weeks without successful diagnosis He developed brain seizures and was placed in a medically induced coma With a sample of his spinal fluid and the use of Snap, a rare infectious bacterium was found Boy was treated, and discharged 4 weeks later [1] https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/

30. Databricks Update and Announcing Databricks Cloud Ion Stoica (CEO, Databricks)

31. even RedHat Fedora

32. New: Databricks Cloud Platform

33. Databricks Platform

34. Databricks Workspace: Notebooks

35. Databricks Workspace: Dashboards

36. Databricks Cloud Demo The following video extract integrates: ● Databricks Workspace ● Databricks Platform ● Spark Streaming ● Spark SQL ● Spark MLLib

37. Databricks Cloud Demo 14min extract: http://youtu.be/dJQ5lV5Tldw?t=26m57s Full video: https://www.youtube.com/watch?v=dJQ5lV5Tldw

38. Databricks Cloud Great tool for data scientists

39. Conclusion

40. Conclusion Most interesting Spark related projects: ● SparkSQL ● BlinkDB ● Tachyon ● Databricks Cloud

Notes de l'éditeur

a kind of cloud-hosted iPython Notebook
Demo Wikipedia ML Twitter realtime graph
Demo Wikipedia ML Twitter realtime graph
Demo Wikipedia ML Twitter realtime graph
Demo Wikipedia ML Twitter realtime graph