New Directions for Spark in 2015 - Spark Summit East

•

16 j'aime•3,167 vues

This document summarizes new directions for Spark in 2015, including developing high-level interfaces for data science similar to single-machine tools, platform interfaces to plug in external data sources and algorithms, machine learning pipelines inspired by scikit-learn, a R interface for Spark, and community packages of third-party libraries. The goal is to create a unified engine for Spark that can handle a variety of data sources, workloads, and environments.

Logiciels

New Directions for Spark in 2015
Matei Zaharia
March 18, 2015

2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500+ active production deployments
2

0
20
40
60
80
100
120
140
2011 2012 2013 2014 2015
Contributors per Month to Spark
Most active project in big data
3

4
On-Disk Sort Record:
Time to sort 100TB
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes

Major Additions in 2014
5
Spark SQL
Java 8 syntax
Python streaming
…
GraphX
Random forests
Streaming MLlib

6
New Directions in 2015
Data Science
High-level interfaces similar
to single-machine tools
Platform Interfaces
Plug in data sources
and algorithms

7
DataFrames
Similar API to data frames
in R and Pandas
Automatically optimized
via Spark SQL
Out in Spark 1.3
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime

8
Machine Learning Pipelines
High-level API inspired by
SciKit-Learn
Featurization, evaluation,
parameter search
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame

9
R Interface (SparkR)
Targeting Spark 1.4 (June)
Exposes DataFrames,
RDDs, and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))

10
New Directions in 2015
Data Science
High-level interfaces similar
to single-machine tools
Platform Interfaces
Plug in data sources
and algorithms

11
External Data Sources
Platform API to plug smart
data sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}

$12 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”$

13
Spark Packages
Community index of third
party packages
bin/spark-shell --packages
databricks/spark-csv:0.2
spark-packages.org

14
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX

15
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX

16
{JSON}
Data Sources
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX

17
{JSON}
Data Sources
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX

18
Goal: unified engine across data sources,
workloads and environments

Recommandé

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

Enabling exploratory data science with Spark and RDatabricks

The BDAS Open Source Communityjeykottalam

New Developments in SparkDatabricks

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks

Recommandé

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

Enabling exploratory data science with Spark and RDatabricks

The BDAS Open Source Communityjeykottalam

New Developments in SparkDatabricks

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

H2O World - H2O Rains with Databricks CloudSri Ambati

Building a modern Application with DataFramesSpark Summit

New directions for Apache Spark in 2015Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Spark DataFrames and ML PipelinesDatabricks

Databricks @ Strata SJDatabricks

Strata NYC 2015 - What's coming for the Spark communityDatabricks

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks

Lessons from Running Large Scale Spark WorkloadsDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Spark what's new what's comingDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Real-Time Spark: From Interactive Queries to StreamingDatabricks

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

What to Expect for Big Data and Apache Spark in 2017 Databricks

Announcing Databricks Cloud (Spark Summit 2014)Databricks

The Future of Real-Time in SparkDatabricks

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Contenu connexe

Tendances

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

H2O World - H2O Rains with Databricks CloudSri Ambati

Building a modern Application with DataFramesSpark Summit

New directions for Apache Spark in 2015Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Spark DataFrames and ML PipelinesDatabricks

Databricks @ Strata SJDatabricks

Strata NYC 2015 - What's coming for the Spark communityDatabricks

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks

Lessons from Running Large Scale Spark WorkloadsDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Spark what's new what's comingDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Real-Time Spark: From Interactive Queries to StreamingDatabricks

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

What to Expect for Big Data and Apache Spark in 2017 Databricks

Announcing Databricks Cloud (Spark Summit 2014)Databricks

Tendances (20)

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

H2O World - H2O Rains with Databricks Cloud

Building a modern Application with DataFrames

New directions for Apache Spark in 2015

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark DataFrames and ML Pipelines

Databricks @ Strata SJ

Strata NYC 2015 - What's coming for the Spark community

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell

Lessons from Running Large Scale Spark Workloads

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Spark what's new what's coming

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Real-Time Spark: From Interactive Queries to Streaming

Spark Summit EU 2015: Reynold Xin Keynote

Jump Start on Apache® Spark™ 2.x with Databricks

What to Expect for Big Data and Apache Spark in 2017

Announcing Databricks Cloud (Spark Summit 2014)

En vedette

The Future of Real-Time in SparkDatabricks

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Dev Ops TrainingSpark Summit

Introduction to Spark TrainingSpark Summit

Spark Summit Europe 2016 Keynote - Databricks CEO Databricks

The Elephant in the CloudsDataWorks Summit/Hadoop Summit

Use r tutorial part1, introduction to sparkrDatabricks

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks

Combining Machine Learning Frameworks with Apache SparkDatabricks

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Apache Spark Model Deployment Databricks

Integrating Apache Spark and NiFi for Data LakesDataWorks Summit/Hadoop Summit

Deep Dive: Memory Management in Apache SparkDatabricks

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Parallelizing Existing R Packages with SparkRDatabricks

Introduction to Apache Spark Developer TrainingCloudera, Inc.

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

TensorFrames: Google Tensorflow on Apache SparkDatabricks

En vedette (20)

The Future of Real-Time in Spark

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Dev Ops Training

Introduction to Spark Training

Spark Summit Europe 2016 Keynote - Databricks CEO

The Elephant in the Clouds

Use r tutorial part1, introduction to sparkr

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

Combining Machine Learning Frameworks with Apache Spark

Spark Summit EU 2015: Lessons from 300+ production users

Apache Spark Model Deployment

Integrating Apache Spark and NiFi for Data Lakes

Deep Dive: Memory Management in Apache Spark

Apache Spark 2.0: Faster, Easier, and Smarter

Parallelizing Existing R Packages with SparkR

Introduction to Apache Spark Developer Training

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Apache Spark MLlib 2.0 Preview: Data Science and Production

TensorFrames: Google Tensorflow on Apache Spark

Similaire à New Directions for Spark in 2015 - Spark Summit East

Spark Community Update - Spark Summit San Francisco 2015Databricks

H2O PySparkling WaterSri Ambati

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Composable Parallel Processing in Apache Spark and WeldDatabricks

Big data apache spark + scalaJuantomás García Molina

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

Spark ML Pipeline servingStepan Pushkarev

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

What's new in Spark 2.0?rerngvit yanggratoke

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

Jump Start into Apache® Spark™ and DatabricksDatabricks

BDTC2015 databricks-辛湜-state of sparkJerry Wen

What's new in spark 2.0?Örjan Lundberg

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!

Big data analysis using spark r publishedDipendra Kusi

20170126 big data processingVienna Data Science Group

Similaire à New Directions for Spark in 2015 - Spark Summit East (20)

Spark Community Update - Spark Summit San Francisco 2015

H2O PySparkling Water

Spark + AI Summit 2020 イベント概要

Jump Start with Apache Spark 2.0 on Databricks

Composable Parallel Processing in Apache Spark and Weld

Big data apache spark + scala

Big Data Processing with .NET and Spark (SQLBits 2020)

Spark ML Pipeline serving

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity

Databricks Meetup @ Los Angeles Apache Spark User Group

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

What's new in Spark 2.0?

Solving Enterprise Data Challenges with Apache Arrow

Jump Start into Apache® Spark™ and Databricks

BDTC2015 databricks-辛湜-state of spark

What's new in spark 2.0?

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...

Big data analysis using spark r published

20170126 big data processing

Plus de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Dernier

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

How to submit a standout Adobe Champion ApplicationBradBedford3

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

React Server Component in Next.js by Hanief UtamaHanief Utama

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

Introduction Computer Science - Software Design.pdfFerryKemperman

Dernier (20)

What is Advanced Excel and what are some best practices for designing and cre...

How to submit a standout Adobe Champion Application

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Intelligent Home Wi-Fi Solutions | ThinkPalm

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

Implementing Zero Trust strategy with Azure

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

Cloud Data Center Network Construction - IEEE

PREDICTING RIVER WATER QUALITY ppt presentation

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Folding Cheat Sheet #4 - fourth in a series

React Server Component in Next.js by Hanief Utama

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Odoo 14 - eLearning Module In Odoo 14 Enterprise

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

Ahmed Motair CV April 2024 (Senior SW Developer)

Introduction Computer Science - Software Design.pdf

New Directions for Spark in 2015 - Spark Summit East

1. New Directions for Spark in 2015 Matei Zaharia March 18, 2015

2. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500+ active production deployments 2

3. 0 20 40 60 80 100 120 140 2011 2012 2013 2014 2015 Contributors per Month to Spark Most active project in big data 3

4. 4 On-Disk Sort Record: Time to sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes

5. Major Additions in 2014 5 Spark SQL Java 8 syntax Python streaming … GraphX Random forests Streaming MLlib

6. 6 New Directions in 2015 Data Science High-level interfaces similar to single-machine tools Platform Interfaces Plug in data sources and algorithms

7. 7 DataFrames Similar API to data frames in R and Pandas Automatically optimized via Spark SQL Out in Spark 1.3 df = jsonFile(“tweets.json”) df[df[“user”] == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime

8. 8 Machine Learning Pipelines High-level API inspired by SciKit-Learn Featurization, evaluation, parameter search tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame

9. 9 R Interface (SparkR) Targeting Spark 1.4 (June) Exposes DataFrames, RDDs, and ML library in R df = jsonFile(“tweets.json”) summarize( group_by( df[df$user == “matei”,], “date”), sum(“retweets”))

10. 10 New Directions in 2015 Data Science High-level interfaces similar to single-machine tools Platform Interfaces Plug in data sources and algorithms

11. 11 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources Spark {JSON}

12. 12 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”

13. 13 Spark Packages Community index of third party packages bin/spark-shell --packages databricks/spark-csv:0.2 spark-packages.org

14. 14 Spark Core Spark Streaming Spark SQL MLlib GraphX

15. 15 Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX

16. 16 {JSON} Data Sources Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX

17. 17 {JSON} Data Sources Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX

18. 18 Goal: unified engine across data sources, workloads and environments

19. 19 Enjoy Spark Summit East!