SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
Distributed Heterogeneous
Mixture Learning on Spark
Masato Asahara and Ryohei Fujimaki
NEC Data Science Research Labs.
Jun/08/2016 @Spark Summit 2016
2 © NEC Corporation 2016
Agenda
3 © NEC Corporation 2016
Agenda
HML
4 © NEC Corporation 2016
Agenda
CPU
12
SEP
11
SEP
10
SEP
HML
5 © NEC Corporation 2016
Agenda
HML
NEC’s Predictive Analytics and
Heterogeneous Mixture Learning
7 © NEC Corporation 2016
Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
8 © NEC Corporation 2016
NEC’s Heterogeneous Mixture Learning (HML)
NEC’s machine learning that automatically derives “accurate” and
“transparent” formulas behind Big Data.
Sun
Sat
Y=a1x1+ ・・・ +anxn
Y=a1x1+ ・・・+knx3
Y=b1x2+ ・・・+hnxn
Y=c1x4+ ・・・+fnxn
Mon
Sun
Explore Massive Formulas
Transparent
data segmentation
and predictive formulas
Heterogeneous
mixture data
9 © NEC Corporation 2016
The HML Model
Gender
Height
tallshort
age
smoke
food
weight
sports
sports
tea/coffee
beer/wine
The health risk of a
tall man can be
predicted by age,
alcohol and smoke!
10 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
11 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
12 © NEC Corporation 2016
HML Algorithm
Segment data by a rule-based tree
Feature
Selection
Pruning
Data
Segmentation
Training data
13 © NEC Corporation 2016
HML Algorithm
Select features and fit predictive models for data segments
Feature
Selection
Pruning
Data
Segmentation
Training data
14 © NEC Corporation 2016
Enterprise Applications of HML
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Operation Mgmt.
Sales
Optimization
Product Price
Optimization
15 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Operation Mgmt.
Product Price
Optimization
Demand for Big Data Analysis
24(hour)×365(days)×3(year)×1000(shops)
~26,000,000 training samples
Sales
Optimization
16 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Energy/Water
Demand Forecasting
Sales
Forecasting
Product Price
Optimization
Demand for Big Data Analysis
5 million(customers)×12(months)
=60,000,000 training samples
17 © NEC Corporation 2016
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenanc
e
Energy/Water
Demand Forecasting
Sales
Forecasting
Product Price
Optimization
Demand for Big Data Analysis
Distributed HML on Spark
19 © NEC Corporation 2016
Why Spark, not Hadoop?
Because Spark’s in-memory architecture can run HML faster
Feature
Selection
Pruning
Data
Segmentation
CPU CPU CPU
Data segmentation Feature selection
~
Pruning
CPU CPU
Data segmentation Feature selection
20 © NEC Corporation 2016
Data Scalability powered by Spark
Treat unlimited scale of training data by adding executors
executor executor executor
CPU CPU CPU
driver
HDFS
Add executors,
and get
more memory
and CPU power
Training data
21 © NEC Corporation 2016
Distributed HML Engine: Architecture
HDFS
Distributed HML
Core (Scala)
YARN
invoke
HML
model
Distributed HML
Interface (Python)
Matrix
computation
libraries
(Breeze, BLAS)
Data Scientist
22 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
23 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
24 © NEC Corporation 2016
Challenge to Avoid Data Shuffling
executor executor executor
driver
Model:
10KB~10MB
HML Model Merge Algorithm
Training data:
TB~
executor executor executor
driver
Naïve Design Distributed HML
25 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors build a rule-based tree from their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
26 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the trees
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
HML Segment Merge Algorithm
27 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged tree to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
28 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors perform feature selection with their local data in parallel
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat
29 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
Sat
Sat
Sat
Sat
HML Feature
Merge Algorithm
30 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
HML Feature
Merge Algorithm
31 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged results of feature selection to executors
Driver
Executor
Feature
Selection
Pruning
Data
Segmentation
Sat
SatSatSat
32 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU
12
SEP
11
SEP
10
SEP
33 © NEC Corporation 2016
Wait Time for Other Executors Delays Execution
Machine learning is likely to cause unbalanced computation load
Executor
Sat
Sat
Time
Executor Executor
34 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
35 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
Sat Sat
Sat Sat
Completion time reduced!
36 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
1TB
CPU
12
SEP
11
SEP
10
SEP
37 © NEC Corporation 2016
Machine Learning Consists of Many Matrix Computations
Feature
Selection
Pruning
Data
Segmentation
Basic Matrix Computation
Matrix Inversion
Combinatorial Optimization
…
CPU
38 © NEC Corporation 2016
RDD Design for Leveraging High-Speed Libraries
Distributed HML’s RDD Design
Matrix object1
element
…
Straightforward RDD Design
Training sample1
partition
Training sample2
CPU
CPU
Matrix computation libraries
(Breeze, BLAS)
~Training sample1
Training sample2
…
seq(Training sample1,
Training sample2, …)
mapPartition
Matrix object1
…
Training sample1
Training sample2
map
Iterator
Matrix
object
creation
element
39 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(PruningMapFunc)
40 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
CPU
.map(PruningMapFunc)
41 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Cut long-chained operations periodically by checkpoint()
Feature
Selection
Pruning
Data
Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
savedRDD
.checkpoint()
=
CPU
~
.map(PruningMapFunc)
Benchmark Performance Evaluations
43 © NEC Corporation 2016
Eval 1: Prediction Error (vs. Spark MLlib algorithms)
Distributed HML achieves low error competitive to a complex model
data # samples Distributed
HML
Logistic
Regression
Decision
Tree
Random
Forests
gas sensor
array (CO)*
4,208,261 0.542 0.597 0.587 0.576
household
power
consumption
*
2,075,259 0.524 0.531 0.529 0.655
HIGGS* 11,000,000 0.335 0.358 0.337 0.317
HEPMASS* 7,000,000 0.156 0.163 0.167 0.175
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
1st
1st
1st
2nd
44 © NEC Corporation 2016
Eval 2: Performance Scalability (vs. Spark MLlib algos)
Distributed HML is competitive with Spark MLlib implementations.
HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores
Distributed
HML
Logistic
Regression
Distributed
HML
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
Logistic
Regression
Evaluation in Real Case
46 © NEC Corporation 2016
ATM Cash Balance Prediction
Much cash  High
inventory cost
Little cash  High risk of
out of stock
Around 20,000 ATMs
47 © NEC Corporation 2016
Training Speed
2 hours9 days
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Cluster spec (10 nodes)
 # CPU cores: 128
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.2 (HDP 2.3.2)
* Run w/ 1 CPU core and 256GB memory
110x
Speed up
Serial HML* Distributed HML
48 © NEC Corporation 2016
Prediction Error
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 23M
▌Cluster spec (10 nodes)
 # CPU cores: 128
 Memory: 2.5TB
 Spark 1.6.1,
Hadoop 2.7.2 (HDP 2.3.2)
Average Error:
17% lower
Serial
HML
Distributed HMLSerial
HML
Serial
HML
vs
Winner: Distributed HML
49 © NEC Corporation 2016
Summary
CPU
12
SEP
11
SEP
10
SEP
50 © NEC Corporation 2016
Summary
Distributed Heterogeneous Mixture Learning On Spark

Contenu connexe

Tendances

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache SparkGeospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkDatabricks
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastDatabricks
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 

Tendances (20)

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache SparkGeospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache Spark
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 

Similaire à Distributed Heterogeneous Mixture Learning On Spark

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...DataWorks Summit
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processingideas2ignite
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkSingleStore
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrationsinside-BigData.com
 
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
OpenACC and Open Hackathons Monthly Highlights May  2023.pdfOpenACC and Open Hackathons Monthly Highlights May  2023.pdf
OpenACC and Open Hackathons Monthly Highlights May 2023.pdfOpenACC
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...Edge AI and Vision Alliance
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdfRioCarthiis
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 

Similaire à Distributed Heterogeneous Mixture Learning On Spark (20)

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrations
 
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
OpenACC and Open Hackathons Monthly Highlights May  2023.pdfOpenACC and Open Hackathons Monthly Highlights May  2023.pdf
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
 
HPC in higher education
HPC in higher educationHPC in higher education
HPC in higher education
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLABBig Data Analytics With MATLAB
Big Data Analytics With MATLAB
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdf
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 

Dernier (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 

Distributed Heterogeneous Mixture Learning On Spark

  • 1. Distributed Heterogeneous Mixture Learning on Spark Masato Asahara and Ryohei Fujimaki NEC Data Science Research Labs. Jun/08/2016 @Spark Summit 2016
  • 2. 2 © NEC Corporation 2016 Agenda
  • 3. 3 © NEC Corporation 2016 Agenda HML
  • 4. 4 © NEC Corporation 2016 Agenda CPU 12 SEP 11 SEP 10 SEP HML
  • 5. 5 © NEC Corporation 2016 Agenda HML
  • 6. NEC’s Predictive Analytics and Heterogeneous Mixture Learning
  • 7. 7 © NEC Corporation 2016 Enterprise Applications of HML Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  • 8. 8 © NEC Corporation 2016 NEC’s Heterogeneous Mixture Learning (HML) NEC’s machine learning that automatically derives “accurate” and “transparent” formulas behind Big Data. Sun Sat Y=a1x1+ ・・・ +anxn Y=a1x1+ ・・・+knx3 Y=b1x2+ ・・・+hnxn Y=c1x4+ ・・・+fnxn Mon Sun Explore Massive Formulas Transparent data segmentation and predictive formulas Heterogeneous mixture data
  • 9. 9 © NEC Corporation 2016 The HML Model Gender Height tallshort age smoke food weight sports sports tea/coffee beer/wine The health risk of a tall man can be predicted by age, alcohol and smoke!
  • 10. 10 © NEC Corporation 2016 HML (Heterogeneous Mixture Learning)
  • 11. 11 © NEC Corporation 2016 HML (Heterogeneous Mixture Learning)
  • 12. 12 © NEC Corporation 2016 HML Algorithm Segment data by a rule-based tree Feature Selection Pruning Data Segmentation Training data
  • 13. 13 © NEC Corporation 2016 HML Algorithm Select features and fit predictive models for data segments Feature Selection Pruning Data Segmentation Training data
  • 14. 14 © NEC Corporation 2016 Enterprise Applications of HML Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Energy/Water Operation Mgmt. Sales Optimization Product Price Optimization
  • 15. 15 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenanc e Energy/Water Operation Mgmt. Product Price Optimization Demand for Big Data Analysis 24(hour)×365(days)×3(year)×1000(shops) ~26,000,000 training samples Sales Optimization
  • 16. 16 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Energy/Water Demand Forecasting Sales Forecasting Product Price Optimization Demand for Big Data Analysis 5 million(customers)×12(months) =60,000,000 training samples
  • 17. 17 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenanc e Energy/Water Demand Forecasting Sales Forecasting Product Price Optimization Demand for Big Data Analysis
  • 19. 19 © NEC Corporation 2016 Why Spark, not Hadoop? Because Spark’s in-memory architecture can run HML faster Feature Selection Pruning Data Segmentation CPU CPU CPU Data segmentation Feature selection ~ Pruning CPU CPU Data segmentation Feature selection
  • 20. 20 © NEC Corporation 2016 Data Scalability powered by Spark Treat unlimited scale of training data by adding executors executor executor executor CPU CPU CPU driver HDFS Add executors, and get more memory and CPU power Training data
  • 21. 21 © NEC Corporation 2016 Distributed HML Engine: Architecture HDFS Distributed HML Core (Scala) YARN invoke HML model Distributed HML Interface (Python) Matrix computation libraries (Breeze, BLAS) Data Scientist
  • 22. 22 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  • 23. 23 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  • 24. 24 © NEC Corporation 2016 Challenge to Avoid Data Shuffling executor executor executor driver Model: 10KB~10MB HML Model Merge Algorithm Training data: TB~ executor executor executor driver Naïve Design Distributed HML
  • 25. 25 © NEC Corporation 2016 Execution Flow of Distributed HML Executors build a rule-based tree from their local data in parallel Driver Executor Feature Selection Pruning Data Segmentation
  • 26. 26 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the trees Driver Executor Feature Selection Pruning Data Segmentation HML Segment Merge Algorithm
  • 27. 27 © NEC Corporation 2016 Execution Flow of Distributed HML Driver broadcasts the merged tree to executors Driver Executor Feature Selection Pruning Data Segmentation
  • 28. 28 © NEC Corporation 2016 Execution Flow of Distributed HML Executors perform feature selection with their local data in parallel Driver Executor Feature Selection Pruning Data Segmentation Sat Sat Sat Sat Sat
  • 29. 29 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the results of feature selection Driver Executor Feature Selection Pruning Data Segmentation Sat Sat Sat Sat Sat HML Feature Merge Algorithm
  • 30. 30 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the results of feature selection Driver Executor Feature Selection Pruning Data Segmentation Sat HML Feature Merge Algorithm
  • 31. 31 © NEC Corporation 2016 Execution Flow of Distributed HML Driver broadcasts the merged results of feature selection to executors Driver Executor Feature Selection Pruning Data Segmentation Sat SatSatSat
  • 32. 32 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  • 33. 33 © NEC Corporation 2016 Wait Time for Other Executors Delays Execution Machine learning is likely to cause unbalanced computation load Executor Sat Sat Time Executor Executor
  • 34. 34 © NEC Corporation 2016 Balance Computational Effort for Each Executor Executor optimizes all predictive formulas with equally-divided training data Executor Sat Sat Time Executor Executor
  • 35. 35 © NEC Corporation 2016 Balance Computational Effort for Each Executor Executor optimizes all predictive formulas with equally-divided training data Executor Sat Sat Time Executor Executor Sat Sat Sat Sat Completion time reduced!
  • 36. 36 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark 1TB CPU 12 SEP 11 SEP 10 SEP
  • 37. 37 © NEC Corporation 2016 Machine Learning Consists of Many Matrix Computations Feature Selection Pruning Data Segmentation Basic Matrix Computation Matrix Inversion Combinatorial Optimization … CPU
  • 38. 38 © NEC Corporation 2016 RDD Design for Leveraging High-Speed Libraries Distributed HML’s RDD Design Matrix object1 element … Straightforward RDD Design Training sample1 partition Training sample2 CPU CPU Matrix computation libraries (Breeze, BLAS) ~Training sample1 Training sample2 … seq(Training sample1, Training sample2, …) mapPartition Matrix object1 … Training sample1 Training sample2 map Iterator Matrix object creation element
  • 39. 39 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Long-chained RDD operations cause high-cost recalculations Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(PruningMapFunc)
  • 40. 40 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Long-chained RDD operations cause high-cost recalculations Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) GC .map(TreeMapFunc) CPU .map(PruningMapFunc)
  • 41. 41 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Cut long-chained operations periodically by checkpoint() Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) GC .map(TreeMapFunc) savedRDD .checkpoint() = CPU ~ .map(PruningMapFunc)
  • 43. 43 © NEC Corporation 2016 Eval 1: Prediction Error (vs. Spark MLlib algorithms) Distributed HML achieves low error competitive to a complex model data # samples Distributed HML Logistic Regression Decision Tree Random Forests gas sensor array (CO)* 4,208,261 0.542 0.597 0.587 0.576 household power consumption * 2,075,259 0.524 0.531 0.529 0.655 HIGGS* 11,000,000 0.335 0.358 0.337 0.317 HEPMASS* 7,000,000 0.156 0.163 0.167 0.175 * UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) 1st 1st 1st 2nd
  • 44. 44 © NEC Corporation 2016 Eval 2: Performance Scalability (vs. Spark MLlib algos) Distributed HML is competitive with Spark MLlib implementations. HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores Distributed HML Logistic Regression Distributed HML * UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) Logistic Regression
  • 46. 46 © NEC Corporation 2016 ATM Cash Balance Prediction Much cash  High inventory cost Little cash  High risk of out of stock Around 20,000 ATMs
  • 47. 47 © NEC Corporation 2016 Training Speed 2 hours9 days ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 10M ▌Cluster spec (10 nodes)  # CPU cores: 128  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.2 (HDP 2.3.2) * Run w/ 1 CPU core and 256GB memory 110x Speed up Serial HML* Distributed HML
  • 48. 48 © NEC Corporation 2016 Prediction Error ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 23M ▌Cluster spec (10 nodes)  # CPU cores: 128  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.2 (HDP 2.3.2) Average Error: 17% lower Serial HML Distributed HMLSerial HML Serial HML vs Winner: Distributed HML
  • 49. 49 © NEC Corporation 2016 Summary CPU 12 SEP 11 SEP 10 SEP
  • 50. 50 © NEC Corporation 2016 Summary