C-Cube Elastic Continuous Clustering in the Cloud

•Télécharger en tant que PPTX, PDF•

0 j'aime•504 vues

Qian Lin

C-Cube: Elastic Continuous Clustering
in the Cloud

Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian

Problem & Objective
• Existing solutions for continuous
clustering are not elastic
– Central server
– Distributed setting with a fixed number of
dedicated servers.
• Objective C-Cube is somewhat tricky on this point. It
alternatively maintains a fixed number of VMs.

– An elastic algorithm for real-time,
continuous clustering analysis

1

Clustering
• Divide a set of unlabeled objects into
groups that are not pre-defined
– objects in the same group  similar
– objects in different groups  dissimilar
• C-Cube’s elastic solution
– Dynamically adjust the amount of
computational resources based on the
current workload
Actually, C-Cube is doing workload-balancing
2

C-Cube
• A general and elastic streaming
framework to support a variety of
clustering algorithms.

Provided by Storm

Only discuss the distance-based
clustering algorithm

3

Elastic Operator
Mapper / Spout Reducer / Last Bolt

Achieve elasticity by dynamically adjusting Worker nodes /
the number of processing units Intermediate Bolts
4

Verification-Reclustering
• Scheme
– Verify the clustering results computed at a
previous timestamp, and
– only re-run the clustering algorithm when
the verifier module determines that the
previous results no longer fit the current
data distribution
• Verification module
– Performed by an elastic operator
• Distance-based clustering criteria

Distance-based Clustering
• Goal
– Partition the objects into clusters to
minimize the sum of distances from all
objects in a cluster to the cluster center
• Distance functions
– K-Means
and their approximations
– K-Median

6

Implementation
• 9 PCs
– 2 GB memory, 1.8 GHz CPU (2 cores)
– Ubuntu 10.0.4
• Storm 0.6.2
– Zookeeper (1 PC)
– Nimbus node (1 PC)
– Kestrel message queue server (1 PC)
– Supervisor nodes (6 PCs)

$Scaling Strategy • Start a maximal number of virtual machines at the beginning Still the limitation • Only use a fraction of the virtual machines and keeps other virtual machines in idle • Activate the virtual machines on demand according to the workload 9$

System Performance
• Number of clusters
• Approximation factor
• Number of verifiers used in C-Cube
• Workload change rate
• Number of machines in the cluster

10

Recommandé

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTINGijccsa

Load balancing In cloud - In a semi distributed systemAchal Gupta

MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott

HybridAzureCloudChris Condo

Microsoft Azure in HPC scenariosmictc

Optimizing elastic search on google compute engineBhuvaneshwaran R

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf

Autonomous control in Big Data platforms: and experience with CassandraEmiliano

Recommandé

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTINGijccsa

Load balancing In cloud - In a semi distributed systemAchal Gupta

MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott

HybridAzureCloudChris Condo

Microsoft Azure in HPC scenariosmictc

Optimizing elastic search on google compute engineBhuvaneshwaran R

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf

Autonomous control in Big Data platforms: and experience with CassandraEmiliano

Dissertation OverviewSi Beaumont

The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsAhmed Abdullah

Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017MLconf

improve deep learning training and inference performances.rohit

Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo

Supporting bioinformatics applications with hybrid multi-cloud servicesAhmed Abdullah

Probabilistic consolidation of virtual machines in self organizing cloud data...Papitha Velumani

Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani

CUDA performance study on Hadoop MapReduce Clusterairbots

Google TPUHao(Robin) Dong

High Performance Computing (HPC) in cloudAccubits Technologies

Workshop actualización SVG CESGA 2012 CESGA Centro de Supercomputación de Galicia

Hadoop analytics provisioning based on a virtual infrastructureCESGA Centro de Supercomputación de Galicia

PresentationDaniel FitzGerald

Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems

Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai

Chainer v2 and future dev planSeiya Tokui

Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Tulipp. Eu

USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano

"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...Edge AI and Vision Alliance

Cvpr 2018 papers review (efficient computing)DonghyunKang12

Autoscaling Kubernetescraigbox

Contenu connexe

Tendances

Dissertation OverviewSi Beaumont

The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsAhmed Abdullah

Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017MLconf

improve deep learning training and inference performances.rohit

Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo

Supporting bioinformatics applications with hybrid multi-cloud servicesAhmed Abdullah

Probabilistic consolidation of virtual machines in self organizing cloud data...Papitha Velumani

Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani

CUDA performance study on Hadoop MapReduce Clusterairbots

Google TPUHao(Robin) Dong

High Performance Computing (HPC) in cloudAccubits Technologies

Workshop actualización SVG CESGA 2012 CESGA Centro de Supercomputación de Galicia

Hadoop analytics provisioning based on a virtual infrastructureCESGA Centro de Supercomputación de Galicia

PresentationDaniel FitzGerald

Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems

Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai

Chainer v2 and future dev planSeiya Tokui

Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Tulipp. Eu

USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano

"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...Edge AI and Vision Alliance

Tendances (20)

Dissertation Overview

The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications

Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017

improve deep learning training and inference performance

Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit

Supporting bioinformatics applications with hybrid multi-cloud services

Probabilistic consolidation of virtual machines in self organizing cloud data...

Distributed, concurrent, and independent access to encrypted cloud databases

CUDA performance study on Hadoop MapReduce Cluster

Google TPU

High Performance Computing (HPC) in cloud

Workshop actualización SVG CESGA 2012

Hadoop analytics provisioning based on a virtual infrastructure

Presentation

Expanding HPCC Systems Deep Neural Network Capabilities

Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)

Chainer v2 and future dev plan

Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...

USENIX NSDI 2016 (Session: Resource Sharing)

"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...

Similaire à C-Cube Elastic Continuous Clustering in the Cloud

Cvpr 2018 papers review (efficient computing)DonghyunKang12

Autoscaling Kubernetescraigbox

Kubernetes Walk Through from Technical ViewLei (Harry) Zhang

Puppet Camp CERN GenevaSteve Traylen

Scheduler activationsVin Voro

Kubernetes presentationGauranG Bajpai

Lect06Vin Voro

Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationBlue Raster

Dynamo and BigTable in light of the CAP theoremGrisha Weintraub

A Performance Comparison of Container-based Virtualization Systems for MapRed...Miguel Xavier

JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMmailjkb

[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...DataScienceConferenc1

Application of machine learning and cognitive computing in intrusion detectio...Mahdi Hosseini Moghaddam

Cassandra 1.2 by Eddie SatterlyDataStax Academy

Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterlybtoddb

Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner

U rpm-v2Senhua Huang

Simulation of Heterogeneous Cloud InfrastructuresCloudLightning

Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusJakob Karalus

Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryDoKC

Similaire à C-Cube Elastic Continuous Clustering in the Cloud (20)

Cvpr 2018 papers review (efficient computing)

Autoscaling Kubernetes

Kubernetes Walk Through from Technical View

Puppet Camp CERN Geneva

Scheduler activations

Kubernetes presentation

Lect06

Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

Dynamo and BigTable in light of the CAP theorem

A Performance Comparison of Container-based Virtualization Systems for MapRed...

JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM

[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...

Application of machine learning and cognitive computing in intrusion detectio...

Cassandra 1.2 by Eddie Satterly

Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly

Scaling Open Source Big Data Cloud Applications is Easy/Hard

U rpm-v2

Simulation of Heterogeneous Cloud Infrastructures

Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus

Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery

Plus de Qian Lin

Fine-Grained, Secure and Efficient Data Provenance on Blockchain SystemsQian Lin

PaxosStore: High-availability Storage Made Practical in WeChatQian Lin

Trinity: A Distributed Graph Engine on a Memory CloudQian Lin

Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesQian Lin

Adaptive Execution Support for Malleable ComputationQian Lin

Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldQian Lin

Optimizing Virtual Machines Using Hybrid VirtualizationQian Lin

Virtual Machine PerformanceQian Lin

Be an Explorer, Be a Coder, Be a WriterQian Lin

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data FormatsQian Lin

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin

In-situ MapReduce for Log ProcessingQian Lin

C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsQian Lin

Plus de Qian Lin (13)

Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems

PaxosStore: High-availability Storage Made Practical in WeChat

Trinity: A Distributed Graph Engine on a Memory Cloud

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Adaptive Execution Support for Malleable Computation

Kineograph: Taking the Pulse of a Fast-Changing and Connected World

Optimizing Virtual Machines Using Hybrid Virtualization

Virtual Machine Performance

Be an Explorer, Be a Coder, Be a Writer

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...

In-situ MapReduce for Log Processing

C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors

C-Cube Elastic Continuous Clustering in the Cloud

1. C-Cube: Elastic Continuous Clustering in the Cloud Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian

2. Problem & Objective • Existing solutions for continuous clustering are not elastic – Central server – Distributed setting with a fixed number of dedicated servers. • Objective C-Cube is somewhat tricky on this point. It alternatively maintains a fixed number of VMs. – An elastic algorithm for real-time, continuous clustering analysis 1

3. Clustering • Divide a set of unlabeled objects into groups that are not pre-defined – objects in the same group  similar – objects in different groups  dissimilar • C-Cube’s elastic solution – Dynamically adjust the amount of computational resources based on the current workload Actually, C-Cube is doing workload-balancing 2

4. C-Cube • A general and elastic streaming framework to support a variety of clustering algorithms. Provided by Storm Only discuss the distance-based clustering algorithm 3

5. Elastic Operator Mapper / Spout Reducer / Last Bolt Achieve elasticity by dynamically adjusting Worker nodes / the number of processing units Intermediate Bolts 4

6. Verification-Reclustering • Scheme – Verify the clustering results computed at a previous timestamp, and – only re-run the clustering algorithm when the verifier module determines that the previous results no longer fit the current data distribution • Verification module – Performed by an elastic operator • Distance-based clustering criteria

7. Distance-based Clustering • Goal – Partition the objects into clusters to minimize the sum of distances from all objects in a cluster to the cluster center • Distance functions – K-Means and their approximations – K-Median 6

8. C-Cube Architecture 7

9. Implementation • 9 PCs – 2 GB memory, 1.8 GHz CPU (2 cores) – Ubuntu 10.0.4 • Storm 0.6.2 – Zookeeper (1 PC) – Nimbus node (1 PC) – Kestrel message queue server (1 PC) – Supervisor nodes (6 PCs)

10. Scaling Strategy • Start a maximal number of virtual machines at the beginning Still the limitation • Only use a fraction of the virtual machines and keeps other virtual machines in idle • Activate the virtual machines on demand according to the workload 9

11. System Performance • Number of clusters • Approximation factor • Number of verifiers used in C-Cube • Workload change rate • Number of machines in the cluster 10