SlideShare une entreprise Scribd logo
1  sur  11
C-Cube: Elastic Continuous Clustering
            in the Cloud


          Speaker: LIN Qian
 http://www.comp.nus.edu.sg/~linqian
Problem & Objective
• Existing solutions for continuous
  clustering are not elastic
  – Central server
  – Distributed setting with a fixed number of
    dedicated servers.
• Objective         C-Cube is somewhat tricky on this point. It
                    alternatively maintains a fixed number of VMs.

  – An elastic algorithm for real-time,
    continuous clustering analysis

                                                                 1
Clustering
• Divide a set of unlabeled objects into
  groups that are not pre-defined
  – objects in the same group  similar
  – objects in different groups  dissimilar
• C-Cube’s elastic solution
  – Dynamically adjust the amount of
    computational resources based on the
    current workload
    Actually, C-Cube is doing workload-balancing
                                                   2
C-Cube
• A general and elastic streaming
  framework to support a variety of
  clustering algorithms.

  Provided by Storm

                      Only discuss the distance-based
                            clustering algorithm

                                                        3
Elastic Operator
   Mapper / Spout                             Reducer / Last Bolt




Achieve elasticity by dynamically adjusting        Worker nodes /
     the number of processing units              Intermediate Bolts
                                                                    4
Verification-Reclustering
• Scheme
  – Verify the clustering results computed at a
    previous timestamp, and
  – only re-run the clustering algorithm when
    the verifier module determines that the
    previous results no longer fit the current
    data distribution
• Verification module
  – Performed by an elastic operator
• Distance-based clustering criteria
Distance-based Clustering
• Goal
  – Partition the objects into clusters to
    minimize the sum of distances from all
    objects in a cluster to the cluster center
• Distance functions
  – K-Means
             and their approximations
  – K-Median


                                                 6
C-Cube Architecture




                      7
Implementation
• 9 PCs
  – 2 GB memory, 1.8 GHz CPU (2 cores)
  – Ubuntu 10.0.4
• Storm 0.6.2
  – Zookeeper (1 PC)
  – Nimbus node (1 PC)
  – Kestrel message queue server (1 PC)
  – Supervisor nodes (6 PCs)
Scaling Strategy
• Start a maximal number of virtual
  machines at the beginning Still the limitation
• Only use a fraction of the virtual
  machines and keeps other virtual
  machines in idle
• Activate the virtual machines on demand
  according to the workload


                                               9
System Performance
•   Number of clusters
•   Approximation factor
•   Number of verifiers used in C-Cube
•   Workload change rate
•   Number of machines in the cluster




                                         10

Contenu connexe

Tendances

Dissertation Overview
Dissertation OverviewDissertation Overview
Dissertation OverviewSi Beaumont
 
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsThe Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsAhmed Abdullah
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017MLconf
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo
 
Supporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesAhmed Abdullah
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Papitha Velumani
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudAccubits Technologies
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
Chainer v2 and future dev plan
Chainer v2 and future dev planChainer v2 and future dev plan
Chainer v2 and future dev planSeiya Tokui
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Tulipp. Eu
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ..."NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...Edge AI and Vision Alliance
 

Tendances (20)

Dissertation Overview
Dissertation OverviewDissertation Overview
Dissertation Overview
 
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsThe Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Supporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud services
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databases
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloud
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
Hadoop analytics provisioning based on a virtual infrastructure
Hadoop analytics provisioning based on a virtual infrastructureHadoop analytics provisioning based on a virtual infrastructure
Hadoop analytics provisioning based on a virtual infrastructure
 
Presentation
PresentationPresentation
Presentation
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Chainer v2 and future dev plan
Chainer v2 and future dev planChainer v2 and future dev plan
Chainer v2 and future dev plan
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ..."NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
 

Similaire à C-Cube Elastic Continuous Clustering in the Cloud

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetescraigbox
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewLei (Harry) Zhang
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN GenevaSteve Traylen
 
Scheduler activations
Scheduler activationsScheduler activations
Scheduler activationsVin Voro
 
Kubernetes presentation
Kubernetes presentationKubernetes presentation
Kubernetes presentationGauranG Bajpai
 
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationOperation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationBlue Raster
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...Miguel Xavier
 
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMJOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMmailjkb
 
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...DataScienceConferenc1
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Mahdi Hosseini Moghaddam
 
Cassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyCassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyDataStax Academy
 
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie SatterlySeattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterlybtoddb
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusJakob Karalus
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryDoKC
 

Similaire à C-Cube Elastic Continuous Clustering in the Cloud (20)

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical View
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN Geneva
 
Scheduler activations
Scheduler activationsScheduler activations
Scheduler activations
 
Kubernetes presentation
Kubernetes presentationKubernetes presentation
Kubernetes presentation
 
Lect06
Lect06Lect06
Lect06
 
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationOperation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...
 
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMJOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
 
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...
 
Cassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyCassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie Satterly
 
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie SatterlySeattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
U rpm-v2
U rpm-v2U rpm-v2
U rpm-v2
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
 

Plus de Qian Lin

Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain SystemsFine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain SystemsQian Lin
 
PaxosStore: High-availability Storage Made Practical in WeChat
PaxosStore: High-availability Storage Made Practical in WeChatPaxosStore: High-availability Storage Made Practical in WeChat
PaxosStore: High-availability Storage Made Practical in WeChatQian Lin
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudQian Lin
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesQian Lin
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationQian Lin
 
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldQian Lin
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationQian Lin
 
Virtual Machine Performance
Virtual Machine PerformanceVirtual Machine Performance
Virtual Machine PerformanceQian Lin
 
Be an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a WriterBe an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a WriterQian Lin
 
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data FormatsSciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data FormatsQian Lin
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
In-situ MapReduce for Log Processing
In-situ MapReduce for Log ProcessingIn-situ MapReduce for Log Processing
In-situ MapReduce for Log ProcessingQian Lin
 
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsQian Lin
 

Plus de Qian Lin (13)

Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain SystemsFine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
 
PaxosStore: High-availability Storage Made Practical in WeChat
PaxosStore: High-availability Storage Made Practical in WeChatPaxosStore: High-availability Storage Made Practical in WeChat
PaxosStore: High-availability Storage Made Practical in WeChat
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable Computation
 
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid Virtualization
 
Virtual Machine Performance
Virtual Machine PerformanceVirtual Machine Performance
Virtual Machine Performance
 
Be an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a WriterBe an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a Writer
 
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data FormatsSciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
In-situ MapReduce for Log Processing
In-situ MapReduce for Log ProcessingIn-situ MapReduce for Log Processing
In-situ MapReduce for Log Processing
 
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
 

C-Cube Elastic Continuous Clustering in the Cloud

  • 1. C-Cube: Elastic Continuous Clustering in the Cloud Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  • 2. Problem & Objective • Existing solutions for continuous clustering are not elastic – Central server – Distributed setting with a fixed number of dedicated servers. • Objective C-Cube is somewhat tricky on this point. It alternatively maintains a fixed number of VMs. – An elastic algorithm for real-time, continuous clustering analysis 1
  • 3. Clustering • Divide a set of unlabeled objects into groups that are not pre-defined – objects in the same group  similar – objects in different groups  dissimilar • C-Cube’s elastic solution – Dynamically adjust the amount of computational resources based on the current workload Actually, C-Cube is doing workload-balancing 2
  • 4. C-Cube • A general and elastic streaming framework to support a variety of clustering algorithms. Provided by Storm Only discuss the distance-based clustering algorithm 3
  • 5. Elastic Operator Mapper / Spout Reducer / Last Bolt Achieve elasticity by dynamically adjusting Worker nodes / the number of processing units Intermediate Bolts 4
  • 6. Verification-Reclustering • Scheme – Verify the clustering results computed at a previous timestamp, and – only re-run the clustering algorithm when the verifier module determines that the previous results no longer fit the current data distribution • Verification module – Performed by an elastic operator • Distance-based clustering criteria
  • 7. Distance-based Clustering • Goal – Partition the objects into clusters to minimize the sum of distances from all objects in a cluster to the cluster center • Distance functions – K-Means and their approximations – K-Median 6
  • 9. Implementation • 9 PCs – 2 GB memory, 1.8 GHz CPU (2 cores) – Ubuntu 10.0.4 • Storm 0.6.2 – Zookeeper (1 PC) – Nimbus node (1 PC) – Kestrel message queue server (1 PC) – Supervisor nodes (6 PCs)
  • 10. Scaling Strategy • Start a maximal number of virtual machines at the beginning Still the limitation • Only use a fraction of the virtual machines and keeps other virtual machines in idle • Activate the virtual machines on demand according to the workload 9
  • 11. System Performance • Number of clusters • Approximation factor • Number of verifiers used in C-Cube • Workload change rate • Number of machines in the cluster 10