2. Problem & Objective
• Existing solutions for continuous
clustering are not elastic
– Central server
– Distributed setting with a fixed number of
dedicated servers.
• Objective C-Cube is somewhat tricky on this point. It
alternatively maintains a fixed number of VMs.
– An elastic algorithm for real-time,
continuous clustering analysis
1
3. Clustering
• Divide a set of unlabeled objects into
groups that are not pre-defined
– objects in the same group similar
– objects in different groups dissimilar
• C-Cube’s elastic solution
– Dynamically adjust the amount of
computational resources based on the
current workload
Actually, C-Cube is doing workload-balancing
2
4. C-Cube
• A general and elastic streaming
framework to support a variety of
clustering algorithms.
Provided by Storm
Only discuss the distance-based
clustering algorithm
3
5. Elastic Operator
Mapper / Spout Reducer / Last Bolt
Achieve elasticity by dynamically adjusting Worker nodes /
the number of processing units Intermediate Bolts
4
6. Verification-Reclustering
• Scheme
– Verify the clustering results computed at a
previous timestamp, and
– only re-run the clustering algorithm when
the verifier module determines that the
previous results no longer fit the current
data distribution
• Verification module
– Performed by an elastic operator
• Distance-based clustering criteria
7. Distance-based Clustering
• Goal
– Partition the objects into clusters to
minimize the sum of distances from all
objects in a cluster to the cluster center
• Distance functions
– K-Means
and their approximations
– K-Median
6
10. Scaling Strategy
• Start a maximal number of virtual
machines at the beginning Still the limitation
• Only use a fraction of the virtual
machines and keeps other virtual
machines in idle
• Activate the virtual machines on demand
according to the workload
9
11. System Performance
• Number of clusters
• Approximation factor
• Number of verifiers used in C-Cube
• Workload change rate
• Number of machines in the cluster
10