Sharding and things we'd like to see improved

Sharding and things
We’d like to see
improved
© Pythian Services Inc 2022 | Conﬁdential|
Igor Donchovski
May-17th 2022

© Pythian Services Inc 2022 | Conﬁdential | 2
About me
Igor Donchovski
Principal Consultant
Pythian - OSDB

25
Years in Business
450+
Experts across every Data Domain & Technology
400+
Global Customers
_______________
Gold Partner
40+ Certifications
5 Competencies,
Incl. Data Analytics
and Data Platform
_______________
Silver Partner
50+ Certifications
Migration Factory
Certified Apps
Hosting
_______________
Advanced Partner
1O0+ Certs/Accreds
Migration & DevOps
Competencies
Pythian overview
________________
Premier Partner
120+ Certs/Creds
6 Specializations, MSP
Badge, Partner &
Technical Advisory Bds
Pythian maximizes the value of your data estate by delivering advanced on-prem, hybrid, cloud, and
multi-cloud solutions and solving your toughest data and analytics challenges.
_______________
Select Partner
10+ Certs/Accreds
_______________
Platinum Partner
60+ Certifications
Advisory Board
Member

Overview
• Scaling (Vertical and Horizontal)
• What is a sharded cluster in MongoDB
• Cluster components - shards, config servers, mongos
• Shard keys and chunks
• Hashed and range based sharding
• Choosing a shard key
• Things we’d like to see improved
• QA

Scaling

Scaling
Time for scaling?
App

Scaling
Time for scaling?
• The CPU and/or memory becomes overloaded, and the database server either cannot
respond to all the request throughput or do so in a reasonable amount of time.
• Your database server runs out of storage, and thus cannot store all the data.
• Your network interface is overloaded, so it cannot support all the network traffic received.

Scaling
• Vertical scaling
– Adding more power (CPU, RAM, DISK) to an
existing machine
– Might require downtime while scaling up
• Horizontal scaling
– Adding more machines into your pool of
resources
– Partitioning the data on multiple machines
– Parallelizing the processing
– More complex to implement and maintain

Vertical Scaling
1TB
2TB
8 vCPU
16G Mem
16 vCPU
32G Mem

Horizontal Scaling - Reads
1TB
Primary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
DC1 DC2 DC3 DC4
votes:0
votes:0 votes:0
votes:0
votes:0
1TB

Horizontal Scaling - Writes
1TB
256GB 256GB 256GB 256GB
1TB
Shard 1 Shard 2 Shard 3 Shard 4

Scaling with MongoDB

Scaling per tenant
App
Tenant A
Tenant B
Tenant C
Tenant DB
Standalone App

Scaling per tenant
App
Tenant A
Tenant B
Tenant C
Tenant DB
App
A B C
Standalone App Database per Tenant
App
A
C
Sharded Multi-Tenant
Catalog
B
Catalog

Scaling per tenant
App Catalog
Tenant A Tenant B Tenant C
Sharded Multi-Tenant
N

Scaling with MongoDB - Sharding
• Shard/Replica set
(subset of the sharded data)
• Config servers
(metadata and config settings)
• mongos
(query router, cluster interface)
mongos> sh.addShard("shardN")

Scaling with MongoDB - Shards
• Contains subset of sharded data
• Replica set for redundancy and HA
• Primary shard (picked by mongos with
least amount of data)
• Non sharded collections
• --shardsvr in config file (port 27018)

Scaling with MongoDB - Conﬁg Servers
• Stores the metadata for sharded cluster in
config database
• Authentication configuration information in
admin database
• Config servers as replica set only (>= 3.4)
• Holds balancer on Primary node (>= 3.4)
• --configsvr in config file (port 27019)

Scaling with MongoDB - mongos
• Caching metadata from config servers
• Routes queries to shards
• No persistent state
• Updates cache on metadata changes
• Holds balancer (MongoDB <= 3.2)
• Starting in MongoDB 4.0, the mongos binary
will crash when attempting to connect to
mongod instances whose (fCV) is greater than
that of the mongos

Monotonically changing
Frequency
Choosing shard key
Cardinality
● Choose a key that is included in most of your queries
● Ideally you don’t want huge number of documents to share the same shard key
● Choose something that will co-locate data you wish to retrieve together

• A contiguous range of shard key values within a particular shard
• Chunk ranges are inclusive of the lower boundary and exclusive of the upper boundary
• Chunks split when they grow beyond the configured chunk size (default is 64MB).
Configurable between 1MB and 1GB
• MongoDB migrates chunks when a shard contains too many chunks of a collection
relative to other shards
Chunks
Number of chunks Migration threshold
< 20 2
20 - 79 4
>= 80 8

Ranged based sharding
• Dividing data into contiguous ranges determined by the shard key values
• Documents with “close” shard key values are likely to be in the same chunk or shard
• Query Isolation - more likely to target single shard for range queries

Hashed based sharding
• Uses a hashed index of a single or compound key to partition data
• More even data distribution at the cost of reducing Query Isolation
• Applications do not need to compute hashes

Balancer
• Background process that monitors the number of chunks on each shard
• Migrates chunks between shards to reach an equal number of chunks per shard
• Runs on the Primary of the config servers replica set
Shard1 Shard2 Shard3
Migrate
64MB 64MB 64MB
64MB
64MB 64MB
64MB 64MB
64MB
64MB 64MB
64MB
64MB

Balancer
• Background process that monitors the number of chunks on each shard
• Migrates chunks between shards to reach an equal number of chunks per shard
• Runs on the Primary of the config servers replica set
64MB 64MB
64MB
64MB
64MB 64MB
64MB
64MB
64MB 64MB
64MB
64MB

• Collection tracking that has documents for {client_id, asset_id, pushed_at…….}
• Each client has their assets tracked by GPS location
• Tens of thousands of clients
• Ranges between tens to hundreds of thousands of assets per client
• Millions of document inserts per minute on cluster level
• Data expires after 12 months, TTL index on pushed_at
• Clients request to get trajectories of asset_id[s] in time range
Sharding a collection

• Shard key options
{client_id : 1}
{client_id : 1, asset_id : 1}
{client_id : 1, asset_id : 1, pushed_at : 1}
Sharding a collection

Reﬁning a shard key
• MongoDB version >= 4.4
{client_id : 1}
db.adminCommand( { refineCollectionShardKey: "assets.tracking",
key: { client_id: 1, asset_id: 1 }} )
db.adminCommand( { refineCollectionShardKey: "assets.tracking",
key: { client_id: 1, asset_id: 1, pushed_at : 1 }} )

Resharding a collection
MongoDB version >=5.0
db.adminCommand({
reshardCollection: "assets.tracking",
key: { asset_id: 1, pushed_at : "hashed" }
})

Things we’d like improved

Balancing
● Goal state: Reach an equal number of chunks per shard
● Does not balance data and has no info how much actual data exists per chunk
64MB 64MB
64MB
64MB
64MB 64MB
64MB
64MB
64MB 64MB
64MB
64MB

Balancing
● Data expires, gets deleted over time (TTL)
● Empty chunks are being created
● There is no internal thread to clean empty chunks
50MB 64MB 32MB
40MB
48MB
5MB 5MB
10MB

Balancing
● Real data distribution
50MB 184MB 20MB

Balancing
Uneven distribution

● Config database, chunks collection has no info on chunks size
mongos> db.chunks.ﬁndOne()
{ "_id" : "assets.tracking-client_id_1asset_id_3705937pushed_at_MinKey",
"lastmod" : Timestamp(562503, 1),
"lastmodEpoch" : ObjectId("5e2e7c6bcab375a6b995002d"),
"ns" : "assets.tracking",
"min" : {"client_id" : NumberLong(1), "asset_id" : NumberLong(3705937),"pushed_at" : { "$minKey" : 1 }},
"max" : {"client_id" : NumberLong(1),"asset_id" : NumberLong(3972512),"pushed_at" : { "$minKey" : 1 }},
"shard" : "s6",
"jumbo" : false,
"history" : [{"validAfter" : Timestamp(1642642009, 1312), "shard" : "s6"}]
}
Cluster metadata

● You have to run separate script to get all chunks with their size
○ dataSize command returns the size in bytes for the specified data
● Requires scanning all of the chunks and documents associated with
● Merging empty chunks using the mergeChunks operation
● Rebalancing the chunks after the initial merge
○ moveChunk command to manually move
○ Let the balancer redistribute the chunks
Cluster metadata

● Merging chunks that are continuous range on the same shard
Merging empty chunks
50MB 64MB 32MB
40MB
48MB
5MB 5MB
10MB

● Rebalancing only moves chunks that are not used - ‘hot’
Merging empty chunks
50MB 64MB 32MB
40MB
48MB
5MB 5MB
10MB
Migrate Migrate

Migrating chunks
1. The balancer process sends the moveChunk command to the source shard
2. The source starts the move with an internal moveChunk command. During the migration process,
operations to the chunk route to the source shard. The source shard is responsible for incoming write
operations for the chunk
3. The destination shard builds any indexes required by the source that do not exist on the destination
4. The destination shard begins requesting documents in the chunk and starts receiving copies of the data.
5. After receiving the final document in the chunk, the destination shard starts a synchronization process to
ensure that it has the changes to the migrated documents that occurred during the migration
6. When fully synchronized, the source shard connects to the config database and updates the
cluster metadata with the new location for the chunk
7. After the source shard completes the update of the metadata, and once there are no open cursors on the
chunk, the source shard deletes its copy of the documents

Migrating chunks
● Collection level lock to update the cluster metadata
● Rebalancing only moves chunks that are not ‘hot’
● Perform this in low peak hours

Observability

Observability
mongostat -u<username> -p<password> --authenticationDatabase=admin --discover --all --interactive 5
host insert query update delete getmore command dirty used flushes mapped vsize res nonmapped faults lrw lrwt qrw arw net_in net_out conn set repl time
A1r2:27018 4 290 197 *0 1531 1082|0 2.9% 80.0% 0 69.0G 49.2G 69.0G n/a 0.0%|0.0% 0|0 0|0 1|0 2.96m 11.9m 711 s1 PRI May 10 11:48:55.927
mongostat -u<username> -p<password> --authenticationDatabase=admin --discover
-o="host,connections.current=connectionsCurrent,metrics.document.returned.rate()=documentsReturned,metrics.operation.scanAndOrder.rate()=scanAndOrder,metrics.queryExecutor.scanned.rate()=
indexScans,metrics.queryExecutor.scannedObjects.rate()=documentScans,localTime" --interactive 5
host connectionsCurrent documentsReturned scanAndOrder indexScans documentScans localTime
A1r2:27018 711 2378 1 3478 3442 2022-05-10 12:03:06.847 +0000 UTC
A2r1:27018 533 1457 2 719 1639 2022-05-10 12:03:06.88 +0000 UTC
A3r2:27018 472 945 1 651 1132 2022-05-10 12:03:06.846 +0000 UTC
A4r1:27018 498 1344 2 1835 2557 2022-05-10 12:03:06.861 +0000 UTC
A5r1:27018 482 1097 3 3510 3674 2022-05-10 12:03:06.897 +0000 UTC
A6r3:27018 463 1146 1 1123 1457 2022-05-10 12:03:06.903 +0000 UTC
A7r3:27018 478 927 1 1034 1313 2022-05-10 12:03:06.908 +0000 UTC
A8r1:27018 498 1539 1 979 1682 2022-05-10 12:03:06.859 +0000 UTC
A9r3:27018 455 995 1 7988 1985 2022-05-10 12:03:06.897 +0000 UTC

● The balancer only cares if the cluster has equal number of chunks per shard
● Migrating a chunk from source to target shard requires collection level lock
● Each chunk moved is being written 3 times (write on 1, write on 2, delete on 1)
● Choose a shard key that will avoid broadcast operations on the cluster
● Hashed based sharding usually requires more resources compared to range based
● Choose a shard key that is included in most of your queries and with high cardinality
● Scaling out might take days or weeks before the data gets redistributed
● Only use sharding for your large collections
● Avoid sharding if you can manage your data in replica sets
Summary

Sharding and things we'd like to see improved

Recommended

Recommended

More Related Content

Similar to Sharding and things we'd like to see improved

Similar to Sharding and things we'd like to see improved (20)

More from Igor Donchovski

More from Igor Donchovski (8)

Recently uploaded

Recently uploaded (20)

Sharding and things we'd like to see improved