Cassandra was built from the ground up to enable linearly scalable, always-on applications. But the path to high availability has many land mines that can mean failure for the inexperienced user. In this talk, I will offer practical advice on how to achieve 100% uptime on millions of transactions per second. I'll address all aspects of the topic, including deployment, configuration, application design, and operations.
2. Who Am I?
Robbie Strickland
VP, Software Engineering
rstrickland@weather.com
@rs_atl An IBM Business
3. Who Am I?
• Contributor to C*
community since 2010
• DataStax MVP 2014/15/16
• Author, Cassandra High
Availability & Cassandra 3.x
High Availability
• Founder, ATL Cassandra User
Group
4. What is HA?
• Five nines – 99.999% uptime?
– Roughly 9 hours per year
– … or a full work day of down time!
• Can we do better?
5. Cassandra + HA
• No SPOF
• Multi-DC replication
• Incremental backups
• Client-side failure handling
• Server-side failure handling
• Lots of JMX stats
6. HA by Design (it’s not an add-on)
• Properly designed topology
• Data model that respects C* architecture
• Application that handles failure
• Monitoring strategy with early warning
• DevOps mentality
9. Consistency Basics
• Start with LOCAL_QUORUM reads & writes
– Balances performance & availability, and provides
single DC full consistency
– Experiment with eventual consistency (e.g.
CL=ONE) in a controlled environment
• Avoid non-local CLs in multi-DC environments
– Otherwise it’s a crap shoot
10. Rack Failure
• Don’t put all your
nodes in one rack!
• Use rack awareness
– Places replicas in
different racks
• But don’t use
RackAwareSnitch
24. Multi-DC Routing
• Use DCAwareRoundRobinPolicy wrapped by
TokenAwarePolicy
– This is the default
– Prefers local DC – chosen based on host distance
and seed list
– BUT this can fail for logical DCs that are physically
co-located, or for improperly defined seed lists!
25. Multi-DC Routing
Pro tip:
val localDC = //get from config
val dcPolicy =
new TokenAwarePolicy(
DCAwareRoundRobinPolicy.builder()
.withLocalDc(localDC)
.build()
)
Be explicit!!
26. Handling DC Failure
• Make sure backup DC has sufficient capacity
– Don’t try to add capacity on the fly!
• Try to limit updates
– Avoids potential consistency issues on recovery
• Be careful with retry logic
– Isolate it to a single point in the stack
– Don’t DDoS yourself with retries!
27. Topology Lessons
• Leverage rack awareness
• Use LOCAL_QUORUM
– Full local consistency
– Eventual consistency across DCs
• Run incremental repairs to maintain inter-DC
consistency
• Explicitly route local app to local C* DC
• Plan for DC failure
29. Quick Primer
• C* is a distributed hash table
– Partition key (first field in PK declaration)
determines placement in the cluster
– Efficient queries MUST know the key!
• Data for a given partition is naturally sorted
based on clustering columns
• Column range scans are efficient
30. Quick Primer
• All writes are immutable
– Deletes create tombstones
– Updates do not immediately purge old data
– Compaction has to sort all this out
31. Who Cares?
• Bad performance = application downtime &
lost users
• Lagging compaction is an operations
nightmare
• Some models & query patterns create serious
availability problems
32. Do
• Choose a partition key that distributes evenly
• Model your data based on common read
patterns
• Denormalize using collections & materialized
views
• Use efficient single-partition range queries
33. Don’t
• Create hot spots in either data or traffic
patterns
• Build a relational data model
• Create an application-side join
• Run multi-node queries
• Use batches to group unrelated writes
35. Client
Problem Case #1
SELECT *
FROM contacts
WHERE id IN (1,3,5,7)
1 2
6 5
4 7
2 8
3 6
7 8
1 3
5 2
4 5
7 8
1 3
6 4
Must ask every 4 out of 6 nodes
in the cluster to satisfy quorum!
36. Client
Problem Case #1
SELECT *
FROM contacts
WHERE id IN (1,3,5,7)
1 2
6 5
4 7
2 8
3 6
7 8
1 3
5 2
4 5
7 8
1 3
6 4
“Not enough replicas available for query
at consistency LOCAL_QUORUM” X
X
1,3,5 all have sufficient replicas,
yet entire query fails because of 7
37. Solution #1
• Option 1: Be optimistic and run it anyway
– If it fails, you can fall back to option 2
• Option 2: Run parallel queries for each key
– Return the results that are available
– Fall back to CL ONE for failed keys
– Client token awareness means coordinator does less
work
38. Problem Case #2
CREATE INDEX ON contacts(birth_year)
SELECT *
FROM contacts
WHERE birth_year=1975
39. Client
Problem Case #2
SELECT *
FROM contacts
WHERE birth_year=1975
1975:
Jim
Sue
1975:
Sam
Jim
1975:
Sue
Tim
1975:
Tim
Jim
1975:
Sue
Sam
1975:
Sam
Tim
Index lives with the source data
… so 5 nodes must be queried!
40. Client
Problem Case #2
SELECT *
FROM contacts
WHERE birth_year=1975
1975:
Jim
Sue
1975:
Sam
Jim
1975:
Sue
Tim
1975:
Tim
Jim
1975:
Sue
Sam
1975:
Sam
Tim
“Not enough replicas available for query
at consistency LOCAL_QUORUM”
Index lives with the source data
… so 5 nodes must be queried!
X
X
41. Solution #2
• Option 1: Build your own index
– App has to maintain the index
• Option 2: Use a materialized view
– Not available before 3.0
• Option 3: Run it anyway
– Ok for small amounts of data (think 10s to 100s of
rows) that can live in memory
– Good for parallel analytics jobs (Spark, Hadoop, etc.)
42. Problem Case #3
CREATE TABLE sensor_readings (
sensorID uuid,
timestamp int,
reading decimal,
PRIMARY KEY (sensorID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
43. Problem Case #3
• Partition will grow unbounded
– i.e. it creates wide rows
• Unsustainable number of columns in each
partition
• No way to archive off old data
44. Solution #3
CREATE TABLE sensor_readings (
sensorID uuid,
time_bucket int,
timestamp int,
reading decimal,
PRIMARY KEY ((sensorID, time_bucket),
timestamp)
) WITH CLUSTERING ORDER BY
(timestamp DESC);
46. Monitoring Basics
• Enable remote JMX
• Connect a stats collector (jmxtrans, collectd,
etc.)
• Use nodetool for quick single-node queries
• C* tells you pretty much everything via JMX
47. Thread Pools
• C* is a SEDA architecture
– Essentially message queues feeding thread pools
– nodetool tpstats
• Pending messages are bad:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 103 0 0
RequestResponseStage 0 0 0 0 0
MutationStage 0 13234794 0 0 0
48. Lagging Compaction
• Lagging compaction is the reason for many
performance issues
• Reads can grind to a halt in the worst case
• Use nodetool tablestats/cfstats &
compactionstats
50. Lagging Compaction
• Leveled: watch for SSTables remaining in L0:
Keyspace: my_keyspace
Read Count: 11207
Read Latency: 0.047931114482020164 ms.
Write Count: 17598
Write Latency: 0.053502954881236506 ms.
Pending Flushes: 0
Table: my_table
SSTable Count: 70
SSTables in each level: [50/4, 15/10, 5/100]
50 in L0 (should be 4)
51. Lagging Compaction Solution
• Triage:
– Check stats history to see if it’s a trend or a blip
– Increase compaction throughput using nodetool
setcompactionthroughput
– Temporarily switch to SizeTiered
• Do some digging:
– I/O problem?
– Add nodes?
52. Wide Rows / Hotspots
• Only takes one to wreak havoc
• It’s a data model problem
• Early detection is key!
• Watch partition max bytes
– Make sure it doesn’t grow unbounded
– … or become significantly larger than mean bytes
53. Wide Rows / Hotspots
• Use nodetool toppartitions to sample
reads/writes and find the offending partition
• Take action early to avoid OOM issues with:
– Compaction
– Streaming
– Reads
Thank you for joining me for my talk today. My name is Robbie Strickland, and I’m going to talk about how to build highly available applications on Cassandra. If this is not the session you’re looking for, this would be a good time to head out and find the right one. Alternatively, if you’re an expert on this subject, please come talk to me afterward and maybe I can find you a new job…
A little background for those who don’t know me. I lead the analytics team at The Weather Company, based in the beautiful city of Atlanta. I am responsible for our data warehouse and our analytics platform, as well as a team of engineers who get to work on cool analytics projects on massive and varied data sets. We were recently acquired by IBM’s analytics group, and so my role has expanded to include work on the larger IBM platform efforts as well.
Why am I qualified to talk about this? I’ve been around the community for a while, since 2010 and Cassandra 0.5 to be exact, and I’ve worked on a variety of Cassandra-related open source projects. If there’s a way to screw things up with Cassandra, I’ve done it. If you’re interested in learning more about that, you can pick up a copy of my book, Cassandra High Availability, which has a newly released second edition focusing on the 3x series.
I’d like to start by asking the question: what do we mean by high availability? A common definition is the so-called five nines of uptime. This sounds really good—until you do the math and realize that .001% equates to 9 hours per year, or a full work day of down time per year! I don’t know about your business, but to me that sounds like an unacceptable number. Can we do better than this?
The conversation around HA and Cassandra is complex and multi-faceted, so it would be impossible to cover everything that needs to be said in a half hour talk. Today I’m going to touch on the highlights, and hopefully take away many of the unknown unknowns. Fortunately Cassandra was built from the ground up to be highly available, and if properly used can deliver 100% uptime on your critical applications. This is possible by leveraging some key capabilities, such as its distributed design with no single point of failure, including replication across data centers. It supports incremental backups, and robust failure handling features on both the client and the server. And Cassandra exposes pretty much anything you’d like to know about its inner workings via a host of JMX stats, so ignorance is no excuse.
As you begin to design your application, I would encourage you to channel the Cassandra architects and think about availability from the start. It’s very difficult to bolt on HA capability to an existing app, and this is especially true with Cassandra. Let’s talk about the ingredients that comprise a successful HA deployment, starting with a properly designed topology. By this I mean the physical deployment of both the database and your application. Next you need a data model that leverages Cassandra’s strengths and mitigates its weaknesses. You’ll want to make sure that your application handles failure as well, and there are some specific strategies I’ll discuss to drive that point home. You will need to keep a close watch on the key performance metrics so you have reaction time before a failure, and lastly you’ll need to cultivate a devops mentality if you don’t already think this way.
I’ve decided to approach this topic by walking you through some common failure scenarios
Let’s lay a few ground rules. I’m going to assume a few things about your configuration that are commonly considered to be table stakes for any production Cassandra deployment. First, you should be using NetworkTopologyStrategy and either the GossipingPropertyFileSnitch or the appropriate snitch for your cloud provider. For the record, we run many multi-region EC2 clusters, yet we still use the Gossiping snitch because it gives us more control. Next, I’m assuming you have at least 5 nodes, since anything less is really insufficient for production use. A replication factor of three is the de facto standard; while there are reasons to have more or fewer, your reason probably isn’t valid. Pop quiz: If you set your replication factor to two, what constitutes a quorum? That’s right: two. Now let’s say you have five nodes. How many nodes can fail without some subset of your data becoming unavailable? Zero. So at RF=2, every node in your cluster becomes a catastrophic failure point. Lastly, please don’t put your cluster behind a load balancer. You will break the client-side smarts built into the driver and produce a lot of unnecessary overhead on your cluster.
With that out of the way, let’s talk about how we build an HA topology.
As I’m sure you’re aware, Cassandra has a robust consistency model with a number of knobs to turn. There are plenty of great resources that cover this, so I’m going to leave you with just a few rules of thumb and let you explore further on your own. I always recommend that people start with LOCAL_QUORUM reads and writes, because this gives you a good balance of performance and availability, and you don’t have to deal with eventual consistency within a single data center. As a corollary, my suggestions is to experiment with eventual consistency (meaning something less than quorum) in a controlled environment. You’ll want to gain some operational experience handling eventually consistent behavior before deploying a mission critical app. Second, don’t use non-local consistency levels in multi-data center environments, because the behavior will be unpredictable. I’ll cover this situation in detail later.
If you follow the basic replication and consistency guidelines I just outlined, single node failures will be relatively straightforward to recover from. But what happens when someone trips over the power cord to your rack, or a switch fails? Fortunately Cassandra offers a mechanism to handle this, as long as you’re smart about your topology. Obviously if you put all your nodes in a single rack, you’re kind of on your own—so don’t do that! Assuming you have multiple racks, you can leverage the rack awareness feature, which places replicas in different racks. However, I would advise against using the RackAwareSnitch, as it makes assumptions about your network configuration that may not always hold.
Let’s look at how rack awareness works. Assuming you have two racks, A and B, Cassandra will insure that the three replicas of each key are distributed across racks. This means you’ll have at least one available replica even if an entire rack is down. In this case, if rack B goes down, your application will have to support reading at CL one if you want to continue to serve that data.
To set this up with the GossipingPropertyFileSnitch, you’ll need to add a cassandra-rackdc.properties file to the config directory, where you’ll specify which data center and rack the node belongs to. This information is automatically gossiped to the rest of the cluster, so there’s no need to keep files in synch as with the legacy PropertyFileSnitch.
Alternatively, if you’re using a cloud snitch, you can accomplish the same thing by locating your nodes in different availability zones. The cloud snitches will map the region to a data center and the availability zone to a rack. Just as with physical racks, it’s important to evenly distribute your nodes across zones if you want this to work properly.
Once you improved local availability, it’s likely that you’ll want or need to expand geographically. There are a variety of reasons for this, such as disaster recovery, failover, and bringing the data closer to your users. Cassandra handles this multi-DC replication automatically through the keyspace definition. In my example here, I have a data center in the US, which we’re calling us-1, and one in Europe (but not England), which we’re calling eu-1.
The setup for this is straightforward using the “with replication” clause on the create keyspace CQL command. You can specify a list of data center names with the corresponding number of replicas you want maintained in each.
One important question when it comes to multi-DC operations is what sort of consistency guarantees you get, again assuming local_quorum reads and writes.
I’ve already established that within a given DC, local_quorum gives you full consistency,
But what guarantee do you get between data centers?
The answer is eventual consistency. This is an extremely important point when designing your application, and it brings my to a closely related topic: client-side routing.
At the risk of stating the obvious, the ideal scenario is to have each client app only talk to local Cassandra nodes using a local consistency level.
This is the right approach, but it’s surprisingly easy to mess this up. I’ve seen this simple rule break down due to misunderstanding about the relationship between consistency level and client load balancing policy.
The breakdown often comes due to failure to set the consistency level to a local variant. This illustrates what happens when you don’t run a local consistency level.
Don’t do this. You end up with traffic running all over the place, because Cassandra is trying to check replicas in the remote data center to satisfy your requested consistency guarantee. This can also happen if you give your app a list of all nodes in your cluster. So make sure you explicitly set a local consistency level, and make sure your client is only connecting to local nodes.
If you want to guarantee that traffic from your app is routed to the right node in the local DC, you’ll want to leverage the DCAwareRoundRobinPolicy, wrapped by the TokenAwarePolicy. The good news is this is the default configuration for the Datastax driver, but there is still potential for problems when relying on the default. If you don’t explicitly specify the DC, it will be chosen automatically using the provided seed list and host distance. We have run into issues where a non-local node was accidentally included in the seed list, which of course caused the driver to learn about other nodes and begin directing traffic to those nodes.
To solve this, obtain your local DC using a local environment configuration, then explicitly specify it using the withLocalDc option, as I’ve shown here. This is essentially a fail-safe against a non-local node getting inadvertently added to your seed list.
So how can you handle the failure of an entire DC? First, assuming you plan to fail over to another DC, please make sure your backup DC can handle the extra load. Trying to add capacity on the fly is unwise, as you’ll be introducing bootstrap overhead as well as the additional traffic. This is very likely to result in failure of your backup data center as well! Second, try to limit updates, as they can cause consistency issues when you try to bring the downed data center back online. Many applications will have a read-only failure mode, which can be significantly better than being down altogether. Lastly, be very careful when designing your retry logic. Make sure to isolate the retries to a single point in the stack, so you don’t end up bringing your app down due to your own retry explosion.
To recap the lessons learned on topology, make sure you’re leveraging rack awareness, use local quorum for full local consistency, run incremental repairs to maintain inter-DC consistency, explicitly set the local DC in your app, and create a plan to handle the failure of a DC.
Now let’s move on to one of the most critical aspects of availability, and frankly the one that trips up most people. It’s easy to become lulled by the familiarity of the CQL syntax, but you really need to pay close attention to what Cassandra is doing with your data. Otherwise you’ll almost certainly run into performance and availability problems.
I’ll begin with a quick primer, though I’m sure many of you know this stuff already. But these are critical points, so a quick recap is in order just in case. First, Cassandra is a distributed hash table, and the partition key determines where data lives in the cluster—specifically which nodes contain replicas. Data for a given partition is sorted based on the clustering column values using the natural sort order of the type. It follows that queries resulting in column range scans are efficient, because they leverage this natural sorting.
Lastly, all Cassandra writes are immutable. Inserts and updates are really the same operation, and deletes create new columns called tombstones that overwrite the old values. Old data hangs around following an update, and compaction has to reconcile this to avoid holding onto a bunch of garbage and making reads more efficient.
So why do we even care about these details? Obviously bad performance results in down time. Maybe less obviously, bad data models can result in significant compaction overhead, which can cause compaction to lag. Lagging compaction is a serious operations problem, especially if it’s allowed to continue undetected for too long. Also, some models and patterns have significant and inherent availability implications.
A couple of general do’s and don’ts. First some rules of thumb: Choose your partition key carefully, such that you get even distribution across the cluster. And unlike your favorite third normal form data model, you’ll want to model based on your most common read patterns. To help accomplish this, you’ll want to denormalize your data. Collections and the new materialized view feature are valuable tools that can help you accomplish this. And this last point could be considered the unifying theory for Cassandra data modeling: always run single partition range queries. If you’re unsure what constitutes a range query, there are a number of excellent resources available to explain this, including a talk I did at a past summit called CQL under the hood.
Now for a list of don’ts. First off, avoid models that result in hot spots in either data load or traffic patterns. A hot spot is simply an unusually large amount of data being written to or read from a single partition key. Secondly, If you find yourself building foreign key style relationships, you need to think differently about the problem. Relational models do not translate well to the Cassandra paradigm. A corollary principle is to avoid joining data on the application side, unless the join table is just a few rows that you can cache in memory. Next, don’t run queries that require many nodes to answer. I’ll cover a couple of these cases in a minute. And lastly, batches are not meant for grouping unrelated writes. In a single-node relational database, batching can be very efficient for loading large amounts of data, but again, this does not translate to Cassandra. If you need to do this, there’s a bulk loader Java API that can be leveraged for this purpose.
Now let’s examine a few problem cases and talk about how they can be addressed. Case number one looks innocuous. You have a contacts table, and you want to retrieve a set of them by ID. So you use an IN clause to filter your results. Why would this be an issue?
Let’s look at what’s happening here: The client issues the request, which will be routed to a coordinator based on one of the keys in the list. Assuming a quorum read, the coordinator will have to find two replicas of every key you asked for, resulting in four out of six nodes participating in the query.
Now suppose we lose two nodes. Cassandra can satisfy quorum for keys 1, 3, and 5, but there aren’t enough available replicas to return the query for 7. Because the keys are grouped using the IN clause, the entire query will fail.
There are two potential solutions to this problem. Option one is to just throw caution to the wind and do it anyway, then in the failure case you can fall back to option two, which is to run parallel queries for each key. If you do this, you are able to return any available results, which may be better than nothing at all. In addition, you can choose to reduce your consistency level to ONE for any failed keys, so you’re effectively taking a best effort approach to returning the latest data. This approach also allows the client to more effectively leverage token awareness, so the coordinator is doing less work.
Even after years of warning, our next case seems to stick around like a lingering cold. Revisiting the contacts table, let’s say you have a field called birth year that you want to use to filter our results. So you do what any good relational database architect would do: you create an index on that field.
But this is a bad plan, because index entries are stored alongside the source table. Architecturally this is a sound strategy, but it means that you have to read the index across the cluster to find out which nodes contain data that matches the value you’re querying. This pattern does not scale well, and is prone to availability issues just like the IN clause.
As in the previous example, if you lose two nodes you can no longer satisfy quorum for the query.
There are three potential alternatives to secondary indexes. One option is to build your own index, which may work well if the column you’re indexing has reasonable distribution. Birth year would likely be such a column, but a boolean value would not. The disadvantage is that your app has to maintain the index and deal with potential consistency and orphan issues. But this is a tried and true approach, and may be the right solution for some cases. Option two is to use a materialized view, which results in essentially the same underlying result as option one, but has the advantage that Cassandra maintains it for you—thus alleviating the burden on the application. The downside is that you’ll need 3.x to get this feature. The last option is to run it anyway, which may be ok if you have only small amounts of data that you’re returning. Indexes can also be a good pairing with analytics frameworks, where good parallelism is important. In this case, the distribution of the query across the cluster is actually a positive attribute.
For our last case, let’s assume you want to capture sensor data, which is inherently time series, and you want to be able to read the latest few values. The obvious model would look like this, where you partition by sensorID and then group by timestamp. You can add the clustering order by clause to reverse the sort order, such that it’s stored with the latest value first. This model allows you to query a given sensor and obtain the readings in descending order by timestamp.
But this model suffers from one of the most insidious of Cassandra evils—the unbounded partition, which is the worst form of wide row problem. Eventually your partition sizes will become unsustainable, which will result in serious problems with compaction and streaming at the very least. Unfortunately, if you find yourself in this situation you will also realize that there’s no way to efficiently archive off old data, because doing so would create a significant number of tombstones and therefore compound the problem.
The solution is to create a compound partition key, using sensorID plus time bucket to create a boundary for growth, where the time bucket can be known at query time. One important trick here is to choose your time bucket such that you only need at most two buckets to satisfy your most common queries. The reason is to limit the number of nodes that will have to be consulted to answer the query. It’s also worth noting that this concept where you add a known value to your partition key to limit growth and provide better distribution will generalize to other use cases as well.
Now to the last subject I want to cover today. Monitoring is a key part of any HA strategy, for reasons that I hope are obvious. What may be less obvious is exactly what you should be looking for to determine the health of the system. While I cannot hope to cover every possible scenario, I’m going to touch on some of the more critical problem areas that may be less obvious.
First a few basic concepts. Before you can collect anything you’ll need to enable remote JMX in the cassandra-env.sh script as well as on the JVM itself. Then you’ll need some way to collect the stats, using jmxtrans, collectd, or something similar. For simple, single-node diagnostics, nodetool provides a convenient interface to some of the more common questions you want to answer. Beyond that, Cassandra exposes just about every stat you can imagine through its JMX interface. So let’s look at a few things you might want to watch.
One very important metric to keep an eye on is the state of the thread pools. Cassandra uses a staged event driven architecture, or SEDA, that’s essentially a set of message queues feeding thread pools where the workers are dequeuing the messages. Nodetool tpstats gives you a view into what’s going on with the pools. The important thing to look for here is a buildup of pending messages, in this case on the mutation pool. As you may have guessed, this indicates that writes to disk aren’t keeping up with the queued requests. This doesn’t tell you why that’s happening, but if you’re also monitoring related areas like disk I/O, you should be able to quickly diagnose the problem. The point is to catch this early so you can resolve the situation before it gets out of hand.
Another very common but largely misunderstood problem relates to compaction—specifically when it gets behind. Lagging compaction can causes significant performance issues, especially with reads. Because it’s responsible for maintaining sane key distribution across SSTables, you can end up with reads that span many tables—and in the worst case you may end up with read latencies spiraling out of control and eventually timing out. To diagnose compaction issues, use nodetool tablestats and compactionstats.
The metric you’re looking for will depend on the compaction strategy you’re using. For Size-tiered, keep an eye on SSTable count, which should stay within a reasonable margin. If your monitoring system shows a consistent growth in SSTables, you’ll need to take action to avoid the situation getting out of hand.
When leveled compaction gets behind the curve, you’ll start to see a buildup of SSTables in the lower levels, specifically in level 0. Since leveled compaction is designed to very quickly compact level 0 SSTables, you should never see more than a handful at time.
Dealing with a lagging compaction situation involves a two-part solution. First, you’ll need to triage quickly to get to as stable a state as possible. First, make sure it’s a trend and not just a blip. This is where history is important, as you can’t make good decisions from a single data point. If you have the ability to do so, consider increasing compaction throughput. In some cases this may be all you need, as long as it recovers successfully and your cluster can handle it. If you’re running leveled compaction, which requires substantially more I/O than Size-tiered, you can often recover by temporarily switching to Size-tiered to catch up. Ultimately you’ll need to figure out what’s causing compaction to lag. Sometimes you can just turn up the throughput, but often there’s an underlying problem, such as poor disk performance, or perhaps you’re underprovisioned and need to add nodes. Either way, these guidelines can help you keep your system running, buying you time to get to the bottom of it.
As I mentioned earlier, one of the worst Cassandra problems I’ve personally experienced is related to wide rows. It really only takes one to completely ruin your day. Fundamentally, wide rows are a data model problem, so the fix is usually retooling your model—which is not usually quick or easy. This is why you’ll want to find out about it as soon as possible, so you have some runway to deal with it before it takes down your application. The key metric here is to watch the max partition bytes for each table. Make sure you don’t see unbounded growth, or a value that greatly exceeds the mean partition bytes.
Once you detect a problem, you can often use the nodetool toppartitions tool to sample your traffic and get a list of candidate partition keys. This works as long as the traffic pattern at the time of the sampling is indicative of the hotspot pattern. When you find a wide row, deal with it as soon as possible, or you may start seeing OOM issues with compaction, streaming, and reads.
I’ve covered a lot of territory, but there’s much more detail to this subject. If you’d like to learn more, there’s this really amazing new book that was just printed, and I’d shamelessly encourage you to get your very own copy today.
Thanks again for coming out to my talk today, and I’d love to answer any questions you may have.