In our product, we use a high-performant distributed database named Cassandra. It is open-source software, but using it is not cost-free. Running its own clusters requires great expertise and regular maintenance.
In one of the environments, the 6-node Cassandra cluster was deployed. In the beginning, nothing was worrying about the cluster. After some time, it turned out that disk space usage was increasing. The storage was resized a couple of times, from 500GB to 1TB, then from 1TB to 2TB, and finally from 2TB to 4TB. But this resizing wasn’t a remedy at all as the cluster started lagging. It was only delaying the upcoming catastrophe. So we had to figure out what was really going on.
Tracking down the root cause
❯ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.115 2.36 TB 256 ? f2f370d6-c829-495d-9b1a-499e799f1916 rack1
UN 192.168.1.53 1.92 TB 256 ? 921e6c39-7f02-4114-8a6a-d8cd5afbd8b8 rack1
UN 192.168.1.52 1.24 TB 256 ? 4e9fdf72-4154-4ff2-a1d3-6413d815ec37 rack1
UN 192.168.1.114 1.29 TB 256 ? 65ceedce-ed5e-4730-b1b9-51c988682a55 rack1
UN 192.168.1.51 2.14 TB 256 ? a3d5f6de-e91d-4778-bcb2-086133c7a81b rack1
UN 192.168.1.116 2.31 TB 256 ? e28ed4f6-288b-4ceb-b737-70c70860245e rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Printing nodetool status showed that some nodes had already been beyond 2TB and were choking. So it was only a matter of time before they reached 4TB.
❯ nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 817585641 0 0
ReadStage 3 0 83673866 0 0
RequestResponseStage 0 0 629340424 0 0
ReadRepairStage 0 0 11325221 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
HintedHandoff 0 0 300 0 0
GossipStage 0 0 3163813 0 0
CacheCleanupExecutor 0 0 0 0 0
InternalResponseStage 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
CompactionExecutor 2 39 123353 0 0
ValidationExecutor 0 0 0 0 0
MigrationStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
PendingRangeCalculator 0 0 13 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 70928 0 0
MemtablePostFlush 0 0 111004 0 0
MemtableReclaimMemory 0 0 70928 0 0
Native-Transport-Requests 2 0 1197812673 0 2561165
Message type Dropped
READ 470097
RANGE_SLICE 21
_TRACE 0
MUTATION 7627446
COUNTER_MUTATION 0
BINARY 0
REQUEST_RESPONSE 27253
PAGED_RANGE 837
READ_REPAIR 11969
That choking could be seen when looking at nodetool tpstats. A lot of Native-Transport-Requests had been throttled.
The next step was to look closely at keyspaces and corresponding column families known simply as tables. Issuing nodetool cfstats gave an insight.
Read latency in data keyspace was pretty odd as reads were taking 265ms.
❯ nodetool cfstats
Keyspace: data
Read Count: 14817739
Read Latency: 265.7262566112144 ms.
Write Count: 270622271
Write Latency: 0.9290800410842758 ms.
Pending Flushes: 0
...
Table: results
SSTable count: 3358
Space used (live): 2066615188640
Space used (total): 2066615188640
Space used by snapshots (total): 0
Off heap memory used (total): 2296918130
SSTable Compression Ratio: 0.14971098992452106
Number of keys (estimate): 8655588
Memtable cell count: 28102
Memtable data size: 245565603
Memtable off heap memory used: 239301550
Memtable switch count: 9079
Local read count: 14817739
Local read latency: 265.727 ms
Local write count: 180389361
Local write latency: 1.142 ms
Pending flushes: 0
Bloom filter false positives: 899066
Bloom filter false ratio: 0.00660
Bloom filter space used: 202764152
Bloom filter off heap memory used: 202737288
Index summary off heap memory used: 26621412
Compression metadata off heap memory used: 1828257880
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 98167
Average live cells per slice (last five minutes): 1.9702823082522913
Maximum live cells per slice (last five minutes): 2.0
Average tombstones per slice (last five minutes): 1.3038426442792657E-4
Maximum tombstones per slice (last five minutes): 2.0
Looking further one table was really worth a closer look at the results table. This table had 3358 SSTables, quite a large number!
Printing nodetool cfhistograms of results table had shown that it was required to read 60 SSTables to return data. So one query towards Cassandra triggered the reading of 60 files from the disk! And this impacted performance heavily.
This pointed us towards the compaction process of Cassandra and looking into nodetool compactionstats.
❯ nodetool compationstats
pending tasks: 86
compaction type keyspace table completed total unit progress
Compaction log commitlog 624030592 5659451776 bytes 11.03%
Compaction data results 2920702929876 4813653709033 bytes 60.68%
Active compaction remaining time : 31h25m28s
Statistics have shown us that it is required 31 hours to finish the compaction process of the results table. So single compaction of the table would have been taking around 77 hours! So the compaction process was quite inefficient.
The next thing to check was virtual memory statistics to find out the I/O blocks throughput.
❯ vmstat 10 1000
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 2 13699692 301824 38448 14898784 14 11 31540 2240 0 0 17 6 57 20 0
5 9 13715456 281188 35816 14990932 11 1586 668936 54885 138226 104549 19 9 41 31 0
4 1 13725484 289156 36064 14992604 14 1025 513021 30527 114013 104072 15 7 63 15 0
3 3 13742652 292892 33252 15011716 17 1696 718570 58119 140681 105421 18 10 38 35 0
3 1 13754044 294876 33308 15021608 11 1182 564726 42929 122446 101715 18 8 50 25 0
13 13 13764612 291908 33516 15029748 11 1019 541156 43989 123884 102516 18 8 48 27 0
7 13 13774820 293080 35236 15035136 133 1279 758554 36706 125018 99450 19 9 40 31 0
8 5 13785004 303612 35360 15013740 16 940 548025 48032 125423 99228 18 8 36 38 0
2 6 13792172 294076 33080 15061304 11 743 466678 39152 113678 98034 16 7 30 47 0
6 7 13800876 294300 33996 15061108 19 836 495162 52322 120387 100311 18 8 42 33 0
4 1 13812444 299568 30396 15077220 16 1189 564618 51925 139990 103598 21 9 48 22 0
Taking the maximum from blocks out, i.e. 58119 blocks, and assuming 1024 bytes per block, gave us approximate throughput of 53MB per second.
The Remedy
Cassandra has compaction_throughput_mb_per_sec configuration parameter that throttles compaction.
The cassandra.yaml configuration file
compaction_throughput_mb_per_sec (Default: 16) Throttles compaction to the specified total throughput across the node. The faster you insert data, the faster you need to compact in order to keep the SSTable count down. The recommended value is 16 to 32 times the rate of write throughput (in Mb/second). Setting the value to 0 disables compaction throttling.
It turned out that the root cause of the issue was the parameter being set to the default value.
The remedy to the issue was to increase compaction_throughput_mb_per_sec to 196.
concurrent_compactors (Default: Smaller of number of disks or number of cores, with a minimum of 2 and a maximum of 8 per CPU core)note Sets the number of concurrent compaction processes allowed to run simultaneously on a node, not including validation compactions for anti-entropy repair. Simultaneous compactions help preserve read performance in a mixed read-write workload by mitigating the tendency of small SSTables to accumulate during a single long-running compaction. If your data directories are backed by SSD, increase this value to the number of cores. If compaction running too slowly or too fast, adjust compaction_throughput_mb_per_sec first.
At the same time, we increased the number of concurrent_compactors from 2 to 6.
It took a while for the cluster to go back to its normal state, but finally, the cluster stopped lagging and the disk usage dropped down to around 500GB on each node!