STAR-1872: Parallelize UCS compactions per output shard #1342

blambov · 2024-10-09T15:08:19Z

This splits compactions that are to produce more than one
output sstable into tasks that can execute in parallel.
Such tasks share a transaction and have combined progress
and observer. Because we cannot mark parts of an sstable
as unneeded, the transaction is only applied when all
tasks have succeeded. This also means that early open
is not supported for such tasks.

The parallelization also takes into account thread reservations,
reducing the parallelism to the number of available threads
for its level. The new functionality is turned on by default.

Major compactions will apply the same mechanism to
parallelize the operation. They will only split on pre-
existing boundary points if they are also boundary
points for the current UCS configuration. This is done
to ensure that major compactions can re-shard data when
the configuration is changed. If pre-existing boundaries
match the current state, a major compaction will still be
broken into multiple operations to reduce the space
overhead of the operation.

Also:

Introduces a parallelism parameter to major compactions
(nodetool compact -j <threads>, defaulting to half the
compaction threads) to avoid stopping all other compaction
for the duration.
Changes SSTable expiration to be done in a separate
getNextBackgroundCompactions round to improve the
efficiency of expiration (separate task can run quickly
and remove the relevant sstables without waiting for
a compaction to end).
Applies small-partition-count correction in
ShardManager.calculateCombinedDensity.

src/java/org/apache/cassandra/db/lifecycle/CompositeLifecycleTransaction.java

src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java

eolivelli · 2024-10-16T06:49:24Z

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.java

+            return tasks;
+    }
+
+    private <T> List<T> splitSSTablesInShards(Collection<SSTableReader> sstables,


what about making this method static and writing specific unit tests to cover all of the cases?

blambov · 2024-10-16T12:35:29Z

The PR is not yet ready for review.

blambov · 2024-10-29T12:38:34Z

The patch is now ready for review.

sonarcloud · 2024-11-08T10:29:49Z

Quality Gate passed

Issues
1 New issue
2 Accepted issues

Measures
0 Security Hotspots
86.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

cassci-bot · 2024-11-08T10:32:41Z

❌ Build ds-cassandra-pr-gate/PR-1342 rejected by Butler

13 new test failure(s) in 16 builds
See build details here

Found 13 new test failures

Test	Explanation	Branch history	Upstream history
...47,483,647 Modifier 0.5 Levels 4 Compactors 15]	flaky	🔵🔴🔵
...,147,483,647 Modifier 1 Levels 3 Compactors 30]	flaky	🔵🔴🔵
...,147,483,647 Modifier 1 Levels 4 Compactors 15]	flaky	🔵🔴🔵
...oadCommitLogAndSSTablesWithDroppedColumnTestDSE	regression	🔴🔴🔵🔵	🔵🔵🔵🔵🔵🔵🔵
...ToolEnableDisableBinaryTest.testMaybeChangeDocs	flaky	🔵🔴🔵🔵🔵🔵🔴	🔵🔵🔵🔵🔵🔵🔵
...positePartitionKeyDataModel{primaryKey=p1, p2}]	regression	🔴🔴🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...positePartitionKeyDataModel{primaryKey=p1, p2}]	failing	🔴🔴🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...positePartitionKeyDataModel{primaryKey=p1, p2}]	regression	🔴🔵🔵🔵	🔵🔵🔵🔵🔵🔵🔵
...positePartitionKeyDataModel{primaryKey=p1, p2}]	regression	🔴🔴🔵🔵🔴🔵🔴	🔵🔵🔵🔵🔵🔵🔵
...positePartitionKeyDataModel{primaryKey=p1, p2}]	failing	🔴🔴🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...i.s.c.VectorSiftSmallTest.testMultiSegmentBuild	failing	🔴🔴🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...st.testTTLOverwriteHasCorrectOnDiskRowCount[dc]	regression	🔴🔵🔵🔵	🔵🔵🔵🔵🔵🔵🔵
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS...	flaky	🔵🔴🔵🔴🔴🔵🔵	🔵🔵🔵🔵🔵🔵🔵

Found 99 known test failures

This splits compactions that are to produce more than one output sstable into tasks that can execute in parallel. Such tasks share a transaction and have combined progress and observer. Because we cannot mark parts of an sstable as unneeded, the transaction is only applied when all tasks have succeeded. This also means that early open is not supported for such tasks. At this time the new parallelization mechanism is not taken into account by the thread allocation scheme, and thus some levels may take more resources than they should. Because of this limitation (which should be fixed in the near future), the new behaviour is off by default. Also: - Adds a flag to combine non-overlapping sets in major compactions to reshard data, as major compactions can can now be executed as a parallelized operation. - Changes SSTable expiration to be done in a separate getNextBackgroundCompactions round to improve the efficiency of expiration (separate task can run quickly and remove the relevant sstables without waiting for a compaction to end). - Applies small-partition-count correction in ShardManager.calculateCombinedDensity.

Change parallelize_output_shards default to true.

eolivelli

I have reviewed the code and left some smaller comments.
I am not very familiar with this code, but I cannot find anything wrong.
All of my previos comments have been addressed.

I am testing the patch on CNDB https://github.com/riptano/cndb/pull/11690 to see if there is there is something that breaks.

eolivelli · 2024-11-11T10:32:43Z

src/java/org/apache/cassandra/db/compaction/CompactionManager.java

        }
        if (nonEmptyTasks > 1)
-            logger.info("Major compaction will not result in a single sstable - repaired and unrepaired data is kept separate and compaction runs per data_file_directory.");
+            logger.info("Major compaction will not result in a single sstable.");


what about adding a reference to the CFS name ? in a live system this log may happen for multiple tables

Agreed, I could see something from CFS being a valuable addition.

eolivelli · 2024-11-11T10:43:50Z

src/java/org/apache/cassandra/db/compaction/ShardTracker.java

@@ -18,6 +18,8 @@

 package org.apache.cassandra.db.compaction;

+import java.util.ArrayList;


nit: unused new imports

eolivelli · 2024-11-11T11:07:50Z

test/unit/org/apache/cassandra/db/compaction/SharedCompactionObserverTest.java

+
+import static org.mockito.Mockito.*;
+
+/// Tests mostly written by Copilot.


nit: remove this reference to Copilot ? (here and in otherplaces)
I am not sure about licensing issues

src/java/org/apache/cassandra/db/compaction/AbstractCompactionTask.java

michaeljmarshall · 2024-11-14T22:56:33Z

src/java/org/apache/cassandra/db/compaction/BackgroundCompactionRunner.java

@@ -382,6 +385,7 @@ private CompletableFuture<Void> startTask(ColumnFamilyStore cfs, AbstractCompact
        {
            ongoingCompactions.decrementAndGet();
            logger.debug("Background compaction task for {} was rejected", cfs);
+            task.rejected(ex);


Are we relying on this to call the observer with the error? I see a warning in my IDE because we're ignoring the resulting throwable. Do we want to return a failed future on the next line? It's minor, but a comment would make that clearer.

michaeljmarshall · 2024-11-15T02:36:52Z

src/java/org/apache/cassandra/db/compaction/SharedCompactionProgress.java

+    public synchronized void addSubtask(CompactionProgress progress)
+    {
+        if (!sources.isEmpty())
+            assert sources.get(0).operationId() == progress.operationId();
+        sources.add(progress);
+    }


I see that this method is synchronized, but none of the others are. How are we ensuring safe publication of the sources list after it is modified?

michaeljmarshall · 2024-11-15T02:56:05Z

src/java/org/apache/cassandra/io/sstable/compaction/SortedStringTableCursor.java

+                    this.endPosition = position.position; // 0 if end is before our first key.
+                else
+                {
+                    assert false : "Range " + tokenRange + " end is before last sstable token " + sstable.last.getToken() + " but no position was found";


Why is this only assert false as opposed to always throwing an IllegalArgumentException?

Changed to AssertionError, because this is a case of broken sstable invariants rather than illegal input.

michaeljmarshall · 2024-11-15T03:08:24Z

src/java/org/apache/cassandra/db/compaction/RepairFinishedCompactionTask.java

-                transaction.finish();
+                transaction.prepareToCommit();
+                transaction.commit();


What is the purpose of making this change? It looks like finish() calls these methods by default but could be overridden, so after a quick glance, it's not clear that this is a safe replacement.

michaeljmarshall · 2024-11-15T03:25:34Z

src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java

+                if (lowerChunkStart < lastEnd)  // if regions include the same chunk, count it only once
+                    lowerChunkStart = lastEnd;
+                total += upperChunkEnd - lowerChunkStart;
+                lastEnd = upperChunkEnd;


Looks like Range.normalize ensures the ranges don't overlap. It seems like this code is being defensive in case that implementation changes. In that case, should we also add something like the following:

if (upperChunkEnd <= lastEnd) continue;

Perhaps there is something about compression metadata that I'm not familiar with here?

If we have two sections immediately (or close after) one another, they may fall in the same compression chunk. In that case the lower bound is its start and the upper its end. Without this adjustment the chunk would be counted twice.

michaeljmarshall · 2024-11-18T23:41:50Z

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

+    protected Set<SSTableReader> inputSSTables()
+    {
+        return transaction.originals();
+    }


I see that not all transaction.originals() references in this class were replaced by inputSSTables(). Is that intentional? It seems like it might only be in methods that are not overridden in subclasses, but I worry that this might be fragile, assuming I understand it correctly.

michaeljmarshall · 2024-11-19T00:04:37Z

src/java/org/apache/cassandra/db/lifecycle/PartialLifecycleTransaction.java

+
+    public SSTableReader current(SSTableReader reader)
+    {
+        return mainTransaction.current(reader);


Is there a reason we don't wrap this one in a synchronized block? In looking at how it is implemented in LifecycleTransaction, it seems like current(reader) might not be thread safe.

michaeljmarshall · 2024-11-20T04:00:22Z

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

+    /**
+     * @return The token range that the operation should compact. This is usually null, but if we have a parallelizable
+     * multi-task operation (see {@link UnifiedCompactionStrategy#createAndAddTasks}), it will specify a subrange.
+     */
+    protected Range<Token> tokenRange()
+    {
+        return null;
+    }


Does the InclusionMethod enum become ignored when using parallelizable compaction? This comment indicates we only compact sstables within a given range, which seems like it might map to the NONE method, is that right?

I see that the range is generated by the ShardManager, which indicates that maybe this token range is independent of the InclusionMethod.

The range is indeed independent of the InclusionMethod. The range here is a range inside the input sstables, and however we select these we still split the output at predetermined positions.

In other words the InclusionMethod is a feature of the compaction selection, while the range here is a feature of the output sharding. The two are generally independent (and should be, to be able to correctly act in case of changes in sharding or upgrade from a legacy strategy), even though the latter should work in a way that makes the former efficient.

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.java

michaeljmarshall

I am completed my initial review. It looks generally good to me, though I am not sure I fully comprehend the nuance in the UnifiedCompactionStrategy class. I left several minor comments.

pcmanus · 2024-11-18T15:22:06Z

src/java/org/apache/cassandra/db/compaction/CompactionManager.java

+    public List<Future<?>> submitMaximal(final ColumnFamilyStore cfStore,
+                                         final int gcBefore,
+                                         boolean splitOutput,
+                                         Integer parallelism,


Here and in callers, I feel an OptionalInt would be clearer/less error prone.

pcmanus · 2024-11-18T15:50:28Z

src/java/org/apache/cassandra/db/compaction/CompactionManager.java

@@ -930,9 +950,15 @@ protected void runMayThrow()
            Future<?> fut = executor.submitIfRunning(runnable, "maximal task");
            if (!fut.isCancelled())
                futures.add(fut);
+            else
+            {
+                Throwable error = task.rejected(null);


My reading of the code is that, because the argument is null, this will call the observers CompactionObserver#onCompleted method with isSuccess == true, which the javadoc says means "compaction finished without any exceptions". Given this case means the task was rejected (probably because of shutdown, but could theoretically be something else), that feels a bit dodgy to me. I'd have passed a RejectedExecutionException or something.

pcmanus · 2024-11-18T15:51:56Z

src/java/org/apache/cassandra/db/compaction/CompactionStrategy.java

-     *
+     * @param gcBefore             throw away tombstones older than this
+     * @param permittedParallelism
+     * @param reshard


Nit: that reshard param does not exists (but the splitOutput does and isn't listed). Of course, listing parameters without actual documentation is of limited interested in the first place :).

pcmanus · 2024-11-20T11:37:28Z

src/java/org/apache/cassandra/db/compaction/SharedCompactionProgress.java

+///
+/// Subtasks may start and add themselves in any order. There may also be periods of time when all started tasks have
+/// completed but there are new ones to still initiate. Because of this all parameters returned by this progress may
+/// increase over time, including the total sizes and sstable lists.


This feels like a non-negligible issue at least for correct progress tracking. My understanding is that in practice the tasks that share on such object will execute completely sequentially, so if you wanted to get an idea of progress by looking at completed() / total(), then you will get something fairly off due to total() essentially lying to you.

I guess my question is, can't we register all the tasks first at the beginning (we create all tasks upfront anyway, and it's not like addSubtask triggers any action by itself), instead of adding task only when they start?

pcmanus · 2024-11-20T13:29:40Z

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.java

-        else
-            return getNextBackgroundTasks(getNextCompactionAggregates(gcBefore), gcBefore);
+
+        // Always check for expired sstables (not just periodically) as expiration will save us unnecessary work.


I find that comment (the "always check ... (not just periodically)") a tad contradictory with the implementation since getExpirationTasks start by giving up if we have check in less that some predefined "period". I'm not sure if the comment means something else and could be maybe rephrased, or ...?

pcmanus · 2024-11-20T13:44:53Z

src/java/org/apache/cassandra/db/lifecycle/PartialLifecycleTransaction.java

+        if (committedOrAborted.get())
+            throw new IllegalStateException("Partial transaction already committed or aborted.");
+
+        throwIfAborted();


Nit: maybe rename to throwIfCompositeAborted; looks weird otherwise on reading as it looks like we already checked for aborted the previous line.

blambov · 2024-11-22T15:30:13Z

I need to port over some changes that came from the development of the OSS version.

I think it would be easier for all of us if we continue the review on the OSS version (CASSANDRA-18802 and CASSANDRA-20092) -- it is quite a bit simpler. We can then come back to the additions here, with any updates that come from that review.

blambov force-pushed the STAR-1872 branch from 1da1be1 to c92f199 Compare October 14, 2024 08:29

eolivelli reviewed Oct 16, 2024

View reviewed changes

src/java/org/apache/cassandra/db/lifecycle/CompositeLifecycleTransaction.java Show resolved Hide resolved

src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java Show resolved Hide resolved

eolivelli reviewed Oct 16, 2024

View reviewed changes

blambov force-pushed the STAR-1872 branch 3 times, most recently from 6cc862f to b6295c0 Compare October 29, 2024 12:38

blambov added 3 commits November 8, 2024 15:58

STAR-1872: Take parallelism into account for UCS.getSelection

b9d8c03

Change parallelize_output_shards default to true.

STAR-1872: Implement parallelisation limit for getMaximalTasks

2b0f132

blambov force-pushed the STAR-1872 branch from 5667a65 to 2b0f132 Compare November 8, 2024 13:59

blambov requested review from eolivelli and michaeljmarshall November 8, 2024 14:08

eolivelli reviewed Nov 11, 2024

View reviewed changes

src/java/org/apache/cassandra/db/compaction/AbstractCompactionTask.java Show resolved Hide resolved

eolivelli reviewed Nov 11, 2024

View reviewed changes

src/java/org/apache/cassandra/db/compaction/AbstractCompactionTask.java Show resolved Hide resolved

michaeljmarshall reviewed Nov 14, 2024

View reviewed changes

michaeljmarshall reviewed Nov 15, 2024

View reviewed changes

michaeljmarshall reviewed Nov 18, 2024

View reviewed changes

michaeljmarshall reviewed Nov 19, 2024

View reviewed changes

michaeljmarshall reviewed Nov 20, 2024

View reviewed changes

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.java Show resolved Hide resolved

michaeljmarshall reviewed Nov 20, 2024

View reviewed changes

pcmanus reviewed Nov 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STAR-1872: Parallelize UCS compactions per output shard #1342

STAR-1872: Parallelize UCS compactions per output shard #1342

blambov commented Oct 9, 2024 •

edited

Loading

eolivelli Oct 16, 2024

blambov commented Oct 16, 2024

blambov commented Oct 29, 2024

sonarcloud bot commented Nov 8, 2024

cassci-bot commented Nov 8, 2024

eolivelli left a comment

eolivelli Nov 11, 2024

michaeljmarshall Nov 15, 2024

eolivelli Nov 11, 2024

eolivelli Nov 11, 2024

michaeljmarshall Nov 14, 2024

michaeljmarshall Nov 15, 2024

michaeljmarshall Nov 15, 2024

blambov Nov 22, 2024

michaeljmarshall Nov 15, 2024

michaeljmarshall Nov 15, 2024

michaeljmarshall Nov 15, 2024

blambov Nov 22, 2024

michaeljmarshall Nov 18, 2024

michaeljmarshall Nov 19, 2024

michaeljmarshall Nov 20, 2024

michaeljmarshall Nov 20, 2024

blambov Nov 22, 2024

michaeljmarshall left a comment

pcmanus Nov 18, 2024

pcmanus Nov 18, 2024

pcmanus Nov 18, 2024

pcmanus Nov 20, 2024

pcmanus Nov 20, 2024

pcmanus Nov 20, 2024

blambov commented Nov 22, 2024

		@@ -18,6 +18,8 @@

		package org.apache.cassandra.db.compaction;

		import java.util.ArrayList;


		import static org.mockito.Mockito.*;

		/// Tests mostly written by Copilot.

STAR-1872: Parallelize UCS compactions per output shard #1342

Are you sure you want to change the base?

STAR-1872: Parallelize UCS compactions per output shard #1342

Conversation

blambov commented Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

blambov commented Oct 16, 2024

blambov commented Oct 29, 2024

sonarcloud bot commented Nov 8, 2024

Quality Gate passed

cassci-bot commented Nov 8, 2024

❌ Build ds-cassandra-pr-gate/PR-1342 rejected by Butler

Found 13 new test failures

Found 99 known test failures

eolivelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaeljmarshall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blambov commented Nov 22, 2024

blambov commented Oct 9, 2024 •

edited

Loading