HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache #6974

whbing · 2024-07-20T09:30:51Z

What changes were proposed in this pull request?

OM cached pipeline may not cache container's all datanode information (for example, some DNs may not report in time), which can be problematic for EC files. If the nodes num is less than the EC data-blocks num, an error will occur when reading data:

There are insufficient datanodes to read the EC block

This err will persist until the cache expires (after 6 hours) even if all DNs corresponding to the EC file are completely normal.

So, it should not cache EC pipelines with incomplete data-nodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11209

How was this patch tested?

unit tests,
and manual tests

errose28 · 2024-07-23T15:44:03Z

It looks like the code will also cache incomplete Ratis pipelines, just not those that are empty. While the reads will technically still work, this could cause under-replicated containers to be sitting in the cache for a while, and may skew the read load for some time even after replication completes. Should we make the cache more strict to only keep pipelines that have all nodes present?

There are insufficient datanodes to read the EC block
This err will persist until the cache expires (after 6 hours) even if all DNs corresponding to the EC file are completely normal.

The client should force a cache refresh on read failures like this or corrupted data since SCM likely made new copies since the last cache refresh. It looks like that is not happening which is another problem.

cc @duongkame

… cache

whbing · 2024-07-29T13:08:01Z

Should we make the cache more strict to only keep pipelines that have all nodes present?

@errose28 Sorry for late reply. I am inclined to make the cache more strict. At the very least, EC pipelines that do not have enough datanodes for reconstruction should not be cached, as they are invalid.
I believe the following approaches can be considered:

datanodes < data num : uncache
datanodes >= data num: cache
datanodes < data num + parity num: uncache
datanodes = data num + parity num: cache
datanodes loss data index: uncache
datanodes contain all data index: cache

The current submission is based on point 3.

The client should force a cache refresh on read failures like this or corrupted data since SCM likely made new copies since the last cache refresh. It looks like that is not happening which is another problem.

Now no retry when InsufficientLocationsException. I tested some cases, such as incorrect locations (cached expired locations), all of these can retry with forceFreshPipeline = true.
Perhaps in the current, we can just deal with the case InsufficientLocationsException. Of course, it can be processed in another ticket.

siddhantsangwan · 2024-07-29T18:36:44Z

@whbing thanks for working on this.

datanodes loss data index: uncache
datanodes contain all data index: cache

I like this approach. Will review the code soon.

OM cached pipeline may not cache container's all datanode information (for example, some DNs may not report in time), which can be problematic for EC files.

This makes me wonder if the scm safe-mode rule for EC containers should be stricter than just seeing 1 replica.

errose28 · 2024-07-31T21:34:05Z

This makes me wonder if the scm safe-mode rule for EC containers should be stricter than just seeing 1 replica.

@siddhantsangwan see #7008

siddhantsangwan

Looks good so far. @whbing can you please add a test?

whbing · 2024-08-04T13:47:07Z

please add a test?

@siddhantsangwan Added test for refreshing container when missing ec data indexes.

siddhantsangwan

@whbing It's worth checking for thread-safety issues here. I see that the cache is documented as thread safe, but the new map that we're using is not thread safe.

siddhantsangwan · 2024-08-16T09:39:18Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ScmClient.java

+              Set<Integer> dataIndexes = EC_DATA_INDEXES.computeIfAbsent(d, k ->
+                  IntStream.rangeClosed(1, d).boxed().collect(Collectors.toSet()));


Here, we're writing to the map in a multi-threaded context. We should either use a concurrent hash map or change this logic such that it's read-only.

@siddhantsangwan Thanks for suggestion ! It should fixed to concurrent hash map.

Btw, I performed benchmarking for several methods (as following). I think using map is unnecessary here and alternative approaches instead.

siddhantsangwan · 2024-08-16T10:15:10Z

.../integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOmContainerLocationCache.java

+      }
+    }
+
+    // case1: pipeline replicaIndexes missing some data indexes, not cache


Suggested change

// case1: pipeline replicaIndexes missing some data indexes, not cache

// case1: pipeline replicaIndexes missing some data indexes, should not cache

siddhantsangwan · 2024-08-16T10:15:14Z

.../integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOmContainerLocationCache.java

+    verify(mockScmContainerClient, times(2))
+        .getContainerWithPipelineBatch(newHashSet(CONTAINER_ID.get()));
+
+    // case2: pipeline replicaIndexes contain all data indexes, cache


Suggested change

// case2: pipeline replicaIndexes contain all data indexes, cache

// case2: pipeline replicaIndexes contain all data indexes, should cache

whbing · 2024-08-17T04:04:14Z

  private static final Map<Integer, Set<Integer>> EC_DATA_INDEXES = new ConcurrentHashMap<>();
  private static final int D = 6;
  private List<Integer> replicaIndexes;

  @Setup
  public void setup() {
    EC_DATA_INDEXES.clear();
    replicaIndexes = IntStream.rangeClosed(1, 9).boxed().collect(Collectors.toList());
    replicaIndexes.add(1);
    Collections.shuffle(replicaIndexes);
  }
 @Benchmark
  @Threads(500)
  public boolean testComputeIfAbsentAndContainsAll() {
    Set<Integer> dataIndexes = EC_DATA_INDEXES.computeIfAbsent(D, k ->
        IntStream.rangeClosed(1, k).boxed().collect(Collectors.toSet()));
    return !replicaIndexes.containsAll(dataIndexes);
  }

  @Benchmark
  @Threads(500)
  public boolean testDirectPutAndContainsAll() {
    Set<Integer> dataIndexes = IntStream.rangeClosed(1, D).boxed().collect(Collectors.toSet());
    return !replicaIndexes.containsAll(dataIndexes);
  }

  @Benchmark
  @Threads(500)
  public boolean testIndexCompare() {
    List<Integer> indexes = replicaIndexes.stream()
        .sorted()
        .collect(Collectors.toList());
    return !(indexes.size() >= D && indexes.get(D - 1) == D);
  }

Benchmark                                         Mode  Cnt  Score   Error  Units
ECDataIndexesBenchmark.testDirectContainsAll      avgt    4  0.029 ± 0.015  ms/op
ECDataIndexesBenchmark.testFromMapAndContainsAll  avgt    4  0.062 ± 0.023  ms/op
ECDataIndexesBenchmark.testIndexCompare           avgt    4  0.039 ± 0.017  ms/op

I tested that using map didn't seem to yield much benefit, so used another more concise method instead.

whbing · 2024-08-20T06:07:09Z

@siddhantsangwan Could you have another look? cc @duongkame @errose28

siddhantsangwan · 2024-08-21T06:02:20Z

@whbing thanks for the update and benchmark. I agree, the map introduces a bit more complexity than necessary. Have you also considered another simple approach - using a nested loop to check if the index is present?

      List<Long> uncachePipelines = result.entrySet().stream()
          .filter(e -> {
            Pipeline pipeline = e.getValue();
            // filter empty pipelines
            if (pipeline.isEmpty()) {
              return true;
            }
            // filter insufficient EC pipelines which missing any data index
            ReplicationConfig repConfig = pipeline.getReplicationConfig();
            if (repConfig instanceof ECReplicationConfig) {
              int d = ((ECReplicationConfig) repConfig).getData();
//              List<Integer> indexes = pipeline.getReplicaIndexes().values().stream()
//                  .sorted()
//                  .distinct()
//                  .collect(Collectors.toList());
//              return !(indexes.size() >= d && indexes.get(d - 1) == d);
              for (int i = 1; i <= d; i++) { // another approach
                if (!pipeline.getReplicaIndexes().containsValue(i)) {
                  return true;
                }
              }
            }
            return false;
          })
          .map(Map.Entry::getKey)
          .collect(Collectors.toList());

The tests you wrote pass with this approach. We expect the number of indexes for EC to be less than 20, so this should barely take any time and is also thread-safe. Let me know if you see any problems with this.

siddhantsangwan · 2024-08-21T08:04:51Z

@whbing thanks for the update. Is it ready for review?

whbing · 2024-08-21T08:32:41Z

Is it ready for review?

@siddhantsangwan Thanks! it's ready.

Let me know if you see any problems with this.

No problem! The giving method is more concise and efficient.

  @Benchmark
  @Threads(500)
  public boolean testIndexCompare() {
    List<Integer> indexes = replicaIndexes.stream()
        .sorted()
        .distinct()
        .collect(Collectors.toList());
    return !(indexes.size() >= D && indexes.get(D - 1) == D);
  }

  @Benchmark
  @Threads(500)
  public boolean testLoop() {
    for (int i = 1; i <= D; i++) {
      if (replicaIndexes.contains(i)) {
        return true;
      }
    }
    return false;
  }
Benchmark                                 Mode  Cnt  Score   Error  Units
ECDataIndexesBenchmark1.testIndexCompare  avgt    4  0.035 ± 0.035  ms/op
ECDataIndexesBenchmark1.testLoop          avgt    4  0.001 ± 0.001  ms/op

siddhantsangwan · 2024-08-21T08:43:53Z

No problem! The giving method is more concise and efficient.

Ah, that's good. Thanks for benchmarking this as well. LGTM, pending green CI.

whbing · 2024-08-21T09:45:06Z

Thanks @errose28 and @siddhantsangwan for view, thanks @siddhantsangwan for merge!

* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java

…an-on-error * HDDS-10239-container-reconciliation: (428 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java

…rrupt-files * HDDS-10239-container-reconciliation: (61 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java

… cache (apache#6974) (cherry picked from commit c81615f)

whbing marked this pull request as draft July 20, 2024 09:30

whbing added 2 commits July 29, 2024 20:10

HDDS-11209. Avoid insufficient EC pipelines in the container pipeline…

c0abd3c

… cache

ec data indexes add to map

166d319

whbing force-pushed the HDDS-11209 branch from 76ec6b6 to 166d319 Compare July 29, 2024 12:11

whbing marked this pull request as ready for review July 29, 2024 12:14

siddhantsangwan reviewed Aug 1, 2024

View reviewed changes

errose28 mentioned this pull request Aug 2, 2024

HDDS-11261. Fix old block info cache when ec reconstruction failed. #7020

Closed

add test

b4d26b2

siddhantsangwan requested a review from duongkame August 16, 2024 09:34

siddhantsangwan reviewed Aug 16, 2024

View reviewed changes

improve compare logic

3dffac1

distinct index

17b5468

simply contains index logic

71618f5

siddhantsangwan approved these changes Aug 21, 2024

View reviewed changes

siddhantsangwan merged commit c81615f into apache:master Aug 21, 2024
39 checks passed

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 16, 2024

HDDS-11209. Avoid insufficient EC pipelines in the container pipeline…

c707b57

… cache (apache#6974) (cherry picked from commit c81615f)

xichen01 mentioned this pull request Sep 19, 2024

[DO NOT MERGE] Backport some fixes and compatibility commits from master to ozone-1.4 #7218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache #6974

HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache #6974

whbing commented Jul 20, 2024

errose28 commented Jul 23, 2024

whbing commented Jul 29, 2024

siddhantsangwan commented Jul 29, 2024

errose28 commented Jul 31, 2024

siddhantsangwan left a comment

whbing commented Aug 4, 2024

siddhantsangwan left a comment

siddhantsangwan Aug 16, 2024

whbing Aug 17, 2024

siddhantsangwan Aug 16, 2024

siddhantsangwan Aug 16, 2024

whbing commented Aug 17, 2024 •

edited

Loading

whbing commented Aug 20, 2024

siddhantsangwan commented Aug 21, 2024

siddhantsangwan commented Aug 21, 2024

whbing commented Aug 21, 2024

siddhantsangwan commented Aug 21, 2024

whbing commented Aug 21, 2024

		Set<Integer> dataIndexes = EC_DATA_INDEXES.computeIfAbsent(d, k ->
		IntStream.rangeClosed(1, d).boxed().collect(Collectors.toSet()));

	// case1: pipeline replicaIndexes missing some data indexes, not cache
	// case1: pipeline replicaIndexes missing some data indexes, should not cache

	// case2: pipeline replicaIndexes contain all data indexes, cache
	// case2: pipeline replicaIndexes contain all data indexes, should cache

HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache #6974

HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache #6974

Conversation

whbing commented Jul 20, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

errose28 commented Jul 23, 2024

whbing commented Jul 29, 2024

siddhantsangwan commented Jul 29, 2024

errose28 commented Jul 31, 2024

siddhantsangwan left a comment

Choose a reason for hiding this comment

whbing commented Aug 4, 2024

siddhantsangwan left a comment

Choose a reason for hiding this comment

siddhantsangwan Aug 16, 2024

Choose a reason for hiding this comment

whbing Aug 17, 2024

Choose a reason for hiding this comment

siddhantsangwan Aug 16, 2024

Choose a reason for hiding this comment

siddhantsangwan Aug 16, 2024

Choose a reason for hiding this comment

whbing commented Aug 17, 2024 • edited Loading

whbing commented Aug 20, 2024

siddhantsangwan commented Aug 21, 2024

siddhantsangwan commented Aug 21, 2024

whbing commented Aug 21, 2024

siddhantsangwan commented Aug 21, 2024

whbing commented Aug 21, 2024

whbing commented Aug 17, 2024 •

edited

Loading