Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache #6974

Merged
merged 6 commits into from
Aug 21, 2024

Conversation

whbing
Copy link
Contributor

@whbing whbing commented Jul 20, 2024

What changes were proposed in this pull request?

OM cached pipeline may not cache container's all datanode information (for example, some DNs may not report in time), which can be problematic for EC files. If the nodes num is less than the EC data-blocks num, an error will occur when reading data:

There are insufficient datanodes to read the EC block 

This err will persist until the cache expires (after 6 hours) even if all DNs corresponding to the EC file are completely normal.

So, it should not cache EC pipelines with incomplete data-nodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11209

How was this patch tested?

  • unit tests,
  • and manual tests

@whbing whbing marked this pull request as draft July 20, 2024 09:30
@errose28
Copy link
Contributor

It looks like the code will also cache incomplete Ratis pipelines, just not those that are empty. While the reads will technically still work, this could cause under-replicated containers to be sitting in the cache for a while, and may skew the read load for some time even after replication completes. Should we make the cache more strict to only keep pipelines that have all nodes present?

There are insufficient datanodes to read the EC block
This err will persist until the cache expires (after 6 hours) even if all DNs corresponding to the EC file are completely normal.

The client should force a cache refresh on read failures like this or corrupted data since SCM likely made new copies since the last cache refresh. It looks like that is not happening which is another problem.

cc @duongkame

@whbing whbing marked this pull request as ready for review July 29, 2024 12:14
@whbing
Copy link
Contributor Author

whbing commented Jul 29, 2024

Should we make the cache more strict to only keep pipelines that have all nodes present?

@errose28 Sorry for late reply. I am inclined to make the cache more strict. At the very least, EC pipelines that do not have enough datanodes for reconstruction should not be cached, as they are invalid.
I believe the following approaches can be considered:

  1. datanodes < data num : uncache
    datanodes >= data num: cache
  2. datanodes < data num + parity num: uncache
    datanodes = data num + parity num: cache
  3. datanodes loss data index: uncache
    datanodes contain all data index: cache

The current submission is based on point 3.


The client should force a cache refresh on read failures like this or corrupted data since SCM likely made new copies since the last cache refresh. It looks like that is not happening which is another problem.

Now no retry when InsufficientLocationsException. I tested some cases, such as incorrect locations (cached expired locations), all of these can retry with forceFreshPipeline = true.
Perhaps in the current, we can just deal with the case InsufficientLocationsException. Of course, it can be processed in another ticket.

@siddhantsangwan
Copy link
Contributor

@whbing thanks for working on this.

datanodes loss data index: uncache
datanodes contain all data index: cache

I like this approach. Will review the code soon.

OM cached pipeline may not cache container's all datanode information (for example, some DNs may not report in time), which can be problematic for EC files.

This makes me wonder if the scm safe-mode rule for EC containers should be stricter than just seeing 1 replica.

@errose28
Copy link
Contributor

This makes me wonder if the scm safe-mode rule for EC containers should be stricter than just seeing 1 replica.

@siddhantsangwan see #7008

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. @whbing can you please add a test?

@whbing
Copy link
Contributor Author

whbing commented Aug 4, 2024

please add a test?

@siddhantsangwan Added test for refreshing container when missing ec data indexes.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whbing It's worth checking for thread-safety issues here. I see that the cache is documented as thread safe, but the new map that we're using is not thread safe.

Comment on lines 133 to 134
Set<Integer> dataIndexes = EC_DATA_INDEXES.computeIfAbsent(d, k ->
IntStream.rangeClosed(1, d).boxed().collect(Collectors.toSet()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we're writing to the map in a multi-threaded context. We should either use a concurrent hash map or change this logic such that it's read-only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for suggestion ! It should fixed to concurrent hash map.

Btw, I performed benchmarking for several methods (as following). I think using map is unnecessary here and alternative approaches instead.

}
}

// case1: pipeline replicaIndexes missing some data indexes, not cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// case1: pipeline replicaIndexes missing some data indexes, not cache
// case1: pipeline replicaIndexes missing some data indexes, should not cache

verify(mockScmContainerClient, times(2))
.getContainerWithPipelineBatch(newHashSet(CONTAINER_ID.get()));

// case2: pipeline replicaIndexes contain all data indexes, cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// case2: pipeline replicaIndexes contain all data indexes, cache
// case2: pipeline replicaIndexes contain all data indexes, should cache

@whbing
Copy link
Contributor Author

whbing commented Aug 17, 2024

  private static final Map<Integer, Set<Integer>> EC_DATA_INDEXES = new ConcurrentHashMap<>();
  private static final int D = 6;
  private List<Integer> replicaIndexes;

  @Setup
  public void setup() {
    EC_DATA_INDEXES.clear();
    replicaIndexes = IntStream.rangeClosed(1, 9).boxed().collect(Collectors.toList());
    replicaIndexes.add(1);
    Collections.shuffle(replicaIndexes);
  }
 @Benchmark
  @Threads(500)
  public boolean testComputeIfAbsentAndContainsAll() {
    Set<Integer> dataIndexes = EC_DATA_INDEXES.computeIfAbsent(D, k ->
        IntStream.rangeClosed(1, k).boxed().collect(Collectors.toSet()));
    return !replicaIndexes.containsAll(dataIndexes);
  }

  @Benchmark
  @Threads(500)
  public boolean testDirectPutAndContainsAll() {
    Set<Integer> dataIndexes = IntStream.rangeClosed(1, D).boxed().collect(Collectors.toSet());
    return !replicaIndexes.containsAll(dataIndexes);
  }

  @Benchmark
  @Threads(500)
  public boolean testIndexCompare() {
    List<Integer> indexes = replicaIndexes.stream()
        .sorted()
        .collect(Collectors.toList());
    return !(indexes.size() >= D && indexes.get(D - 1) == D);
  }
Benchmark                                         Mode  Cnt  Score   Error  Units
ECDataIndexesBenchmark.testDirectContainsAll      avgt    4  0.029 ± 0.015  ms/op
ECDataIndexesBenchmark.testFromMapAndContainsAll  avgt    4  0.062 ± 0.023  ms/op
ECDataIndexesBenchmark.testIndexCompare           avgt    4  0.039 ± 0.017  ms/op

I tested that using map didn't seem to yield much benefit, so used another more concise method instead.

@whbing
Copy link
Contributor Author

whbing commented Aug 20, 2024

@siddhantsangwan Could you have another look? cc @duongkame @errose28

@siddhantsangwan
Copy link
Contributor

@whbing thanks for the update and benchmark. I agree, the map introduces a bit more complexity than necessary. Have you also considered another simple approach - using a nested loop to check if the index is present?

      List<Long> uncachePipelines = result.entrySet().stream()
          .filter(e -> {
            Pipeline pipeline = e.getValue();
            // filter empty pipelines
            if (pipeline.isEmpty()) {
              return true;
            }
            // filter insufficient EC pipelines which missing any data index
            ReplicationConfig repConfig = pipeline.getReplicationConfig();
            if (repConfig instanceof ECReplicationConfig) {
              int d = ((ECReplicationConfig) repConfig).getData();
//              List<Integer> indexes = pipeline.getReplicaIndexes().values().stream()
//                  .sorted()
//                  .distinct()
//                  .collect(Collectors.toList());
//              return !(indexes.size() >= d && indexes.get(d - 1) == d);
              for (int i = 1; i <= d; i++) { // another approach
                if (!pipeline.getReplicaIndexes().containsValue(i)) {
                  return true;
                }
              }
            }
            return false;
          })
          .map(Map.Entry::getKey)
          .collect(Collectors.toList());

The tests you wrote pass with this approach. We expect the number of indexes for EC to be less than 20, so this should barely take any time and is also thread-safe. Let me know if you see any problems with this.

@siddhantsangwan
Copy link
Contributor

@whbing thanks for the update. Is it ready for review?

@whbing
Copy link
Contributor Author

whbing commented Aug 21, 2024

Is it ready for review?

@siddhantsangwan Thanks! it's ready.

Let me know if you see any problems with this.

No problem! The giving method is more concise and efficient.

  @Benchmark
  @Threads(500)
  public boolean testIndexCompare() {
    List<Integer> indexes = replicaIndexes.stream()
        .sorted()
        .distinct()
        .collect(Collectors.toList());
    return !(indexes.size() >= D && indexes.get(D - 1) == D);
  }

  @Benchmark
  @Threads(500)
  public boolean testLoop() {
    for (int i = 1; i <= D; i++) {
      if (replicaIndexes.contains(i)) {
        return true;
      }
    }
    return false;
  }
Benchmark                                 Mode  Cnt  Score   Error  Units
ECDataIndexesBenchmark1.testIndexCompare  avgt    4  0.035 ± 0.035  ms/op
ECDataIndexesBenchmark1.testLoop          avgt    4  0.001 ± 0.001  ms/op

@siddhantsangwan
Copy link
Contributor

No problem! The giving method is more concise and efficient.

Ah, that's good. Thanks for benchmarking this as well. LGTM, pending green CI.

@siddhantsangwan siddhantsangwan merged commit c81615f into apache:master Aug 21, 2024
39 checks passed
@whbing
Copy link
Contributor Author

whbing commented Aug 21, 2024

Thanks @errose28 and @siddhantsangwan for view, thanks @siddhantsangwan for merge!

errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 26, 2024
…an-on-error

* HDDS-10239-container-reconciliation: (428 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 28, 2024
…rrupt-files

* HDDS-10239-container-reconciliation: (61 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants