-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache #6974
Conversation
It looks like the code will also cache incomplete Ratis pipelines, just not those that are empty. While the reads will technically still work, this could cause under-replicated containers to be sitting in the cache for a while, and may skew the read load for some time even after replication completes. Should we make the cache more strict to only keep pipelines that have all nodes present?
The client should force a cache refresh on read failures like this or corrupted data since SCM likely made new copies since the last cache refresh. It looks like that is not happening which is another problem. cc @duongkame |
@errose28 Sorry for late reply. I am inclined to make the cache more strict. At the very least, EC pipelines that do not have enough datanodes for reconstruction should not be cached, as they are invalid.
The current submission is based on point 3.
Now no retry when |
@whbing thanks for working on this.
I like this approach. Will review the code soon.
This makes me wonder if the scm safe-mode rule for EC containers should be stricter than just seeing 1 replica. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. @whbing can you please add a test?
@siddhantsangwan Added test for refreshing container when missing ec data indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@whbing It's worth checking for thread-safety issues here. I see that the cache is documented as thread safe, but the new map that we're using is not thread safe.
Set<Integer> dataIndexes = EC_DATA_INDEXES.computeIfAbsent(d, k -> | ||
IntStream.rangeClosed(1, d).boxed().collect(Collectors.toSet())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we're writing to the map in a multi-threaded context. We should either use a concurrent hash map or change this logic such that it's read-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan Thanks for suggestion ! It should fixed to concurrent hash map.
Btw, I performed benchmarking for several methods (as following). I think using map
is unnecessary here and alternative approaches instead.
} | ||
} | ||
|
||
// case1: pipeline replicaIndexes missing some data indexes, not cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// case1: pipeline replicaIndexes missing some data indexes, not cache | |
// case1: pipeline replicaIndexes missing some data indexes, should not cache |
verify(mockScmContainerClient, times(2)) | ||
.getContainerWithPipelineBatch(newHashSet(CONTAINER_ID.get())); | ||
|
||
// case2: pipeline replicaIndexes contain all data indexes, cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// case2: pipeline replicaIndexes contain all data indexes, cache | |
// case2: pipeline replicaIndexes contain all data indexes, should cache |
private static final Map<Integer, Set<Integer>> EC_DATA_INDEXES = new ConcurrentHashMap<>();
private static final int D = 6;
private List<Integer> replicaIndexes;
@Setup
public void setup() {
EC_DATA_INDEXES.clear();
replicaIndexes = IntStream.rangeClosed(1, 9).boxed().collect(Collectors.toList());
replicaIndexes.add(1);
Collections.shuffle(replicaIndexes);
}
@Benchmark
@Threads(500)
public boolean testComputeIfAbsentAndContainsAll() {
Set<Integer> dataIndexes = EC_DATA_INDEXES.computeIfAbsent(D, k ->
IntStream.rangeClosed(1, k).boxed().collect(Collectors.toSet()));
return !replicaIndexes.containsAll(dataIndexes);
}
@Benchmark
@Threads(500)
public boolean testDirectPutAndContainsAll() {
Set<Integer> dataIndexes = IntStream.rangeClosed(1, D).boxed().collect(Collectors.toSet());
return !replicaIndexes.containsAll(dataIndexes);
}
@Benchmark
@Threads(500)
public boolean testIndexCompare() {
List<Integer> indexes = replicaIndexes.stream()
.sorted()
.collect(Collectors.toList());
return !(indexes.size() >= D && indexes.get(D - 1) == D);
} Benchmark Mode Cnt Score Error Units
ECDataIndexesBenchmark.testDirectContainsAll avgt 4 0.029 ± 0.015 ms/op
ECDataIndexesBenchmark.testFromMapAndContainsAll avgt 4 0.062 ± 0.023 ms/op
ECDataIndexesBenchmark.testIndexCompare avgt 4 0.039 ± 0.017 ms/op I tested that using map didn't seem to yield much benefit, so used another more concise method instead. |
@siddhantsangwan Could you have another look? cc @duongkame @errose28 |
@whbing thanks for the update and benchmark. I agree, the map introduces a bit more complexity than necessary. Have you also considered another simple approach - using a nested loop to check if the index is present?
The tests you wrote pass with this approach. We expect the number of indexes for EC to be less than 20, so this should barely take any time and is also thread-safe. Let me know if you see any problems with this. |
@whbing thanks for the update. Is it ready for review? |
@siddhantsangwan Thanks! it's ready.
No problem! The giving method is more concise and efficient. @Benchmark
@Threads(500)
public boolean testIndexCompare() {
List<Integer> indexes = replicaIndexes.stream()
.sorted()
.distinct()
.collect(Collectors.toList());
return !(indexes.size() >= D && indexes.get(D - 1) == D);
}
@Benchmark
@Threads(500)
public boolean testLoop() {
for (int i = 1; i <= D; i++) {
if (replicaIndexes.contains(i)) {
return true;
}
}
return false;
}
Benchmark Mode Cnt Score Error Units
ECDataIndexesBenchmark1.testIndexCompare avgt 4 0.035 ± 0.035 ms/op
ECDataIndexesBenchmark1.testLoop avgt 4 0.001 ± 0.001 ms/op |
Ah, that's good. Thanks for benchmarking this as well. LGTM, pending green CI. |
Thanks @errose28 and @siddhantsangwan for view, thanks @siddhantsangwan for merge! |
* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
…an-on-error * HDDS-10239-container-reconciliation: (428 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
…rrupt-files * HDDS-10239-container-reconciliation: (61 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
… cache (apache#6974) (cherry picked from commit c81615f)
What changes were proposed in this pull request?
OM cached pipeline may not cache container's all datanode information (for example, some DNs may not report in time), which can be problematic for EC files. If the nodes num is less than the EC data-blocks num, an error will occur when reading data:
This err will persist until the cache expires (after 6 hours) even if all DNs corresponding to the EC file are completely normal.
So, it should not cache EC pipelines with incomplete data-nodes.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11209
How was this patch tested?