-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11650. ContainerId list to track all containers created in a datanode #7402
Conversation
…anode Change-Id: I94fd413a2d778ac5d86a7da5126cf3d1cac8113a
Change-Id: I22654091edbd3a11c585aa95ca2b554eba0f9d95
Change-Id: Ibad0d380486c03dda6c61f87b329c98f90a01fd8 # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
Change-Id: Icaa5ae0b29ec0ffccf5914bec0fd6ed6ae117219
Change-Id: I995fc25b93f16aa859eeb8f0418aa774e3719330
Change-Id: Ibeadc9330185f699e4cf1d9c1c8631d1af52683e
Change-Id: I5ac6a685a49e79be5ea43717294dd649383433f2
Change-Id: I18f48f9d97b0cc16a3c97a3137ee01ebda4fcbec
Change-Id: I16cec52ea1c2853c80ee9a6e3279a23408d05651
Change-Id: Icf779e2bff8ace1721b529e3c89edbe9effa9989
Change-Id: I485646e86105a8a1bab6b638262669fc5f92d94d
Change-Id: I03ab7dd188ae39248ca889f40b9490eb2870579f
Change-Id: I82647ef09edc6fd9432652d911bf2ff4bccf25a5
Change-Id: I73091fc280dea5ad447b9df8bb0a1877d8f1ff35
Change-Id: Ife63a4ab2a69869cce9d1c407bfdeba2540d2482
Change-Id: Ic9fe75b9efe885080e3ad440f132eb0100c41a17
Change-Id: I5a8e092d8fb751a2ca69256740df59edd59b9b95
…containers Change-Id: Ic67537ed852920d8945430665e22eeddc7350d6e
StorageContainerException { | ||
Preconditions.checkNotNull(container, "container cannot be null"); | ||
|
||
long containerId = container.getContainerData().getContainerID(); | ||
State containerState = container.getContainerData().getState(); | ||
if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we cannot just check if the container is in the ContainerIDTable? Why do we need to use this missingContainerSet instead?
I traced the creation of the missing set to a ratis snapshot (I know nothing about it) and a diff against the containers found in memory. What about EC containers? Do they all become part of this missing set?
Prior to this change, the missingSet doesn't seem to be used for anything inside this class, but not it is used for these checks. What was missingSet used for before?
I would have thought the ContainerID table starts initially empty after this change is committed. On first start, it gets populated with all containers in the ContainerSet (scanned from disk). After initial startup, or later startups, it can also have all containers added, but they should already all be there. Then we can just track previously created in the containerID table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container set is first initialised with all the containers present on disk by performing an ls on the data volumes. Then we iterate through the containerIDs table on the rocksdb, and any container present in this table and not present on disk will be added to the missing container set. This would include both ratis and ec containers.
Now on container creation we just have to check this in the in-memory set than going to the rocksdb everytime which would incur an io cost, however small it is not need in my opinion when we have everything in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I mis-read the Java doc and thought it was a ratis snapshot. However my question still stands - why not simply use the ContainerIDTable to check for previous container creation, than augmenting this missing list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just reusing the data structure which was already there. This missing container set was just used for ratis container initially, I am now making it track even the EC containers. I didn't want to change a lot of code and reuse most of the stuff which is already there.
Moreover an in memory get is faster than fetching from rocksdb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to this change, the missingContainerSet variable in the ContainerSet was not used for anything within that class. Now, we are potentially adding to it from the containerList - what impact does that have more widely? What was the missingContainerSet used for before?
We also have to open the new containerList table - it would be more clear if we just loaded that into memory and use the new "seenContainerList" for previously existing checks.
However backing up a bit - how can containers get created?
One way, is via the write path - that is the problem we are trying to solve.
Another is via EC Reconstruction - there is already logic there to deal with a duplicate container I think, however it may not cover volume failure. However EC reconstruction effectively uses the same write path as normal writes, so it its basically the same as the write path.
Then there is replication, which is a very different path into the system.
Maybe this problem is better not solved in ContainerSet, but at the higher level in the dispatcher where the writes are being handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should track all the containers getting added to the system. This flow also accomodates for replication both RATIS & EC. Barring the exception we give replication job to overwrite missing containers we should prevent overwriting any container coming through any flow. If we have this check in a central place, any future code changes would be foolproof. At least things would break and not silently end up writing something which is wrong.
For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes. It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR. |
All of the tests use these containerSet. If we pass a null value it would make the container set a read only data structure(I have put guardrails in place so that we don't end up inadvertently writing containers without persisting the data). Since we have made the table a must have arg we are having to pass a mocked table in all the existing tests. Another big change. We didn't have a pre-existing rocksdb on the master volume. DatanodeStore was only defined for data volumes, I wanted to reuse the code for creating a master volume that ended up adding generic everywhere, having created a base interface everywhere. |
StorageContainerException { | ||
Preconditions.checkNotNull(container, "container cannot be null"); | ||
|
||
long containerId = container.getContainerData().getContainerID(); | ||
State containerState = container.getContainerData().getState(); | ||
if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) { | ||
throw new StorageContainerException(String.format("Container with container Id %d is missing in the DN " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyValueHandler has this similar logic - why is it needed in both places? Feels like containerSet (ie here) may not need to check this, if its checked in the handler that creates the containers on the write path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyValueHandler is catching the exception and wrapping the exception as an error response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right. This is redundant. Actually let me shift this logic to a single function so that we don't have redundant code. But we should perform the missing container check on add container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function which doesn't overwrite containers. We should be performing this check on every container add. In KeyValueHandler we are having to do it there because the creation of KeyValueContainerData creates the required directory structure in the data volume. So we are having to provide this check. But whenever we are adding to data structure from other paths we have to ensure the container is not missing by default unless we explicity intend to overwrite the missing container.
Change-Id: Icf8b45e0c2de6d353f3f880c441de7d7a6138009
/** | ||
* Add Container to container map. | ||
* @param container container to be added | ||
* @return If container is added to containerMap returns true, otherwise | ||
* false | ||
*/ | ||
public boolean addContainer(Container<?> container) throws | ||
public boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding a new method here - addContainerIgnoringMissing (or something similar), rather than changing the existing interface. That way, it avoids changes in existing calling code, and only code that explicitly needs to pass the boolean can be changed to call the new method. I suspect this change would reduce the number of files impacted by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a method for that.
Lines 106 to 108 in af0f757
public boolean addContainer(Container<?> container) throws StorageContainerException { | |
return addContainer(container, false); | |
} |
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
|
||
import java.io.Closeable; | ||
|
||
/** | ||
* DB handle abstract class. | ||
*/ | ||
public abstract class DBHandle implements Closeable { | ||
public abstract class DBHandle<STORE extends AbstractStore> implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid making changes to this class and get the desired functionality another way (extending another base class, something else)? Changing this has a knock on effect on many other classes and this feels like refactoring that perhaps isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I have limited the changes to only the AbstractStore changes. I don't see a point in bringing down the changes further & I don't think there is going to be a lot of conflict while backporting the AbstractStore changes
@Override | ||
public Void call() throws Exception { | ||
|
||
OzoneConfiguration conf = createOzoneConfiguration(); | ||
try { | ||
OzoneConfiguration conf = createOzoneConfiguration(); | ||
|
||
final Collection<String> datanodeStorageDirs = | ||
HddsServerUtil.getDatanodeStorageDirs(conf); | ||
final Collection<String> datanodeStorageDirs = | ||
HddsServerUtil.getDatanodeStorageDirs(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is just wrapping existing code in try-finally
, with new code added only in finally
. You can reduce the change by:
- rename existing
call()
to something else (usingoldCall()
here as example) - add new
call()
as:
public Void call() throws Exception {
try {
oldCall();
} finally {
...
}
}
This way indentation is unchanged, diff is reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -21,7 +21,7 @@ OZONE-SITE.XML_ozone.scm.datanode.ratis.volume.free-space.min=10MB | |||
OZONE-SITE.XML_ozone.scm.pipeline.creation.interval=30s | |||
OZONE-SITE.XML_ozone.scm.pipeline.owner.container.count=1 | |||
OZONE-SITE.XML_ozone.scm.names=scm | |||
OZONE-SITE.XML_ozone.scm.datanode.id.dir=/data | |||
OZONE-SITE.XML_ozone.scm.datanode.id.dir=/data/metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain this part of the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creation of rocksdb under /data fails because of permission issue. So had to create under a metadata directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why do we need to create DB in the ozone.scm.datanode.id.dir
, which is for the datanode ID file? Can we use ozone.metadata.dirs
, which is already configured to /data/metadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the datanode.id file & the rocksdb to be present in the same volume. Keeping them in the same directory becomes logical in that case.
public static <KEY, VALUE> Table<KEY, VALUE> getInMemoryTableForTest() { | ||
return new Table<KEY, VALUE>() { | ||
private final Map<KEY, VALUE> map = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a non-anonymous class for the Table
implementation, and drop DBTestUtils
, which has no other code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* Defines a functional interface to call void returning function. | ||
*/ | ||
@FunctionalInterface | ||
public interface VoidCallable<EXCEPTION_TYPE extends Exception> { | ||
void call() throws EXCEPTION_TYPE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use CheckedRunnable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import java.util.Map; | ||
|
||
/** | ||
* Class for defining the schema for master volume in a datanode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a master volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Master volume is defined in this doc:
https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing . It is the same volume where we have datanode.id file.
* Abstract Interface for interacting with datanode databases. | ||
*/ | ||
public interface AbstractStore extends Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interfaces are already abstract, so both doc and interface name are very vague. Please try to find a better name. As far as I see, it's mostly for managing lifecycle of a DBStore
. Also, this parent interface does not seem to be specific to datanodes. Should it be moved to hadoop-hdds/framework
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a base interface for initialising a rocksdb in datanode alone, like loading the datanode profile and other things that comes allied with it. I believe there is a duplicate interface for OM & SCM as well. We should pick this up as a follow up task item and unify these things in one place.
case UNSUPPORTED_REQUEST: | ||
case CONTAINER_MISSING:// Blame client for sending unsupported request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment about unsupported request
belongs to UNSUPPORTED_REQUEST
case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Change-Id: Iead70ac8119ac0bd970a8575267e7becb354fb7b
Change-Id: I4d42a3b04b84ce6544cf5549aedcb1bbe37b2b86
Change-Id: Icd905015d020fad583a49454e4847e8ed03e7785
@adoroszlai I have addressed all the comments. Do you have any more review comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the logic embedded into hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
A cleaner approach would be expand via aggregation the container set logic.
- Create a new Java class (WitnessedContainers.java) that reads and writes to the DB and keeps a concurrent copy in memory to handle concurrency (the current code does this)
- Do not bother with missing Container logic at the level of WitnessedContainers.java. When the replication manager tries to create a container, it checks if it is in the witnessed list and then adds it. So we only need two methods for addition
AddContainer
andAddContainerIfNotPresent
.
Do we need reference counted code to be introduced? Can we get away with just embedding it within the ContainerSet?
I think this code change can go through more than one round of simplification for what it is trying to solve.
...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
Outdated
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
Show resolved
Hide resolved
...rvice/src/main/java/org/apache/hadoop/ozone/container/metadata/MasterVolumeDBDefinition.java
Outdated
Show resolved
Hide resolved
...vice/src/main/java/org/apache/hadoop/ozone/container/metadata/MasterVolumeMetadataStore.java
Outdated
Show resolved
Hide resolved
Then what is the purpose of missing container set? We shouldn't have 2 data structures solving the same stuff. We should remove the missing container set and only rely on the witnessed container list. If we have 2 out of the 3 we can compute the 3rd one. With the above approach proposed we would have the logic present spread across all the 3 set which would complicate things further in future. We shouldn't be adding redundant data structures if not required.
I had to do the reference counted db implementation just because of the test cases. The PR would have gotten much bigger to fix the test case. A simpler approach was to use the reference counted, we can have only one instance of a writable db.
|
Change-Id: I24693b81d1409120537f37289aec886bad845588
Change-Id: Ie1d1f7163a04990f0447517b0e4e2f89fc8bc48b
MissingContainerSet in the containerSet was added as part of solving HDDS-935, which is trying to solve the same problem (Avoid creating an already created container on a datanode in case of disk removal followed by datanode restart) we are trying to tackle here. But unfortunately the solution only accounted for ratis container which was because it was the only container type back then. The solution looks into adding all the containers that are present in the ratis snapshot's containerBCSIdMap but not found on disk after scanning through all the data volumes in the node. |
…oid column family option close Change-Id: I71ae4da25b0340a12eb6bbaab7338ceb0823ad8b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better now. I want to go over the DB code refactoring and test code.
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
Change-Id: Ie9c55a9111e215b4ae0fc58d80079ee71a5cdffb
/** | ||
* Removes the Container matching with specified containerId. | ||
* @param containerId ID of the container to remove | ||
* @return If container is removed from containerMap returns true, otherwise | ||
* false | ||
*/ | ||
public boolean removeContainer(long containerId) { | ||
private boolean removeContainer(long containerId, boolean markMissing, boolean removeFromDB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeFromDB
do we expect it to be true outside of testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false, false : This happens on finding a duplicate container on startup where the container info in memory gets swapped by removing and adding the updated container info.
true, false : This happens on volume failure and we have to mark it as missing. This is a plain inmemory operation where in we remove the container from containerMap and add the containerId to the missingContainerSet.
false, true : This is a regular delete flow which happens when SCM sends a delete container command. This will only happen after deleting the entire container data from the disk.
true, true : This is an invalid case, this doesn't happen anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that once a container is created, it will remain the DB forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any use of that container information. Why keep it if it is not required?
Change-Id: I5ba57cc8de58aed79ed41c9f1198cf9ee7fd3c27
Change-Id: I24bc0bce41a0a8cccec5b0e872359502835f1231
} | ||
}, 1000, 30000); | ||
|
||
// Read Chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a second write chunk as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a loop for testing creating open container. Then post restarting I am creating a recovering container and performing the same set operations expecting it to succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recovering container accepting writes should be for EC only? Do we want write chunk coming for a non EC container to fail? In this test this is a single node cluster
cluster = MiniOzoneCluster.newBuilder(conf)
.setNumDatanodes(1)
.build();
cluster.waitForClusterToBeReady();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing both creating an open container and recovering container. Agnostic of whether it is an EC key or a ratis key we land in the same dispatcher flow. So we don't have to test this separately
Lines 253 to 256 in 827bc86
// Create Container | |
request = ContainerTestHelper.getCreateContainerRequest(testContainerID, pipeline); | |
response = client.sendCommand(request); | |
assertEquals(ContainerProtos.Result.CONTAINER_MISSING, response.getResult()); |
Lines 292 to 296 in 827bc86
// Create Recovering Container | |
request = ContainerTestHelper.getCreateContainerRequest(testContainerID, pipeline, | |
ContainerProtos.ContainerDataProto.State.RECOVERING); | |
response = client.sendCommand(request); | |
assertEquals(ContainerProtos.Result.SUCCESS, response.getResult()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a follow up jira please accept write chunks only for EC pipelines if container is recovering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a follow up jira for performing this validation:
https://issues.apache.org/jira/browse/HDDS-11761
The change seems reasonable now. Please ensure we are testing the APIs we care about for a container after the container goes missing. I think we can generate more test cases via fault injection. We can discuss if it needs to be part of this PR or a separate PR (to keep this PR small). |
Change-Id: I03aac18fd7547ea39519721798015adc3c09033c
Change-Id: I1967faced513f6a6439b66034661b2342ab0475b
Change-Id: Icb96283b7496a3c02b0a16d3b7410fd4e649c30c
Change-Id: Id25cc023de401c9033078fd8b37b7b28f89c04cc
Change-Id: I52a3676fa80fd87c218dc749acd4cf5e99a78308
Should we keep the scope of this PR limited? We can look into adding a follow up jira for performing fault injection tests. The PR is already big as it is. |
Thanks for the review on the patch @kerneltime @sodonnel @adoroszlai |
…anode (apache#7402) (cherry picked from commit fc6a2ea)
What changes were proposed in this pull request?
We need to persist the list of containers created on a datanode to ensure that volume failures doesn't lead to recreation of containers on write chunk leading to partial inconsistent data in datanodes.
For ratis containers this is taken care of by persisting the containerInfo to bcsiD map in ratis snapshot file, but this list won't be present for EC containers.
Details to follow in a doc.
https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11650
How was this patch tested?
Adding more unit tests & integration tests