-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd: include trashed parent images while calculating the clone depth #4029
base: devel
Are you sure you want to change the base?
Conversation
/test ci/centos/mini-e2e/k8s-1.27 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr looks good to me.
This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes? 🙏 |
/test ci/centos/mini-e2e/k8s-1.27 |
/test ci/centos/mini-e2e/k8s-1.27 |
Manual stress testing with the scripts from rook/rook#12312 passes. Need to fix the golangci-lint issues and have the go-ceph rebase merged before this is completely ready. |
internal/rbd/rbd_util.go
Outdated
// | ||
// This function re-uses the ioctx of the image to open all images in the | ||
// chain. There is no need to open new ioctx's for every image. | ||
func (ri *rbdImage) getCloneDepth() (uint, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @nixpanic,
Can we add another arguement like maxDepthToTraverse
which can be used to exit and not traverse further unnecessarily?
We can pass hard rbdHardMaxCloneDepth+1
to the function call then.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I'll add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a 2nd commit to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Niels
@Mergifyio rebase |
✅ Branch has been successfully rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, i will let @Rakshith-R to approve it as had some suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks !
@Mergifyio refresh |
✅ Pull request refreshed |
@Mergifyio queue |
🛑 The pull request has been removed from the queue
|
/test ci/centos/upgrade-tests-cephfs |
It seemed that my deployment did not include |
/test ci/centos/mini-e2e/k8s-1.28 |
/test ci/centos/mini-e2e/k8s-1.28 |
For anyone watching this, I have now noticed that the ImageID of a |
Dropped the commit
as it seems that the ImageID is not always set to the correct value, but to the ImageID of a cloned image. Without this commit, e2e should pass. |
/test ci/centos/mini-e2e/k8s-1.28 |
/test ci/centos/mini-e2e/k8s-1.28 |
/test ci/centos/mini-e2e/k8s-1.28 |
Deleting the RBD image seems to have resulted in a (spurious?) error. The |
/test ci/centos/mini-e2e/k8s-1.28 |
`librbd.OpenImageById()` works if the image is in the trash, so it makes it possible to get the parent of the image. Signed-off-by: Niels de Vos <[email protected]>
When a new volume is not created yet, the ImageID should not be set to the ID of the snapshot. Signed-off-by: Niels de Vos <[email protected]>
In some places the ImageID is used as the ID of the parent. That is very confusing and prone to errors. Instead, fetch the right ImageID where possible, and set ParentID for referencing to parent images. Signed-off-by: Niels de Vos <[email protected]>
Signed-off-by: Niels de Vos <[email protected]>
If the RBD-image is deleted already, the DeleteVolume CSI procedure is expected to report success (as it should be idempotent). In case the returned error indicates "RBD image not found", the error is ignored and the DeleteVolume procedure continues. Signed-off-by: Niels de Vos <[email protected]>
The `getCloneDepth()` function did not account for images that are in the trash. A trashed image can only be opened by the image-id, and not by name anymore. Closes: ceph#4013 Signed-off-by: Niels de Vos <[email protected]>
The `getCloneDepth()` function does not need to traverse the whole chain of parents when a certain max-limit is configured. The traversing can be aborted once the hard-limit is reached. This makes the procedure a little more efficient, as unnecessary traversing is prevented. Signed-off-by: Niels de Vos <[email protected]>
By returned `ABORTED` or `PRECONDITION_FAILED` in the right places, the CO will retry with the same arguments until the snapshot is `ReadyToUse`. This causes restoring a volume from a snapshot to be delayed, until the snapshot can be used. Signed-off-by: Niels de Vos <[email protected]>
/test ci/centos/mini-e2e/k8s-1.27 |
From the logs it looks like the flattening of images is in progress. I have not found messages that suggest something is blocked, but it is well possible that there is a deadlock on the PVC-restore and snapshot-delete somewhere (I have seen these in manual testing too, but though I addressed them). Maybe #1883 (comment) contains more hints on things that I have missed in the current PR. |
This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes? 🙏 |
The
getCloneDepth()
function did not account for images that are inthe trash. A trashed image can only be opened by the image-id, and not
by name anymore.
Closes: #4013
Depends-on: #4064 #4273
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>
: retest the<job-name>
after unrelatedfailure (please report the failure too!)