Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent dataloss due to the concurrent RPC calls (occurrence is very low) #4970

Merged
merged 4 commits into from
Nov 21, 2024

Conversation

Madhu-1
Copy link
Collaborator

@Madhu-1 Madhu-1 commented Nov 20, 2024

This PR includes series for commits for the following actions

  • Introduce in memory lock for NodePublish and UnPublish (we should not be dependent on the CO and make assumption that CO makes the calls serial for the same volID and the targetPath)
  • Use os.Remove instead of os.Removeall to remove the empty directory, os.Removeall ends up removing everything in the path

Thanks to @martin-helmich @hensur and @Lucaber for the detailed analysis

fixes: #4966

@Rakshith-R
Copy link
Contributor

/test ci/centos/mini-e2e/k8s-1.31

@Madhu-1 Madhu-1 force-pushed the fix-4966 branch 2 times, most recently from cafad3a to 8c455bd Compare November 20, 2024 10:25
@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Nov 20, 2024

/test ci/centos/mini-e2e/k8s-1.31

Copy link

@martin-helmich martin-helmich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that was quick, thanks. 🤩🚀 This is also pretty much the fix that we had in mind for this. 👍


return nil, status.Errorf(codes.Aborted, util.TargetPathOperationAlreadyExistsFmt, targetPath)
}
defer ns.VolumeLocks.Release(targetPath)

// Check if that target path exists properly
notMnt, err := ns.createTargetMountPath(ctx, targetPath, isBlock)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see it correctly, createTargetMountPath still uses the (possibly unreliable?) IsLikelyNotMountPoint under the hood; should this be adjusted to the same IsMountPoint that the cephfs implementation uses? Or can this remain as-is, as discussed in #4966 (comment)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats correct, it checks the sources, The main problem is with lock and the os.RemoveAll, if os.Remove fails we will get another RPC for unpublish which will trigger umount if its actually a mount. For now lets keep it simple with what we have today.

@nixpanic nixpanic changed the title Fix data lose due to the concurrent RPC calls (occurrence is very low) Prevent dataloss due to the concurrent RPC calls (occurrence is very low) Nov 21, 2024
Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It would be good to backport this to active releases, I think?

@Madhu-1 Madhu-1 added the backport-to-release-v3.12 Label to backport from devel to release-v3.12 branch label Nov 21, 2024
@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Nov 21, 2024

@Mergifyio queue

Copy link
Contributor

mergify bot commented Nov 21, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 00d252e

We should not be dependent on the CO to ensure
that it will serialize the request instead of
that we need to have own internal locks to ensure
that we dont do concurrent operations for same
request.

Signed-off-by: Madhu Rajanna <[email protected]>
We should not be dependent on the CO to ensure
that it will serialize the request instead of
that we need to have own internal locks to ensure
that we dont do concurrent operations for same
request.

Signed-off-by: Madhu Rajanna <[email protected]>
using os.RemoveAll will remove everything
in the director after the Umount we should
be using os.Remove only to remove the empty
directory

Signed-off-by: Madhu Rajanna <[email protected]>
using os.RemoveAll will remove everything
in the director after the Umount we should
be using os.Remove only to remove the empty
directory

Signed-off-by: Madhu Rajanna <[email protected]>
@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Nov 21, 2024
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.29

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 21, 2024
@mergify mergify bot merged commit 00d252e into ceph:devel Nov 21, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-release-v3.12 Label to backport from devel to release-v3.12 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in RBD node plugin may lead to total data loss while unmounting
5 participants