Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Disable resource management when user set FI_OPT_EFA_RNR_RETRY #10560

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jiaxiyan
Copy link
Contributor

If a user is explicitly asking for a retry count that means they also want to manage the resources themselves and the EFA provider should disable resource management to prevent implicit retries.

Also remove the unused function.

@jiaxiyan jiaxiyan requested a review from a team November 19, 2024 21:52
@shijin-aws
Copy link
Contributor

Changes LGTM, would u mind adding a unit test?

@shijin-aws
Copy link
Contributor

It seems rdm_rnr_queue_resend test conflicts with your PR. This test assumes RM is enabled after setopt. We need to update this test accordingly

If a user is explicitly asking for a retry count that means they
also want to manage the resources themselves and the EFA provider
should disable resource management to prevent implicit retries.

Also remove the unused function.

Signed-off-by: Jessie Yang <[email protected]>
if (ret) {
FT_PRINTERR("fi_setopt", -ret);
FT_PRINTERR("setenv", -ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think setenv works here because fi_param_get happens at fi_getinfo where we already have a efa_env.rnr_retry set as 3.

Also, this is shared among the rnr_cq_read_error and queue_resend test. The former intends to turn off RM, the latter one intends to turn on RM. It's totally fine for rnr_cq_read_error to call fi_setopt, but not queue_resend.

So we should have a bool somewhere to call fi_setopt only for cq_read_error test. And make queue_resend_test setenv before initing any resources

Copy link
Contributor

@shijin-aws shijin-aws Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code again, Nah. rnr_retry is not really environment variable even though it is in efa_env so setenv won't work at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants