Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: add extra info to doom exceptions #6453

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Nov 21, 2024

Problem: Exceptions from the doom plugin do not specify what job failed. It can be confusing to users which job failed amongst multiple in a batch job, or if the broker from a batch job itself failed.

Add the job executable name and jobid to the doom exception outputs.

Fixes #6357


we could obviously debate the message format, I went back and forth and just ended up at this. Tweak suggestions welcome.

@grondo
Copy link
Contributor

grondo commented Nov 22, 2024

Tweak suggestions welcome

We might be able to shorten it by just prefixing the existing exception note with jobid: arg0: . Also, I wonder if we should just include the basename of arg0? (This is just a hint so the user knows at what level the exception occurred).

Now that I'm thinking about it, I wonder if flux job attach should be the thing printing the jobid for job exceptions. Logging the jobid in the eventlog that's specific to a jobid seems redundant in retrospect.

@chu11
Copy link
Member Author

chu11 commented Nov 22, 2024

Also, I wonder if we should just include the basename of arg0?

Oh that's a good idea, that just slipped my mind.

Problem: Exceptions from the doom plugin do not specify what
job failed.  It can be confusing to users which job failed amongst
multiple in a batch job, or if the broker from a batch job itself
failed.

Add the job executable name to the doom exception outputs.

Fixes flux-framework#6357
@chu11
Copy link
Member Author

chu11 commented Nov 22, 2024

re-pushed with tweaks per comments above

  • shell: only add basename to exception note
  • flux job attach: add jobid to output message

so now it looks like this (example from t2614-job-shell-doom)

1.105s: job.exception type=exec severity=0 jobid=f4m65MP5 test2.sh: shell rank 1 on host corona211 exited and exit-timeout=1s has expired

@grondo
Copy link
Contributor

grondo commented Nov 22, 2024

1.105s: job.exception type=exec severity=0 jobid=f4m65MP5 test2.sh: shell rank 1 on host corona211 exited and exit-timeout=1s has expired

Again, it would be shorter to just put the jobid as a prefix:

1.105s job.exception f4m65MP5 type=exec severity=0 test2.sh: shell rank1 on host corona211 exited and exit-timeout=1s has expired

Since there's no mistaking that's the jobid. However, that's pretty minor and probably not worth quibbling about.

Edit: moved jobid after job.exception in example output.

@chu11 chu11 force-pushed the issue6357_doom_exception branch 2 times, most recently from 9449739 to 69e5ceb Compare November 22, 2024 23:10
@chu11
Copy link
Member Author

chu11 commented Nov 22, 2024

ok, re-pushed with the output message tweak.

Also had to tweak two tests that had to be updated for the expected output.

Edit: output is now

0.106s: job.exception f5MVaghD type=exec severity=0 test2.sh: shell rank 1 on host corona211 failed and exit-on-error is set

Comment on lines +102 to +107
static char buf[PATH_MAX+1] = {0};
json_t *s = json_array_get (doom->shell->info->jobspec->command, 0);
const char *path = json_string_value (s);
char *rv;
snprintf (buf, PATH_MAX, "%s", path);
rv = basename (buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to just do something like

const char *p = strrchr (s, "/");
const char *base = p ? p + 1 : s;

And avoid the copy? Maybe we should have our own implementation of basename() that doesn't modify its argument (I think some places in the codebase may already assume GNU basename(3), which doesn't modify its arg, but since we support alpine and someday BSD, that's probably not safe).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that would be better. When I read the manpage for basename(3) I was like "huh, it needs to modify the buffer?" Lets go with your simpler approach.

@grondo
Copy link
Contributor

grondo commented Nov 22, 2024

Ah, bummer. The change I suggested breaks some flux-sched checks which are matching flux job attach output exactly :-( (Since it affects all job exceptions, not just the exit-timeout)

Sorry if that was a bad suggestion.

@chu11
Copy link
Member Author

chu11 commented Nov 23, 2024

Ah, bummer. The change I suggested breaks some flux-sched checks which are matching flux job attach output exactly :-(

those tests are easily fixable (as is one I missed in t2304-sched-simple-alloc-check.t). Thinking we should keep the new style? Or are you thinking we should tweak?

Problem: In a batch job, it may be difficult to discern which
job had an exception.

When flux job attach outputs an exception, also output the jobid
of the job.

Update tests in t2608-job-shell-log.t and t2304-sched-simple-alloc-check.t
for change in expected output.
@grondo
Copy link
Contributor

grondo commented Nov 23, 2024

Up to you if you don't mind submitting a flux-sched PR

Copy link

codecov bot commented Nov 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.63%. Comparing base (8015f88) to head (31800a8).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6453      +/-   ##
==========================================
+ Coverage   83.61%   83.63%   +0.02%     
==========================================
  Files         523      523              
  Lines       87547    87555       +8     
==========================================
+ Hits        73199    73228      +29     
+ Misses      14348    14327      -21     
Files with missing lines Coverage Δ
src/cmd/job/attach.c 85.75% <100.00%> (+0.15%) ⬆️
src/shell/doom.c 88.34% <100.00%> (+0.60%) ⬆️

... and 10 files with indirect coverage changes

---- 🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shell: doom timeout exception should include argv[0], jobid of affected job
2 participants