Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add example for job info #229

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Apr 26, 2023

Problem: We do not have good examples for replicating flux job info in Python
Solution: Add an interactive demo

Ping @xorJane !

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

I'm curious what the request and/or need was for this. flux job info is a "plumbing" command and we don't really advertise it, i.e. we decided not to put flux job info in the manpage for flux-job(1).

If a user has requested they really want to/need to get the jobspec or something for their job, and they specifically want to do it in python, perhaps we need to discuss how best to do that in python, i.e. if new function needs to be added, etc.

@vsoch
Copy link
Member Author

vsoch commented Apr 26, 2023

@xorJane asked me directly on Mattermost for this example, and given that Python interactions might want this complete metadata, it’s worthy to add. If we wind up
Changing the UI we can just update the example.

TLDR: a real request for this exact example to demonstrate it’s needed and should be provided.

Also, I’ve mentioned before (and asked multiple times) for getting job info on the command line and in Python and disagree about it being a “plumbing” command. Getting job info back is hugely useful in many contexts.

@vsoch
Copy link
Member Author

vsoch commented Apr 26, 2023

See flux-framework/flux-core#4761 as just one of the times, and a clearly laid out user interaction.

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

See flux-framework/flux-core#4761 as just one of the times, and a clearly laid out user interaction.

The data provided by that feature is different than what is provided by flux job info, that data is more correlated to what is provided by flux jobs, and we added a function into the same area of the Python API that flux jobs uses.

Also, I’ve mentioned before (and asked multiple times) for getting job info on the command line and in Python and disagree about it being a “plumbing” command.

Maybe we have different definitions of "plumbing". As you note in the key output from flux job info.

    J
    R
    eventlog
    jobspec
    guest.exec.eventlog
    guest.input
    guest.output

The eventlogs cannot be understood without reading the RFCs, and the stdin/stdout cannot be groked without understanding our stdin/stdout protocol and encoding. That's sort of why we don't advertise it, a normal user would not have any understanding of what is being returned.

@xorJane can you provide more detail on what was the request and need? Perhaps there is a specific piece of data that was desired, and it just isn't provided via job-list/flux jobs at the moment? Update: Or maybe they were just interested in getting the jobspec?

Edit:

Getting job info back is hugely useful in many contexts.

I'm not disagreeing with "job information" in general, you are correct. I'm speaking specifically of the information provided by flux job info. You show the eventlog output from guest.output in your PR. Does a user really want to parse and decode that information to get stdout? Granted we have issue flux-framework/flux-core#4854 open and don't have a good API for it yet, but would calling flux job attach from within Python make more sense?

@grondo
Copy link
Contributor

grondo commented Apr 26, 2023

FWIW, I think it is fine if users use flux job info or the job-info module interface to grab the jobspec or R for the job, For example you can't view the original job environment or the full command line arguments, or the exact resource spec without it. Sometimes it is fine to use a plumbing command or API if necessary. 🤷

That being said we do have an open issue on providing a more user appropriate interface to fetch job output, and we probably need a nice Python API for getting the original jobspec, since that requires fetching the signed J from job-info and decoding it.

Note that the jobspec you are fetching directly from the KVS is the version with its environment (and possibly other keys) redacted and modified by the instance for its use. Also, only the instance owner can fetch directly from the KVS like this, so this will only work for the instance owner (i.e. a normal user couldn't use these examples in a multi-user instance)

flux job info --original jobspec takes care of fetching J and "unrwrapping" it for the user, and in this way it is probably less of a plumbing command, and for now is the suggested method to get the original jobspec.

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

FWIW, I think it is fine if users use flux job info or the job-info module interface to grab the jobspec or R for the job, For example you can't view the original job environment or the full command line arguments, or the exact resource spec without it. Sometimes it is fine to use a plumbing command or API if necessary. shrug

I recall that we decided not to advertise it in the flux-job(1) manpage. I'm not sure if that was generically b/c it was "plumbing" or if it's because we thought flux job info might be confusing to users, thinking it's like flux jobs purpose? Or possibly for both reasons.

Should we start to advertise it? Or perhaps flux job info --original jobspec only? Or should we slightly tweak it? Perhaps just outputting something to the effect:

    general:
    jobspec (consider using w/ --original)
    R
    
    advanced - only if you know what you're doing:
    J
    eventlog
    guest.exec.eventlog
    guest.input
    guest.output

would make it better.

🤷

Edit: oh here's an idea, only list the advanced ones with a --verbose or similar. If we only advertise retrieving jobspec/R, it begins to look more user facing.

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

re-reading this PR's contents, I think maybe what confused me is that it's really 3 different examples lumped into one, all under the heading of "job info", which I think maybe isn't the best way to organize this. I think what might be better is:

  • one example that covers getting general job information about a job, could list jobs for a user, get job information via job id, the basic "job list" kinda stuff. This is what 90% of users using python w/ flux would care about.

  • one example would cover reading stdout from a job. I think reading from guest.output is way too complex to document on the website and would be better documented as a workaround in an issue for the time being. But if there is strong opinion to publish an example on the website, that can be changed when libflux/python: convenience function to read stdio  flux-core#4854 is completed.

  • a third example would be reading the jobspec, which goes into the kvs api space, etc. and some of the caveats that @grondo mentions above can also be covered. If/when we come up with a better API for this, we can update that example.

@vsoch
Copy link
Member Author

vsoch commented Apr 26, 2023

FWIW, I think it is fine if users use flux job info or the job-info module interface to grab the jobspec or R for the job,

@grondo thank you for hearing me.

one example that covers getting general job information about a job, could list jobs for a user, get job information via job id, the basic "job list" kinda stuff. This is what 90% of users using python w/ flux would care about.

I again respectfully disagree. As a user I don't place things into these same categories based on design (that a developer would be biased to see). When I want "job information" or go looking for a tutorial to show me how to do that, I want the whole gamut of things, from the original jobspec, to the output contents, to the core info like status / return code. I don't want to have to know there are three different tutorials because (in the mind of the developer) "but they are different!" I am a layperson, I submit a job, and I want to know everything about it.

@xorJane
Copy link
Contributor

xorJane commented Apr 26, 2023

Hi all! Just chiming in late with the context for my request.

Post TOSS 4 updates on LC, users have asked about the disappearance of the checkjob command on systems running Slurm, which is a handy Moab utility for seeing all sorts of job stats. Since we're not supporting Moab anymore, Jeff Long & I put together a tool that nicely formats some of the output you can get with squeue and sacct. The result looks something like

janeh@pascal83:~$ slurm_jobinfo.py -j 16436_9
JobID        :  <JobID>
JobName      :  Symmetric
State        :  PENDING
User         :  <user>
Group        :  <user's group>
Account      :  <bank>
Partition    :  pbatch
QOS          :  normal
Timelimit    :  1-00:00:00
Submit       :  Tue 04/25 08:49:06
Eligible     :  Tue 04/25 08:49:07
Start        :  Unknown
End          :  Unknown
Elapsed      :  00:00:00
Priority     :  124135
NNodes       :  4
NCPUS        :  144
MinCPUNode   :
NodeList     :  None assigned
WorkDir      :  /p/lustre1/<user directory>
SubmitLine   :  sbatch --array=0-12 Submit_4_nodes.sh
Dependency   :  (null)
EstStart     :  N/A
Reason       :  (AssocMaxJobsLimit)

A user requested a similar utility for Flux (as well as more tutorials on how to get job stats with native Flux commands), and I was trying to use the Python APIs to get there. What I have so far:

janeh@tioga10:~$ flux_jobinfo.py -j fnwSSLATbTM
Getting info for Flux JobID: fnwSSLATbTM
JobID        :  fnwSSLATbTM
JobName      :  flux
State        :  RUN
User         :  <user>
Partition    :  pdebug
Timelimit    :  4:00:00
Submit       :  Wed 04/26/2023 09:20:44
Elapsed      :  2:14:53
Start        :  Wed 04/26/2023 09:20:44
End          :  Wed 04/26/2023 13:20:44
NNodes       :  1
NCores       :  64
NTasks       :  1
NodeList     :  tioga26
Dependencies  :  []

Thanks so much for helping me out @vsoch!

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

I again respectfully disagree. As a user I don't place things into these same categories based on design (that a developer would be biased to see). When I want "job information" or go looking for a tutorial to show me how to do that, I want the whole gamut of things, from the original jobspec, to the output contents, to the core info like status / return code. I don't want to have to know there are three different tutorials because (in the mind of the developer) "but they are different!". I am a layperson, I submit a job, and I want to know everything about it.

I disagree with this a bit.

  1. Part of this is information overload. The average user mostly wants status / return code / stuff from flux jobs. Showing them a whole bunch of more advanced information can be confusing.

  2. its not because of just "developer division", it's about what a person is looking for. If a person asks, "I'd like to look at stdout for my job", I don't think the first place they will look is for something under the header "job info". They would look under something related to "stdio" or "stdout".

FWIW I have been a user before (https://github.com/LLNL/magpie). So my opinions aren't coming solely from developer land. It's coming from that experience as well as experience aiding users from that project. I suspect you might think I am trying to "baby" users, and that is sort of what I lean towards based on my experiences.

That said, my experiences can be very biased to the users I was helping. So I don't know what collective opinion would be here on this topic.

@vsoch
Copy link
Member Author

vsoch commented Apr 26, 2023

Part of this is information overload.

But it's not. The tutorial itself isn't that long, and it's neatly packaged under a single label that makes sense. When a person finds what they need, they are good. Information overload would be giving them three separate tutorials that seem to be for similar things, and requiring them to read through all of them to put together a single, cohesive picture. The larger issue right now is that they don't currently find information how to do this - this is why I've had to come into the Flux Slack umpteen times and ask "How do I get output? How do I get a return code?" I couldn't find it. It's a signal when @xorJane comes to me and asks how to do something, it tells me that the docs don't do a good enough job to give her that information. It means it cannot be found where someone went looking for it. It might be obvious to a core developer, but it's not obvious to a developer user.

"I'd like to look at stdout for my job", I don't think the first place they will look is for something under the header "job info". They would look under something related to "stdio" or "stdout".

Having worked on many client tools for many years, I respectfully again disagree. In fact, the user doesn't even know what they are looking for, so "job info" would click in their head as "information about my job" - yes! Many users don't even think in terms of stdin and stdout, those are more advanced concepts (maybe for power users, which maybe the lab is biased to have, but not most centers).

FWIW I have been a user before (https://github.com/LLNL/magpie). So my opinions aren't coming solely from developer land. It's coming from that experience as well as experience aiding users from that project. I suspect you might think I am trying to "baby" users, and that is sort of what I lean towards based on my experiences.

I don't think you are trying to baby users, I'm just not sure you are putting yourself in all of the different shoes you might! I've worked in different contexts sitting within labs and also providing support for users at Duke, Stanford, and (not much here) but several fairly large open source communities. My bias / perspective comes from both being a user, and learning over time how to put myself in their mental map and then best derive a piece of documentation or similar to make it understandable.

And to be clear, I would be totally in support of refactor / change of the interactions themselves, but until that happens, this is currently how someone would do this, and I think we should provide it as a simple tutorial for those that come looking for it. It can be updated later if needed. It doesn't make sense to me to split it up, or continue to hide information just because of personal opinions about labeling it plumbing or not.

My 0.02.

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

I guess we'll just agree to disagree. As a side question, what part of the script @xorJane needed to get the jobspec? From the above, it looks like everything is from job list.

Additional comments.

  • as i said above, flux job info was intentionally left out of flux-job(1) in the past, so if this is becoming user facing, that needs to be corrected. So we should make that a TODO.

  • I would recommend you decode the guest.output into actually stdout in your example. I'm not sure showing the raw eventlog output as "standard output" is very useful and if someone does want to get stdout from a job, they'll need to know how to decode it.

@grondo
Copy link
Contributor

grondo commented Apr 26, 2023

@xorJane, would you mind if I copied your comment into a new flux-core issue (or perhaps a discussion)? I think many of thie items you are looking for are already available from the jobInfo objects returned from JobList, but it might be nice if we kept our suggestions in a separate issue so it isn't mixed up with the discussion about documentation.

Then as we find holes in the API where data is missing, we can open up separate issues if necessary (I do think we should offer a high-level Python API call to get the original or redacted jobspec for example)

@grondo
Copy link
Contributor

grondo commented Apr 26, 2023

what part of the script @xorJane needed to get the jobspec? From the above, it looks like everything is from job list.

Well there's WorkDir (cwd)...

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

Well there's WorkDir (cwd)...

Ahhh I was only looking at the flux output, and I now see that it's an in-progress work.

@xorJane
Copy link
Contributor

xorJane commented Apr 26, 2023

@grondo I don't mind at all! Also, does Flux offer job start prediction, at least soon before the job actually starts? I noticed that the reported start time is often 12/31/1969 before a job starts, but I'm wondering if that reporting changes closer to runtime.

@grondo
Copy link
Contributor

grondo commented Apr 26, 2023

Also, does Flux offer job start prediction

With Fluxion (flux-sched) there is an optional t_estimate scheduler annoation (i.e. sched.t_estimate), however, I've only seen it working on the highest priority pending job. There's an issue open on that here: flux-framework/flux-sched#1015

noticed that the reported start time is often 12/31/1969 before a job starts

Yes, the t_* fields are initialized to zero, so a job that hasn't started yet will have a start time of 0 seconds since epoch or 12/31/1969... Same for other time fields, e.g. t_cleanup etc.

@vsoch
Copy link
Member Author

vsoch commented Apr 26, 2023

I would like to point out that we have two people (myself and Jane) saying "we noticed the absence of small bit of documentation we want need" (and we've put in the work to figure it out and provide for others) and the first response is "you don't really need/want it."

Feels a bit... off. I certainly hope this is not how a new contributor would be received here.

@chu11
Copy link
Member

chu11 commented Apr 26, 2023

I would like to point out that we have two people (myself and Jane) saying "we noticed the absence of small bit of documentation we want need" (and we've put in the work to figure it out and provide for others) and the first response is "you don't really need/want it."

Apologies if that is how it came off. The issue was not the job list stuff, that was absolutely important and necessary.

The subtlety was there was a conscious decision to not document flux job info in the past. Maybe that was a bad decision by flux-core people, but it was a decision that was made. So I was trying to understand what need there was, and subsequently if there was a better way, or a new way should be created.

@vsoch
Copy link
Member Author

vsoch commented Apr 26, 2023

The subtlety was there was a conscious decision to not document flux job info in the past. Maybe that was a bad decision by flux-core people, but it was a decision that was made. So I was trying to understand what need there was, and subsequently if there was a better way, or a new way should be created.

I think that's totally valid! I think my high level observation (and suggestion for the future) might be a slight tweak to how this is communicated. I'll also say there are no hard feelings - I've learned in my open source experience it's important to have very thick skin on these issue threads (cue memory of me biking home early in my OSS experience completely sobbing because of a conversation, lol). If it helps, what I try to do when there is a new contributor (or someone that I am more formal with because I don't know them super well yet) would be something of the following pattern:

  • Thank them for the contribution + additional commentary to restate their goals so they see I understand
  • Clarification of the use case / problem they were having (this also tells them that I'm hearing them properly)
  • Mention of discussion to tweak the work and best address the problem
  • And then technical discussion / details and follow up

So (as a quick example) for the PR here (and there are many ways to skin a cat) but one approach might be like:

Hey <name> - thanks for adding this tutorial! So that I best understand, you were trying to get information about a job using the Flux Python API - and that includes output, job info/metadata, and the jobspec? In the past we had declared this "job info" group to be "plumbing" or "for developer users only" but it sounds like based on your need, we might need to adjust this view. Let's figure out the best way to add these examples, and we can work together to adjust the code here appropriately. Does that sound OK? <ping other project devs> should we rethink how the underlying API is working here, and then if so, should we release this example in the meantime (and update later) since it's the current way to go about it?

That's just one example, but in the above I've thanked the contributor, asked for clarity about their problem, and then explained my view / opinion (hopefully without making them feel like they have to go on the defense "yes I really want/need this!" The issue itself (or PR in this case) should be sufficient for that. I'll also emphasize just saying "let's work together on this" to set the original tone. And then after all that (when the contributor feels heard, and involved) I bring in the other devs to start the more technical discussion.

Again, totally no hard feelings - I can't tell you how many times I've messed up with interactions in issues - it's really hard. A lot of times I'll also have negative experiences, and see patterns, or maybe wake up the next morning and realize something isn't sitting right. A lot of it is really subtle, so that's why it's hard. It's important we can talk about it, definitely between the core team here, so when a new contributor does show up, we don't scare them away! 😆

@vsoch
Copy link
Member Author

vsoch commented May 3, 2023

Just for reference I'm coming here now to reference this tutorial to remember how to fully interact with jobs :)

@vsoch
Copy link
Member Author

vsoch commented Jun 9, 2023

Another time I'm visiting this PR to copy paste this code for another script that I need to get job info for!

@grondo
Copy link
Contributor

grondo commented Jun 9, 2023

Looks like this has some conflicts. Also the PR branch is pushed to this repo instead of a fork, and as we saw that seems to break mergify. I guess you'll have to create a new PR or we can merge this one manually. However, I'd suggest creating all PRs from a personal fork in the future.

vsoch added 3 commits June 9, 2023 18:10
Problem: We do not have good examples for replicating
flux job info in Python
Solution: Add an interactive demo

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the add/flux-python-info-example branch from 92e212b to da924eb Compare June 10, 2023 00:14
@vsoch
Copy link
Member Author

vsoch commented Jun 10, 2023

All set!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants