Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: always request total needed memory, as snakemake seems to count as single process #4

Merged
merged 3 commits into from
Mar 13, 2024

Conversation

dlaehnemann
Copy link
Contributor

We have a cluster with the following setting:

        LSB_JOB_MEMLIMIT = N

For this case, the lsf docs on LSB_JOB_MEMLIMIT specify:

When LSB_JOB_MEMLIMIT is N or not defined, the LSF-enforced per-job limit is disabled, and the OS-enforced per-process limit is enabled.

So your interpretation previously was, that each requested cpu is its own process. However, I think that the whole job is always one single process, because it simply runs snakemake again in the main bsub job. So I think we always need to request the full amount of needed memory via -R rusage[mem={mem_}], no matter what the LSB_JOB_MEMLIMIT setting.

As a bit of backup / reasoning, here's what I saw for a rule that has threads: 8 specified, and resources: mem_mb = lambda wc, threads: threads * 4000 when run with the executor-pluging-lsf without the change proposed here. TL;DR is, it will only reserve the memory for one thread / cpu, but the process requires the total memory. Here's the logging output (shortened and redacted a bit):

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 1.

Resource usage summary:

    CPU time :                                   15.81 sec.
    Max Memory :                                 4000 MB
    Average Memory :                             1403.00 MB
    Total Requested Memory :                     4000.00 MB
    Delta Memory :                               0.00 MB
    Max Swap :                                   -
    Max Processes :                              10
    Max Threads :                                26
    Run time :                                   57 sec.
    Turnaround time :                            57 sec.

The output (if any) follows:

[...]
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=32000, mem_mib=30518, disk_mb=11452, disk_mib=10922
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 12 23:20:42 2024]
rule bwa_mem:
[...]
    jobid: 0
    reason: Forced execution
[...]
    threads: 8
    resources: mem_mb=32000, mem_mib=30518, disk_mb=11452, disk_mib=10922, tmpdir=<TBD>, lsf_queue=long

[...]
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=32000, mem_mib=30518, disk_mb=11452, disk_mib=10922
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 12 23:20:51 2024]
localrule bwa_mem:
[...]
    jobid: 0
    reason: Forced execution
[...]
    threads: 8
    resources: mem_mb=32000, mem_mib=30518, disk_mb=11452, disk_mib=10922, tmpdir=/local/<user>/cluster_tmp, lsf_queue=long

Activating conda environment: .snakemake/conda/5dbb9e4bb1f45b4c08879cdb19603237_
Activating conda environment: .snakemake/conda/5dbb9e4bb1f45b4c08879cdb19603237_
Traceback (most recent call last):
  File "/omics/odcf/analysis/OE0377_projects/ec_dna/analyses/colo_320_dm/.snakemake/scripts/tmpdkme9zmf.wrapper.py", line 68, in <module>
    shell(
  File "/omics/groups/OE0377/internal/laehnemann/mambaforge/envs/snakemake_8/lib/python3.12/site-packages/snakemake/shell.py", line 297, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  (<command>' returned non-zero exit status 1.
RuleException:
CalledProcessError in file https://raw.githubusercontent.com/snakemake-workflows/dna-seq-short-read-circle-map/v1.2.3/workflow/rules/mapping.smk, line 18:
Command '<command>' returned non-zero exit status 1.
  File "https://raw.githubusercontent.com/snakemake-workflows/dna-seq-short-read-circle-map/v1.2.3/workflow/rules/mapping.smk", line 18, in __rule_bwa_mem
[Tue Mar 12 23:21:17 2024]
Error in rule bwa_mem:
    jobid: 0
[...]

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Storing output in storage.
WorkflowError:
At least one job did not complete successfully.

[Tue Mar 12 23:21:17 2024]
Error in rule bwa_mem:
    jobid: 0
[...]

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Storing output in storage.
WorkflowError:
At least one job did not complete successfully.

@dlaehnemann
Copy link
Contributor Author

P.S.: With the fix, it reserves the memory as I would expect and runs through. Forgot to clearly state this... 😅

@BEFH
Copy link
Owner

BEFH commented Mar 13, 2024

I'm going to need to take a look at this. The issue is that on some clusters, -R rusage[mem={mem_}] is per job, and on some clusters, it's per thread. On our cluster, it is per-thread.

The previous LSF profile actually used the Snakemake mem_mb value in -R rusage[mem=MEM], which went against the Snakemake definitions for most LSF clusters. This is an attempt to address that, which I believe was working on our cluster.

If you have provided per-thread memory requests, I have a helper function to dynamically correct that.

Could you please let me know if -R rusage[mem=MEM] is per job or per thread on your cluster?

The relevant documentation is here:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=o-r#bsub_R_description__title__3

From what I understand, this setting will overwrite any cluster-wide settings and ensure, that memory is always reserved for the entire job (not per host or task). I think this is what we want snakemake to always do in this context, as this always submits one job and we know how much memory the entire job should use.
@dlaehnemann
Copy link
Contributor Author

I amended my PR after some further digging. Here's the reasoning behind this, with the relevant documentation being:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=o-r#bsub_R_description__title__3

From what I understand, the -R rusage[mem={mem_}/job] setting will overwrite any existing cluster-wide settings and ensure, that memory is always reserved for the entire job (not per host or task). I think this is what we want snakemake to always do in this context, as this always submits one job and we know how much memory the entire job should use.

Does this make sense? Could you try this "at home"?

@BEFH
Copy link
Owner

BEFH commented Mar 13, 2024

I have confirmed that -R rusage[mem={mem_}/job] works! We now need to simplify the code to get the LSF settings, since I do not believe we are using that setting.

@dlaehnemann
Copy link
Contributor Author

Good point, I think I have now done this. But please double-check in your code review.

@BEFH
Copy link
Owner

BEFH commented Mar 13, 2024

This looks good from me reading it, but we don't have any real unit tests in the repo other than code quality. I am going to manually change the pyproject.toml and make a new release later today, and then I will do some more testing. (I don't have Release Please set up right now.

Thanks so much for the help!

@BEFH BEFH merged commit ffc0c8b into BEFH:main Mar 13, 2024
3 checks passed
@dlaehnemann dlaehnemann deleted the patch-1 branch March 13, 2024 17:56
@dlaehnemann
Copy link
Contributor Author

Many thanks for setting up this executor. It's a huge head start for the snakemake 8 transition.

And thanks for being so responsive. I'll keep filing PRs as further stuff comes up. And should you want to share the maintenance burden at some point, let me know.

@dlaehnemann
Copy link
Contributor Author

Also, I'll watch out for the bioconda bot auto-update of the respective recipe, and shepherd that along. Unless you beat me to it... ;)

@dlaehnemann
Copy link
Contributor Author

It seems like this /job syntax actually runs into an LSF bug on the system that I work on. They check for a correct setting of the LSB_SUB_MEM_USAGE parameter mentioned here:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=controls-configuration-enable-job-submission-execution

And this does not get parsed correctly by LSF, if you add a resource reservation method to the rusage string. So currently this breaks our setup, but I haven't found a way to more generally check this. So this is basically just a ping, in case you have another idea of how to achieve a more general behaviour. Otherwise, we'll have to wait for a workaround on our system, or a fix of the bug in LSF...

@BEFH
Copy link
Owner

BEFH commented Mar 19, 2024 via email

@dlaehnemann
Copy link
Contributor Author

That's a good idea, we could try that.

An alternative thought: Do you know which exact setting determines that memory is requested per thread? If so, we could query that setting and only append the /job specifier in this case. That way, we would not have it in there unnecessarily...

@BEFH
Copy link
Owner

BEFH commented Mar 19, 2024

I have looked and not been able to find this. I thought I had, but no luck.

@dlaehnemann
Copy link
Contributor Author

So I got some extra pointers from our IT. From what I understand, this is how they set up things so that memory requests are set and enforced per job (and per host):

  1. LSB_JOB_MEMLIMIT = N in lsf.conf (to disable per-job memory limits)
  2. LSB_MEMLIMIT_ENFORCE=N in lsf.conf (to disable per-process memory limits)

And instead they enforce it via setting:

  1. LSB_RESOURCE_ENFORCE="memory" in lsf.conf (memory is one of multiple space-separated entries, here)

There is a docs page for LSB_RESOURCE_ENFORCE and a page providing more detail for this setting for memory.

Do these variables point to any differences that you can see in your local setup, that would explain the per-thread behaviour?

Otherwise, I think this command-line option sounds like a good alternative to the /jobs syntax:

https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=options-hl

So I will open a PR with this, and then you can also test this in your setting.

@BEFH
Copy link
Owner

BEFH commented Mar 19, 2024

I was previously using and maybe misunderstanding mem_perjob = self.lsf_config.get("LSB_JOB_MEMLIMIT", "n").lower(). This seems to be a custom config edge case on your server, and I wonder if other sysadmins are doing the same thing in different ways

@dlaehnemann
Copy link
Contributor Author

Just to link this up, the new PR is #5 .

@dlaehnemann
Copy link
Contributor Author

I was previously using and maybe misunderstanding mem_perjob = self.lsf_config.get("LSB_JOB_MEMLIMIT", "n").lower(). This seems to be a custom config edge case on your server, and I wonder if other sysadmins are doing the same thing in different ways

With the state of the docs, my working theory is that any server setup will end up being a config edge case... 👀

Also, lots to misunderstand in all these docs, and with stuff seemingly implicitly happening before and during submission, that are not transparent to the user. But let's try to at least get a solution, that works for both our setups. 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants