Skip to content

Latest commit

 

History

History
38 lines (29 loc) · 3.64 KB

mmlong2-faq.md

File metadata and controls

38 lines (29 loc) · 3.64 KB

Frequently asked questions about mmlong2

Will my sequenced dataset work with mmlong2?

  • It is very situational whether mmlong2 is a good fit for your samples or project, but generally, the workflow is intended for highly complex metagenomes (e.g. soil, sewage sludge, human gut) and is not optimal for samples with very low microbial diversity (e.g. pure cultures, Zymo Mock DNA Standard).
  • Please keep in mind that mmlong2 is a long-reads-only workflow, designed to work with Nanopore (about 1 % read error rate) or with PacBio HiFi (about 0.1 % read error rate) datasets. Short-read datasets can be used for mapping to improve genome recovery via differential coverage binning, but the workflow is not designed for short-read metagenomic assembly.
  • It is also recommended that the input for mmlong2 would be at least 1 GB of sequenced data with multiple prokaryotic organisms.

Are there special hardware or software requirements?

  • In general, mmlong2 is designed to be used on HPC clusters with ≥100 threads and ≥300 Gb of RAM allocated per workflow run.
  • The metagenomic binning part of the workflow is compute intensive (optimized for MAG yield) and might take several days to weeks to complete.
  • The mmlong2 workflow has been developed and tested on HPC nodes (Slurm cluster and bare metal) running on Ubuntu 22.04.

Is read data pre-processing required by the workflow?

  • It is highly recommended to perform read quality filtering (e.g. remove reads with less than Phred Q10 for Nanopore and Phred Q20 for PacBio HiFi as well as short-reads) before running mmlong2.
  • Triming off read adaptor and barcode sequences as well as filtering out very short reads (e.g. below 200 bp for Nanopore or PacBio data) might also improve genome recovery.

Is there a way to test mmlong2 without installing over 100 Gb of databases?

  • If you are only interested in getting the genomes, check out mmlong2-lite, which is a lightweight version of the pipeline with an identical prokaryotic genome recovery procedure and does not require large database installation.

Is it possible to avoid the temporary files generated by the workflow?

  • During a workflow run, temporary files might be generated and not deleted by Snakemake when the run finishes.
  • By default, the current working directory is used to store these temporary files. Hence, it is recommended to have a directory dedicated to temporary files and provide it to mmlong2 through the --temporary_dir option.

What should be done when the mmlong2 workflow crashes?

  • If the workflow crashes, try inspecting the Stdout and Snakemake logs for troubleshooting.
  • The workflow can usually be resumed by re-running the same commands.
  • If you want to resume the workflow from a new installation of mmlong2, it is highly recommended to first run the workflow with the --touch option to mark the generated files against deletion.
  • If the workflow still keeps crashing after several retries, feel free to post the error logs in the GitHub Issues section.

Can I run the automated analysis modules with custom genomes?

  • Although it is possible to run the genome analysis section with a custom set of genomes by mimicking the workflow directory structure, this is quite technical to achieve and might lead to compatibility issues.
  • A more streamlined method for providing custom genomes to the workflow will be part of a future release.

What about eukaryotic or viral genomes?

  • At the moment, mmlong2 does not feature genome recovery of viruses or eukaryotes.
  • Expansion of the binning features, however, is planned for future releases.