Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add samtools-depth-single process #1311

Merged
merged 1 commit into from
Mar 11, 2024
Merged

Conversation

marcellevstek
Copy link
Contributor

@marcellevstek marcellevstek commented Feb 9, 2024

REID-2234

  • Update CHANGELOG.rst for each commit separately:
    • Pay attention to write entries under the "Unreleased" section.
    • Mark all breaking changes as "BACKWARD INCOMPATIBLE:" and put them
      before non-breaking changes.
    • If a commit modifies a feature listed under "Unreleased" section,
      it might be sufficient to modify the existing CHANGELOG entry from previous
      commit(s).
  • Bump the process version:
    • MAJOR version (first number): Backward incompatible changes (changes
      that break the api/interface). Examples: renaming the input/output, adding
      mandatory input, removing input/output...
    • MINOR version (middle number): add functionality or changes in a
      backwards-compatible manner. Examples: add output field, add non-mandatory
      input parameter, use a different tool that produces same results...
    • PATCH version (last number): changes/bug fixes that do not affect
      the api/interface. Examples: typo fix, change/add warning messages...
  • All inputs are used in process.
  • All output fields have a value assigned to them.

resolwe_bio/processes/samtools/samtools_depth.py Outdated Show resolved Hide resolved
resolwe_bio/processes/samtools/samtools_depth.py Outdated Show resolved Hide resolved
resolwe_bio/processes/samtools/samtools_depth.py Outdated Show resolved Hide resolved
resolwe_bio/processes/samtools/samtools_depth.py Outdated Show resolved Hide resolved
@@ -18,6 +21,13 @@
)


def prune_zero_depth(stdout):
"""Prune zero depth entries from the samtools depth output."""
df = pd.read_csv(StringIO(stdout), sep="\t", names=["chrom", "pos", "depth"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid reading the entire file, you can limit the import to non-zero values, perhaps something along the lines of (untested):

import pandas as pd

chunkz = pd.read_csv(StringIO(stdout), sep="\t", names=["chrom", "pos", "depth"], chunksize=5000)
df = pd.concat((x.query("depth > 0") for x in chunkz))

This works, though:

from io import StringIO
import pandas as pd


file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(file_name)

flow = StringIO()
xy = df.to_csv(flow)

chunkz = pd.read_csv(flow , chunksize=5)
pd.concat((x.query("species == 'setosa'") for x in chunkz))

Unnamed: 0	sepal_length	sepal_width	petal_length	petal_width	species
0	0	5.1	3.5	1.4	0.2	setosa
1	1	4.9	3.0	1.4	0.2	setosa
2	2	4.7	3.2	1.3	0.2	setosa
3	3	4.6	3.1	1.5	0.2	setosa
4	4	5.0	3.6	1.4	0.2	setosa
5	5	5.4	3.9	1.7	0.4	setosa
6	6	4.6	3.4	1.4	0.3	setosa
...

resolwe_bio/processes/samtools/samtools_depth.py Outdated Show resolved Hide resolved
Copy link
Member

@jkokosar jkokosar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In next iteration, consider adding additional output:

  • HDF5 file format
  • coverage data in BED file format + associated index for genome browsers (example, should be reimplemented for Python)

@jkokosar jkokosar merged commit 9a763f1 into genialis:master Mar 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants