Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI tool for aggregating single cell data #1264

Closed
7 of 10 tasks
arteymix opened this issue Oct 23, 2024 · 6 comments
Closed
7 of 10 tasks

CLI tool for aggregating single cell data #1264

arteymix opened this issue Oct 23, 2024 · 6 comments
Assignees
Labels
cli Issues affecting the CLI single cell Issues related to single-cell data support

Comments

@arteymix
Copy link
Member

arteymix commented Oct 23, 2024

TODO

  • aggregate any single cell vectors or the preferred ones if unspecified
  • aggregate by a given cell type assignment or the preferred one if unspecified
  • aggregate by any cell-level characteristics (optional)
  • aggregate by multiple characteristics (very optional)
  • add a switch to make the resulting raw QT preferred
  • add appropriate audit trail events describing how the aggregation was done and with data vectors and annotations were used
  • support deleting aggregated data
  • indicate that the data is in CPM
  • switch COUNT to LINEAR or maybe compute log2cpm outright
  • investigate if we could include normalization factors (optional)
@arteymix arteymix added cli Issues affecting the CLI single cell Issues related to single-cell data support labels Oct 23, 2024
@arteymix arteymix self-assigned this Oct 23, 2024
@arteymix
Copy link
Member Author

This is done, I'm just doing a little bit more testing at this point.

@arteymix
Copy link
Member Author

I'm almost done with deleting aggregated data. There's a few caveat to consider such as whether to remove the dimension and resetting the single-cell metrics (see #1273), but also deleting generated data files for that QT.

@arteymix
Copy link
Member Author

Good, now we can reliably aggregate and delete aggregated vectors! I'm looking into some -Infinity slipping through the aggregation process and causing the processed vectors to be filled with NaNs...

I've also made some improvements to which file get deleted when regenerating a platform annotations and pre-processing an experiment.

@arteymix
Copy link
Member Author

Ok got the NaN situation figured out. We need to adjust the data to the library size and add a pseudocount just like we do for log2cpm of RNA-Seq data.

@arteymix
Copy link
Member Author

Counting data would become linear after library size normalization.

Linear data would technically not be CPM, but I don't think that is important.

We also need to look into allowing count data to use a logarithmic scale type. We might find counting data out there that is unfortunately already log-transformed.

@arteymix
Copy link
Member Author

Another thing to include in the tests is non-integer counting data.

This happens for some method that regresses out ambient RNA or other contaminants from the data. This we would get a general type COUNT and a scale type LINEAR, or something similar. We can add a way to generate such vectors by adding a little bit of multiplicative Gaussian noise.

@arteymix arteymix closed this as completed Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Issues affecting the CLI single cell Issues related to single-cell data support
Projects
None yet
Development

No branches or pull requests

1 participant