stats redesign #216

golobor · 2024-03-17T16:34:09Z

stats are getting out of control.

A non-comprehensive list of issues:

We have several stats, each generated by a different tool during their runtime (parse/stats, dedup, filterbycov). This is probably not an issue, but we do not have a standard for what stats are, how they should look like, and how should API look like.
We lack a good naming scheme for stat modules and their output. The canonical "stats" is the summary produced by pairtools parse. It contains many unrelated sections and is called, confusingly, stats. Stats produced by other methods are called more specifically, e.g. "dedup.stats". Furthermore, some tools, e.g. dedup, produce multiple stats (e.g. duplication-by-tile).
Currently, stats-calculating codes are interweaved with the main code, resulting in spaghetti code.
Stats-calculating codes are duplicated: one version takes one pair at a time (PairCounter) and the other processes entire dataframes with many pairs.
We have implemented summaries, but they may not be sufficiently documented/separated in the API. We may need a further category of stats: "counts" (i.e. additive statistics), "summaries" and "const" (i.e. non-additive stats, i.e. genome assembly properties). These categories should then probably be split right at the root level of the stats module.
Saving stats is optional, but some (e.g. dedup) may not be regenerated after running.
Stats can require many flags of their own (e.g. dedup now has five stats-related flags!), which makes the CLI confusing.
We do not have any way to track which flags were applied to generate stats (e.g. --filter!!), which makes them less useful. CORRECTION: filter expression seems to be stored:

pairtools/pairtools/lib/stats.py

Line 66 in b3c2ce3

self._stat[key]["filter_expression"] = self.filters[key]
Pairtools stats API has many unrelated sections and is thus hard to navigate. It also has partially overlapping functionality with P(s).
In the future, we may need even more stats, e.g. restrict (e.g. histogram of distances from cut site to restriction sites), scaling (P(s) as a stat, convergence point of scalings of different directionalities with high precision), phase.
Stats are undocumented. As stats get more and more complicated, it becomes harder and harder to parse them - and to understand what some keys mean. Ideally, (a) .stats outputs should contain auto-generated comments explaining the content of each field, and (b) in the code, the text of these comments should be stored somewhere next to the code that generates them so it's easy to check the presence and the validity of the comment.

agalitsyna · 2024-03-17T17:49:30Z

Excellent points. I would add also:

Some stats are re-used in MultiQC that has to be up-to-date with the changes in format/reporting, which adds another layer of restrictions - we at least need to keep track of what types of stats are used there, and what can be safely modified.
There are currently two variants to report scalings, which ideally can be a single table. (related to p. 8)

Phlya · 2024-03-18T09:25:31Z

Thank you for bringing this up and writing up such a detailed list! I agree, many of these things also irked me.

Re --filter - indeed, the expression is stored!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats redesign #216

stats redesign #216

golobor commented Mar 17, 2024 •

edited

Loading

agalitsyna commented Mar 17, 2024 •

edited by golobor

Loading

Phlya commented Mar 18, 2024

stats redesign #216

stats redesign #216

Comments

golobor commented Mar 17, 2024 • edited Loading

agalitsyna commented Mar 17, 2024 • edited by golobor Loading

Phlya commented Mar 18, 2024

golobor commented Mar 17, 2024 •

edited

Loading

agalitsyna commented Mar 17, 2024 •

edited by golobor

Loading