You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have several stats, each generated by a different tool during their runtime (parse/stats, dedup, filterbycov). This is probably not an issue, but we do not have a standard for what stats are, how they should look like, and how should API look like.
We lack a good naming scheme for stat modules and their output. The canonical "stats" is the summary produced by pairtools parse. It contains many unrelated sections and is called, confusingly, stats. Stats produced by other methods are called more specifically, e.g. "dedup.stats". Furthermore, some tools, e.g. dedup, produce multiple stats (e.g. duplication-by-tile).
Currently, stats-calculating codes are interweaved with the main code, resulting in spaghetti code.
Stats-calculating codes are duplicated: one version takes one pair at a time (PairCounter) and the other processes entire dataframes with many pairs.
We have implemented summaries, but they may not be sufficiently documented/separated in the API. We may need a further category of stats: "counts" (i.e. additive statistics), "summaries" and "const" (i.e. non-additive stats, i.e. genome assembly properties). These categories should then probably be split right at the root level of the stats module.
Saving stats is optional, but some (e.g. dedup) may not be regenerated after running.
Stats can require many flags of their own (e.g. dedup now has five stats-related flags!), which makes the CLI confusing.
We do not have any way to track which flags were applied to generate stats (e.g. --filter!!), which makes them less useful. CORRECTION: filter expression seems to be stored:
Pairtools stats API has many unrelated sections and is thus hard to navigate. It also has partially overlapping functionality with P(s).
In the future, we may need even more stats, e.g. restrict (e.g. histogram of distances from cut site to restriction sites), scaling (P(s) as a stat, convergence point of scalings of different directionalities with high precision), phase.
Stats are undocumented. As stats get more and more complicated, it becomes harder and harder to parse them - and to understand what some keys mean. Ideally, (a) .stats outputs should contain auto-generated comments explaining the content of each field, and (b) in the code, the text of these comments should be stored somewhere next to the code that generates them so it's easy to check the presence and the validity of the comment.
The text was updated successfully, but these errors were encountered:
Some stats are re-used in MultiQC that has to be up-to-date with the changes in format/reporting, which adds another layer of restrictions - we at least need to keep track of what types of stats are used there, and what can be safely modified.
There are currently two variants to report scalings, which ideally can be a single table. (related to p. 8)
stats are getting out of control.
A non-comprehensive list of issues:
We have several stats, each generated by a different tool during their runtime (parse/stats, dedup, filterbycov). This is probably not an issue, but we do not have a standard for what stats are, how they should look like, and how should API look like.
We lack a good naming scheme for stat modules and their output. The canonical "stats" is the summary produced by pairtools parse. It contains many unrelated sections and is called, confusingly, stats. Stats produced by other methods are called more specifically, e.g. "dedup.stats". Furthermore, some tools, e.g. dedup, produce multiple stats (e.g. duplication-by-tile).
Currently, stats-calculating codes are interweaved with the main code, resulting in spaghetti code.
Stats-calculating codes are duplicated: one version takes one pair at a time (PairCounter) and the other processes entire dataframes with many pairs.
We have implemented summaries, but they may not be sufficiently documented/separated in the API. We may need a further category of stats: "counts" (i.e. additive statistics), "summaries" and "const" (i.e. non-additive stats, i.e. genome assembly properties). These categories should then probably be split right at the root level of the stats module.
Saving stats is optional, but some (e.g. dedup) may not be regenerated after running.
Stats can require many flags of their own (e.g. dedup now has five stats-related flags!), which makes the CLI confusing.
We do not have any way to track which flags were applied to generate stats (e.g. --filter!!), which makes them less useful. CORRECTION: filter expression seems to be stored:
pairtools/pairtools/lib/stats.py
Line 66 in b3c2ce3
Pairtools stats API has many unrelated sections and is thus hard to navigate. It also has partially overlapping functionality with P(s).
In the future, we may need even more stats, e.g. restrict (e.g. histogram of distances from cut site to restriction sites), scaling (P(s) as a stat, convergence point of scalings of different directionalities with high precision), phase.
Stats are undocumented. As stats get more and more complicated, it becomes harder and harder to parse them - and to understand what some keys mean. Ideally, (a) .stats outputs should contain auto-generated comments explaining the content of each field, and (b) in the code, the text of these comments should be stored somewhere next to the code that generates them so it's easy to check the presence and the validity of the comment.
The text was updated successfully, but these errors were encountered: