get-enriched-functions-per-pan-group, R issue? #1383

TonyMane · 2020-03-19T22:45:59Z

Anvi'o version ...............................: esther (v6.1-master)
Profile DB version ...........................: 31
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

Having an issue with 'get-enriched-functions-per-pan-group'. I have ran this function successfully on the Prochlorococcus tutorial. But can't seem to get it work on data-set i have been working on. Below is the specific command I have run:

(anvio-master) (anvio-master) -bash-4.2$ anvi-get-enriched-functions-per-pan-group -p marine-PAN.db -g marine-GENOMES.db --category clade --annotation COG_FUNCTION -o OMZ.txt
Genomes storage .............................................: Initialized (storage hash: hash46f728cc)                                                                                                                                                       
Num genomes in storage ......................................: 20
Num genomes will be used ....................................: 20
Pan DB ......................................................: Initialized: marine-PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [YES]; Geometric: [YES]; Combined: [YES]
                                                                                                                                                                                                                                                              
* Gene clusters are initialized for all 1733 gene clusters in the database.

Category ....................................................: clade                                                                                                                                                                                          
Functional annotation source ................................: COG_FUNCTION
Exclude ungrouped ...........................................: False
Functional occurrence summary ...............................: /tmp/tmpkphxhqfo                                                                                                                                                                               
                                                                                                                                                                                                                                                              

Config Error: It looks like something went wrong during the functional enrichment analysis. We
              don't know what happened, but this log file could contain some clues:           
              /tmp/tmpi9qoa3k5

github_0319202.tar.gz

I have attached the marine-PAN.db , marine-GENOMES.db, the text file i used to populate the layers (marine_layer.txt, with anvi-import-misc-data) and tmp/tmpi9q0a3k5 (all in github_0319202.tar.gz).

Any commments would be useful!!!

The text was updated successfully, but these errors were encountered:

meren · 2020-03-19T23:09:36Z

I think this is an @adw96 or @mooreryan question (whoever sees / or has time for this first, as I am not qualified to debug in R).

But to make testing easier, I created the following two files:

input-GOOD.txt.gz
input-BAD.txt.gz

Input good is coming from the Infant Gut Dataset. When I run the R script anvi-script-run-functional-enrichment-stats on input-GOOD.txt, I get what I expect and everything works smoothly:

anvi-script-run-functional-enrichment-stats --input input-GOOD.txt --output output-GOOD.txt

When I run it on input-BAD.txt,

anvi-script-run-functional-enrichment-stats --input input-BAD.txt --output output-BAD.txt

it gives this error:

Error in smooth.spline(lambda, pi0, df = smooth.df) :
  missing or infinite values in inputs are not allowed
Calls: %>% ... mutate_impl -> <Anonymous> -> pi0est -> smooth.spline
Execution halted

This may be due to a user error, but I couldn't find anything structurally wrong with the file. I feel like a test and exception would have been more helpful. Both input-GOOD.txt and input-BAD.txt are reported by anvi'o, so if there is a way to test the problem also in upstream and never even get to the anvi-script-run-functional-enrichment-stats stage, I'd be happy to solve it too.

Thanks for your input :)

mooreryan · 2020-03-20T00:34:51Z

There was a similar problem in #1320. Some issue in the qvalue function. (for reference, see
#1320 (comment))

The qvalue code calls smooth.spline in a couple of places and one of those seems to be blowing up.

Will try and look at this more tomorrow!

adw96 · 2020-03-25T00:37:50Z

FYI this is on my calendar for tomorrow afternoon

TonyMane · 2020-03-25T03:43:24Z

Greetings, so i toyed around some of the original files, specifically the "external-genomes.txt" file to create the GENOMES.db and the layer text file to add the metadata. I noticed in the Prochlorococcus tutorial the names/metadata indices were a lot simpler, so i just made my genome names a lot simpler (2 characters, instead of 12) and changed metadata variables to single letters instead of several. This seemed to work out. SO, from my end, this is resolved. But thank you!!!

TonyMane · 2020-03-25T03:56:59Z

and here are modified files that worked!
github03242020.tar.gz

meren · 2020-03-25T13:58:23Z

Thank you, @TonyMane! I'm glad it worked for you. But in an ideal world your analysis should have never came all the way down to a smooth.spline error, and you should have been warned if there was something wrong with your data in earlier stages of the process :)

It is good to know though if no one else can figure it out we can close the issue.

adw96 · 2020-03-25T22:54:20Z

Hi folks -- this is related to an issue with qvalue, see, e.g., StoreyLab/qvalue#9

I'm currently navigating how to best robust-ify this for us. More soon.

@meren I am completely with you on having unit tests that check for this. Since I am new to unit tests for python/big code bases, can I have a Meren Lab buddy who can help me with this?

adw96 · 2020-03-26T00:07:10Z

Hi folks,

TL;DR There is an issue that we need to fix, but this problem is very unusual and is only going to affect small datasets with poorly conserved functions/genomes that share very little similarity.

Here's the deal as I see it:

Wayyyyyy under the hood, qvalue::qvalue (run on default settings) expects the largest p-value in the set to exceed 0.95 (e.g., try qvalue::qvalue(c(0.04, 0.5, 0.8, 0.3, 0.94)) and qvalue::qvalue(c(0.04, 0.5, 0.8, 0.3, 0.96))). Because there are an enormous number of functions in most pangenomes, the largest p-value almost always is going exceed 0.95. For example, a core function would likely be observed in all genomes (obvious caveats like depth aside), and have p-value 1. From a really quick look at input-GOOD, I can see that a bunch of ribosomal proteins have p-values of 1 (such as Ribosomal proteins S3, L21, S11...).

We are also likely to have largest p-value greater than 1 when we have a lot of functions represented in our genome collection. e.g., input-GOOD had 1549 functions with maximum p-value of 1, input-BAD had 691 functions with a maximum p-value of 0.9476791.

So @TonyMane had a very bad luck of working with a set of genomes with a maximum p-value of 0.9476791. So when qvalue uses its spline approach, it failed. If we change that pesky 0.9476791 to 0.9500001, the error disappears.

I've investigated a couple of different ways to robust-ify our script to prevent this from happening in future, I think the following is the best: if the largest p-value is less than 0.95, set lambda = seq(0.05, max_lambda, 0.05) where max_lambda <- min(0.95, max(unadjusted_p_value, na.rm=T)). So the lambda grid would shrink to (0.05, 0.1, ..., 0.85, 0.9) for @TonyMane's data, but wouldn't change the result of input-GOOD.txt, or any other dataset that has previously run without issue.

I have a fix already implemented for this but I need someone's help testing it and figuring out all this git branch magic. 🤔

Also, I really can't see how changing the names of the data files would have changed anything here. @meren if you're able to give me the input file post-name change I can look into what happened on R's end, since this is really bizarre.

Many thanks @TonyMane for bringing this to our attention and I'm sorry that my first pass wasn't robust enough for your data!

Amy

adw96 · 2020-03-26T00:18:37Z

Here's my fixed Rscript (in case something happens to me and this issue goes unresolved!)

anvi-script-run-functional-enrichment-stats-fixed.R.txt

and my baby testing approach

Rscript anvi-script-run-functional-enrichment-stats-fixed.R --input input-BAD.txt --output output-BAD-fixed.txt
Rscript anvi-script-run-functional-enrichment-stats-fixed.R --input input-GOOD.txt --output output-GOOD-fixed.txt
diff output-GOOD.txt output-GOOD-fixed.txt

ivagljiva · 2022-06-28T19:27:43Z

Hi @adw96! Hope you are well!

It's been a long time since this issue was closed, but today I ran into a similar problem while I was working on #1935. The long story short behind #1935 is that, for pathway annotation sources like COG20_PATHWAY, we seem to be putting very few entries into the functional occurrence text file (the input to the enrichment script). In my test case, I get a functional occurrence file with only 4 COG20_PATHWAY entries.

That's it's own bug, which I still need to figure out.

But the derivative problem which stems from those short input files is that the maximum p-value can be very, very low - in my case, it was 0.006. Then, when lambdas <- seq(0.05, max_lambda, 0.05) runs with a max_lambda less than 0.05, we get the following error:

Error in seq.default(0.05, max_lambda, 0.05) :
  wrong sign in 'by' argument

because you can't make an increasing sequence of numbers when the max value is less than the min value.

I managed to fix this by changing the max_lambda calculation to always yield a number >= 0.05, like so:

max_lambda <- max(0.05, min(0.95, max(pvalues_df$unadjusted_p_value, na.rm=T)))

When I do this, the seq command works, and the script actually fails downstream with a much nicer, clearer error:

Error: Doh! We still can't estimate the proportion of features that are truly null. It's Amy's fault, so please let her know ASAP so she can fix it!

Obviously this is another weird edge case, and one that might go away if #1935 turns out to be fixable :) But I wanted to ask you if you think the above is the right way to fix this edge case? Or should we not even try to fix it at all (because maybe the statistical tests don't make sense when the input size is that small?) and just fail with a "No can do" error when max_lambda < 0.05?

Thanks for your input!

adw96 · 2022-06-28T20:55:07Z

Great to hear from you and thanks so much for the detailed info, @ivagljiva!

From the looks of this, it's only the q-value procedure that is causing problems. The p-values are correctly calculated.

While there shouldn't theoretically be an issue with calculating FDR significance in the small number of COGs cases, it's hard to estimate the overall proportion of null p-values (lambda) here. Some options that come to my mind include

the user chooses lambda (there is no widely used "standard")
we pick a default (I really don't know what we tend to see, but you and I could look at some datasets together to pick one)
we say "no can do q-values! You will have to rely on p-values (and effect sizes!) instead."

Let me know how you want to proceed!

ivagljiva · 2022-06-28T21:01:24Z

Oh that's good to hear! Thanks Amy!

In that case, I think

the user chooses lambda

is probably the best choice. Though if it turns out not to be easy to implement, I think

we say "no can do q-values!

is an acceptable option as well :)

Would you like to take care of this, or should I give it a go?

adw96 · 2022-06-28T21:10:34Z

Would you mind giving it a go? 🙏 Happy to help/troubleshoot if needed (and if I can!)

ivagljiva · 2022-07-06T15:40:12Z

Hey @adw96! I implemented the fix here, would you mind testing it out and making sure it looks sound? You can run git checkout qvalue_lambda to get the right anvi'o branch for testing :)

Thanks for your help!

meren assigned adw96 Mar 26, 2020

meren added the bug label Mar 26, 2020

adw96 closed this as completed in 176b5e5 Mar 26, 2020

ts-dangelo mentioned this issue Apr 30, 2021

[FEATURE] Adding prevalence/abundance filter for gene clusters to functional enrichment script #1731

Open

ivagljiva mentioned this issue Jul 6, 2022

Add lambda parameter for q-value estimation in enrichment script #1953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get-enriched-functions-per-pan-group, R issue? #1383

get-enriched-functions-per-pan-group, R issue? #1383

TonyMane commented Mar 19, 2020 •

edited by meren

Loading

meren commented Mar 19, 2020

mooreryan commented Mar 20, 2020

adw96 commented Mar 25, 2020

TonyMane commented Mar 25, 2020

TonyMane commented Mar 25, 2020

meren commented Mar 25, 2020 •

edited

Loading

adw96 commented Mar 25, 2020

adw96 commented Mar 26, 2020

adw96 commented Mar 26, 2020 •

edited

Loading

ivagljiva commented Jun 28, 2022

adw96 commented Jun 28, 2022

ivagljiva commented Jun 28, 2022

adw96 commented Jun 28, 2022

ivagljiva commented Jul 6, 2022

get-enriched-functions-per-pan-group, R issue? #1383

get-enriched-functions-per-pan-group, R issue? #1383

Comments

TonyMane commented Mar 19, 2020 • edited by meren Loading

meren commented Mar 19, 2020

mooreryan commented Mar 20, 2020

adw96 commented Mar 25, 2020

TonyMane commented Mar 25, 2020

TonyMane commented Mar 25, 2020

meren commented Mar 25, 2020 • edited Loading

adw96 commented Mar 25, 2020

adw96 commented Mar 26, 2020

adw96 commented Mar 26, 2020 • edited Loading

ivagljiva commented Jun 28, 2022

adw96 commented Jun 28, 2022

ivagljiva commented Jun 28, 2022

adw96 commented Jun 28, 2022

ivagljiva commented Jul 6, 2022

TonyMane commented Mar 19, 2020 •

edited by meren

Loading

meren commented Mar 25, 2020 •

edited

Loading

adw96 commented Mar 26, 2020 •

edited

Loading