Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get-enriched-functions-per-pan-group, R issue? #1383

Closed
TonyMane opened this issue Mar 19, 2020 · 14 comments
Closed

get-enriched-functions-per-pan-group, R issue? #1383

TonyMane opened this issue Mar 19, 2020 · 14 comments
Assignees
Labels

Comments

@TonyMane
Copy link

TonyMane commented Mar 19, 2020

Anvi'o version ...............................: esther (v6.1-master)
Profile DB version ...........................: 31
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

Having an issue with 'get-enriched-functions-per-pan-group'. I have ran this function successfully on the Prochlorococcus tutorial. But can't seem to get it work on data-set i have been working on. Below is the specific command I have run:

(anvio-master) (anvio-master) -bash-4.2$ anvi-get-enriched-functions-per-pan-group -p marine-PAN.db -g marine-GENOMES.db --category clade --annotation COG_FUNCTION -o OMZ.txt
Genomes storage .............................................: Initialized (storage hash: hash46f728cc)                                                                                                                                                       
Num genomes in storage ......................................: 20
Num genomes will be used ....................................: 20
Pan DB ......................................................: Initialized: marine-PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [YES]; Geometric: [YES]; Combined: [YES]
                                                                                                                                                                                                                                                              
* Gene clusters are initialized for all 1733 gene clusters in the database.

Category ....................................................: clade                                                                                                                                                                                          
Functional annotation source ................................: COG_FUNCTION
Exclude ungrouped ...........................................: False
Functional occurrence summary ...............................: /tmp/tmpkphxhqfo                                                                                                                                                                               
                                                                                                                                                                                                                                                              

Config Error: It looks like something went wrong during the functional enrichment analysis. We
              don't know what happened, but this log file could contain some clues:           
              /tmp/tmpi9qoa3k5

github_0319202.tar.gz

I have attached the marine-PAN.db , marine-GENOMES.db, the text file i used to populate the layers (marine_layer.txt, with anvi-import-misc-data) and tmp/tmpi9q0a3k5 (all in github_0319202.tar.gz).

Any commments would be useful!!!

@meren
Copy link
Member

meren commented Mar 19, 2020

I think this is an @adw96 or @mooreryan question (whoever sees / or has time for this first, as I am not qualified to debug in R).

But to make testing easier, I created the following two files:

input-GOOD.txt.gz
input-BAD.txt.gz

Input good is coming from the Infant Gut Dataset. When I run the R script anvi-script-run-functional-enrichment-stats on input-GOOD.txt, I get what I expect and everything works smoothly:

anvi-script-run-functional-enrichment-stats --input input-GOOD.txt --output output-GOOD.txt

When I run it on input-BAD.txt,

anvi-script-run-functional-enrichment-stats --input input-BAD.txt --output output-BAD.txt

it gives this error:

Error in smooth.spline(lambda, pi0, df = smooth.df) :
  missing or infinite values in inputs are not allowed
Calls: %>% ... mutate_impl -> <Anonymous> -> pi0est -> smooth.spline
Execution halted

This may be due to a user error, but I couldn't find anything structurally wrong with the file. I feel like a test and exception would have been more helpful. Both input-GOOD.txt and input-BAD.txt are reported by anvi'o, so if there is a way to test the problem also in upstream and never even get to the anvi-script-run-functional-enrichment-stats stage, I'd be happy to solve it too.

Thanks for your input :)

@mooreryan
Copy link
Contributor

There was a similar problem in #1320. Some issue in the qvalue function. (for reference, see
#1320 (comment))

The qvalue code calls smooth.spline in a couple of places and one of those seems to be blowing up.

Will try and look at this more tomorrow!

@adw96
Copy link
Contributor

adw96 commented Mar 25, 2020

FYI this is on my calendar for tomorrow afternoon

@TonyMane
Copy link
Author

Greetings, so i toyed around some of the original files, specifically the "external-genomes.txt" file to create the GENOMES.db and the layer text file to add the metadata. I noticed in the Prochlorococcus tutorial the names/metadata indices were a lot simpler, so i just made my genome names a lot simpler (2 characters, instead of 12) and changed metadata variables to single letters instead of several. This seemed to work out. SO, from my end, this is resolved. But thank you!!!

@TonyMane
Copy link
Author

and here are modified files that worked!
github03242020.tar.gz

@meren
Copy link
Member

meren commented Mar 25, 2020

Thank you, @TonyMane! I'm glad it worked for you. But in an ideal world your analysis should have never came all the way down to a smooth.spline error, and you should have been warned if there was something wrong with your data in earlier stages of the process :)

It is good to know though if no one else can figure it out we can close the issue.

@adw96
Copy link
Contributor

adw96 commented Mar 25, 2020

Hi folks -- this is related to an issue with qvalue, see, e.g., StoreyLab/qvalue#9

I'm currently navigating how to best robust-ify this for us. More soon.

@meren I am completely with you on having unit tests that check for this. Since I am new to unit tests for python/big code bases, can I have a Meren Lab buddy who can help me with this?

@adw96
Copy link
Contributor

adw96 commented Mar 26, 2020

Hi folks,

TL;DR There is an issue that we need to fix, but this problem is very unusual and is only going to affect small datasets with poorly conserved functions/genomes that share very little similarity.

Here's the deal as I see it:

Wayyyyyy under the hood, qvalue::qvalue (run on default settings) expects the largest p-value in the set to exceed 0.95 (e.g., try qvalue::qvalue(c(0.04, 0.5, 0.8, 0.3, 0.94)) and qvalue::qvalue(c(0.04, 0.5, 0.8, 0.3, 0.96))). Because there are an enormous number of functions in most pangenomes, the largest p-value almost always is going exceed 0.95. For example, a core function would likely be observed in all genomes (obvious caveats like depth aside), and have p-value 1. From a really quick look at input-GOOD, I can see that a bunch of ribosomal proteins have p-values of 1 (such as Ribosomal proteins S3, L21, S11...).

We are also likely to have largest p-value greater than 1 when we have a lot of functions represented in our genome collection. e.g., input-GOOD had 1549 functions with maximum p-value of 1, input-BAD had 691 functions with a maximum p-value of 0.9476791.

So @TonyMane had a very bad luck of working with a set of genomes with a maximum p-value of 0.9476791. So when qvalue uses its spline approach, it failed. If we change that pesky 0.9476791 to 0.9500001, the error disappears.

I've investigated a couple of different ways to robust-ify our script to prevent this from happening in future, I think the following is the best: if the largest p-value is less than 0.95, set lambda = seq(0.05, max_lambda, 0.05) where max_lambda <- min(0.95, max(unadjusted_p_value, na.rm=T)). So the lambda grid would shrink to (0.05, 0.1, ..., 0.85, 0.9) for @TonyMane's data, but wouldn't change the result of input-GOOD.txt, or any other dataset that has previously run without issue.

I have a fix already implemented for this but I need someone's help testing it and figuring out all this git branch magic. 🤔

Also, I really can't see how changing the names of the data files would have changed anything here. @meren if you're able to give me the input file post-name change I can look into what happened on R's end, since this is really bizarre.

Many thanks @TonyMane for bringing this to our attention and I'm sorry that my first pass wasn't robust enough for your data!

Amy

@adw96
Copy link
Contributor

adw96 commented Mar 26, 2020

Here's my fixed Rscript (in case something happens to me and this issue goes unresolved!)

anvi-script-run-functional-enrichment-stats-fixed.R.txt

and my baby testing approach

Rscript anvi-script-run-functional-enrichment-stats-fixed.R --input input-BAD.txt --output output-BAD-fixed.txt
Rscript anvi-script-run-functional-enrichment-stats-fixed.R --input input-GOOD.txt --output output-GOOD-fixed.txt
diff output-GOOD.txt output-GOOD-fixed.txt

@ivagljiva
Copy link
Contributor

Hi @adw96! Hope you are well!

It's been a long time since this issue was closed, but today I ran into a similar problem while I was working on #1935. The long story short behind #1935 is that, for pathway annotation sources like COG20_PATHWAY, we seem to be putting very few entries into the functional occurrence text file (the input to the enrichment script). In my test case, I get a functional occurrence file with only 4 COG20_PATHWAY entries.

That's it's own bug, which I still need to figure out.

But the derivative problem which stems from those short input files is that the maximum p-value can be very, very low - in my case, it was 0.006. Then, when lambdas <- seq(0.05, max_lambda, 0.05) runs with a max_lambda less than 0.05, we get the following error:

Error in seq.default(0.05, max_lambda, 0.05) :
  wrong sign in 'by' argument

because you can't make an increasing sequence of numbers when the max value is less than the min value.

I managed to fix this by changing the max_lambda calculation to always yield a number >= 0.05, like so:

max_lambda <- max(0.05, min(0.95, max(pvalues_df$unadjusted_p_value, na.rm=T)))

When I do this, the seq command works, and the script actually fails downstream with a much nicer, clearer error:

Error: Doh! We still can't estimate the proportion of features that are truly null. It's Amy's fault, so please let her know ASAP so she can fix it!

Obviously this is another weird edge case, and one that might go away if #1935 turns out to be fixable :) But I wanted to ask you if you think the above is the right way to fix this edge case? Or should we not even try to fix it at all (because maybe the statistical tests don't make sense when the input size is that small?) and just fail with a "No can do" error when max_lambda < 0.05?

Thanks for your input!

@adw96
Copy link
Contributor

adw96 commented Jun 28, 2022

Great to hear from you and thanks so much for the detailed info, @ivagljiva!

From the looks of this, it's only the q-value procedure that is causing problems. The p-values are correctly calculated.

While there shouldn't theoretically be an issue with calculating FDR significance in the small number of COGs cases, it's hard to estimate the overall proportion of null p-values (lambda) here. Some options that come to my mind include

  • the user chooses lambda (there is no widely used "standard")
  • we pick a default (I really don't know what we tend to see, but you and I could look at some datasets together to pick one)
  • we say "no can do q-values! You will have to rely on p-values (and effect sizes!) instead."

Let me know how you want to proceed!

@ivagljiva
Copy link
Contributor

Oh that's good to hear! Thanks Amy!

In that case, I think

the user chooses lambda

is probably the best choice. Though if it turns out not to be easy to implement, I think

we say "no can do q-values!

is an acceptable option as well :)

Would you like to take care of this, or should I give it a go?

@adw96
Copy link
Contributor

adw96 commented Jun 28, 2022

Would you mind giving it a go? 🙏 Happy to help/troubleshoot if needed (and if I can!)

@ivagljiva
Copy link
Contributor

Hey @adw96! I implemented the fix here, would you mind testing it out and making sure it looks sound? You can run git checkout qvalue_lambda to get the right anvi'o branch for testing :)

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants