-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get-enriched-functions-per-pan-group, R issue? #1383
Comments
I think this is an @adw96 or @mooreryan question (whoever sees / or has time for this first, as I am not qualified to debug in R). But to make testing easier, I created the following two files: input-GOOD.txt.gz Input good is coming from the Infant Gut Dataset. When I run the R script anvi-script-run-functional-enrichment-stats on
When I run it on
it gives this error:
This may be due to a user error, but I couldn't find anything structurally wrong with the file. I feel like a test and exception would have been more helpful. Both Thanks for your input :) |
There was a similar problem in #1320. Some issue in the The Will try and look at this more tomorrow! |
FYI this is on my calendar for tomorrow afternoon |
Greetings, so i toyed around some of the original files, specifically the "external-genomes.txt" file to create the GENOMES.db and the layer text file to add the metadata. I noticed in the Prochlorococcus tutorial the names/metadata indices were a lot simpler, so i just made my genome names a lot simpler (2 characters, instead of 12) and changed metadata variables to single letters instead of several. This seemed to work out. SO, from my end, this is resolved. But thank you!!! |
and here are modified files that worked! |
Thank you, @TonyMane! I'm glad it worked for you. But in an ideal world your analysis should have never came all the way down to a It is good to know though if no one else can figure it out we can close the issue. |
Hi folks -- this is related to an issue with I'm currently navigating how to best robust-ify this for us. More soon. @meren I am completely with you on having unit tests that check for this. Since I am new to unit tests for python/big code bases, can I have a Meren Lab buddy who can help me with this? |
Hi folks, TL;DR There is an issue that we need to fix, but this problem is very unusual and is only going to affect small datasets with poorly conserved functions/genomes that share very little similarity. Here's the deal as I see it: Wayyyyyy under the hood, We are also likely to have largest p-value greater than 1 when we have a lot of functions represented in our genome collection. e.g., input-GOOD had 1549 functions with maximum p-value of 1, input-BAD had 691 functions with a maximum p-value of 0.9476791. So @TonyMane had a very bad luck of working with a set of genomes with a maximum p-value of 0.9476791. So when I've investigated a couple of different ways to robust-ify our script to prevent this from happening in future, I think the following is the best: if the largest p-value is less than 0.95, set I have a fix already implemented for this but I need someone's help testing it and figuring out all this Also, I really can't see how changing the names of the data files would have changed anything here. @meren if you're able to give me the input file post-name change I can look into what happened on R's end, since this is really bizarre. Many thanks @TonyMane for bringing this to our attention and I'm sorry that my first pass wasn't robust enough for your data! Amy |
Here's my fixed Rscript (in case something happens to me and this issue goes unresolved!) anvi-script-run-functional-enrichment-stats-fixed.R.txt and my baby testing approach
|
Hi @adw96! Hope you are well! It's been a long time since this issue was closed, but today I ran into a similar problem while I was working on #1935. The long story short behind #1935 is that, for pathway annotation sources like COG20_PATHWAY, we seem to be putting very few entries into the functional occurrence text file (the input to the enrichment script). In my test case, I get a functional occurrence file with only 4 COG20_PATHWAY entries. That's it's own bug, which I still need to figure out. But the derivative problem which stems from those short input files is that the maximum p-value can be very, very low - in my case, it was 0.006. Then, when
because you can't make an increasing sequence of numbers when the max value is less than the min value. I managed to fix this by changing the
When I do this, the
Obviously this is another weird edge case, and one that might go away if #1935 turns out to be fixable :) But I wanted to ask you if you think the above is the right way to fix this edge case? Or should we not even try to fix it at all (because maybe the statistical tests don't make sense when the input size is that small?) and just fail with a "No can do" error when Thanks for your input! |
Great to hear from you and thanks so much for the detailed info, @ivagljiva! From the looks of this, it's only the q-value procedure that is causing problems. The p-values are correctly calculated. While there shouldn't theoretically be an issue with calculating FDR significance in the small number of COGs cases, it's hard to estimate the overall proportion of null p-values (lambda) here. Some options that come to my mind include
Let me know how you want to proceed! |
Oh that's good to hear! Thanks Amy! In that case, I think
is probably the best choice. Though if it turns out not to be easy to implement, I think
is an acceptable option as well :) Would you like to take care of this, or should I give it a go? |
Would you mind giving it a go? 🙏 Happy to help/troubleshoot if needed (and if I can!) |
Having an issue with 'get-enriched-functions-per-pan-group'. I have ran this function successfully on the Prochlorococcus tutorial. But can't seem to get it work on data-set i have been working on. Below is the specific command I have run:
github_0319202.tar.gz
I have attached the marine-PAN.db , marine-GENOMES.db, the text file i used to populate the layers (marine_layer.txt, with anvi-import-misc-data) and tmp/tmpi9q0a3k5 (all in github_0319202.tar.gz).
Any commments would be useful!!!
The text was updated successfully, but these errors were encountered: