Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxon-term table option to switch valid/satisfiable criteria; emulate gain/loss of a term for taxa #81

Open
dustine32 opened this issue Feb 19, 2021 · 10 comments

Comments

@dustine32
Copy link

Currently, gaferencer taxa command generates a taxon-term table that marks whether a term is "valid" for a taxon, given the taxon constraints present in the ontology. Valid is denoted 1 otherwise 0 is used for not valid. But we might need to expand the values possible and maybe define what we mean by "valid"?

The use case for this issue is looking at the taxon-term row for GO:0006954 (inflammatory response) and finding that the column for taxon Bilateria (NCBITaxon:33213) is marked valid (1), despite the taxon constraint of GO:0006954 only_in_taxon Vertebrata. With Bilateria being an ancestor of Vertebrata (NCBITaxon:7742), one would think this should be marked invalid 0 since Bilateria is not "underneath" Vertebrata on the species tree:
image
On this tree (taken from PANTHER), Vertebrata is roughly equivalent to Euteleostomi (NCBITaxon:117571). I also marked the current taxon-term value of GO:0006954 for most species. A green check for '1' or a red X for '0', I stopped after several and just drew that green line, which denotes every species below it is valid for GO:0006954.

So, what does "valid" mean? Given the above results, it looks like it means the term can be annotated to some taxon (e.g. Bilateria descendant taxon Vertebrata can use GO:0006954). But for PAINT annotation purposes (the consumer of this awesome table) we really need this "valid" to mean the term can be annotated to all taxon (e.g. GO:0006954 can be in all Vertebrata, so 1, but not all Bilateria, so 0).

@balhoff As we discussed, to solve this we could use a third value to denote "partial validity", a taxon with some but not all subtaxons valid for the term. Are we still thinking empty string or blank?

  • 1 - Taxon and all descendant taxons are VALID for term
  • "" - Taxon and only some descendant taxons are VALID for term
  • 0 - Taxon and all descendant taxons are INVALID for term

In our example, Vertebrata and its descendants would get 1, all ancestors of Vertebrata would get "", and all other taxons would get 0. The PAINT software would then only accept the 1's as valid.

Another solution could be to still have only 1 or 0 values but add a switch/arg to the gaferencer taxa command that changes the criteria for 1 to exclude the "partially valid" taxons (the "" category) and set those to 0 instead.

@balhoff Apologies if I just ruined our earlier discussion by adding to the confusion here.

@dustine32
Copy link
Author

Tagging @thomaspd @huaiyumi @mugitty

@thomaspd
Copy link

For PAINT, we really need something that treats a taxon ID as representing a species, rather than a group of species. For living species like humans and Drosophila, these are the same thing. But for a taxonomy group, like vertebrates, we need it to represent the common ancestor species of all living vertebrates. We want to interpret the taxon constraint as indicating whether that term is valid in a given ancestral species, not a group of living species.

I'm not sure how your code works, Jim, but I'm assuming it might do something like this. Following the example above, the constraint is only_in_taxon vertebrata (corresponding to Euteleostomi in PANTHER). To see if the term is valid for human, I assume you check to see if human is a member/subclass of the NCBITaxonomy group vertebrata. If so, you could do the same thing for ancestral species (other taxa). For example, if you wanted to check Eutheria, you'd check to see if Eutheria in NCBITaxonomy is a subclass of vertebrata. If so, the term would be valid according to that particular constraint.

To do this, we'd need to have Taxonomy ID's for all the ancestral species nodes in PANTHER. If we don't have that, we can make it.

@dustine32
Copy link
Author

@thomaspd Thankfully, we do have NCBITaxon IDs mapped to most of the PANTHER species (256 out of 288) including for the ancestral species. These NCBITaxon IDs get passed into the gaferencer tool to make the taxon-term table. Just double-checked and Bilateria, Eutheria, and Euteleostomi NCBITaxon IDs are all mapped. So the ancestor-to-ancestor checks should be using the NCBITaxonomy class hierarchy.

@balhoff This <taxon-list> I'm using is available here.

@dustine32
Copy link
Author

@thomaspd Yep, it just clicked for me!

Considering this as the gain and loss of functions, it wouldn't make sense to say that the Bilateria ancestor species itself (as opposed to its set of descendant species) had inflammatory response (GO:0006954) just because its descendant Vertabrata gained it later. So this should be '0' for Bilateria and "partial" wouldn't make sense either.

@kltm
Copy link
Member

kltm commented Mar 2, 2021

Tagging self on.

@dustine32 dustine32 changed the title Taxon-term table value to represent partial validity Taxon-term table option to switch valid/satisfiable criteria; emulate gain/loss of a term for taxa Mar 25, 2021
@dustine32
Copy link
Author

@balhoff I believe this line here contains the meat:

def toExpression: OWLClassExpression = term and (InTaxon some taxon)

We could pass an option/flag into gaferencer to use an altered version of this. Though I don't yet know the syntax to express what we really need. We might need to talk it through on a call.

@balhoff
Copy link
Member

balhoff commented Apr 1, 2021

We discussed changing the table format:

  • 0 == most broad taxa "never in"
  • 1 == most specific taxon "always in"
  • blank == satisfiable to use with taxon (could be superclass, or subclass, of taxon with 1 value)

@balhoff
Copy link
Member

balhoff commented Apr 1, 2021

That scenario (above) will match the output of the taxon-constraints Protégé plugin.

@dustine32
Copy link
Author

For our PAINT purposes, I'll add an extra step in the post-processing of this table, which we handle here, that fills in the blank cells. This final, complete taxon-term table will then be what is consumed by the PAINT tool.

In the Bilateria inflammatory response example, this taxon-term cell will be blank coming out of the new gaferencer. My post-step logic would assign it 1 if any ancestor species of Bilateria have a 1 for this term. But since only a descendant of Bilateria (Euteleostomi) has a 1, Bilateria would be assigned 0. All descendant species of Euteleostomi would be assigned 1 by this logic.

For other terms that have a never_in_taxon, the logic will ask if the species has an ancestor with 0 for this term and, if yes, the species will inherit the 0. Perhaps I should pseudocode all of this:

for taxon in table.row(term).cells {
    # Check for "never_in" ancestors first
    for anc_taxon in species_tree.get_ancestors(taxon) {
        if gaferencer_result(anc_taxon) == 0 {
            return 0
        }
    }
    # If no "never_in" ancestors, check for "always_in" ancestors
    for anc_taxon in species_tree.get_ancestors(taxon) {
        if gaferencer_result(anc_taxon) == 1 {
            return 1
        }
    }
    return 0  # This is where Bilateria inflammatory response will go, so must be '0'
}

@thomaspd @huaiyumi What should the default value be if a term doesn't have any taxon constraints defined? Should the PAINT tool allow curators to make annotations to these terms? If yes, then the default should be 1, resulting in the whole term row being all 1s for all taxons. Are you OK with this?

@dustine32
Copy link
Author

Answer from @thomaspd and @huaiyumi is that yes, terms not having any constraints in the gaferencer taxa output (all-blank row) will be assigned 1 for all taxons by the PAINT update's post-processing script.

I'll make a ticket in that repo for implementing the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants