Skip to content

gene ontology ingest from gene association formatted files

License

Notifications You must be signed in to change notification settings

monarch-initiative/go-ingest

Repository files navigation

go-ingest

| Documentation |

modularized go ingest

Requirements

Setting Up a New Project

Upon creating a new project from the cookiecutter-monarch-ingest template, you can install and test the project:

cd go-ingest
make install
make test

There are a few additional steps to complete before the project is ready for use.

GitHub Repository

  1. Create a new repository on GitHub.

  2. Enable GitHub Actions to read and write to the repository (required to deploy the project to GitHub Pages).

    • in GitHub, go to Settings -> Action -> General -> Workflow permissions and choose read and write permissions
  3. Initialize the local repository and push the code to GitHub. For example:

    cd go-ingest
    git init
    git remote add origin https://github.com/<username>/<repository>.git
    git add -A && git commit -m "Initial commit"
    git push -u origin main

Transform Code and Configuration

  1. Edit the download.yaml, transform.py, transform.yaml, and metadata.yaml files to suit your needs.
  2. Add any additional dependencies to the pyproject.toml file.
  3. Adjust the contents of the tests directory to test the functionality of your transform.

Documentation

  1. Update this README.md file with any additional information about the project.
  2. Add any appropriate documentation to the docs directory.

Note: After the GitHub Actions for deploying documentation runs, the documentation will be automatically deployed to GitHub Pages.
However, you will need to go to the repository settings and set the GitHub Pages source to the gh-pages branch, using the /docs directory.

GitHub Actions

This project is set up with several GitHub Actions workflows.
You should not need to modify these workflows unless you want to change the behavior.
The workflows are located in the .github/workflows directory:

  • test.yaml: Run the pytest suite.
  • create-release.yaml: Create a new release once a week, or manually.
  • deploy-docs.yaml: Deploy the documentation to GitHub Pages (on pushes to main).
  • update-docs.yaml: After a release, update the documentation with node/edge reports.

Once you have completed these steps, you can remove the Setting Up a New Project section from this README.md file.

Installation

cd go-ingest
make install
# or
poetry install

Note that the make install command is just a convenience wrapper around poetry install.

Once installed, you can check that everything is working as expected:

# Run the pytest suite
make test
# Download the data and run the Koza transform
make download
make run

Usage

This project is set up with a Makefile for common tasks.
To see available options:

make help

Download and Transform

Download the data for the go_ingest transform:

poetry run go_ingest download

To run the Koza transform for go-ingest:

poetry run go_ingest transform

To see available options:

poetry run go_ingest download --help
# or
poetry run go_ingest transform --help

Testing

To run the test suite:

make test

Gene Ontology (GO) Annotation Database

The Gene Ontology Annotation Database compiles high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB), RNA molecules from RNACentral and protein complexes from the Complex Portal.

Manual annotation is the direct assignment of GO terms to proteins, ncRNA and protein complexes by curators from evidence extracted during the review of published scientific literature, with an appropriate evidence code assigned to give an assessment of the strength of the evidence. GOA files contain a mixture of manual annotation supplied by members of the Gene Ontology Consortium and computationally assigned GO terms describing gene products. Annotation type is clearly indicated by associated evidence codes and there are links to the source data.

There is a ReadMe.txt file that explains the different annotation files available. The ingested Gene Annotation File (GAF) is a 17 column tab-delimited file. The file format conforms to the specifications demanded by the GO Consortium and therefore GO IDs and not GO term names are shown.

Biolink captured

Subject Concept Node (Gene)
  • biolink:Gene
    • id (NCBIGene Entrez ID)
Object Concept Node (Gene Ontology Terms)
  • biolink:MolecularActivity

    • id (GO ID)
  • biolink:BiologicalProcess

    • id (GO ID)
  • biolink:CellularComponent

    • id (GO ID)
Additional Gene Ontology Term Concept Nodes for possible use?
  • biolink:Pathway

    • id (GO ID)
  • biolink:PhysiologicalProcess

    • id (GO ID)

Associations

  • biolink:FunctionalAssociation
    • id (random uuid)
    • subject (gene.id)
    • predicate (related_to)
    • object (go_term.id)
    • negated
    • has_evidence
    • publications
    • aggregating_knowledge_source (["infores:monarchinitiative"])
    • primary_knowledge_source

OR

  • biolink:MacromolecularMachineToMolecularActivityAssociation:

    • id (random uuid)
    • subject (gene.id)
    • predicate (related_to)
    • object (go_term.id)
    • negated
    • has_evidence
    • publications
    • aggregating_knowledge_source (["infores:monarchinitiative"])
    • primary_knowledge_source
  • biolink:MacromolecularMachineToBiologicalProcessAssociation:

    • id (random uuid)
    • subject (gene.id)
    • predicate (participates_in)
    • object (go_term.id)
    • negated
    • has_evidence
    • publications
    • aggregating_knowledge_source (["infores:monarchinitiative"])
    • primary_knowledge_source
  • biolink:MacromolecularMachineToCellularComponentAssociation:

    • id (random uuid)
    • subject (gene.id)
    • predicate (located_in)
    • object (go_term.id)
    • negated
    • has_evidence
    • publications
    • aggregating_knowledge_source (["infores:monarchinitiative"])
    • primary_knowledge_source

Possible Additional Gene to Gene Ontology Term Association?

  • biolink:GeneToGoTermAssociation:
    • id (random uuid)
    • subject (gene.id)
    • predicate (related_to)
    • object (go_term.id)
    • negated
    • has_evidence
    • publications
    • aggregating_knowledge_source (["infores:monarchinitiative"])
    • primary_knowledge_source

Citation

Ashburner et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000 May;25(1):25-9. The Gene Ontology Consortium. The Gene Ontology knowledgebase in 2023. Genetics. 2023 May 4;224(1):iyad031


This project was generated using monarch-initiative/cookiecutter-monarch-ingest.
Keep this project up to date using cruft by occasionally running in the project directory:

cruft update

For more information, see the cruft documentation