Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore adding the "complicated" (full provenance) GAF 2.2 header to the GPAD 2.0 #387

Open
kltm opened this issue Sep 30, 2024 · 7 comments

Comments

@kltm
Copy link
Member

kltm commented Sep 30, 2024

Talking to @pgaudet , we were wondering if it would be possible to have the "complicated"/full provenance GAF 2.2 header also be emitted into the GPAD 2.0? While not technically a show-stopper, as the GPAD 2.0 is going to be a bit more of an "internal" format moving forward, it is nice to have and would be good to be consistent.

@kltm
Copy link
Member Author

kltm commented Sep 30, 2024

Tagging @sierra-moxon

@sierra-moxon sierra-moxon self-assigned this Sep 30, 2024
@kltm
Copy link
Member Author

kltm commented Nov 4, 2024

Talking to @pgaudet , @kltm will follow up with @sierra-moxon

@sierra-moxon
Copy link
Member

sierra-moxon commented Nov 5, 2024

I think the ask here is to include the source file(s) headers in addition to the gpad-version, date-generated, and generated-by header values it already has.

the source headers include (when available):

  • the noctua gpad file header
  • the upstream GAF header(s) (sometimes more than one, incl PAINT, etc.)

this will require two changes: one to the GpadWriter itself to accept a source param, and in validate.py (the script called from mega make step to produce GPAD).

@kltm
Copy link
Member Author

kltm commented Nov 5, 2024

@sierra-moxon I'm happy to talk more about it, but the idea is to preserve as much of the provenance as we can in our output files, letting the final consumer know where things are coming from and giving them a better idea how to debug (this is our intent). Right now, the "best" example we have of this is the GAF headers. (They themselves have some issues, but are considered fine for now.)

@sierra-moxon
Copy link
Member

sierra-moxon commented Nov 12, 2024

This is what the header looks like in my development branch for a test set, GOA-chicken

!gpad-version: 2.0
!generated-by: GOC
!date-generated: 2024-11-11T15:35
!Header from source GAF file(s)
!=================================
!gaf-version: 2.2
!
!generated-by: GOC
!
!date-generated: 2024-11-11T15:33
!
!Header from goa_chicken source association file:
!=================================
!
!The set of protein accessions included in this file is based on UniProt reference proteomes, which provide one protein per gene.
!They include the protein sequences annotated in Swiss-Prot or the longest TrEMBL transcript if there is no Swiss-Prot record.
!In addition this file included Swiss-Prot Isoforms, RNA and ComplexPortal annotations data
!
!date-generated: 2024-10-20 10:37
!generated-by: UniProt
!go-version: http://purl.obolibrary.org/obo/go/releases/2024-10-11/extensions/go-plus.owl
!
!Header from source GAF file(s)
!=================================
!gaf-version: 2.2
!Created on Mon Sep 23 12:29:06 2024.
!generated-by: PANTHER
!date-generated: 2024-09-23
!PANTHER version: v.19.0.
!GO version: 2024-06-17.

Does this seem reasonable and/or does it need any changes? I think its a bit strange to see the GAF bits in the GPAD header, but not quite sure. @kltm, @pgaudet

@sierra-moxon sierra-moxon assigned kltm and sierra-moxon and unassigned sierra-moxon and kltm Nov 12, 2024
@kltm
Copy link
Member Author

kltm commented Nov 12, 2024

@sierra-moxon I think this looks good to me? WRT seeing the GAF header: technically, these files are being derived from GAF (source GAF file), so I think that's fine. Moreover, the GPAD is more "internal" anyways.
@pgaudet Would this all work for you?

@pgaudet
Copy link
Contributor

pgaudet commented Nov 13, 2024

Looks good! thanks @sierra-moxon !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants