-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track provenance information for all modeling inputs #976
Comments
Prior to actually having atom::read_pdb pull in provenance information about PDB files, describe what will be necessary in the documentation. Relates #976.
These allow a structure to be tagged with a tree of provenance nodes, that explain how the structure was created. Relates salilab/imp#976.
This adds basic support for provenance tracking to IMP and RMF. Relates #976. Operations that alter parts of the structure (such as reading in from a PDB file, sampling, filtering, clustering) can now be recorded directly in the Model itself by means of provenance decorators, attached to atom::Hierarchy nodes. This provenance information is also stored in RMF files. Where possible, IMP and PMI should fill in this information automatically.
For tracking provenance of most experimental information, some additional information needs to be stored in the RMF file, namely the set of restraints, which particles they act on, and which restraints were used in each sampling step. Proposal: RMF already stores basic information about decomposed restraints. Make each set of decomposed restraints children of the 'real' restraint, which holds serialized information on the restraint itself (e.g. filename where the EM map was read from, cross correlation information, total score). The |
This adds a new IMP.mmcif module, which is similar in concept to the IMP.rmf module - it adds support for the mmCIF file format. It is intended to be used to convert sets of IMP models (generally read from intermediate RMF files) into a single mmCIF file, for deposition in PDB-Dev, and relies on provenance information (see #976) being present in the models. Relates #968.
IMP currently takes as input files in a variety of formats, but doesn't care where those files originate. This becomes a problem when we come to publish a modeling study and deposit the files (e.g. at PDB-dev). It's a lot of work to backtrack and try to figure out where such files came from. It would be much simpler if IMP tracked this information from day one, reading it in some standardized way from the files themselves (or the Python script), storing it in the Model, and also storing it in RMF files.
Since this is prerequisite information for outputting mmCIF files, solving this issue would be a step towards addressing #968. Much of this information is currently stored outside of the Model, mostly in PMI 1 data structures, and so currently outputting mmCIF requires PMI 1.
Only input atomic models are explicitly considered here but similar considerations should apply to restraints (e.g. where an EM map comes from), sequences (e.g. uniprot identifier), etc. (More generally, any transformation of the model, such as sampling, filtering or clustering, should also be recorded.)
Input files
atom::read_pdb
.mrc
or.pgm
files) the metadata will need to be stored somewhere else - one solution would be an accompanying JSON file (e.g.foo.mrc
is described byfoo.mrc.json
) with domain-specific metadata.Storage in Model
atom::read_pdb
should add suitable decorators to the createdHierarchy
particles to identify their source.atom::StructureSource
(see Add a Source decorator #894) is one example, although more data should be stored here (e.g. path to the file, PDB ID, version, descriptive text).atom::create_simplified_along_backbone
and PMI's generation of initial models should copy or otherwise preserve this information in the newly-generated hierarchies, while adding additional information about the simplification applied (e.g. the resolution).core::ProvenanceHierarchy
decorator to track a separate hierarchy from theatom::Hierarchy
. The root of this tree is the current state of the object, while children are inputs or previous states (and so will be bothcore::ProvenanceHierarchy
and some other decorator such asStructureSource
). A particle can be decorated as bothcore::ProvenanceHierarchy
andatom::Hierarchy
. Example hierarchies include:System
(atom::Hierarchy
), alsocore::ProvenanceHierarchy
rootoutput.1/rmfs/0.rmf3
foo.pgm
foo.csv
output.2/rmfs/0.rmf3
Chain
(atom::Hierarchy
), alsocore::ProvenanceHierarchy
rootfoo.pdb
, chainA
, heavy atoms only1xyzA
with alignmentfoo.ali
core::ProvenanceHierarchy
should be static (i.e. the same for all frames in a trajectory) so that it doesn't need to be updated during a simulation, and can be stored efficiently in an RMF file. Care also needs to be taken to avoid unnecessary duplication (e.g. each ensemble contributing to a cluster likely has the exact same set of inputs).Storage in RMF
The text was updated successfully, but these errors were encountered: