-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table spec proposal #64
Conversation
I think this PR now reflects our thoughts from the hackathon. What is the next step, post it in the image.sc thread for broader comments? cc @joshmoore |
@kevinyamauchi certainly good to get broader feedback, especially if all cc'd above are happy. |
I agree, if it is not too much work it would be great to have a minimal use-case demonstration. Unfortunately, on the Java side I don't think that's easily possible (at least if we want to use N5-ZARR to read and write the tabular data, in particular String values, there is quite a bit of work to be done). @bogovicj, do you know if there could be anyone that would like to take this on? As mentioned, @constantinpape and I hacked something together, but I think it would need someone more into the N5 (and imglib2) core development to implement something nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X
MUST not be a complex type (i.e., MUST be a single type)
What is meant by "complex" vs "single" type, exactly? I assume "complex" would include objects, arrays, dictionaries, ..., and "single" would be primitive types. Are variable-length strings "complex" or "single" ?
Thanks @kevinyamauchi for the nice work and @tischi for the mention. Agree, this would be very nice to have in Java -land, and will ask around to see who might have bandwidth +1 to having some minimal use-cases / containers with the functionality described here with expected behavior. I'm unlikely to start work without such examples.
Would you mind sending a pointer if its convenient?
Do you mean storing variable-length strings in the data matrix |
@bogovicj Here's the work for padded string support in n5-zarr. It's very preliminary and hacky, I am sure there are more elegant ways of doing this. Note that the current solution is not fully working (strings read back are not quite what's expected), but I think the only thing missing is to change the endianess of the characters in the strings (characters are encoded in utf64; this took us some time to figure out): |
By storing column names as the actual file names of arrays ( For example hdf5, parquet and feather seem to allow column names like "µm", "m/z" or containing colons and even backslashes. If we decide to support a subset of possible Unicode names, it should be mentioned in the spec. |
regarding points: see also - https://mobile.twitter.com/howardbutler/status/1453395016592789509 |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/ome-zarr-hackathon-january-2022/60771/1 |
Cross-referencing relating models for geospatial with Arrow for geometry data: https://github.com/jorisvandenbossche/geo-arrow-spec/blob/geo-arrow-format-initial/format.md |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/storage-format-for-super-resolution-point-cloud-data/61393/3 |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/morpholibj-measurements-output/62614/3 |
Hello everyone! I am so sorry for the radio silence from my side. I've been struggling to carve out the time to work on this. However, thanks to the OME-NGFF hackathon this week, I will finally get to work on this some more! I think @will-moore 's feedback to add an example implementation is a good one, so that will be my first goal this week. I'll have a look at that checklist PR and see what else we should add. Sorry again for the delay, but I am looking forward to pushing this forward a bit this week. |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-2022-01-27/60885/11 |
Hello everyone! After some great discussion and hacking with @ivirshup, @LucaMarconato, @giovp, and @joshmoore. I think we have settled on the proposed spec for tables. Spec overviewWe have designed the specification with two broad use cases in mind:
In the long run, we imagine there would be multiple types of tables derived from the base table specification. These include:
We propose to extend full AnnData support. In the initial proposal, we thought constraining to the minimal subset of AnnData (i.e., X, obs, var) would ease the implementation. While I still think this is true, there are many spatial omics users who would benefit from the full AnnData table as it would allow them to save and load full analysis results from the core python omics tools (see scverse). This includes data such as coordinates for low dimensional embeddings (e.g., UMAP, T-SNE), neighbor graphs, etc. For more details, see the Metadata
Examples
Scope for this PRIn this PR, I would like to propose the base table specification (i.e., the layout of Follow up workIn a follow up, we can add specifications for other types of tables that are based on the finalized
ImplementationThe
|
Automated Review URLs |
| # `.zattrs` MUST contain "region", which is the path to the data the table is annotating. | ||
| # "region" MUST be a single path (single region) or an array of paths (multiple regions). | ||
| # "region" paths MUST be objects with a key "path" and the path value MUST be a string. | ||
| # `.zattrs` MUST contain "region_key" if "region" is an array. "region_key" is the key in `obs` denoting which region a given row corresponds to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these MUST
rules apply to regions table, but don't apply to the use-case of storing points/tracking data in tables as in ome/napari-ome-zarr#81
Presumably we do want to allow that use case (and others)? So these rules could be MAY
instead of MUST
?
I am a little confused about the scope of this addition. Should all applications that aim to support OME-NGFF for displaying images begin implementing support for tables? And in terms of the content of the spec, it seems that it's written as commented inline JSON? I would prefer a formatted set of paragraphs that explain the motivation / history for this data model (why anndata? are there alternatives?) and explain the spec in detail, followed by a potentially comment-annotated example. Although, ideally, the JSON examples should not require any comments at all -- everything that would be conveyed in those comments should have been previously conveyed in the description of the spec. |
"All" is of course a big ask, but I think any application that wants to show additional "properties" that are attached to an image should consider supporting the tables. |
I think I would benefit from some clarification of the relationship between this spec and the Anndata spec. For example, why is this PR necessary if the Anndata spec + toolchain already exists? AFAIK there is no restriction on extra metadata keys in OME-NGFF, so I don't understand why OME-NGFF metadata and Anndata metadata can't co-exist inside the same zarr container. |
For me, this PR has been useful to see what a AnnData file looks like "on disk", whereas https://anndata.readthedocs.io/en/latest/fileformat-prose.html gives more of a Python API view of the spec. Also the specification states where to look for multiple AnnData tables under an Image It also includes the addition of "keys" for listing groups that are not part of the AnnData spec but are useful if reading from the web, or anywhere you can't list the contents of a directory. |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/faim-hcs-functions-to-work-with-hcs-data/78868/9 |
A heads up for those that have been following along: over the past several weeks there have been a number of offline discussions after feedback on the Table spec (e.g. #178). In part this stems from a lack of clarity on the overall process applied to each NGFF PR. That's a separate issue that we as a community will need to address soon (#132). But with that lack of clarity, the state of the PR after a year and a half was also in question. In an effort to bring these conversations back into the open, @minnerbe will be opening an "addendum" to the tables spec for continued discussion. |
As announced by @joshmoore, I submitted a PR to this branch on @kevinyamauchi's fork with an addendum of this table spec proposal. My proposal should be understood as refactoring of this proposal from the point of view of someone who wants to read and write tables from Java. As such, it can provide a basis for further iterations and improvements on tables in OME-NGFF. In this spirit, I am very much looking forward to feedback and comments on my proposal! |
I think this PR makes sense. I have been uneasy about AnnData as spec because I have the impression it is too tied to the spatial transcriptomics use case and the python way of doing things. I didn't have the time to propose an alternative so thanks @minnerbe for doing it. |
Thanks for the PR, @minnerbe ! Also, thanks for the clear description of your motivation, concerns, and considerations. At first glance, this seems like it could be a viable alternative. I understand your concern that overfitting the on-disk representation to the python in-memory representation could make interoperability more difficult in some cases. I think we are fully aligned in the desire to get a spec accepted as soon as is reasonable that the community is generally happy with. I'll have a look this week and comment directly on your PR. |
To @jkh1 and @minnerbe's points, I'll add before digging into kevinyamauchi#3 is that if there is an overfitting on the side of AnnData to Python (a) I imagine that's something that the scverse community would (and should) be part of discussing and (b) we might not place the bar on NGFF of having that solved before we can move forward but can state that there is a more optimal situation that we would like to achieve. |
Seeing that this discussion has lost some of its momentum, I want to review the state of the present PR and this related PR, which we have been calling the 'minimal proposal'. In October 2021, @kevinyamauchi et al. proposed this Tables spec. Janelia made some new hires and new investments at the start of 2023, so several of us became interested in OME-NGFF long after Kevin et al.’s PR was first proposed. When we discovered the PR, we felt it was out of scope for a universal bioimaging metadata schema. We felt it was a long, highly detailed specification for one software solution to a specialized scientific problem, whose implementation was very Pythonic. It lacked design principles that could be extended to a variety of programming languages or scientific applications. Additionally, putting AnnData into OME-NGFF meant OME-NGFF would be forever catching up to AnnData updates. Instead, we felt that Kevin et al.’s proposal belonged in the AnnData documentation. That would give the AnnData maintainers complete control over the AnnData-Zarr specification, instead of making it OME’s responsibility. Kevin et al. responded proactively to an urgent need, and they worked hard to produce a useful standard. I don’t want to disparage their efforts, and I don’t believe their work should be discarded. I simply believe that the OME-NGFF specification is not the proper home for the product they’ve produced. @minnerbe and I (mostly Michael) wrote a new, alternative PR, with the goal of providing generic, extensible building blocks for storing tabular (meta)data in an OME-Zarr dataset. We wanted a specification that would accommodate existing AnnData-Zarr datasets if possible, without limiting scientists' tabular metadata to AnnData concepts and data structures. In truth, we’re not sure whether tables belong in OME-NGFF at all, but we did not want to criticize other peoples’ methods without providing a reasonable alternative solution. Now, as the OME-NGFF spec says, we are at a crossroads. Michael and I still see some things we would like to change about our spec:
We’re grateful that OME-NGFF exists, and we’re excited to have the opportunity to contribute to it. However, at this point, Michael and I are reluctant to put any more effort into our “minimal proposal” without a clear path forward. What’s the procedure for resolving these sorts of stalemates? Who is in charge of managing and ultimately merging or closing PRs (see also #134, #168, #170)? It seems we can either pause work on this spec and talk about governance procedures, or we can resolve this Tables dispute with an arbitrary procedure. In any case, let's talk about the plan. |
Hey @virginiascarlett and @minnerbe . I am sorry I haven't been very active on your PR (especially after you engaged earnestly). The summer got unexpectedly busy and I dropped the ball 😞 I am about to head on some much-needed holiday for the next week, but I wanted to acknowledge your message. Thank you for the summary and your perspective.
This resonates with me. I have been feeling similarly. It might be time to have a governance conversation. I have a few questions about the NGFF that would help me in thinking about how to can move towards a decision on tables in the NGFF:
In any case, I wanted to close by thanking everybody for being collegial in their messages. Personally, I have never felt attacked or disparaged and I hope the same is true from your side. Have a great weekend! |
Hello everyone! I have been thinking more about this and had a nice conversation with @joshmoore . I think there are some good points raised in kevinyamauchi#3. I also agree that without a process in place, the best way to a decision on tables in the NGFF is not clear. Thus, I think the best path forward would be to close this PR and focus on establishing a process for submitting, reviewing, and approving new specs. Once the governance and spec proposal process has been outlined, we can re-evaluate submitting the table spec proposals under that framework. Thank you all for your effort on these pull requests. I appreciate that everybody is trying to work towards making the NGFF a format that will serve the needs of the community. I am sorry we weren't able to get this across the line. However, I think taking a step back to clarify governance and the spec approval process will pay off in the long run. |
Thanks for the status update, @kevinyamauchi, as well as all of your efforts here. I’ll close the PR now. Most immediately for those following along, this will mean that 0.5 will focus primarily on the transformation work from @bogovicj along with any backlog of PRs which have accrued. Regarding @virginiascarlett’s onpoint questions regarding resolution of PRs that have lost their momentum, I’ll propose that @kevinyamauchi and I kick that off with a post mortem including the various offline attempts to resolve the issues and the lessons learned that you need encoding in the governance process. I’d also propose that that be posted on image.sc for a wider audience where we can discuss next steps. However, as there are a number of projects depending on the table spec, I’d suggest once we feel we’ve made progress with the process, that we come back quickly to identifying the parts of what’s been discussed here at such length that we can agree on. |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: |
updated June 29, 2022
This PR is a proposal for adding a spec for tabular data from the OME NGFF spatial omics hackathon and a few subsequent hackathons. This is a result of collaborative work and feedback with @joshmoore , @constantinpape , @aeisenbarth, @sebgoti , @VolkerH, @oeway, @tischi, @ilan-gold, @ctr26, @ivirshup , @giovp , @LucaMarconato, @DrKenHo-crick, and others.
Proposed specification
We are proposing to add a specification for a table to annotate label images. That is each row in the table provides annotations for a given label value in a label image. These may be computed features such as area or graphs encoding connectivity. We have modeled the table after the AnnData table, which is a workhorse table format for single-cell omics data. The rationale is that this table provides convenient data structures for (e.g., data frame, dense arrays, and sparse arrays). Further, using the
AnnData
table will make it easier to integrate imaging and spatial omics data and take advantage of the great tools for analyzing high dimensional features from the single-cell omics community (e.g. scanpy and squidpy).Layout
Tables are contained in a zarr group that may either be in root or contained within another group. See below for a graphical representation of the layout of an AnnData table.
X
the matrix containing the measured feature valueslayers
alternate matrices of the same size ofX
. Typically used to represent different normalizations of the data (e.g."counts"
,"log1p"
)obs
: a DataFrame annotating each row. This may contain categorical values such as name of the cell typeobsm
: dense matrices annotating each row. For example, the coordinates of the centroid of each labeled object (Nx3 array for N cells in 3D) or output dimensions of a dimentionality reduction algorithm.obsp
: sparse matrices annotating each row. For example, a connectivity graph encoding neighbor relationships between labeled objects.var
: a DataFrame annotating each column inX
. For example, the name of each measured value in X or summary statistics (e.g., mean, standard deviation) for the values in each columnvarm
: dense matrices annotating the columns inX
. This can include variable loadings from a dimensionality reduction or a table containing extra annotations on the variables.varp
sparse matrices annotating the columns inX
. For example, this could be used to store variable correlation networks.uns
: a dictionary containing scalars, dense matrices, sparse matrices, and dictionaries pairs.Linking tables to labels
To specify which labels image the table is annotating, the table group
.zattrs
must contain "region", which is the path to the data the table is annotating. To annotate multiple lable images, a list of paths can be provided. In the case of annotating multiple label images, the table group.zattrs
MUST contain "region_key", which is the key inobs
denoting which region a given row annotates.Example workflow
We have made a repo with a couple of example datasets. These demos both write a dataset out into an OME-Zarr file with the proposed table spec and read the data back into a napari viewer. The reading code is a quick prototype to allow the data to be viewed and doesn't represent the final implementation.
https://github.com/kevinyamauchi/ome-ngff-tables-prototype
Design trade-offs
The proposed table is more complex than a standard table. However, the additional data structures provide the ability to store important data related to analysis of the spatial features we measure from our images (e.g., adjacency graph). Further implementing the table as compatible with AnnData improves compatibility with established analysis tools for the analysis of spatial features (e.g. squidpy, PathML, vitessce).
Current limitations
Paths
The proposed specification for specifying the location to the data being annotated by the table is modeled after the Label image specification, but does have some limitations. For example, there is not a straightforward way to refer to images that are outside of the zarr file containing the table. However, we believe path handling should be addressed in a separate issue / PR.
Types of tables
The current proposal only specifies a table for annotating label images. In the long run, we imagine there would be multiple types of tables related to this table specification. These include: