Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table spec proposal #64

Closed
wants to merge 35 commits into from
Closed

Conversation

kevinyamauchi
Copy link
Contributor

@kevinyamauchi kevinyamauchi commented Oct 1, 2021

updated June 29, 2022

This PR is a proposal for adding a spec for tabular data from the OME NGFF spatial omics hackathon and a few subsequent hackathons. This is a result of collaborative work and feedback with @joshmoore , @constantinpape , @aeisenbarth, @sebgoti , @VolkerH, @oeway, @tischi, @ilan-gold, @ctr26, @ivirshup , @giovp , @LucaMarconato, @DrKenHo-crick, and others.

Proposed specification

We are proposing to add a specification for a table to annotate label images. That is each row in the table provides annotations for a given label value in a label image. These may be computed features such as area or graphs encoding connectivity. We have modeled the table after the AnnData table, which is a workhorse table format for single-cell omics data. The rationale is that this table provides convenient data structures for (e.g., data frame, dense arrays, and sparse arrays). Further, using the AnnData table will make it easier to integrate imaging and spatial omics data and take advantage of the great tools for analyzing high dimensional features from the single-cell omics community (e.g. scanpy and squidpy).

Layout

Tables are contained in a zarr group that may either be in root or contained within another group. See below for a graphical representation of the layout of an AnnData table.

anndata layout

  • X the matrix containing the measured feature values
  • layers alternate matrices of the same size of X. Typically used to represent different normalizations of the data (e.g. "counts", "log1p")
  • obs: a DataFrame annotating each row. This may contain categorical values such as name of the cell type
  • obsm: dense matrices annotating each row. For example, the coordinates of the centroid of each labeled object (Nx3 array for N cells in 3D) or output dimensions of a dimentionality reduction algorithm.
  • obsp: sparse matrices annotating each row. For example, a connectivity graph encoding neighbor relationships between labeled objects.
  • var: a DataFrame annotating each column in X. For example, the name of each measured value in X or summary statistics (e.g., mean, standard deviation) for the values in each column
  • varm: dense matrices annotating the columns in X. This can include variable loadings from a dimensionality reduction or a table containing extra annotations on the variables.
  • varp sparse matrices annotating the columns in X. For example, this could be used to store variable correlation networks.
  • uns: a dictionary containing scalars, dense matrices, sparse matrices, and dictionaries pairs.

Linking tables to labels

To specify which labels image the table is annotating, the table group .zattrs must contain "region", which is the path to the data the table is annotating. To annotate multiple lable images, a list of paths can be provided. In the case of annotating multiple label images, the table group .zattrs MUST contain "region_key", which is the key in obs denoting which region a given row annotates.

Example workflow

We have made a repo with a couple of example datasets. These demos both write a dataset out into an OME-Zarr file with the proposed table spec and read the data back into a napari viewer. The reading code is a quick prototype to allow the data to be viewed and doesn't represent the final implementation.

https://github.com/kevinyamauchi/ome-ngff-tables-prototype

Design trade-offs

The proposed table is more complex than a standard table. However, the additional data structures provide the ability to store important data related to analysis of the spatial features we measure from our images (e.g., adjacency graph). Further implementing the table as compatible with AnnData improves compatibility with established analysis tools for the analysis of spatial features (e.g. squidpy, PathML, vitessce).

Current limitations

Paths

The proposed specification for specifying the location to the data being annotated by the table is modeled after the Label image specification, but does have some limitations. For example, there is not a straightforward way to refer to images that are outside of the zarr file containing the table. However, we believe path handling should be addressed in a separate issue / PR.

Types of tables

The current proposal only specifies a table for annotating label images. In the long run, we imagine there would be multiple types of tables related to this table specification. These include:

  • PointTable: a table where each row is a point (i.e., has coordinates, radii, annotations for each point)
  • RegionTable: a table where each row is a series of annotations on a regions of interest (region of interest could be definied by a polygon, etc.)
  • ImageTable: a table where each row is a series of annotations on an image. These could be summary statistics, etc

latest/index.bs Outdated Show resolved Hide resolved
@kevinyamauchi
Copy link
Contributor Author

kevinyamauchi commented Oct 20, 2021

I think this PR now reflects our thoughts from the hackathon. What is the next step, post it in the image.sc thread for broader comments?

cc @joshmoore

@will-moore
Copy link
Member

@kevinyamauchi certainly good to get broader feedback, especially if all cc'd above are happy.
One thing that is always useful is some sample data in the proposed format (or a way to generate it), and implementations that can read / display the data.
There's the makings of a checklist at https://github.com/ome/ngff/pull/52/files

@tischi
Copy link

tischi commented Oct 21, 2021

One thing that is always useful is some sample data in the proposed format (or a way to generate it), and implementations that can read / display the data.

I agree, if it is not too much work it would be great to have a minimal use-case demonstration. Unfortunately, on the Java side I don't think that's easily possible (at least if we want to use N5-ZARR to read and write the tabular data, in particular String values, there is quite a bit of work to be done). @bogovicj, do you know if there could be anyone that would like to take this on? As mentioned, @constantinpape and I hacked something together, but I think it would need someone more into the N5 (and imglib2) core development to implement something nice.

Copy link
Contributor

@bogovicj bogovicj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X MUST not be a complex type (i.e., MUST be a single type)

What is meant by "complex" vs "single" type, exactly? I assume "complex" would include objects, arrays, dictionaries, ..., and "single" would be primitive types. Are variable-length strings "complex" or "single" ?

@bogovicj
Copy link
Contributor

Thanks @kevinyamauchi for the nice work and @tischi for the mention.

Agree, this would be very nice to have in Java -land, and will ask around to see who might have bandwidth
to help on this. Maybe I end up taking lead here, it'll just be slow if it's me :-/

+1 to having some minimal use-cases / containers with the functionality described here with expected behavior. I'm unlikely to start work without such examples.

@constantinpape and I hacked something together,

Would you mind sending a pointer if its convenient?

in particular String values, there is quite a bit of work to be done

Do you mean storing variable-length strings in the data matrix X? Is that allowed / desired? Related to my question above about what types are allowed.

@constantinpape
Copy link
Contributor

@bogovicj Here's the work for padded string support in n5-zarr. It's very preliminary and hacky, I am sure there are more elegant ways of doing this. Note that the current solution is not fully working (strings read back are not quite what's expected), but I think the only thing missing is to change the endianess of the characters in the strings (characters are encoded in utf64; this took us some time to figure out):
https://github.com/tischi/n5/tree/readStrings
https://github.com/tischi/n5-zarr/tree/readStrings
Here's our test zarr container with string data:
https://github.com/ome/spatial-omics-hackathon-2021/tree/main/zarr-string-table/table.zarr

@aeisenbarth
Copy link

By storing column names as the actual file names of arrays (col_0 etc.), we limit column names to restrictions of the underlaying file system. Some table formats may support column names that can not be represented as file names on some systems.

For example hdf5, parquet and feather seem to allow column names like "µm", "m/z" or containing colons and even backslashes.

If we decide to support a subset of possible Unicode names, it should be mentioned in the spec.

@joshmoore
Copy link
Member

regarding points: see also - https://mobile.twitter.com/howardbutler/status/1453395016592789509

@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-zarr-hackathon-january-2022/60771/1

@thewtex
Copy link
Contributor

thewtex commented Dec 7, 2021

Cross-referencing relating models for geospatial with Arrow for geometry data:

https://github.com/jorisvandenbossche/geo-arrow-spec/blob/geo-arrow-format-initial/format.md

@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/storage-format-for-super-resolution-point-cloud-data/61393/3

@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/morpholibj-measurements-output/62614/3

@kevinyamauchi
Copy link
Contributor Author

kevinyamauchi commented Jan 31, 2022

Hello everyone! I am so sorry for the radio silence from my side. I've been struggling to carve out the time to work on this. However, thanks to the OME-NGFF hackathon this week, I will finally get to work on this some more!

I think @will-moore 's feedback to add an example implementation is a good one, so that will be my first goal this week. I'll have a look at that checklist PR and see what else we should add.

Sorry again for the delay, but I am looking forward to pushing this forward a bit this week.

@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-2022-01-27/60885/11

@kevinyamauchi
Copy link
Contributor Author

Hello everyone! After some great discussion and hacking with @ivirshup, @LucaMarconato, @giovp, and @joshmoore. I think we have settled on the proposed spec for tables.

Spec overview

We have designed the specification with two broad use cases in mind:

  • defining regions of interest (e.g., points, polygons)
  • annotating data (e.g., summary statistics about images, measurements of regions of interest)

In the long run, we imagine there would be multiple types of tables derived from the base table specification. These include:

  • PointsTable: a table where each row is a point (i.e., has coordinates, radii, annotations for each point)
  • RegionTable: a table where each row is a series of annotations on a region of interest (region of interest could be definied by a label image, by a polygon, etc.)
  • ImageTable: a table where each row is a series of annotations on an image. These could be summary statistics, etc.

We propose to extend full AnnData support. In the initial proposal, we thought constraining to the minimal subset of AnnData (i.e., X, obs, var) would ease the implementation. While I still think this is true, there are many spatial omics users who would benefit from the full AnnData table as it would allow them to save and load full analysis results from the core python omics tools (see scverse). This includes data such as coordinates for low dimensional embeddings (e.g., UMAP, T-SNE), neighbor graphs, etc.

For more details, see the AnnData paper.

Metadata

  • To define the type of table, each table should have a type metadata field (e.g., @type:nfgg:label_region.
  • To link the annotations in the table to the region of interest, the table should have a coordinate_space field that contains the name of the coordinate space the table is linked to.

Examples

Scope for this PR

In this PR, I would like to propose the base table specification (i.e., the layout of AnnData in OME-NGFF. If folks agree conceptually, I will extend this PR to include the full AnnData layout.

Follow up work

In a follow up, we can add specifications for other types of tables that are based on the finalized BaseTable spec. For example:

  • points
  • polygons
  • ImageTable

Implementation

The AnnData python library already has a writer for zarr, so it is fairly straight forward to add an implementation to the python ome-zarr library. I propose we can implement the spec in a staged manner.

  1. Stage 1:
    • python: full support in ome-zar via anndata dependency
    • javascript: vitessce supported anndata fields
    • java: ?
  2. Stage 2:
    • python: full support in ome-zar via anndata dependency
    • javascript: full support
    • java: ?
  3. Stage 3:
    • python: full support in ome-zar via anndata dependency
    • javascript: full support
    • java: full support

@github-actions
Copy link
Contributor

github-actions bot commented Jun 19, 2022

Automated Review URLs

| # `.zattrs` MUST contain "region", which is the path to the data the table is annotating.
| # "region" MUST be a single path (single region) or an array of paths (multiple regions).
| # "region" paths MUST be objects with a key "path" and the path value MUST be a string.
| # `.zattrs` MUST contain "region_key" if "region" is an array. "region_key" is the key in `obs` denoting which region a given row corresponds to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these MUST rules apply to regions table, but don't apply to the use-case of storing points/tracking data in tables as in ome/napari-ome-zarr#81
Presumably we do want to allow that use case (and others)? So these rules could be MAY instead of MUST?

@d-v-b
Copy link
Contributor

d-v-b commented Mar 8, 2023

I am a little confused about the scope of this addition. Should all applications that aim to support OME-NGFF for displaying images begin implementing support for tables?

And in terms of the content of the spec, it seems that it's written as commented inline JSON? I would prefer a formatted set of paragraphs that explain the motivation / history for this data model (why anndata? are there alternatives?) and explain the spec in detail, followed by a potentially comment-annotated example. Although, ideally, the JSON examples should not require any comments at all -- everything that would be conveyed in those comments should have been previously conveyed in the description of the spec.

@joshmoore
Copy link
Member

Should all applications that aim to support OME-NGFF for displaying images begin implementing support for tables?

"All" is of course a big ask, but I think any application that wants to show additional "properties" that are attached to an image should consider supporting the tables.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 10, 2023

I think I would benefit from some clarification of the relationship between this spec and the Anndata spec. For example, why is this PR necessary if the Anndata spec + toolchain already exists? AFAIK there is no restriction on extra metadata keys in OME-NGFF, so I don't understand why OME-NGFF metadata and Anndata metadata can't co-exist inside the same zarr container.

@will-moore
Copy link
Member

For me, this PR has been useful to see what a AnnData file looks like "on disk", whereas https://anndata.readthedocs.io/en/latest/fileformat-prose.html gives more of a Python API view of the spec.
Since I'm trying to consume this data from the web, rather than python.

Also the specification states where to look for multiple AnnData tables under an Image image.zarr/tables/.zattrs.

It also includes the addition of "keys" for listing groups that are not part of the AnnData spec but are useful if reading from the web, or anywhere you can't list the contents of a directory.

@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/faim-hcs-functions-to-work-with-hcs-data/78868/9

@joshmoore
Copy link
Member

A heads up for those that have been following along: over the past several weeks there have been a number of offline discussions after feedback on the Table spec (e.g. #178). In part this stems from a lack of clarity on the overall process applied to each NGFF PR. That's a separate issue that we as a community will need to address soon (#132). But with that lack of clarity, the state of the PR after a year and a half was also in question. In an effort to bring these conversations back into the open, @minnerbe will be opening an "addendum" to the tables spec for continued discussion.

@minnerbe
Copy link

As announced by @joshmoore, I submitted a PR to this branch on @kevinyamauchi's fork with an addendum of this table spec proposal.

My proposal should be understood as refactoring of this proposal from the point of view of someone who wants to read and write tables from Java. As such, it can provide a basis for further iterations and improvements on tables in OME-NGFF. In this spirit, I am very much looking forward to feedback and comments on my proposal!

@jkh1
Copy link

jkh1 commented Jun 20, 2023

I think this PR makes sense. I have been uneasy about AnnData as spec because I have the impression it is too tied to the spatial transcriptomics use case and the python way of doing things. I didn't have the time to propose an alternative so thanks @minnerbe for doing it.

@kevinyamauchi
Copy link
Contributor Author

Thanks for the PR, @minnerbe ! Also, thanks for the clear description of your motivation, concerns, and considerations. At first glance, this seems like it could be a viable alternative. I understand your concern that overfitting the on-disk representation to the python in-memory representation could make interoperability more difficult in some cases. I think we are fully aligned in the desire to get a spec accepted as soon as is reasonable that the community is generally happy with. I'll have a look this week and comment directly on your PR.

@joshmoore
Copy link
Member

To @jkh1 and @minnerbe's points, I'll add before digging into kevinyamauchi#3 is that if there is an overfitting on the side of AnnData to Python (a) I imagine that's something that the scverse community would (and should) be part of discussing and (b) we might not place the bar on NGFF of having that solved before we can move forward but can state that there is a more optimal situation that we would like to achieve.

@virginiascarlett
Copy link
Contributor

Seeing that this discussion has lost some of its momentum, I want to review the state of the present PR and this related PR, which we have been calling the 'minimal proposal'.

In October 2021, @kevinyamauchi et al. proposed this Tables spec. Janelia made some new hires and new investments at the start of 2023, so several of us became interested in OME-NGFF long after Kevin et al.’s PR was first proposed. When we discovered the PR, we felt it was out of scope for a universal bioimaging metadata schema. We felt it was a long, highly detailed specification for one software solution to a specialized scientific problem, whose implementation was very Pythonic. It lacked design principles that could be extended to a variety of programming languages or scientific applications. Additionally, putting AnnData into OME-NGFF meant OME-NGFF would be forever catching up to AnnData updates.

Instead, we felt that Kevin et al.’s proposal belonged in the AnnData documentation. That would give the AnnData maintainers complete control over the AnnData-Zarr specification, instead of making it OME’s responsibility.

Kevin et al. responded proactively to an urgent need, and they worked hard to produce a useful standard. I don’t want to disparage their efforts, and I don’t believe their work should be discarded. I simply believe that the OME-NGFF specification is not the proper home for the product they’ve produced.

@minnerbe and I (mostly Michael) wrote a new, alternative PR, with the goal of providing generic, extensible building blocks for storing tabular (meta)data in an OME-Zarr dataset. We wanted a specification that would accommodate existing AnnData-Zarr datasets if possible, without limiting scientists' tabular metadata to AnnData concepts and data structures. In truth, we’re not sure whether tables belong in OME-NGFF at all, but we did not want to criticize other peoples’ methods without providing a reasonable alternative solution.

Now, as the OME-NGFF spec says, we are at a crossroads. Michael and I still see some things we would like to change about our spec:

  • Some of our concepts need clarification (e.g., What is a table?).
  • It’s unclear how to implement multi-dimensional columns in Python (and whether we definitely want multidimensional columns).
  • It’s unclear where should the AnnData user store “X”.
  • For that matter, what is “X”? Is it a table? Is it an array? Is it comparable to a label image, and if so, should OME-NGFF provide some generic concepts and definitions for these kinds of ancillary data?
  • Our spec essentially allows the user to create a relational database schema where only one-to-one relations are allowed. Is this what we want? Does this framing change how we write about it?

We’re grateful that OME-NGFF exists, and we’re excited to have the opportunity to contribute to it. However, at this point, Michael and I are reluctant to put any more effort into our “minimal proposal” without a clear path forward. What’s the procedure for resolving these sorts of stalemates? Who is in charge of managing and ultimately merging or closing PRs (see also #134, #168, #170)? It seems we can either pause work on this spec and talk about governance procedures, or we can resolve this Tables dispute with an arbitrary procedure. In any case, let's talk about the plan.

@kevinyamauchi
Copy link
Contributor Author

Hey @virginiascarlett and @minnerbe . I am sorry I haven't been very active on your PR (especially after you engaged earnestly). The summer got unexpectedly busy and I dropped the ball 😞

I am about to head on some much-needed holiday for the next week, but I wanted to acknowledge your message. Thank you for the summary and your perspective.

However, at this point, Michael and I are reluctant to put any more effort into our “minimal proposal” without a clear path forward.

This resonates with me. I have been feeling similarly. It might be time to have a governance conversation.

I have a few questions about the NGFF that would help me in thinking about how to can move towards a decision on tables in the NGFF:

  • Like @virginiascarlett , I am curious about who makes the "final decision" about a spec being approved under the current NGFF governance model?
  • What is the expected timeline for implementation in the three languages (java, javascript, python) after a spec is accepted?
  • Who is responsible for recruiting/identifying implementers for a proposed spec (i.e., does a proposal need to have a clear implementation plan to be accepted)?

In any case, I wanted to close by thanking everybody for being collegial in their messages. Personally, I have never felt attacked or disparaged and I hope the same is true from your side. Have a great weekend!

@kevinyamauchi
Copy link
Contributor Author

Hello everyone! I have been thinking more about this and had a nice conversation with @joshmoore . I think there are some good points raised in kevinyamauchi#3. I also agree that without a process in place, the best way to a decision on tables in the NGFF is not clear.

Thus, I think the best path forward would be to close this PR and focus on establishing a process for submitting, reviewing, and approving new specs. Once the governance and spec proposal process has been outlined, we can re-evaluate submitting the table spec proposals under that framework.

Thank you all for your effort on these pull requests. I appreciate that everybody is trying to work towards making the NGFF a format that will serve the needs of the community. I am sorry we weren't able to get this across the line. However, I think taking a step back to clarify governance and the spec approval process will pay off in the long run.

@joshmoore
Copy link
Member

Thanks for the status update, @kevinyamauchi, as well as all of your efforts here. I’ll close the PR now. Most immediately for those following along, this will mean that 0.5 will focus primarily on the transformation work from @bogovicj along with any backlog of PRs which have accrued.

Regarding @virginiascarlett’s onpoint questions regarding resolution of PRs that have lost their momentum, I’ll propose that @kevinyamauchi and I kick that off with a post mortem including the various offline attempts to resolve the issues and the lessons learned that you need encoding in the governance process. I’d also propose that that be posted on image.sc for a wider audience where we can discuss next steps.

However, as there are a number of projects depending on the table spec, I’d suggest once we feel we’ve made progress with the process, that we come back quickly to identifying the parts of what’s been discussed here at such length that we can agree on.

@joshmoore joshmoore closed this Sep 25, 2023
@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ngff-updates-oct-2023/86957/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.