Skip to content

Commit

Permalink
Update introduction and references
Browse files Browse the repository at this point in the history
  • Loading branch information
joshmoore committed Nov 20, 2020
1 parent 7312c3c commit 4cc2535
Showing 1 changed file with 79 additions and 43 deletions.
122 changes: 79 additions & 43 deletions index.bs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@ Repository: https://github.com/joshmoore/ngff
Issue Tracking: Forums https://forum.image.sc/tag/ome-ngff
Logo: http://www.openmicroscopy.org/img/logos/ome-logomark.svg
Local Boilerplate: header yes
Local Boilerplate: copyright yes
Boilerplate: style-darkmode off
Markup Shorthands: markdown yes
Editor: Josh Moore, Open Microscopy Environment (OME) https://www.openmicroscopy.org
Abstract: This document contains next-generation file format (NGFF)
Abstract: specifications for storing bioimaging data in the cloud.
Expand All @@ -28,11 +30,11 @@ larger, preciser spatial measurements is unfortunately at odds with our ability
to structure and share those measurements with others. During a global pandemic
more than ever, we believe fervently that global, collaborative discovery as
opposed to the post-publication, "data-on-request" mode of operation is the
path forward. Bioimages should be shareable via open and commercial cloud
path forward. Bioimaging data should be shareable via open and commercial cloud
resources without the need to download entire datasets.

At the moment, that is not the norm. The plethora of data formats produced by
imaging systems are ill-suited to the remote sharing. Individual scientists
imaging systems are ill-suited to remote sharing. Individual scientists
typically lack the infrastructure they need to host these data themselves. When
they acquire images from elsewhere, time-consuming translations and data
cleaning are needed to interpret findings. Those same costs are multiplied when
Expand All @@ -41,53 +43,63 @@ factor before publication is possible. Without a common effort, each lab or
resource is left building the tools they need and maintaining that
infrastructure often without dedicated funding.

This document assumes that there are three keys to a workable solution:
This document defines a specification for bioimaging data to make it possible
to enable the conversion of proprietary formats into a common, cloud-ready one.
Such next-generation file formats layout data so that individual portions, or
"chunks", of large data are reference-able eliminating the need to download
entire datasets.

1. Converting all data out of proprietary formats rather than trying to translate data on every access.
2. Chunking the data so that manageable areas of large data are reference-able online rather than downloading them entirely.
3. Collaborating on a small number of container formats and conventions for metadata rather than developing new versions to meet each individual requirement.

This document specifies one layout for images within Zarr files. The APIs and
scripts provided by this repository will support one or more versions of this
file, but they should all be considered internal investigations, not intended
for public re-use.

Why "next generation"? {#ngff}
------------------------------
Why "<dfn export="true"><abbr title="Next-generation file-format">NGFF</abbr></dfn>"? {#why-ngff}
-------------------------------------------------------------------------------------------------

A short description of what is needed for an imaging format is "a hierarchy
of n-dimensional (dense) arrays with metadata". This combination of features
is certainly provided by <dfn><abbr title="Hierarchical Data Format 5">HDF5</abbr></dfn>
is certainly provided by <dfn export="true"><abbr title="Hierarchical Data Format 5">HDF5</abbr></dfn>
from the <a href="https://www.hdfgroup.org">HDF Group</a>, which a number of
bioimaging formats do use. HDF5 and other larger binary structures, however,
are ill-suited for storage in the cloud where accessing individual segments,
or "chunks", of data by name rather than seeking through a large file is at
the heart of parallelization.
are ill-suited for storage in the cloud where accessing individual chunks
of data by name rather than seeking through a large file is at the heart of
parallelization.

As a result, a number of formats have been developed more recently which provide
the basic data structure of an HDF5 file, but do so in a more cloud-friendly way.

<!--
Zarr {#zarr}

N5 {#n5}

Eventually, of course, these files will no longer be next-generation and we will
need to change the name ...


The specification
https://github.com/saalfeldlab/n5#file-system-specification

https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html
assumes a file format
-->
In the [PyData](https://pydata.org/) community, the Zarr [[zarr]] format was developed
for easily storing collections of [NumPy](https://numpy.org/) arrays. In the
[ImageJ](https://imagej.net/) community, N5 [[n5]] was developed to work around
the limitations of HDF5 ("N5" was originally short for "Not-HDF5").
Both of these formats permit storing individual chunks of data either locally in
separate files or in cloud-based object stores as separate keys.

A [current effort](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html)
is underway to unify the two similar specifications to provide a single binary
specification. The editor's draft will soon be entering a [request for comments (RFC)](https://github.com/zarr-developers/zarr-specs/issues/101) phase with the goal of having a first version early in 2021. As that
process comes to an end, this document will be updated.

OME-NGFF {#ome-ngff}
--------------------

The conventions and specifications defined in this document are designed to
enable next-generation file formats to represent the same bioimaging data
that can be represented in \[OME-TIFF](http://www.openmicroscopy.org/ome-files/)
and beyond. However, the conventions will also be usable by HDF5 and other sufficiently advanced
binary containers. Eventually, we hope, the moniker "next-generation" will no longer be
applicable, and this will simply be the most efficient, common, and useful representation
of bioimaging data, whether during acquisition or sharing in the cloud.

Note: The following text makes use of OME-Zarr [[ome-zarr-py]], the current prototype implementation,
for all examples.

On-disk (or in-cloud) layout {#on-disk}
=======================================

```
An overview of the layout of an OME-Zarr fileset should make
understanding the following metadata sections easier. The hierarchy
is represented here as it would appear locally but could equally
be stored on a web server to be accessed via HTTP or in object storage
like S3 or GCS.

```
. # Root folder, potentially in S3,
│ # with a flat list of images by image ID.
Expand Down Expand Up @@ -130,8 +142,6 @@ On-disk (or in-cloud) layout {#on-disk}
├── 0 # Each multiscale level is stored as a separate Zarr array, as above, but only integer values
│ ... # are supported.
└── n


```

Metadata {#metadata}
Expand Down Expand Up @@ -312,6 +322,17 @@ above).

<pre class="biblio">
{
"blogNov2020": {
"href": "https://blog.openmicroscopy.org/file-formats/community/2020/11/04/zarr-data/",
"title": "Public OME-Zarr data (Nov. 2020)",
"authors": [
"OME Team"
],
"status": "Informational",
"publisher": "OME",
"id": "blogNov2020",
"date": "04 November 2020"
},
"imagesc26952": {
"href": "https://forum.image.sc/t/ome-s-position-regarding-file-formats/26952",
"title": "OME’s position regarding file formats",
Expand All @@ -323,16 +344,31 @@ above).
"id": "imagesc26952",
"date": "19 June 2020"
},
"blogNov2020": {
"href": "https://blog.openmicroscopy.org/file-formats/community/2020/11/04/zarr-data/",
"title": "Public OME-Zarr data (Nov. 2020)",
"n5": {
"id": "n5",
"href": "https://github.com/saalfeldlab/n5/issues/62",
"title": "N5---a scalable Java API for hierarchies of chunked n-dimensional tensors and structured meta-data",
"status": "Informational",
"authors": [
"OME Team"
"John A. Bogovic",
"Igor Pisarev",
"Philipp Hanslovsky",
"Neil Thistlethwaite",
"Stephan Saalfeld"
],
"date": "2020"
},
"ome-zarr-py": {
"id": "ome-zarr-py",
"href": "https://doi.org/10.5281/zenodo.4113931",
"title": "ome-zarr-py: Experimental implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.",
"status": "Informational",
"publisher": "OME",
"id": "blogNov2020",
"date": "04 November 2020"
"publisher": "Zenodo",
"authors": [
"OME",
"et al"
],
"date": "06 October 2020"
},
"zarr": {
"id": "zarr",
Expand Down

0 comments on commit 4cc2535

Please sign in to comment.