From f6979682933b06b4418a1e913ef7dab3bf64ce1f Mon Sep 17 00:00:00 2001
From: Norman Fomferra Contributions
corresponding issue.
Setup development environment:
-pip install -r requirements.txt
-pip install -r requirements-dev.txt
-pip install -r requirements-docs.txt
-
+pip install -r requirements.txt
+pip install -r requirements-dev.txt
+pip install -r requirements-docs.txt
+
zappend
uses pytest for unit-level testing
and code coverage analysis.
pytest --cov=zappend tests
-
+pytest --cov=zappend tests
+
zappend
source code is formatted using the black tool.
black zappend
-
+black zappend
+
zappend
documentation is build using the mkdocs tool.
pip install -r requirements-doc.txt
+pip install -r requirements-doc.txt
-mkdocs build
-mkdocs serve
-mkdocs gh-deploy
-
+mkdocs build
+mkdocs serve
+mkdocs gh-deploy
+
License
zappend
is open source made available under the terms and conditions of the
MIT License.
diff --git a/api/index.html b/api/index.html
index 9fb3900..82f095f 100755
--- a/api/index.html
+++ b/api/index.html
@@ -193,11 +193,11 @@
-
+
- Introduction
+ Overview
diff --git a/cli/index.html b/cli/index.html
index fceb95f..2425bd5 100755
--- a/cli/index.html
+++ b/cli/index.html
@@ -193,11 +193,11 @@
-
+
- Introduction
+ Overview
@@ -370,7 +370,7 @@
Command Line Interface Reference
-Usage: zappend [OPTIONS] [SLICES]...
+Usage: zappend [OPTIONS] [SLICES]...
Create or update a Zarr dataset TARGET from slice datasets SLICES.
@@ -379,12 +379,12 @@ Command Line Interface Reference
passed, subsequent configurations are incremental to
the previous ones.
-t, --target TARGET Target Zarr dataset path or URI. Overrides the
- 'target_dir' configuration field.
+ 'target_dir' configuration field.
--dry-run Run the tool without creating, changing, or deleting
any files.
--help-config json|md Show configuration help and exit.
--help Show this message and exit.
-
+
diff --git a/config/index.html b/config/index.html
index 82cc92b..bbd56c1 100755
--- a/config/index.html
+++ b/config/index.html
@@ -193,11 +193,11 @@
-
+
- Introduction
+ Overview
diff --git a/guide/index.html b/guide/index.html
index 2d8f41f..a38d77c 100755
--- a/guide/index.html
+++ b/guide/index.html
@@ -193,11 +193,11 @@
-
+
- Introduction
+ Overview
@@ -298,9 +298,9 @@
-
-
+
- Variable Outline
+ Variable Dimensions
@@ -335,27 +335,27 @@
-
-
+
- Missing Values
+ Missing Data
-
-
+
- Compression
+ Data Packing
-
-
+
- Data Packing
+ Compression
@@ -556,9 +556,9 @@
-
-
+
- Variable Outline
+ Variable Dimensions
@@ -593,27 +593,27 @@
-
-
+
- Missing Values
+ Missing Data
-
-
+
- Compression
+ Data Packing
-
-
+
- Data Packing
+ Compression
@@ -701,64 +701,60 @@
User Guide
slice dataset paths that contribute to the datacube to be generated. The target dataset
path must point to a directory that will contain a Zarr group to be created and
updated. The slice dataset paths may be provided as Zarr as well or in other data
-formats supported by the
-xarray.open_dataset()
-function. The target and slice dataset are allowed to live in different filesystems.
+formats supported by the xarray.open_dataset() function.
+The target and slice dataset are allowed to live in different filesystems.
Additional filesystem storage options may be specified via the tool's configuration.
The tool takes care of generating the target dataset from slice datasets, but doesn't
care how the slice datasets are created. Hence, when using the Python zappend()
function, the slice datasets can be provided in various forms. More on this below.
-
-[!NOTE]
-We use the term Dataset in the same way xarray
does: A dataset
-comprises any number of multidimensional Data Variables, and
-usually 1-dimensional Coordinate Variables that provide the labels for
-the dimensions used by the data variables. A variable comprises the actual
-data array as well as metadata describing the data dimensions,
-units, and encoding, such as chunking and compression.
-
+
+Note
+We use the term Dataset in the same way xarray
does: A dataset comprises any
+number of multidimensional Data Variables, and usually 1-dimensional
+Coordinate Variables that provide the labels for the dimensions used by the data
+variables. A variable comprises the actual data array as well as metadata describing
+the data dimensions, units, and encoding, such as chunking and compression.
+
Dataset Outline
-If no further configuration is supplied, then the target dataset's outline
-and data encoding is fully prescribed by the first slice dataset provided.
-By default, the dimension along subsequent slice datasets are concatenated
-is time
. If you use a different append dimension, the append_dim
-setting can be used to specify its name:
-{
- "append_dim": "depth"
-}
-
+If no further configuration is supplied, then the target dataset's outline and data
+encoding is fully prescribed by the first slice dataset provided. By default, the
+dimension along subsequent slice datasets are concatenated is time
. If you use a
+different append dimension, the append_dim
setting can be used to specify its name:
+{
+ "append_dim": "depth"
+}
+
All other non-variadic dimensions can and should be specified using the
fixed_dims
setting which is a mapping from dimension name to the
fixed dimension sizes, e.g.:
-{
- "fixed_dims": {
- "x": 16384,
- "y": 8192
- }
-}
-
+{
+ "fixed_dims": {
+ "x": 16384,
+ "y": 8192
+ }
+}
+
By default, without further configuration, all data variables seen in the first
dataset slice will be included in the target dataset. If only a subset of
variables shall be used from the slice dataset, they can be specified using the
included_variables
setting, which is a list of names of variables that will
be included:
-{
- "included_variables": [
- "time", "y", "x",
- "chl",
- "tsm"
- ]
-}
-
+{
+ "included_variables": [
+ "time", "y", "x",
+ "chl",
+ "tsm"
+ ]
+}
+
Often, it is easier to tell which variables should be excluded:
-{
- "excluded_variables": ["GridCellId"]
-}
-
+{
+ "excluded_variables": ["GridCellId"]
+}
+
Variable Metadata
-Without any additional configuration, zappend
uses the outline, attributes,
-and encoding information of data variables for the target dataset from the
-data variables of the first slice dataset.
+
Without any additional configuration, zappend
uses the dimensions, attributes,
+and encoding information from the data variables of the first slice dataset.
Encoding information is used only to the extent applicable to the Zarr format.
Non-applicable encoding information will be reported by a warning log record
but is otherwise ignored.
@@ -770,96 +766,162 @@ Variable Metadata
first dataset slice.
A special "variable name" is the wildcard *
that can be used to define default
values for all variables:
-{
- "variables": {
- "*": { }
- }
-}
-
+{
+ "variables": {
+ "*": {
+ }
+ }
+}
+
If *
is specified, the effective variable metadata applied is gained by merging a
given specific metadata, into the common metadata given by *
, which is eventually
merged into metadata of the variable in the first dataset slice.
-
-[!NOTE]
-The metadata of variables from subsequent slice datasets is ignored!
-
-Variable Outline
-To ensure a slice variable has the expected dimensionality, the dims
-setting is used. The following example defines the dimensions of the data variable
+
+Note
+Only metadata from the first slice dataset is used, metadata of variables from
+subsequent slice datasets is ignored entirely.
+
+Variable Dimensions
+To ensure a slice variable has the expected dimensionality and shape, the dims
+setting is used. The following example defines the dimensions of a data variable
named chl
(Chlorophyll):
-{
- "variables": {
- "chl": {
- "dims": ["time", "y", "x"]
- }
- }
-}
-
+{
+ "variables": {
+ "chl": {
+ "dims": ["time", "y", "x"]
+ }
+ }
+}
+
An error will be raised if a variable from a subsequent slice has different dimensions.
Variable Attributes
Extra variable attributes can be provided using the attrs
setting:
-{
- "variables": {
- "chl": {
- "attrs": {
- "units": "mg/m^3",
- "long_name": "chlorophyll_concentration"
- }
- }
- }
-}
-
+{
+ "variables": {
+ "chl": {
+ "attrs": {
+ "units": "mg/m^3",
+ "long_name": "chlorophyll_concentration"
+ }
+ }
+ }
+}
+
Variable Encoding
-Encoding metadata specifies how array data is stored in the target dataset and
-includes storage data type, packing, chunking, and compression.
-Encoding metadata for a given variable is provided by the encoding
setting.
-Since the encoding is often shared by multiple variables the wildcard
-variable name *
can often be of help.
+Encoding metadata specifies how array data is stored in the target dataset and includes
+storage data type, packing, chunking, and compression. Encoding metadata for a given
+variable is provided by the encoding
setting. Since the encoding is often shared by
+multiple variables the wildcard variable name *
can often be of help.
+
+Verify encoding is as expected
+To verify that zappend
uses the expected encoding for your variables create a
+target dataset for testing from your first slice dataset and open it using
+ds = xarray.open_zarr(target_dir, decode_cf=False)
. Then inspect dataset ds
+using the Python console or Jupyter Notebook (attribute ds.<var>.encoding
).
+You can also inspect the Zarr directly by opening the <target_dir>/<var>/.zarray
+or <target_dir>/.zmetadata
metadata JSON files.
+
Chunking
By default, the chunking of the coordinate variable corresponding to the append
-dimension will be its dimension in the first slice dataset. Often, this will
-be one or a small number. Since xarray
loads coordinates eagerly when opening
-a dataset, this can lead to performance issues if the target dataset is served
-from object storage such as S3. This is because, a separate HTTP request is
-required for every single chunk. It is therefore very advisable to set the
-chunks of that variable to a larger number using the chunks
setting.
-For other variables, the chunking within the append dimension may stay small
-if desired:
-{
- "variables": {
- "time": {
- "dims": ["time"],
- "encoding": {
- "chunks": [1024]
- }
- },
- "chl": {
- "dims": ["time", "y", "x"],
- "encoding": {
- "chunks": [1, 2048, 2048]
- }
- }
- }
-}
-
-Missing Values
-This section is a work in progress.
-Compression
-This section is a work in progress.
+dimension will be its dimension in the first slice dataset. Often, this will be one or
+a small number. Since xarray
loads coordinates eagerly when opening a dataset, this
+can lead to performance issues if the target dataset is served from object storage such
+as S3. This is because, a separate HTTP request is required for every single chunk. It
+is therefore very advisable to set the chunks of that variable to a larger number using
+the chunks
setting. For other variables, the chunking within the append dimension may
+stay small if desired:
+{
+ "variables": {
+ "time": {
+ "dims": ["time"],
+ "encoding": {
+ "chunks": [1024]
+ }
+ },
+ "chl": {
+ "dims": ["time", "y", "x"],
+ "encoding": {
+ "chunks": [1, 2048, 2048]
+ }
+ }
+ }
+}
+
+Missing Data
+To indicate missing data in a variable data array, a dedicated no-data or missing value
+can be specified by the fill_value
setting. The value is given in a variable's storage
+type and storage units, see next section Data Packing.
+{
+ "variables": {
+ "chl": {
+ "encoding": {
+ "fill_value": -999
+ }
+ }
+ }
+}
+
+If the fill_value
is not specified, the default is NaN
(given as string "NaN"
+in JSON) if the storage data type is floating point; it is None
(null
in JSON)
+if the storage data types is integer, which effectively means, no fill value is used.
+You can also explicitly set fill_value
to null
(None
in Python) to not use one.
+Setting the fill_value
for a variable can be important for saving storage space and
+improving data I/O performance in many cases, because zappend
does not write empty
+array chunks - chunks that comprise missing data only, i.e.,
+slice.to_zarr(target_dir, write_empty_chunks=False, ...)
.
Data Packing
-This section is a work in progress.
-
--
-
The target encoding should also allow for packing floating point data into
- integer data with fewer bits using scaling factor and offset.
-
--
-
If the target exists, the slice will be appended. Check if the slice to be
- appended is last. If not, refuse to append (alternative: insert but this is
- probably difficult or error prone).
-
-- Slices are appended in the order they are provided.
-
+Data packing refers to a simple lossy data compression method where 32- or 64-bit
+floating point values are linearly scaled so that their value range can be fully or
+partially represented by a lower precision integer data type. Packed values usually
+also give higher compression rates when using a compressor
, see next section.
+Data packing is specified using the scale_factor
and add_offset
settings together
+with the storage data type setting dtype
. The settings should be given as a triple:
+{
+ "variables": {
+ "chl": {
+ "encoding": {
+ "dtype": "int16",
+ "scale_factor": 0.005,
+ "add_offset": 0.0
+ }
+ }
+ }
+}
+
+The in-memory value in its physical units for a given encoded value in storage is
+computed according to
+memory_value = scale_factor * storage_value + add_offset
+
+Hence, the encoded value is computed from an in-memory value in physical units as
+storage_value = (memory_value - add_offset) / scale_factor
+
+You can compute scale_factor
and add_offset
from given data range in physical units
+according to
+ add_offset = memory_value_min
+ scale_factor = (memory_value_max - memory_value_min) / (2 ** num_bits - 1)
+
+with num_bits
being the number of bits for the integer type to be used.
+Compression
+Data compression is specified by the compressor
setting, optionally paired with the
+filters
setting:
+{
+ "variables": {
+ "chl": {
+ "encoding": {
+ "compressor": {},
+ "filters": []
+ }
+ }
+ }
+}
+
+By default, zappend
uses default the default blosc
compressor of Zarr, if not
+specified. To explicitly disable compression you must set the compressor
to None
+(null
in JSON).
+The usage of compressors and filters is best explained in dedicated sections of the
+Zarr Tutorial, namely
+Compressors and
+Filters.
Data I/O
This section is a work in progress.
@@ -924,37 +986,36 @@ Transactions
Slice Data Types
This section is a work in progress.
Logging
-The zappend
logging configuration follows exactly the
-Python dictionary schema of the Python module logging.config
.
-The logger used by the zappend
tool is named zappend
.
-Note that you can also configure the logger of other Python modules, e.g.,
+
The zappend
logging configuration follows exactly the dictionary schema of the
+Python module logging.config
. The logger used by the zappend
tool is named
+zappend
. Note that you can also configure the logger of other Python modules, e.g.,
xarray
or dask
here.
Given here is an example that logs zappend
's output to the console using
the INFO level:
-
{
- "logging": {
- "version": 1,
- "formatters": {
- "normal": {
- "format": "%(asctime)s %(levelname)s %(message)s",
- "style": "%"
- }
- },
- "handlers": {
- "console": {
- "class": "logging.StreamHandler",
- "formatter": "normal"
- }
- },
- "loggers": {
- "zappend": {
- "level": "INFO",
- "handlers": ["console"]
- }
- }
- }
-}
-
+{
+ "logging": {
+ "version": 1,
+ "formatters": {
+ "normal": {
+ "format": "%(asctime)s %(levelname)s %(message)s",
+ "style": "%"
+ }
+ },
+ "handlers": {
+ "console": {
+ "class": "logging.StreamHandler",
+ "formatter": "normal"
+ }
+ },
+ "loggers": {
+ "zappend": {
+ "level": "INFO",
+ "handlers": ["console"]
+ }
+ }
+ }
+}
+
diff --git a/intro/index.html b/index.html
similarity index 89%
rename from intro/index.html
rename to index.html
index b4bba72..872fe8f 100755
--- a/intro/index.html
+++ b/index.html
@@ -10,19 +10,19 @@
-
+
-
+
- Introduction - zappend
+ zappend
-
+
@@ -41,9 +41,9 @@
-
+
-
+
@@ -78,7 +78,7 @@