1 The NetCDF NCZarr Implementation
2 ============================
3 <!-- double header is needed to workaround doxygen bug -->
5 # The NetCDF NCZarr Implementation {#nczarr_head}
9 # NCZarr Introduction {#nczarr_introduction}
11 Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to support data stored using the Zarr data model and storage format [4,6]. As part of this support, netCDF adds support for accessing data stored using cloud storage (e.g. Amazon S3 <a href="#ref_aws">[1]</a> ).
13 The goal of this project, then, is to provide maximum interoperability between the netCDF Enhanced (netcdf-4) data model and the Zarr version 2 <a href="#ref_zarr">[4]</a><!-- or Version 3 <a href="#ref_zarrv3">[13]</a>--> data model. This is embodied in the netcdf-c library so that it is possible to use the netcdf API to read and write Zarr formatted datasets.
15 In order to better support the netcdf-4 data model, the netcdf-c library implements a limited set of extensions to the *Zarr* data model.
16 This extended model is referred to as *NCZarr*.
17 Additionally, another goal is to ensure interoperability between *NCZarr*
18 formatted files and standard (aka pure) *Zarr* formatted files.
19 This means that (1) an *NCZarr* file can be read by any other *Zarr* library (and especially the Zarr-python library), and (2) a standard *Zarr* file can be read by netCDF. Of course, there limitations in that other *Zarr* libraries will not use the extra, *NCZarr* meta-data, and netCDF will have to "fake" meta-data not provided by a pure *Zarr* file.
21 As a secondary -- but equally important -- goal, it must be possible to use
22 the NCZarr library to read and write datasets that are pure Zarr,
23 which means that none of the NCZarr extensions are used. This feature does come
24 with some costs, namely that information contained in the netcdf-4
25 data model may be lost in the pure Zarr dataset.
27 Notes on terminology in this document.
28 * The term "dataset" is used to refer to all of the Zarr objects constituting
29 the meta-data and data.
30 * NCZarr currently is not thread-safe. So any attempt to use it with parallelism, including MPIO, is likely to fail.
32 # The NCZarr Data Model {#nczarr_data_model}
34 NCZarr uses a data model that, by design, extends the Zarr Version 2 Specification <!--or Version 3 Specification-->.
36 __Note Carefully__: a legal _NCZarr_ dataset is expected to also be a legal _Zarr_ dataset.
37 The inverse is true also. A legal _Zarr_ dataset is expected to also be a legal _NCZarr_ dataset, where "legal" means it conforms to the Zarr specification(s).
38 In addition, certain non-Zarr features are allowed and used.
39 Specifically the XArray [7] ''\_ARRAY\_DIMENSIONS'' attribute is one such.
41 There are two other, secondary assumption:
43 1. The actual storage format in which the dataset is stored -- a zip file, for example -- can be read by the _Zarr_ implementation.
44 2. The compressors (aka filters) used by the dataset can be encoded/decoded by the implementation. NCZarr uses HDF5-style filters, so ensuring access to such filters is somewhat complicated. See [the companion document on
45 filters](./md_filters.html "filters") for details.
47 Briefly, the data model supported by NCZarr is netcdf-4 minus
48 the user-defined types and full String type support.
49 However, a restricted form of String type
50 is supported (see Appendix D).
51 As with netcdf-4, chunking is supported. Filters and compression
52 are also [supported](./md_filters.html "filters").
54 Specifically, the model supports the following.
55 - "Atomic" types: char, byte, ubyte, short, ushort, int, uint, int64, uint64, string.
56 - Shared (named) dimensions
57 - Unlimited dimensions
58 - Attributes with specified types -- both global and per-variable
62 - N-Dimensional variables
64 - Per-variable endianness (big or little)
65 - Filters (including compression)
67 With respect to full netCDF-4, the following concepts are
68 currently unsupported.
69 - User-defined types (enum, opaque, VLEN, and Compound)
70 - Contiguous or compact storage
72 Note that contiguous and compact are not actually supported
73 because they are HDF5 specific.
74 When specified, they are treated as chunked where the file consists of only one chunk.
75 This means that testing for contiguous or compact is not possible; the _nc_inq_var_chunking_ function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.
77 Additionally, it should be noted that NCZarr supports scalar variables,
78 but Zarr Version 2 does not; Zarr V2 only supports dimensioned variables.
79 In order to support interoperability, NCZarr V2 does the following.
80 1. A scalar variable is recorded in the Zarr metadata as if it has a shape of **[1]**.
81 2. A note is stored in the NCZarr metadata that this is actually a netCDF scalar variable.
83 These actions allow NCZarr to properly show scalars in its API while still
84 maintaining compatibility with Zarr.
86 # Enabling NCZarr Support {#nczarr_enable}
88 NCZarr support is enabled by default.
89 If the _--disable-nczarr_ option is used with './configure', then NCZarr (and Zarr) support is disabled.
90 If NCZarr support is enabled, then support for datasets stored as files in a directory tree is provided as the only guaranteed mechanism for storing datasets.
91 However, several addition storage mechanisms are available if additional libraries are installed.
93 1. Zip format -- if _libzip_ is installed, then it is possible to directly read and write datasets stored in zip files.
94 2. If one of the supported AWS SDKs is installed, then it is possible to directly read and write datasets stored in the Amazon S3 cloud storage.
96 # Accessing Data Using the NCZarr Prototocol {#nczarr_accessing_data}
98 In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format.
99 Note specifically that there is no NC_NCZARR flag for the mode argument of _nc_create_ or _nc_open_.
100 In this case, it is indicated by the URL path.
103 The URL is the usual format.
105 protocol:://host:port/path?query#fragment
107 See the document "quickstart_paths" for details about
110 There are, however, some details that are important.
111 - Protocol: this should be _https_ or _s3_,or _file_.
112 The _s3_ scheme is equivalent to "https" plus setting "mode=s3".
113 Specifying "file" is mostly used for testing, but also for directory tree or zipfile format storage.
117 The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format.
119 For reading, _key=value_ pairs are provided for specifying the storage format.
122 Additional pairs are provided to specify the Zarr version.
125 Additional pairs are provided to specify the storage medium: Amazon S3 vs File tree vs Zip file.
128 Note that when reading, an attempt will be made to infer the
129 format and Zarr version and storage medium format by probing the
130 file. If inferencing fails, then it is reported. In this case,
131 the client may need to add specific mode flags to avoid
134 Typically one will specify three mode flags: one to indicate what format
135 to use and one to specify the way the dataset is to be stored<!--,and one to specifiy the Zarr format version-->.
136 For example, a common one is "mode=zarr,file<!--,v2-->"
137 <!--If not specified, the version will be the default specified when
138 the netcdf-c library was built.-->
140 Obviously, when creating a file, inferring the type of file to create
141 is not possible so the mode flags must be set specifically.
142 This means that both the storage medium and the exact storage
143 format must be specified.
144 Using _mode=nczarr_ causes the URL to be interpreted as a
145 reference to a dataset that is stored in NCZarr format.
146 The _zarr_ mode tells the library to use NCZarr, but to restrict its operation to operate on pure Zarr.
147 <!--The _v2_ mode specifies Version 2 and _v3_mode specifies Version 3.
148 If the version is not specified, it will default to the value specified when the netcdf-c library was built.-->
150 The modes _s3_, _file_, and _zip_ tell the library what storage medium
152 * The _s3_ driver stores data using Amazon S3 or some equivalent.
153 * The _file_ driver stores data in a directory tree.
154 * The _zip_ driver stores data in a local zip file.
156 As an aside, it should be the case that zipping a _file_
157 format directory tree will produce a file readable by the
158 _zip_ storage format, and vice-versa.
160 By default, the XArray convention is supported for Zarr Version 2
161 and used for both NCZarr files and pure Zarr files.
162 <!--It is not needed for Version 3 and is ignored.-->
163 This means that every variable in the root group whose named dimensions
164 are also in the root group will have an attribute called
165 *\_ARRAY\_DIMENSIONS* that stores those dimension names.
166 The _noxarray_ mode tells the library to disable the XArray support.
168 # NCZarr Map Implementation {#nczarr_mapimpl}
170 Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used.
171 This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python _MutableMap_ <a href="#ref_python">[5]</a> class.
173 In NCZarr, the corresponding type is called _zmap_.
174 The __zmap__ API essentially implements a simplified variant
175 of the Amazon S3 API.
177 As with Amazon S3, __keys__ are utf8 strings with a specific structure:
178 that of a path similar to those of a Unix path with '/' as the
179 separator for the segments of the path.
181 As with Unix, all keys have this BNF syntax:
184 keypath: '/' segment | keypath '/' segment ;
185 segment: <sequence of UTF-8 characters except control characters and '/'>
187 Obviously, one can infer a tree structure from this key structure.
188 A containment relationship is defined by key prefixes.
189 Thus one key is "contained" (possibly transitively)
190 by another if one key is a prefix (in the string sense) of the other.
191 So in this sense the key "/x/y/z" is contained by the key "/x/y".
193 In this model all keys "exist" but only some keys refer to
194 objects containing content -- aka _content bearing_.
195 An important restriction is placed on the structure of the tree,
196 namely that keys are only defined for content-bearing objects.
197 Further, all the leaves of the tree are these content-bearing objects.
198 This means that the key for one content-bearing object should not
199 be a prefix of any other key.
201 There several other concepts of note.
202 1. __Dataset__ - a dataset is the complete tree contained by the key defining
203 the root of the dataset. The term __File__ will often be used as a synonym.
204 Technically, the root of the tree is the key <dataset>/.zgroup, where .zgroup can be considered the _superblock_ of the dataset.
205 2. __Object__ - equivalent of the S3 object; Each object has a unique key
206 and "contains" data in the form of an arbitrary sequence of 8-bit bytes.
208 The zmap API defined here isolates the key-value pair mapping
209 code from the Zarr-based implementation of NetCDF-4.
210 It wraps an internal C dispatch table manager for implementing an
211 abstract data structure implementing the zmap key/object model.
212 Of special note is the "search" function of the API.
214 __Search__: The search function has two purposes:
215 1. Support reading of pure zarr datasets (because they do not explicitly track their contents).
216 2. Debugging to allow raw examination of the storage. See zdump for example.
218 The search function takes a prefix path which has a key syntax (see above).
219 The set of legal keys is the set of keys such that the key references a content-bearing object -- e.g. /x/y/.zarray or /.zgroup.
220 Essentially this is the set of keys pointing to the leaf objects of the tree of keys constituting a dataset.
221 This set potentially limits the set of keys that need to be examined during search.
223 The search function returns a limited set of names, where the set of names are immediate suffixes of a given prefix path.
224 That is, if _<prefix>_ is the prefix path, then search returnsnall _<name>_ such that _<prefix>/<name>_ is itself a prefix of a "legal" key.
225 This can be used to implement glob style searches such as "/x/y/*" or "/x/y/**"
227 This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion.
228 It was also chosen to limit the number of names returned from the search.
230 1. Avoid returning keys that are not a prefix of some legal key.
231 2. Avoid returning all the legal keys in the dataset because that set may be very large; although the implementation may still have to examine all legal keys to get the desired subset.
232 3. Allow for use of partial read mechanisms such as iterators, if available.
233 This can support processing a limited set of keys for each iteration.
234 This is a straighforward tradeoff of space over time.
236 As a side note, S3 supports this kind of search using common prefixes with a delimiter of '/', although its use is a bit tricky.
237 For the file system zmap implementation, the legal search keys can be obtained one level at a time, which directly implements the search semantics.
238 For the zip file implementation, this semantics is not possible, so the whole
239 tree must be obtained and searched.
243 1. S3 limits key lengths to 1024 bytes.
244 Some deeply nested netcdf files will almost certainly exceed this limit.
245 2. Besides content, S3 objects can have an associated small set
246 of what may be called tags, which are themselves of the form of
247 key-value pairs, but where the key and value are always text.
248 As far as it is possible to determine, Zarr never uses these tags,
249 so they are not included in the zmap data structure.
251 __A Note on Error Codes:__
253 The zmap API returns some distinguished error code:
254 1. NC_NOERR if a operation succeeded
255 2. NC_EEMPTY is returned when accessing a key that has no content.
256 3. NC_EOBJECT is returned when an object is found which should not exist
257 4. NC_ENOOBJECT is returned when an object is not found which should exist
259 This does not preclude other errors being returned such NC_EACCESS or NC_EPERM or NC_EINVAL if there are permission errors or illegal function arguments, for example.
260 It also does not preclude the use of other error codes internal to the zmap implementation.
261 So zmap_file, for example, uses NC_ENOTFOUND internally because it is possible to detect the existence of directories and files.
262 But this does not propagate outside the zmap_file implementation.
264 ## Zmap Implementatons
266 The primary zmap implementation is _s3_ (i.e. _mode=nczarr,s3_) and indicates that the Amazon S3 cloud storage -- or some related applicance -- is to be used.
267 Another storage format uses a file system tree of directories and files (_mode=nczarr,file_).
268 A third storage format uses a zip file (_mode=nczarr,zip_).
269 The latter two are used mostly for debugging and testing.
270 However, the _file_ and _zip_ formats are important because they are intended to match corresponding storage formats used by the Python Zarr implementation.
271 Hence it should serve to provide interoperability between NCZarr and the Python Zarr, although this interoperability has had only limited testing.
273 Examples of the typical URL form for _file_ and _zip_ are as follows.
275 file:///xxx/yyy/testdata.file#mode=nczarr,file
276 file:///xxx/yyy/testdata.zip#mode=nczarr,zip
279 Note that the extension (e.g. ".file" in "testdata.file")
280 is arbitrary, so this would be equally acceptable.
282 file:///xxx/yyy/testdata.anyext#mode=nczarr,file
284 As with other URLS (e.g. DAP), these kind of URLS can be passed as the path argument to, for example, __ncdump__.
286 # NCZarr versus Pure Zarr. {#nczarr_purezarr}
288 The NCZARR format extends the pure Zarr format by adding extra attributes such as ''\_nczarr\_array'' inside the ''.zattr'' object.
289 It is possible to suppress the use of these extensions so that the netcdf library can write a pure zarr formatted file. But this probably unnecessary
290 since these attributes should be readable by any other Zarr implementation.
291 But these extra attributes might be seen as clutter and so it is possible
292 to suppress them when writing using *mode=zarr*.
294 Reading of pure Zarr files created using other implementations is a necessary
295 compatibility feature of NCZarr.
296 This requirement imposed some constraints on the reading of Zarr datasets using the NCZarr implementation.
297 1. Zarr allows some primitive types not recognized by NCZarr.
298 Over time, the set of unrecognized types is expected to diminish.
299 Examples of currently unsupported types are as follows:
300 * "c" -- complex floating point
303 2. The Zarr dataset may reference filters and compressors unrecognized by NCZarr.
304 3. The Zarr dataset may store data in column-major order instead of row-major order. The effect of encountering such a dataset is to output the data in the wrong order.
306 Again, this list should diminish over time.
308 # Notes on Debugging NCZarr Access {#nczarr_debug}
310 The NCZarr support has a trace facility.
311 Enabling this can sometimes give important, but voluminous information.
312 Tracing can be enabled by setting the environment variable NCTRACING=n,
313 where _n_ indicates the level of tracing.
314 A good value of _n_ is 9.
316 # Zip File Support {#nczarr_zip}
318 In order to use the _zip_ storage format, the libzip [3] library must be installed.
319 Note that this is different from zlib.
323 The notion of "addressing style" may need some expansion. Amazon S3 accepts two forms for specifying the endpoint for accessing the data (see the document "quickstart_path").
325 1. Virtual -- the virtual addressing style places the bucket in the host part of a URL.
329 https://<bucketname>.s2.<region>.amazonaws.com/
332 2. Path -- the path addressing style places the bucket in at the front of the path part of a URL.
336 https://s3.<region>.amazonaws.com/<bucketname>/
339 The NCZarr code will accept either form, although internally, it is standardized on path style.
340 The reason for this is that the bucket name forms the initial segment in the keys.
342 # Zarr vs NCZarr {#nczarr_vs_zarr}
346 The NCZarr storage format is almost identical to that of the the standard Zarr format.
347 The data model differs as follows.
349 1. Zarr only supports anonymous dimensions -- NCZarr supports only shared (named) dimensions.
350 2. Zarr attributes are untyped -- or perhaps more correctly characterized as of type string.
351 3. Zarr does not explicitly support unlimited dimensions -- NCZarr does support them.
355 Consider both NCZarr and Zarr, and assume S3 notions of bucket and object.
356 In both systems, Groups and Variables (Array in Zarr) map to S3 objects.
357 Containment is modeled using the fact that the dataset's key is a prefix of the variable's key.
358 So for example, if variable _v1_ is contained in top level group g1 -- _/g1 -- then the key for _v1_ is _/g1/v_.
359 Additional meta-data information is stored in special objects whose name start with ".z".
361 In Zarr Version 2, the following special objects exist.
362 1. Information about a group is kept in a special object named _.zgroup_;
363 so for example the object _/g1/.zgroup_.
364 2. Information about an array is kept as a special object named _.zarray_;
365 so for example the object _/g1/v1/.zarray_.
366 3. Group-level attributes and variable-level attributes are stored in a special object named _.zattr_;
367 so for example the objects _/g1/.zattr_ and _/g1/v1/.zattr_.
368 4. Chunk data is stored in objects named "<n1>.<n2>...,<nr>" where the ni are positive integers representing the chunk index for the ith dimension.
370 The first three contain meta-data objects in the form of a string representing a JSON-formatted dictionary.
371 The NCZarr format uses the same objects as Zarr, but inserts NCZarr
372 specific attributes in the *.zattr* object to hold NCZarr specific information
373 The value of each of these attributes is a JSON dictionary containing a variety
374 of NCZarr specific information.
376 These NCZarr-specific attributes are as follows:
378 _\_nczarr_superblock\__ -- this is in the top level group's *.zattr* object.
379 It is in effect the "superblock" for the dataset and contains
380 any netcdf specific dataset level information.
381 It is also used to verify that a given key is the root of a dataset.
382 Currently it contains keys that are ignored and exist only to ensure that
383 older netcdf library versions do not crash.
384 * "version" -- the NCZarr version defining the format of the dataset (deprecated).
386 _\_nczarr_group\__ -- this key appears in every group's _.zattr_ object.
387 It contains any netcdf specific group information.
388 Specifically it contains the following keys:
389 * "dimensions" -- the name and size of shared dimensions defined in this group, as well an optional flag indictating if the dimension is UNLIMITED.
390 * "arrays" -- the name of variables defined in this group.
391 * "groups" -- the name of sub-groups defined in this group.
392 These lists allow walking the NCZarr dataset without having to use the potentially costly search operation.
394 _\_nczarr_array\__ -- this key appears in the *.zattr* object associated
395 with a _.zarray_ object.
396 It contains netcdf specific array information.
397 Specifically it contains the following keys:
398 * dimension_references -- the fully qualified names of the shared dimensions referenced by the variable.
399 * storage -- indicates if the variable is chunked vs contiguous in the netcdf sense. Also signals if a variable is scalar.
401 _\_nczarr_attr\__ -- this attribute appears in every _.zattr_ object.
402 Specifically it contains the following keys:
403 * types -- the types of all attributes in the _.zattr_ object.
405 ## Translation {#nczarr_translation}
407 With some loss of netcdf-4 information, it is possible for an nczarr library to read the pure Zarr format and for other zarr libraries to read the nczarr format.
409 The latter case, zarr reading nczarr, is trival because all of the nczarr metadata is stored as ordinary, String valued (but JSON syntax), attributes.
411 The former case, nczarr reading zarr is possible assuming the nczarr code can simulate or infer the contents of the missing _\_nczarr\_xxx_ attributes.
412 As a rule this can be done as follows.
413 1. _\_nczarr_group\__ -- The list of contained variables and sub-groups can be computed using the search API to list the keys "contained" in the key for a group.
414 The search looks for occurrences of _.zgroup_, _.zattr_, _.zarray_ to infer the keys for the contained groups, attribute sets, and arrays (variables).
415 Constructing the set of "shared dimensions" is carried out
416 by walking all the variables in the whole dataset and collecting
417 the set of unique integer shapes for the variables.
418 For each such dimension length, a top level dimension is created
419 named "_Anonymous_Dimension_<len>" where len is the integer length.
420 2. _\_nczarr_array\__ -- The dimension referencess are inferred by using the shape in _.zarray_ and creating references to the simulated shared dimensions.
421 netcdf specific information.
422 3. _\_nczarr_attr\__ -- The type of each attribute is inferred by trying to parse the first attribute value string.
424 # Compatibility {#nczarr_compatibility}
426 In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.
430 The Xarray [7] Zarr implementation uses its own mechanism for specifying shared dimensions.
431 It uses a special attribute named ''_ARRAY_DIMENSIONS''.
432 The value of this attribute is a list of dimension names (strings).
433 An example might be ````["time", "lon", "lat"]````.
434 It is almost equivalent to the ````_nczarr_array "dimension_references" list````, except that the latter uses fully qualified names so the referenced dimensions can be anywhere in the dataset. The Xarray dimension list differs from the netcdf-4 shared dimensions in two ways.
435 1. Specifying Xarray in a non-root group has no meaning in the current Xarray specification.
436 2. A given name can be associated with different lengths, even within a single array. This is considered an error in NCZarr.
438 The Xarray ''_ARRAY_DIMENSIONS'' attribute is supported for both NCZarr and pure Zarr.
439 If possible, this attribute will be read/written by default,
440 but can be suppressed if the mode value "noxarray" is specified.
441 If detected, then these dimension names are used to define shared dimensions.
442 The following conditions will cause ''_ARRAY_DIMENSIONS'' to not be written.
443 * The variable is not in the root group,
444 * Any dimension referenced by the variable is not in the root group.
446 Note that this attribute is not needed for Zarr Version 3, and is ignored.
448 # Examples {#nczarr_examples}
450 Here are a couple of examples using the _ncgen_ and _ncdump_ utilities.
452 1. Create an nczarr file using a local directory tree as storage.
454 ncgen -4 -lb -o "file:///home/user/dataset.file#mode=nczarr,file" dataset.cdl
456 2. Display the content of an nczarr file using a zip file as storage.
458 ncdump "file:///home/user/dataset.zip#mode=nczarr,zip"
460 3. Create an nczarr file using S3 as storage.
462 ncgen -4 -lb -o "s3://s3.us-west-1.amazonaws.com/datasetbucket" dataset.cdl
464 4. Create an nczarr file using S3 as storage and keeping to the pure zarr format.
466 ncgen -4 -lb -o 's3://s3.uswest-1.amazonaws.com/datasetbucket\#mode=zarr dataset.cdl
468 5. Create an nczarr file using the s3 protocol with a specific profile
470 ncgen -4 -lb -o "s3://datasetbucket/rootkey\#mode=nczarr&awsprofile=unidata" dataset.cdl
472 Note that the URL is internally translated to this
474 "https://s2.<region>.amazonaws.com/datasetbucket/rootkey\#mode=nczarr&awsprofile=unidata"
476 # Appendix A. Building NCZarr Support {#nczarr_build}
478 Currently the following build cases are known to work.
479 Note that this does not include S3 support.
480 A separate tabulation of S3 support is in the document _cloud.md_.
483 <tr><td><u>Operating System</u><td><u>Build System</u><td><u>NCZarr</u>
484 <tr><td>Linux <td> Automake <td> yes
485 <tr><td>Linux <td> CMake <td> yes
486 <tr><td>Cygwin <td> Automake <td> yes
487 <tr><td>Cygwin <td> CMake <td> yes
488 <tr><td>OSX <td> Automake <td> yes
489 <tr><td>OSX <td> CMake <td> yes
490 <tr><td>Visual Studio <td> CMake <td> yes
495 The relevant ./configure options are as follows.
497 1. *--disable-nczarr* -- disable the NCZarr support.
501 The relevant CMake flags are as follows.
503 1. *-DNETCDF_ENABLE_NCZARR=off* -- equivalent to the Automake *--disable-nczarr* option.
504 ## Testing NCZarr S3 Support {#nczarr_testing_S3_support}
506 The relevant tests for S3 support are in the _nczarr_test_ directory.
507 Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group.
508 This is because it uses a Unidata-specific bucket that is inaccessible to the general user.
512 In order to build netcdf-c with S3 sdk support,
513 the following options must be specified for ./configure.
517 If you have access to the Unidata bucket on Amazon, then you can
518 also test S3 support with this option.
520 --with-s3-testing=yes
523 ### NetCDF CMake Build
525 Enabling S3 support is controlled by this cmake option:
527 -DNETCDF_ENABLE_S3=ON
529 However, to find the aws sdk libraries,
530 the following environment variables must be set:
532 AWSSDK_ROOT_DIR="c:/tools/aws-sdk-cpp"
533 AWSSDKBIN="/cygdrive/c/tools/aws-sdk-cpp/bin"
534 PATH="$PATH:${AWSSDKBIN}"
536 Then the following options must be specified for cmake.
538 -DAWSSDK_ROOT_DIR=${AWSSDK_ROOT_DIR}
539 -DAWSSDK_DIR=${AWSSDK_ROOT_DIR}/lib/cmake/AWSSDK
542 # Appendix B. Amazon S3 Imposed Limits {#nczarr_s3limits}
544 The Amazon S3 cloud storage imposes some significant limits that are inherited by NCZarr (and Zarr also, for that matter).
546 Some of the relevant limits are as follows:
547 1. The maximum object size is 5 Gigabytes with a total for all objects limited to 5 Terabytes.
548 2. S3 key names can be any UNICODE name with a maximum length of 1024 bytes.
549 Note that the limit is defined in terms of bytes and not (Unicode) characters.
550 This affects the depth to which groups can be nested because the key encodes the full path name of a group.
552 # Appendix C. JSON Attribute Convention. {#nczarr_json}
554 The Zarr V2 <!--(and V3)--> specification is somewhat vague on what is a legal
555 value for an attribute. The examples all show one of two cases:
556 1. A simple JSON scalar atomic values (e.g. int, float, char, etc), or
557 2. A JSON array of such values.
559 However, the Zarr specification can be read to infer that the value
560 can in fact be any legal JSON expression.
561 This "convention" is currently used routinely to help support various
562 attributes created by other packages where the attribute is a
563 complex JSON expression. An example is the GDAL Driver
564 convention <a href='#ref_gdal'>[12]</a>, where the value is a complex
567 In order for NCZarr to be as consistent as possible with Zarr,
568 it is desirable to support this convention for attribute values.
569 This means that there must be some way to handle an attribute
570 whose value is not either of the two cases above. That is, its value
571 is some more complex JSON expression. Ideally both reading and writing
572 of such attributes should be supported.
574 One more point. NCZarr attempts to record the associated netcdf
575 attribute type (encoded in the form of a NumPy "dtype") for each
576 attribute. This information is stored as NCZarr-specific
577 metadata. Note that pure Zarr makes no attempt to record such
580 The current algorithm to support JSON valued attributes
583 ## Writing an attribute:
584 There are mutiple cases to consider.
586 1. The netcdf attribute **is not** of type NC_CHAR and its value is a single atomic value.
587 * Convert to an equivalent JSON atomic value and write that JSON expression.
588 * Compute the Zarr equivalent dtype and store in the NCZarr metadata.
590 2. The netcdf attribute **is not** of type NC_CHAR and its value is a vector of atomic values.
591 * Convert to an equivalent JSON array of atomic values and write that JSON expression.
592 * Compute the Zarr equivalent dtype and store in the NCZarr metadata.
594 3. The netcdf attribute **is** of type NC_CHAR and its value – taken as a single sequence of characters –
595 **is** parseable as a legal JSON expression.
596 * Parse to produce a JSON expression and write that expression.
597 * Use "|J0" as the dtype and store in the NCZarr metadata.
599 4. The netcdf attribute **is** of type NC_CHAR and its value – taken as a single sequence of characters –
600 **is not** parseable as a legal JSON expression.
601 * Convert to a JSON string and write that expression
602 * Use ">S1" as the dtype and store in the NCZarr metadata.
604 ## Reading an attribute:
606 The process of reading and interpreting an attribute value requires two
607 pieces of information.
608 * The value of the attribute as a JSON expression, and
609 * The optional associated dtype of the attribute; note that this may not exist
610 if, for example, the file is pure zarr.
612 Given these two pieces of information, the read process is as follows.
614 1. The JSON expression is a simple JSON atomic value.
615 * If the dtype is defined, then convert the JSON to that type of data,
616 and then store it as the equivalent netcdf vector of size one.
617 * If the dtype is not defined, then infer the dtype based on the the JSON value,
618 and then store it as the equivalent netcdf vector of size one.
620 2. The JSON expression is an array of simple JSON atomic values.
621 * If the dtype is defined, then convert each JSON value in the array to that type of data,
622 and then store it as the equivalent netcdf vector.
623 * If the dtype is not defined, then infer the dtype based on the first JSON value in the array,
624 and then store it as the equivalent netcdf vector.
626 3. The attribute is any other JSON structure.
627 * Un-parse the expression to an equivalent sequence of characters, and then store it as of type NC_CHAR.
631 1. If a character valued attributes's value can be parsed as a legal JSON expression, then it will be stored as such.
632 2. Reading and writing are *almost* idempotent in that the sequence of
633 actions "read-write-read" is equivalent to a single "read" and "write-read-write" is equivalent to a single "write".
634 The "almost" caveat is necessary because (1) whitespace may be added or lost during the sequence of operations,
635 and (2) numeric precision may change.
637 # Appendix D. Support for string types
639 Zarr supports a string type, but it is restricted to
640 fixed size strings. NCZarr also supports such strings,
641 but there are some differences in order to interoperate
642 with the netcdf-4/HDF5 variable length strings.
644 The primary issue to be addressed is to provide a way for user
645 to specify the maximum size of the fixed length strings. This is
646 handled by providing the following new attributes:
647 1. **_nczarr_default_maxstrlen** —
648 This is an attribute of the root group. It specifies the default
649 maximum string length for string types. If not specified, then
650 it has the value of 128 characters.
651 2. **_nczarr_maxstrlen** —
652 This is a per-variable attribute. It specifies the maximum
653 string length for the string type associated with the variable.
654 If not specified, then it is assigned the value of
655 **_nczarr_default_maxstrlen**.
657 Note that when accessing a string through the netCDF API, the
658 fixed length strings appear as variable length strings. This
659 means that they are stored as pointers to the string
660 (i.e. **char\***) and with a trailing nul character.
661 One consequence is that if the user writes a variable length
662 string through the netCDF API, and the length of that string
663 is greater than the maximum string length for a variable,
664 then the string is silently truncated.
665 Another consequence is that the user must reclaim the string storage.
667 Adding strings also requires some hacking to handle the existing
668 netcdf-c NC_CHAR type, which does not exist in Zarr. The goal
669 was to choose NumPY types for both the netcdf-c NC_STRING type
670 and the netcdf-c NC_CHAR type such that if a pure zarr
671 implementation reads them, it will still work.
673 For writing variables and NCZarr attributes, the type mapping is as follows:
675 * "|S1" for NC_STRING && MAXSTRLEN==1
676 * "|Sn" for NC_STRING && MAXSTRLEN==n
678 Admittedly, this encoding is a bit of a hack.
680 So when reading data with a pure zarr implementaion
681 the above types should always appear as strings,
682 and the type that signals NC_CHAR (in NCZarr)
683 would be handled by Zarr as a string of length 1.
686 # Appendix E. Zarr Version 3: NCZarr Version 3 Meta-Data Representation. {#nczarr_version3}
688 For Zarr version 3, the added NCZarr specific metadata is stored
689 as attributes pretty much the same as for Version 2.
690 Specifically, the following Netcdf-4 meta-data information needs to be captured by NCZarr:
691 1. Shared dimensions: name and size.
692 2. Unlimited dimensions: which dimensions are unlimited.
694 4. Netcdf types not included in Zarr: currently "char" and "string".
695 5. Zarr types not included in Netcdf: currently only "complex(32|64)"
696 This extra netcdfd-4 meta-data to attributes so as to not interfere with existing implementations.
699 Zarr version 3 supports the following "atomic" types:
700 bool, int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, float64.
701 It also defines two structured type: complex64 and complex128.
703 NCZarr supports all of the atomic types.
704 Specialized support is provided for the following
705 Netcdf types: char, string.
706 The Zarr types bool and complex64 are not yet supported, but will be added shortly.
707 The type complex128 is not supported at all.
709 The Zarr type "bool" will appear in the netcdf types as
710 the enum type "_bool" whose netcdf declaration is as follows:
712 ubyte enum _bool_t {FALSE=0, TRUE=1};
714 The type complex64 will be supported by by defining this compound type:
716 compound _Complex64_t { float64 i; float64 j;}
719 Strings present a problem because there is a proposal
720 to add variable length strings to the Zarr version 3 specification;
721 fixed-length strings would not be supported at all.
722 But strings are important in Netcdf, so a forward compatible
723 representation is provided where the type is string
724 and its maximum size is specified.
726 For arrays, the Netcdf types "char" and "string" are stored
727 in the Zarr file as of type "uint8" and "r<8*n>", respectively
728 where _n_ is the maximum length of the string in bytes (not characters).
729 The fact that they represent "char" and "string" is encoded in the "_nczarr_array" attribute (see below).
732 The *_nczarr_superblock* attribute is used as a useful marker to signal that a file is in fact NCZarr as opposed to Zarr.
733 This attribute is stored in the *zarr.info* attributes in the root group of the Zarr file.
734 The relevant attribute has the following format:
736 "_nczarr_superblock": {
742 The optional *_nczarr_group* attribute is stored in the attributes of a Zarr group within
743 the *zarr.json* object in that group.
744 The relevant attribute has the following format:
747 \"dimensions\": [{name: <dimname>, size: <integer>, unlimited: 1|0},...],
748 \"arrays\": ["<name>",...],
749 \"subgroups\": ["<name>",...]
752 Its purpose is two-fold:
753 1. record the objects immediately within that group
754 2. define netcdf-4 dimenension objects within that group.
757 In order to support Netcdf concepts in Zarr, it may be necessary
758 to annotate a Zarr array with extra information.
759 The optional *_nczarr_array* attribute is stored in the attributes of a Zarr array within
760 the *zarr.json* object in that array.
761 The relevant attribute has the following format:
764 \"dimension_references\": [\"/g1/g2/d1\", \"/d2\",...],
765 \"type_alias\": "<string indicating special type aliasing>" // optional
768 The *dimension_references* key is an expansion of the "dimensions" key
769 found in the *zarr.json* object for an array.
770 The problem with "dimensions" is that it specifies a simple name for each
771 dimension, whereas netcdf-4 requires that the array references dimension objects
772 that may appear in groups anywhere in the file. These references are encoded
773 as FQNs "pointing" to a specific dimension declaration (see *_nczarr_group* attribute
776 FQN is an acronym for "Fully Qualified Name".
777 It is a series of names separated by the "/" character, much
778 like a file system path.
779 It identifies the group in which the dimension is ostensibly "defined" in the Netcdf sense.
780 For example ````/d1```` defines a dimension "d1" defined in the root group.
781 Similarly ````/g1/g2/d2```` defines a dimension "d2" defined in the
782 group g2, which in turn is a subgroup of group g1, which is a subgroup
785 The *type_alias* key is used to annotate the type of an array
786 to allow discovery of netcdf-4 specific types.
787 Specifically, there are three current cases:
788 | dtype | type_alias |
789 | ----- | ---------- |
794 If, for example, an array's dtype is specified as *uint8*, then it may be that
795 it is actually of unsigned 8-bit integer type. But it may actually be of some
796 netcdf-4 type that is encoded as *uint8* in order to be recognized by other -- pure zarr--
797 implementations. So, for example, if the netcdf-4 type is *char*, then the array's
798 dtype is *uint8*, but its type alias is *char*.
800 ## Attribute Type Annotation
801 In Zarr version 3, group and array attributes are stored inside
802 the corresponding _zarr.info_. object under the dictionary key "attributes".
803 Note that this decision is still under discussion and it may be changed
804 to store attributes in an object separate from _zarr.info_.
806 Regardless of where the attributes are stored, and in order to
807 support netcdf-4 typed attributes, the per-attribute information
808 is stored as a special attribute called _\_nczarr_attrs\__ defined to hold
809 NCZarr specific attribute information. Currently, it only holds
810 the attribute typing information.
811 It can appear in any *zarr.json* object: group or array.
817 {"name": "attr1", "configuration": {"type": "<dtype>"}},
822 There is one entry for every attribute (including itself) giving the type
824 It should be noted that Zarr allows the value of an attribute to be an arbitrary
825 JSON-encoded structure. In order to support this in netcdf-4, is such a structure
826 is encountered as an attribute value, then it typed as *json* (see previously
829 ## Codec Specification
830 The Zarr version 3 representation of codecs is slightly different
831 than that used by Zarr version 2.
832 In version 2, the codec is represented by this JSON template.
834 {"id": "<codec name>" "<param>": "<value>", "<param>": "<value>", ...}
836 In version 3, the codec is represented by this JSON template.
838 {"name": "<codec name>" "configuration": {"<param>": "<value>", "<param>": "<value>", ...}}
842 # References {#nczarr_bib}
844 <a name="ref_aws">[1]</a> [Amazon Simple Storage Service Documentation](https://docs.aws.amazon.com/s3/index.html)<br>
845 <a name="ref_awssdk">[2]</a> [Amazon Simple Storage Service Library](https://github.com/aws/aws-sdk-cpp)<br>
846 <a name="ref_libzip">[3]</a> [The LibZip Library](https://libzip.org/)<br>
847 <a name="ref_nczarr">[4]</a> [NetCDF ZARR Data Model Specification](https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification)<br>
848 <a name="ref_python">[5]</a> [Python Documentation: 8.3.
849 collections — High-performance dataset datatypes](https://docs.python.org/2/library/collections.html)<br>
850 <a name="ref_zarrv2">[6]</a> [Zarr Version 2 Specification](https://zarr.readthedocs.io/en/stable/spec/v2.html)<br>
851 <a name="ref_xarray">[7]</a> [XArray Zarr Encoding Specification](http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification)<br>
852 <a name="dynamic_filter_loading">[8]</a> [Dynamic Filter Loading](https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf)<br>
853 <a name="official_hdf5_filters">[9]</a> [Officially Registered Custom HDF5 Filters](https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins)<br>
854 <a name="blosc-c-impl">[10]</a> [C-Blosc Compressor Implementation](https://github.com/Blosc/c-blosc)<br>
855 <a name="ref_awssdk_conda">[11]</a> [Conda-forge packages / aws-sdk-cpp](https://anaconda.org/conda-forge/aws-sdk-cpp)<br>
856 <a name="ref_gdal">[12]</a> [GDAL Zarr](https://gdal.org/drivers/raster/zarr.html)<br>
858 <a name="ref_nczarrv3">[13]</a> [NetCDF ZARR Data Model Specification Version 3](https://zarr-specs.readthedocs.io/en/latest/specs.html)
861 # Change Log {#nczarr_changelog}
862 [Note: minor text changes are not included.]
864 Note, this log was only started as of 8/11/2022 and is not
865 intended to be a detailed chronology. Rather, it provides highlights
866 that will be of interest to NCZarr users. In order to see exact changes,
867 It is necessary to use the 'git diff' command.
870 1. Document the change to V2 to using attributes to hold NCZarr metadata.
873 1. Add description of support for Zarr version 3 as an appendix.
876 1. Move most of the S3 text to the cloud.md document.
879 1. Zarr fixed-size string types are now supported.
882 1. The NCZarr specific keys have been converted to lower-case
883 (e.g. "_nczarr_attr" instead of "_NCZARR_ATTR"). Upper case is
884 accepted for back compatibility.
886 2. The legal values of an attribute has been extended to
887 include arbitrary JSON expressions; see Appendix D for more details.
889 # Point of Contact {#nczarr_poc}
891 __Author__: Dennis Heimbigner<br>
892 __Email__: dmh at ucar dot edu<br>
893 __Initial Version__: 4/10/2020<br>
894 __Last Revised__: 4/02/2024