pp-mo · pp-mo · Jan 16, 2025 · Jan 16, 2025 · Jan 25, 2025 · Feb 6, 2025
diff --git a/docs/change_log.rst b/docs/change_log.rst
@@ -1,22 +1,27 @@
+.. _change_log:
+
 Versions and Change Notes
 =========================
 
-Project Status
---------------
+.. _development_status:
+
+Project Development Status
+--------------------------
 We intend to follow `PEP 440 <https://peps.python.org/pep-0440/>`_,
 or (older) `SemVer <https://semver.org/>`_ versioning principles.
 This means the version string has the basic form **"major.minor.bugfix[special-types]"**.
 
-Current release version is at **"v0.1"**.
+Current release version is at **"v0.2"**.
 
-This is a first complete implementation,
-with functional operational of all public APIs.
+This is a complete implementation, with functional operational of all public APIs.
 The code is however still experimental, and APIs are not stable
 (hence no major version yet).
 
+.. _change_notes:
 
 Change Notes
 ------------
+Summary of key features by release number
 
 Unreleased
 ^^^^^^^^^^

diff --git a/docs/details/character_handling.rst b/docs/details/character_handling.rst
@@ -0,0 +1,61 @@
+.. _string-and-character-data:
+
+Character and String Data Handling
+----------------------------------
+NetCDF can contain string and character data in at least 3 different contexts :
+
+Characters in Data Component Names
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+That is, names of groups, variables, attributes or dimensions.
+Component names in the API are just native Python strings.
+
+Since NetCDF version 4, the names of components within files are fully unicode
+compliant, using UTF-8.
+
+These names can use virtually **any** characters, with the exception of the forward
+slash "/", since in some technical cases a component name needs to specified as a
+"path-like" compound.
+
+
+Characters in Attribute Values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Character data in string *attribute* values can likewise be read and written simply as
+Python strings.
+
+However they are actually *stored* in an :class:`~ncdata.NcAttribute`'s
+``.value`` as a character array of dtype "<U??"  (that is, the dtype does not really
+have a "??", but some definite length).  These are returned by
+:meth:`ncdata.NcAttribute.as_python_value` as a simple Python string.
+
+A vector of strings is also a permitted attribute value, but bear in mind that
+**a vector of strings is not currently supported in netCDF4 implementations**.
+Thus, you cannot have an array or list of strings as an attribute value in an actual file,
+and if stored to a file such an attribute will be concatenated into a single string value.
+
+In actual files, Unicode is again supported via UTF-8, and seamlessly encoded/decoded.
+
+
+Characters in Variable Data
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Character data in variable *data* arrays are generally stored as fixed-length arrays of
+characters (i.e. fixed-width strings), and no unicode interpretation is applied by the
+libraries (neither netCDF4 or ncdata).  In this case, the strings appear in Python as
+numpy character arrays of dtype "<U1".  All elements have the same fixed length, but
+may contain zero bytes so that they convert to variable-width (Python) strings up to a
+maximum width.  Trailing characters are filled with "NUL", i.e. "\\0" character
+aka "zero byte".  The (maximum) string length is a separate dimension, which is
+recorded as a normal netCDF file dimension like any other.
+
+.. note::
+
+    Although it is not tested, it has proved possible (and useful) at present to load
+    files with variables containing variable-length string data, but it is
+    necessary to supply an explicit user chunking to workaround limitations in Dask.
+    Please see the :ref:`howto example <howto_load_variablewidth_strings>`.
+
+.. warning::
+
+    The netCDF4 package will perform automatic character encoding/decoding of a
+    character variable if it has a special ``_Encoding`` attribute.  Ncdata does not
+    currently allow for this.  See : :ref:`known-issues`
+
diff --git a/docs/details/details_index.rst b/docs/details/details_index.rst
@@ -1,9 +1,14 @@
 Detail Topics
 =============
+Detail reference topics
+
 .. toctree::
     :maxdepth: 2
 
+    ../change_log
+    ./known_issues
     ./interface_support
+    ./character_handling
     ./threadlock_sharing
     ./developer_notes
 
diff --git a/docs/details/developer_notes.rst b/docs/details/developer_notes.rst
@@ -28,6 +28,12 @@ Documentation build
 Release actions
 ---------------
 
+#. Update the :ref:`change_log` page in the details section
+
+    #. ensure all major changes + PRs are referenced in the :ref:`change_notes` section
+
+    #. update the "latest version" stated in the :ref:`development_status` section
+
 #. Cut a release on GitHub : this triggers a new docs version on [ReadTheDocs](https://readthedocs.org/projects/ncdata/)
 
 #. Build the distribution

diff --git a/docs/details/interface_support.rst b/docs/details/interface_support.rst
@@ -14,43 +14,59 @@ Datatypes
 ^^^^^^^^^
 Ncdata supports all the regular datatypes of netcdf, but *not* the
 variable-length and user-defined datatypes.
+Please see : :ref:`data-types`.
 
-This means, notably, that all string variables will have the basic numpy type
-'S1', equivalent to netcdf 'NC_CHAR'.  Thus, multi-character string variables
-must always have a definite "string-length" dimension.
 
-Attribute values, by contrast, are treated as Python strings with the normal
-variable length support.  Their basic dtype can be any numpy string dtype,
-but will be converted when required.
-
-The NetCDF C library and netCDF4-python do not support arrays of strings in
-attributes, so neither does NcData.
-
-
-Data Scaling, Masking and Compression
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Ncdata does not implement scaling and offset within data arrays :  The ".data"
+Data Scaling and Masking
+^^^^^^^^^^^^^^^^^^^^^^^^
+Ncdata does not implement scaling and offset within variable data arrays :  The ".data"
 array has the actual variable dtype, and the "scale_factor" and
 "add_offset" attributes are treated like any other attribute.
 
-The existence of a "_FillValue" attribute controls how.. TODO
+Likewise, Ncdata does not use masking within its variable data arrays, so that variable
+data arrays contain "raw" data, which include any "fill" values -- i.e. at any missing
+data points you will have a "fill" value rather than a masked point.
+
+The use of "scale_factor", "add_offset" and "_FillValue" attributes are standard
+conventions described in the NetCDF documentation itself, and implemented by NetCDF
+library software including the Python netCDF4 library.  To ignore these default
+interpretations, ncdata has to actually turn these features "off".  The rationale for
+this, however, is that the low-level unprocessed data content, equivalent to actual
+file storage, may be more likely to form a stable common basis of equivalence, particularly
+between different system architectures.
 
 
+.. _file-storage:
+
 File storage control
 ^^^^^^^^^^^^^^^^^^^^
 The :func:`ncdata.netcdf4.to_nc4` cannot control compression or storage options
 provided by :meth:`netCDF4.Dataset.createVariable`, which means you can't
 control the data compression and translation facilities of the NetCDF file
 library.
-If required, you should use :mod:`iris` or :mod:`xarray` for this.
+If required, you should use :mod:`iris` or :mod:`xarray` for this, i.e. use
+:meth:`xarray.Dataset.to_netcdf` or :func:`iris.save` instead of
+:func:`ncdata.netcdf4.to_nc4`, as these provide more special options for controlling
+netcdf file creation.
+
+File-specific storage aspects, such as chunking, data-paths or compression
+strategies, are not recorded in the core objects.  However, array representations in
+variable and attribute data (notably dask lazy arrays) may hold such information.
+
+The concept of "unlimited" dimensions is also, you might think, outside the abstract
+model of NetCDF data and not of concern to Ncdata .  However, in fact this concept is
+present as a core property of dimensions in the classic NetCDF data model (see
+"Dimension" in the `NetCDF Classic Data Model`_), so that is why it **is** an essential
+property of an NcDimension also.
 
 
 Dask chunking control
 ^^^^^^^^^^^^^^^^^^^^^
 Loading from netcdf files generates  variables whose data arrays are all Dask
 lazy arrays.  These are created with the "chunks='auto'" setting.
-There is currently no control for this : If required, load via Iris or Xarray
-instead.
+
+However there is a simple per-dimension chunking control available on loading.
+See :func:`ncdata.netcdf4.from_nc4`.
 
 
 Xarray Compatibility
@@ -94,3 +110,4 @@ see : `support added in v3.7.0 <https://scitools-iris.readthedocs.io/en/stable/w
 
 
 .. _Continuous Integration testing on GitHub: https://github.com/pp-mo/ncdata/blob/main/.github/workflows/ci-tests.yml
+.. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
diff --git a/docs/userdocs/user_guide/known_issues.rst → docs/details/known_issues.rst b/docs/userdocs/user_guide/known_issues.rst → docs/details/known_issues.rst
@@ -1,3 +1,5 @@
+.. _known-issues:
+
 Outstanding Issues
 ==================
 
@@ -21,6 +23,19 @@ To be fixed
 
    * `issue#66 <https://github.com/pp-mo/ncdata/issues/66>`_
 
+* in conversion to/from netCDF4 files
+
+   * netCDF4 performs automatic encoding/decoding of byte data to characters, triggered
+     by the existence of an ``_Encoding`` attribute on a character type variable.
+     Ncdata does not currently account for this, and may fail to read/write correctly.
+
+
+.. _todo:
+
+Incomplete Documentation
+^^^^^^^^^^^^^^^^^^^^^^^^
+(PLACEHOLDER: documentation is incomplete, please fix me !)
+
 
 Identified Design Limitations
 -----------------------------
@@ -36,7 +51,7 @@ There are no current plans to address these, but could be considered in future
     * notably, includes compound and variable-length types
 
     * ..and especially **variable-length strings in variables**.
-      see : :ref:`string_and_character_data`
+      see : :ref:`string-and-character-data`, :ref:`data-types`
 
 
 Features planned

diff --git a/docs/details/threadlock_sharing.rst b/docs/details/threadlock_sharing.rst
@@ -1,30 +1,23 @@
+.. _thread-safety:
+
 NetCDF Thread Locking
 =====================
-Ncdata includes support for "unifying" the thread-safety mechanisms between
-ncdata and the format packages it supports (Iris and Ncdata).
+Ncdata provides the :mod:`ncdata.threadlock_sharing` module, which can ensure that all
+multiple relevant data-format packages use a "unified" thread-safety mechanism to
+prevent them disturbing each other.
 
 This concerns the safe use of the common NetCDF library by multiple threads.
 Such multi-threaded access usually occurs when your code has Dask arrays
 created from netcdf file data, which it is either computing or storing to an
 output netcdf file.
 
-The netCDF4 package (and the underlying C library) does not implement any
-threadlock, neither is it thread-safe (re-entrant) by design.
-Thus contention is possible unless controlled by the calling packages.
-*Each* of the data-format packages (Ncdata, Iris and Xarray) defines its own
-locking mechanism to prevent overlapping calls into the netcdf library.
-
-All 3 data-format packages can map variable data into Dask lazy arrays.  Iris and
-Xarray can also create delayed write operations (but ncdata currently does not).
-
-However, those mechanisms cannot protect an operation of that package from
-overlapping with one in *another* package.
+In short, this is not needed when all your data is loaded with only **one** of the data
+packages (Iris, Xarray or ncdata).  The problem only occurs when you try to
+realise/calculate/save results which combine data loaded from a mixture of sources.
 
-The :mod:`ncdata.threadlock_sharing` module can ensure that all of the relevant
-packages use the *same* thread lock,
-so that they can safely co-operate in parallel operations.
+sample code:
 
-sample code::
+.. code-block:: python
 
     from ncdata.threadlock_sharing import enable_lockshare, disable_lockshare
     from ncdata.xarray import from_xarray
@@ -40,11 +33,49 @@ sample code::
 
     disable_lockshare()
 
-or::
+... *or* ...
+
+.. code-block:: python
 
     with lockshare_context(iris=True):
         ncdata = NcData(source_filepath)
         ncdata.variables['x'].attributes['units'] = 'K'
         cubes = ncdata.iris.to_iris(ncdata)
         iris.save(cubes, output_filepath)
 
+
+Background
+^^^^^^^^^^
+In practice, Iris, Xarray and Ncdata are all capable of scanning netCDF files and interpreting their metadata, while
+not reading all the core variable data contained in them.
+
+This generates objects containing `Dask arrays <https://docs.dask.org/en/stable/array.html>`_ with deferred access
-This generates objects containing `Dask arrays <https://docs.dask.org/en/stable/array.html>`_ with deferred access
+This generates objects containing Dask :external+dask:doc:`array` s with deferred access
-This generates objects containing `Dask arrays <https://docs.dask.org/en/stable/array.html>`_ with deferred access
+This generates objects containing Dask :external+dask:doc:`array` s with deferred access
+to bulk file data for later access, with certain key benefits :
+
+* no data loading or calculation happens until needed
+*  the work is divided into sectional ‘tasks’, of which only some may ultimately be needed
+* it may be possible to perform multiple sections of calculation (including data fetch) in parallel
+* it may be possible to localise operations (fetch or calculate) near to data distributed across a cluster
+
+Usually, the most efficient parallelisation of array operations is by multi-threading, since that can use memory
+sharing of large data arrays in memory.
+
+However, the python netCDF4 library (and the underlying C library) is not threadsafe
+(re-entrant) by design, neither does it implement any thread locking itself, therefore
+the “netcdf fetch” call in each input operation must be guarded by a mutex.
+Thus, contention is possible unless controlled by the calling packages.
+
+Each of Xarray, Iris and ncdata create input data tasks to fetch sections of data from
+the input files.  Each uses a mutex lock around netcdf accesses in those tasks, to stop
+them accessing the netCDF4 interface at the same time as any of the others.
+
+This works beautifully until ncdata connects (for example) lazy data loaded *with Iris*
+with lazy data loaded *from Xarray*.  These would then unfortunately each be using their
+own *separate* mutexes to protect the same netcdf library.  So, if we then attempt to
+calculate or save the result, which combines data from both sources, we could get
+sporadic and unpredictable system-level errors, even a core-dump type failure.
+
+So, the function of :mod:`ncdata.threadlock_sharing` is to connect the thread-locking
+schemes of the separate libraries, so that they cannot accidentally overlap an access
+call in a different thread *from the other package*, just as they already cannot
+overlap *one of their own*.
diff --git a/docs/index.rst b/docs/index.rst
@@ -38,8 +38,9 @@ User Documentation
    User Guide <./userdocs/user_guide/user_guide>
 
 
-Reference
----------
+Reference Documentation
+-----------------------
+
 .. toctree::
    :maxdepth: 2
 

diff --git a/docs/userdocs/getting_started/installation.rst b/docs/userdocs/getting_started/installation.rst
@@ -4,13 +4,28 @@ Ncdata is available on PyPI and conda-forge
 
 Install from conda-forge with conda
 -----------------------------------
-Like this::
-    conda install -c conda-forge ncdata
+Like this:
+
+.. code-block:: bash
+
+    $ conda install -c conda-forge ncdata
 
 
 Install from PyPI with pip
 --------------------------
-Like this::
+Like this:
+
+.. code-block:: bash
+
     pip install ncdata
 
 
+Check install
+^^^^^^^^^^^^^
+
+.. code-block:: bash
+
+    $ python -c "from ncdata import NcData; print(NcData())"
+    <NcData: <'no-name'>
+    >
+