Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big docs reorganise and expand. #109

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
a51f251
Big rework and expand docs.
pp-mo Jan 16, 2025
5e81543
Lots more improvements + move sections.
pp-mo Jan 16, 2025
8b3c52a
More fixes to correctness, consistency, readability. Add example for…
pp-mo Jan 25, 2025
0e83165
Overhaul all API docstrings.
pp-mo Feb 6, 2025
dce4b72
Update docs/userdocs/user_guide/data_objects.rst
pp-mo Feb 6, 2025
de38b89
Update docs/userdocs/user_guide/data_objects.rst
pp-mo Feb 6, 2025
10a6bee
Update docs/userdocs/user_guide/common_operations.rst
pp-mo Feb 6, 2025
cf79296
Update docs/userdocs/user_guide/common_operations.rst
pp-mo Feb 6, 2025
2356d12
Update docs/userdocs/user_guide/common_operations.rst
pp-mo Feb 6, 2025
872aa19
Update docs/userdocs/user_guide/common_operations.rst
pp-mo Feb 6, 2025
33232da
Update docs/userdocs/user_guide/general_topics.rst
pp-mo Feb 6, 2025
28b3ca3
Update docs/userdocs/user_guide/general_topics.rst
pp-mo Feb 6, 2025
eea69fb
Update docs/userdocs/user_guide/general_topics.rst
pp-mo Feb 6, 2025
a1fa515
Review changes: links, indents, rewording.
pp-mo Feb 7, 2025
3433c29
Completion of original review comments (mostly, a few from new set).
pp-mo Feb 11, 2025
06cd859
Fixes to data types documentation.
pp-mo Feb 12, 2025
4e563c1
Fix external link.
pp-mo Feb 12, 2025
41701f9
Fix list of core object container properties.
pp-mo Feb 12, 2025
e5007f1
Fix bad formatting on installation page.
pp-mo Feb 12, 2025
a9afc60
More review changes + tweaks.
pp-mo Feb 12, 2025
d526b0c
Include basic changelog update in the release process docs.
pp-mo Feb 12, 2025
12eb3a2
Fix code blocks in introduction.
pp-mo Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions docs/change_log.rst
Original file line number Diff line number Diff line change
@@ -1,22 +1,27 @@
.. _change_log:

Versions and Change Notes
=========================

Project Status
--------------
.. _development_status:

Project Development Status
--------------------------
We intend to follow `PEP 440 <https://peps.python.org/pep-0440/>`_,
or (older) `SemVer <https://semver.org/>`_ versioning principles.
This means the version string has the basic form **"major.minor.bugfix[special-types]"**.

Current release version is at **"v0.1"**.
Current release version is at **"v0.2"**.
trexfeathers marked this conversation as resolved.
Show resolved Hide resolved

This is a first complete implementation,
with functional operational of all public APIs.
This is a complete implementation, with functional operational of all public APIs.
The code is however still experimental, and APIs are not stable
(hence no major version yet).

.. _change_notes:

Change Notes
------------
Summary of key features by release number

Unreleased
^^^^^^^^^^
Expand Down
61 changes: 61 additions & 0 deletions docs/details/character_handling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
.. _string-and-character-data:

Character and String Data Handling
----------------------------------
NetCDF can contain string and character data in at least 3 different contexts :

Characters in Data Component Names
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
That is, names of groups, variables, attributes or dimensions.
Component names in the API are just native Python strings.

Since NetCDF version 4, the names of components within files are fully unicode
compliant, using UTF-8.

These names can use virtually **any** characters, with the exception of the forward
slash "/", since in some technical cases a component name needs to specified as a
"path-like" compound.


Characters in Attribute Values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Character data in string *attribute* values can likewise be read and written simply as
Python strings.

However they are actually *stored* in an :class:`~ncdata.NcAttribute`'s
``.value`` as a character array of dtype "<U??" (that is, the dtype does not really
have a "??", but some definite length). These are returned by
:meth:`ncdata.NcAttribute.as_python_value` as a simple Python string.

A vector of strings is also a permitted attribute value, but bear in mind that
**a vector of strings is not currently supported in netCDF4 implementations**.
Thus, you cannot have an array or list of strings as an attribute value in an actual file,
and if stored to a file such an attribute will be concatenated into a single string value.

In actual files, Unicode is again supported via UTF-8, and seamlessly encoded/decoded.


Characters in Variable Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Character data in variable *data* arrays are generally stored as fixed-length arrays of
characters (i.e. fixed-width strings), and no unicode interpretation is applied by the
libraries (neither netCDF4 or ncdata). In this case, the strings appear in Python as
numpy character arrays of dtype "<U1". All elements have the same fixed length, but
may contain zero bytes so that they convert to variable-width (Python) strings up to a
maximum width. Trailing characters are filled with "NUL", i.e. "\\0" character
aka "zero byte". The (maximum) string length is a separate dimension, which is
recorded as a normal netCDF file dimension like any other.

.. note::

Although it is not tested, it has proved possible (and useful) at present to load
files with variables containing variable-length string data, but it is
necessary to supply an explicit user chunking to workaround limitations in Dask.
Please see the :ref:`howto example <howto_load_variablewidth_strings>`.

.. warning::

The netCDF4 package will perform automatic character encoding/decoding of a
character variable if it has a special ``_Encoding`` attribute. Ncdata does not
currently allow for this. See : :ref:`known-issues`

5 changes: 5 additions & 0 deletions docs/details/details_index.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
Detail Topics
=============
Detail reference topics

.. toctree::
:maxdepth: 2

../change_log
./known_issues
./interface_support
./character_handling
./threadlock_sharing
./developer_notes

6 changes: 6 additions & 0 deletions docs/details/developer_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ Documentation build
Release actions
---------------

#. Update the :ref:`change_log` page in the details section

#. ensure all major changes + PRs are referenced in the :ref:`change_notes` section

#. update the "latest version" stated in the :ref:`development_status` section

#. Cut a release on GitHub : this triggers a new docs version on [ReadTheDocs](https://readthedocs.org/projects/ncdata/)

#. Build the distribution
Expand Down
53 changes: 35 additions & 18 deletions docs/details/interface_support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,43 +14,59 @@ Datatypes
^^^^^^^^^
Ncdata supports all the regular datatypes of netcdf, but *not* the
variable-length and user-defined datatypes.
Please see : :ref:`data-types`.

This means, notably, that all string variables will have the basic numpy type
'S1', equivalent to netcdf 'NC_CHAR'. Thus, multi-character string variables
must always have a definite "string-length" dimension.

Attribute values, by contrast, are treated as Python strings with the normal
variable length support. Their basic dtype can be any numpy string dtype,
but will be converted when required.

The NetCDF C library and netCDF4-python do not support arrays of strings in
attributes, so neither does NcData.


Data Scaling, Masking and Compression
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Ncdata does not implement scaling and offset within data arrays : The ".data"
Data Scaling and Masking
^^^^^^^^^^^^^^^^^^^^^^^^
Ncdata does not implement scaling and offset within variable data arrays : The ".data"
array has the actual variable dtype, and the "scale_factor" and
"add_offset" attributes are treated like any other attribute.

The existence of a "_FillValue" attribute controls how.. TODO
Likewise, Ncdata does not use masking within its variable data arrays, so that variable
data arrays contain "raw" data, which include any "fill" values -- i.e. at any missing
data points you will have a "fill" value rather than a masked point.

The use of "scale_factor", "add_offset" and "_FillValue" attributes are standard
conventions described in the NetCDF documentation itself, and implemented by NetCDF
library software including the Python netCDF4 library. To ignore these default
interpretations, ncdata has to actually turn these features "off". The rationale for
this, however, is that the low-level unprocessed data content, equivalent to actual
file storage, may be more likely to form a stable common basis of equivalence, particularly
between different system architectures.


.. _file-storage:

File storage control
^^^^^^^^^^^^^^^^^^^^
The :func:`ncdata.netcdf4.to_nc4` cannot control compression or storage options
provided by :meth:`netCDF4.Dataset.createVariable`, which means you can't
control the data compression and translation facilities of the NetCDF file
library.
If required, you should use :mod:`iris` or :mod:`xarray` for this.
If required, you should use :mod:`iris` or :mod:`xarray` for this, i.e. use
:meth:`xarray.Dataset.to_netcdf` or :func:`iris.save` instead of
:func:`ncdata.netcdf4.to_nc4`, as these provide more special options for controlling
netcdf file creation.

File-specific storage aspects, such as chunking, data-paths or compression
strategies, are not recorded in the core objects. However, array representations in
variable and attribute data (notably dask lazy arrays) may hold such information.

The concept of "unlimited" dimensions is also, you might think, outside the abstract
model of NetCDF data and not of concern to Ncdata . However, in fact this concept is
present as a core property of dimensions in the classic NetCDF data model (see
"Dimension" in the `NetCDF Classic Data Model`_), so that is why it **is** an essential
property of an NcDimension also.


Dask chunking control
^^^^^^^^^^^^^^^^^^^^^
Loading from netcdf files generates variables whose data arrays are all Dask
lazy arrays. These are created with the "chunks='auto'" setting.
There is currently no control for this : If required, load via Iris or Xarray
instead.

However there is a simple per-dimension chunking control available on loading.
See :func:`ncdata.netcdf4.from_nc4`.


Xarray Compatibility
Expand Down Expand Up @@ -94,3 +110,4 @@ see : `support added in v3.7.0 <https://scitools-iris.readthedocs.io/en/stable/w


.. _Continuous Integration testing on GitHub: https://github.com/pp-mo/ncdata/blob/main/.github/workflows/ci-tests.yml
.. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _known-issues:

Outstanding Issues
==================

Expand All @@ -21,6 +23,19 @@ To be fixed

* `issue#66 <https://github.com/pp-mo/ncdata/issues/66>`_

* in conversion to/from netCDF4 files

* netCDF4 performs automatic encoding/decoding of byte data to characters, triggered
by the existence of an ``_Encoding`` attribute on a character type variable.
Ncdata does not currently account for this, and may fail to read/write correctly.


.. _todo:

Incomplete Documentation
^^^^^^^^^^^^^^^^^^^^^^^^
(PLACEHOLDER: documentation is incomplete, please fix me !)


Identified Design Limitations
-----------------------------
Expand All @@ -36,7 +51,7 @@ There are no current plans to address these, but could be considered in future
* notably, includes compound and variable-length types

* ..and especially **variable-length strings in variables**.
see : :ref:`string_and_character_data`
see : :ref:`string-and-character-data`, :ref:`data-types`


Features planned
Expand Down
67 changes: 49 additions & 18 deletions docs/details/threadlock_sharing.rst
trexfeathers marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,30 +1,23 @@
.. _thread-safety:

NetCDF Thread Locking
=====================
Ncdata includes support for "unifying" the thread-safety mechanisms between
ncdata and the format packages it supports (Iris and Ncdata).
Ncdata provides the :mod:`ncdata.threadlock_sharing` module, which can ensure that all
multiple relevant data-format packages use a "unified" thread-safety mechanism to
prevent them disturbing each other.

This concerns the safe use of the common NetCDF library by multiple threads.
Such multi-threaded access usually occurs when your code has Dask arrays
created from netcdf file data, which it is either computing or storing to an
output netcdf file.

The netCDF4 package (and the underlying C library) does not implement any
threadlock, neither is it thread-safe (re-entrant) by design.
Thus contention is possible unless controlled by the calling packages.
*Each* of the data-format packages (Ncdata, Iris and Xarray) defines its own
locking mechanism to prevent overlapping calls into the netcdf library.

All 3 data-format packages can map variable data into Dask lazy arrays. Iris and
Xarray can also create delayed write operations (but ncdata currently does not).

However, those mechanisms cannot protect an operation of that package from
overlapping with one in *another* package.
In short, this is not needed when all your data is loaded with only **one** of the data
packages (Iris, Xarray or ncdata). The problem only occurs when you try to
realise/calculate/save results which combine data loaded from a mixture of sources.

The :mod:`ncdata.threadlock_sharing` module can ensure that all of the relevant
packages use the *same* thread lock,
so that they can safely co-operate in parallel operations.
sample code:

sample code::
.. code-block:: python

from ncdata.threadlock_sharing import enable_lockshare, disable_lockshare
from ncdata.xarray import from_xarray
Expand All @@ -40,11 +33,49 @@ sample code::

disable_lockshare()

or::
... *or* ...

.. code-block:: python

with lockshare_context(iris=True):
ncdata = NcData(source_filepath)
ncdata.variables['x'].attributes['units'] = 'K'
cubes = ncdata.iris.to_iris(ncdata)
iris.save(cubes, output_filepath)


Background
^^^^^^^^^^
In practice, Iris, Xarray and Ncdata are all capable of scanning netCDF files and interpreting their metadata, while
not reading all the core variable data contained in them.

This generates objects containing `Dask arrays <https://docs.dask.org/en/stable/array.html>`_ with deferred access
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have Intersphinx for Dask, so I recommend using it.

To achieve a link to this specific page, you can use this syntax (not sure about correct way to pluralise):

Suggested change
This generates objects containing `Dask arrays <https://docs.dask.org/en/stable/array.html>`_ with deferred access
This generates objects containing Dask :external+dask:doc:`array` s with deferred access

More:

to bulk file data for later access, with certain key benefits :

* no data loading or calculation happens until needed
* the work is divided into sectional ‘tasks’, of which only some may ultimately be needed
* it may be possible to perform multiple sections of calculation (including data fetch) in parallel
* it may be possible to localise operations (fetch or calculate) near to data distributed across a cluster

Usually, the most efficient parallelisation of array operations is by multi-threading, since that can use memory
sharing of large data arrays in memory.

However, the python netCDF4 library (and the underlying C library) is not threadsafe
(re-entrant) by design, neither does it implement any thread locking itself, therefore
the “netcdf fetch” call in each input operation must be guarded by a mutex.
Thus, contention is possible unless controlled by the calling packages.

Each of Xarray, Iris and ncdata create input data tasks to fetch sections of data from
the input files. Each uses a mutex lock around netcdf accesses in those tasks, to stop
them accessing the netCDF4 interface at the same time as any of the others.

This works beautifully until ncdata connects (for example) lazy data loaded *with Iris*
with lazy data loaded *from Xarray*. These would then unfortunately each be using their
own *separate* mutexes to protect the same netcdf library. So, if we then attempt to
calculate or save the result, which combines data from both sources, we could get
sporadic and unpredictable system-level errors, even a core-dump type failure.

So, the function of :mod:`ncdata.threadlock_sharing` is to connect the thread-locking
schemes of the separate libraries, so that they cannot accidentally overlap an access
call in a different thread *from the other package*, just as they already cannot
overlap *one of their own*.
5 changes: 3 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,9 @@ User Documentation
User Guide <./userdocs/user_guide/user_guide>


Reference
---------
Reference Documentation
-----------------------

.. toctree::
:maxdepth: 2

Expand Down
21 changes: 18 additions & 3 deletions docs/userdocs/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,28 @@ Ncdata is available on PyPI and conda-forge

Install from conda-forge with conda
-----------------------------------
Like this::
conda install -c conda-forge ncdata
Like this:

.. code-block:: bash

$ conda install -c conda-forge ncdata


Install from PyPI with pip
--------------------------
Like this::
Like this:

.. code-block:: bash

pip install ncdata


Check install
^^^^^^^^^^^^^

.. code-block:: bash

$ python -c "from ncdata import NcData; print(NcData())"
<NcData: <'no-name'>
>

Loading
Loading