Merge remote-tracking branch 'upstream/develop' into fix/nodeweights

MeteoSwiss · Jan 6, 2025 · baa6e1f · baa6e1f
2 parents 1b7c729 + c9fa231
commit baa6e1f
Show file tree

Hide file tree

Showing 26 changed files with 663 additions and 123 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,45 +8,58 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 Please add your functional changes to the appropriate section in the PR.
 Keep it human-readable, your future self will thank you!
 
-## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.1...HEAD)
+## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.2...HEAD)
+
+## [0.3.2 - Multiple Fixes, Checkpoint updates, Stretched-grid/LAM updates](https://github.com/ecmwf/anemoi-training/compare/0.3.1...0.3.2) - 2024-12-19
+
 ### Fixed
+
 - Not update NaN-weight-mask for loss function when using remapper and no imputer [#178](https://github.com/ecmwf/anemoi-training/pull/178)
 - Dont crash when using the profiler if certain env vars arent set [#180](https://github.com/ecmwf/anemoi-training/pull/180)
 - Remove saving of metadata to training checkpoint [#57](https://github.com/ecmwf/anemoi-training/pull/190)
 - Fixes to callback plots [#182] (power spectrum large numpy array error + precip cmap for cases where precip is prognostic).
+- GraphTrainableParameters callback will log a warning when no trainable parameters are specified  [#173](https://github.com/ecmwf/anemoi-training/pull/173)
+- Fixes to checkpoint saving - ensure last checkpoint if saving when using max_steps [#191] (https://github.com/ecmwf/anemoi-training/pull/191)
+- Identify stretched grid models based on graph rather than configuration file [#204](https://github.com/ecmwf/anemoi-training/pull/204)
 
 ### Added
+
 - Introduce variable to configure: transfer_learning -> bool, True if loading checkpoint in a transfer learning setting.
-- <b> TRANSFER LEARNING</b>: enabled new functionality. You can now load checkpoints from different models and different training runs.
+-
+<b> TRANSFER LEARNING</b>: enabled new functionality. You can now load checkpoints from different models and different training runs.
 - Effective batch size: `(config.dataloader.batch_size["training"] * config.hardware.num_gpus_per_node * config.hardware.num_nodes) // config.hardware.num_gpus_per_model`.
   Used for experiment reproducibility across different computing configurations.
 - Added a check for the variable sorting on pre-trained/finetuned models [#120](https://github.com/ecmwf/anemoi-training/pull/120)
+- Added default configuration files for stretched grid and limited area model experiments [173](https://github.com/ecmwf/anemoi-training/pull/173)
+- Added new metrics for stretched grid models to track losses inside/outside the regional domain [#199](https://github.com/ecmwf/anemoi-training/pull/199)
+- Add supporting arrrays (numpy) to checkpoint
+- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)
+- Improved validation metrics, allow 'all' to be scaled [#202](https://github.com/ecmwf/anemoi-training/pull/202)
 
 ### Changed
 
 ### Removed
-- Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)
-
-### Added
 
-- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)
+- Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)
 
 ## [0.3.1 - AIFS v0.3 Compatibility](https://github.com/ecmwf/anemoi-training/compare/0.3.0...0.3.1) - 2024-11-28
 
 ### Changed
+
 - Perform full shuffle of training dataset [#153](https://github.com/ecmwf/anemoi-training/pull/153)
 
 ### Fixed
-- Update `n_pixel` used by datashader to better adapt across resolutions #152
+
+- Update `n_pixel` used by datashader to better adapt across resolutions [#152](https://github.com/ecmwf/anemoi-training/pull/152)
 - Fixed bug in power spectra plotting for the n320 resolution.
 - Allow histogram and spectrum plot for one variable [#165](https://github.com/ecmwf/anemoi-training/pull/165)
 
 ### Added
+
 - Introduce variable to configure (Cosine Annealing) optimizer warm up [#155](https://github.com/ecmwf/anemoi-training/pull/155)
 - Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)
 - Bump `anemoi-graphs` version to 0.4.1 [#159](https://github.com/ecmwf/anemoi-training/pull/159)
 
-
 ## [0.3.0 - Loss & Callback Refactors](https://github.com/ecmwf/anemoi-training/compare/0.2.2...0.3.0) - 2024-11-14
 
 ### Fixed
@@ -55,8 +68,10 @@ Keep it human-readable, your future self will thank you!
 - Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)
   - Updated docs [#115](https://github.com/ecmwf/anemoi-training/pull/115)
   - Fix enabling LearningRateMonitor [#119](https://github.com/ecmwf/anemoi-training/pull/119)
+
 - Refactored rollout [#87](https://github.com/ecmwf/anemoi-training/pulls/87)
   - Enable longer validation rollout than training
+
 - Expand iterables in logging [#91](https://github.com/ecmwf/anemoi-training/pull/91)
   - Save entire config in mlflow
 
@@ -66,8 +81,10 @@ Keep it human-readable, your future self will thank you!
 - Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)
 - Include option to use datashader and optimised asyncronohous callbacks [#102](https://github.com/ecmwf/anemoi-training/pull/102)
   - Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
+
 - Allow updates to scalars [#137](https://github.com/ecmwf/anemoi-training/pulls/137)
   - Add without subsetting in ScaleTensor
+
 - Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
 - Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)
 - Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)
@@ -77,7 +94,6 @@ Keep it human-readable, your future self will thank you!
 - Functionality to change the weight attribute of nodes in the graph at the start of training without re-generating the graph. [#136] (https://github.com/ecmwf/anemoi-training/pull/136)
 - Custom System monitor for Nvidia and AMD GPUs [#147](https://github.com/ecmwf/anemoi-training/pull/147)
 
-
 ### Changed
 
 - Renamed frequency keys in callbacks configuration. [#118](https://github.com/ecmwf/anemoi-training/pull/118)

diff --git a/README.md b/README.md
@@ -1,5 +1,15 @@
+> [!IMPORTANT]
+> **Repository Migration Notice**
+> >
+> This repository has been migrated to our new consolidated Anemoi Core mono-repository. All future development, including new features and bug fixes, will take place in the new repository. Please update your references to use the new location:
+> 
+> 🔗 [@ecmwf/anemoi-core](https://github.com/ecmwf/anemoi-core)
+
 # anemoi-training
 
+[![Documentation Status](https://readthedocs.org/projects/anemoi-training/badge/?version=latest)](https://anemoi-training.readthedocs.io/en/latest/?badge=latest)
+
+
 **DISCLAIMER**
 This project is **BETA** and will be **Experimental** for the foreseeable future.
 Interfaces and functionality are likely to change, and the project itself may be scrapped.

diff --git a/docs/conf.py b/docs/conf.py
@@ -42,7 +42,7 @@
 
 author = "Anemoi contributors"
 
-year = datetime.datetime.now(tz="UTC").year
+year = datetime.datetime.now(tz=datetime.timezone.utc).year
 years = "2024" if year == 2024 else f"2024-{year}"
 
 copyright = f"{years}, Anemoi contributors"  # noqa: A001

diff --git a/docs/modules/losses.rst b/docs/modules/losses.rst
@@ -73,11 +73,36 @@ Currently, the following scalars are available for use:
 ********************
 
 Validation metrics as defined in the config file at
-``config.training.validation_metrics`` follow the same initialise
+``config.training.validation_metrics`` follow the same initialisation
 behaviour as the loss function, but can be a list. In this case all
 losses are calculated and logged as a dictionary with the corresponding
 name
 
+Scaling Validation Losses
+=========================
+
+Validation metrics can **not** by default be scaled by scalars across
+the variable dimension, but can be by all other scalars. If you want to
+scale a validation metric by the variable weights, it must be added to
+`config.training.scale_validation_metrics`.
+
+These metrics are then kept in the normalised, preprocessed space, and
+thus the indexing of scalars aligns with the indexing of the tensors.
+
+By default, only `all` is kept in the normalised space and scaled.
+
+.. code:: yaml
+
+   # List of validation metrics to keep in normalised space, and scalars to be applied
+   # Use '*' in reference all metrics, or a list of metric names.
+   # Unlike above, variable scaling is possible due to these metrics being
+   # calculated in the same way as the training loss, within the internal model space.
+   scale_validation_metrics:
+   scalars_to_apply: ['variable']
+   metrics:
+      - 'all'
+      # - "*"
+
 ***********************
  Custom Loss Functions
 ***********************

diff --git a/pyproject.toml b/pyproject.toml
@@ -42,7 +42,7 @@ dynamic = [ "version" ]
 dependencies = [
   "anemoi-datasets>=0.5.2",
   "anemoi-graphs>=0.4.1",
-  "anemoi-models>=0.3",
+  "anemoi-models>=0.4.1",
   "anemoi-utils[provenance]>=0.4.4",
   "datashader>=0.16.3",
   "einops>=0.6.1",

diff --git a/src/anemoi/training/config/graph/limited_area.yaml b/src/anemoi/training/config/graph/limited_area.yaml
@@ -10,7 +10,7 @@ nodes:
     node_builder:
       _target_: anemoi.graphs.nodes.ZarrDatasetNodes
       dataset: ${dataloader.training.dataset}
-    attributes: ${graph.attributes.data_nodes}
+    attributes: ${graph.attributes.nodes}
   # Hidden nodes
   hidden:
     node_builder:
@@ -26,8 +26,8 @@ edges:
   edge_builders:
   - _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
     cutoff_factor: 0.6 # only for cutoff method
-  - _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
-    cutoff_factor: 2 # only for cutoff method
+  - _target_: anemoi.graphs.edges.CutOffEdges # connects only boundary nodes
+    cutoff_factor: 1.5 # only for cutoff method
     source_mask_attr_name: boundary_mask
   attributes: ${graph.attributes.edges}
 # Processor configuration
@@ -46,16 +46,15 @@ edges:
     num_nearest_neighbours: 3 # only for knn method
   attributes: ${graph.attributes.edges}
 
-
 post_processors:
   - _target_: anemoi.graphs.processors.RemoveUnconnectedNodes
     nodes_name: data
     ignore: cutout_mask # optional
     save_mask_indices_to_attr: indices_connected_nodes # optional
 
-
 attributes:
-  data_nodes:
+  nodes:
+    # Attributes for data nodes
     area_weight:
       _target_: anemoi.graphs.nodes.attributes.AreaWeights # options: Area, Uniform
       norm: unit-max # options: l1, l2, unit-max, unit-sum, unit-std

diff --git a/src/anemoi/training/config/graph/stretched_grid.yaml b/src/anemoi/training/config/graph/stretched_grid.yaml
@@ -11,12 +11,7 @@ nodes:
     node_builder:
       _target_: anemoi.graphs.nodes.ZarrDatasetNodes
       dataset: ${dataloader.training.dataset}
-    attributes:
-      area_weight:
-        _target_: anemoi.graphs.nodes.attributes.AreaWeights
-        norm: unit-max
-      cutout:
-        _target_: anemoi.graphs.nodes.attributes.CutOutMask
+    attributes: ${graph.attributes.nodes}
   hidden:
     node_builder:
       _target_: anemoi.graphs.nodes.StretchedTriNodes
@@ -25,10 +20,6 @@ nodes:
       reference_node_name: ${graph.data}
       mask_attr_name: cutout
       margin_radius_km: 11
-    attributes:
-      area_weights:
-        _target_: anemoi.graphs.nodes.attributes.AreaWeights
-        norm: unit-max
 
 edges:
 # Encoder
@@ -54,6 +45,13 @@ edges:
   attributes: ${graph.attributes.edges}
 
 attributes:
+  nodes:
+    # Attributes for data nodes
+    area_weight:
+      _target_: anemoi.graphs.nodes.attributes.AreaWeights
+      norm: unit-max
+    cutout:
+      _target_: anemoi.graphs.nodes.attributes.CutOutMask
   edges:
     edge_length:
       _target_: anemoi.graphs.edges.attributes.EdgeLength

diff --git a/src/anemoi/training/config/lam.yaml b/src/anemoi/training/config/lam.yaml
@@ -0,0 +1,36 @@
+defaults:
+- data: zarr
+- dataloader: native_grid
+- diagnostics: evaluation
+- hardware: example
+- graph: limited_area
+- model: graphtransformer
+- training: default
+- _self_
+
+
+### This file is for local experimentation.
+##  When you commit your changes, assign the new features and keywords
+##  to the correct defaults.
+# For example to change from default GPU count:
+# hardware:
+#   num_gpus_per_node: 1
+
+dataloader:
+  dataset:
+    cutout:
+      - dataset: ${hardware.paths.data}/${hardware.files.dataset}
+        thinning: ???
+      - dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
+    adjust: all
+    min_distance_km: 0
+  grid_indices:
+    _target_: anemoi.training.data.grid_indices.MaskedGrid
+    nodes_name: data
+    node_attribute_name: indices_connected_nodes
+model:
+  output_mask: cutout_mask # it must be a node attribute of the output nodes
+hardware:
+  files:
+    dataset: ???
+    forcing_dataset: ???
diff --git a/src/anemoi/training/config/stretched.yaml b/src/anemoi/training/config/stretched.yaml
@@ -0,0 +1,37 @@
+defaults:
+- data: zarr
+- dataloader: native_grid
+- diagnostics: evaluation
+- hardware: example
+- graph: stretched_grid
+- model: graphtransformer
+- training: default
+- _self_
+
+
+### This file is for local experimentation.
+##  When you commit your changes, assign the new features and keywords
+##  to the correct defaults.
+# For example to change from default GPU count:
+# hardware:
+#   num_gpus_per_node: 1
+
+dataloader:
+  dataset:
+    cutout:
+      - dataset: ${hardware.paths.data}/${hardware.files.dataset}
+        thinning: ???
+      - dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
+    adjust: all
+    min_distance_km: 0
+training:
+  loss_scaling:
+    spatial:
+      _target_: anemoi.training.data.scaling.ReweightedGraphAttribute
+      target_nodes: ${graph.data}
+      scaled_attribute: area_weight # it must be a node attribute of the output nodes
+      cutout_weight_frac_of_global: ???
+hardware:
+  files:
+    dataset: ???
+    forcing_dataset: ???
diff --git a/src/anemoi/training/config/training/default.yaml b/src/anemoi/training/config/training/default.yaml
@@ -58,16 +58,32 @@ loss_gradient_scaling: False
 
 # Validation metrics calculation,
 # This may be a list, in which case all metrics will be calculated
-# and logged according to their name
+# and logged according to their name.
+# These metrics are calculated in the output model space, and thus
+# have undergone postprocessing.
 validation_metrics:
   # loss class to initialise
   - _target_: anemoi.training.losses.mse.WeightedMSELoss
     # Scalars to include in loss calculation
-    # Available scalars include, 'variable'
+    # Cannot scale over the variable dimension due to possible remappings.
+    # Available scalars include:
+    # - 'loss_weights_mask': Giving imputed NaNs a zero weight in the loss function
+    # Use the `scale_validation_metrics` section to variable scale.
     scalars: []
     # other kwargs
     ignore_nans: True
 
+# List of validation metrics to keep in normalised space, and scalars to be applied
+# Use '*' in reference all metrics, or a list of metric names.
+# Unlike above, variable scaling is possible due to these metrics being
+# calculated in the same way as the training loss, within the internal model space.
+scale_validation_metrics:
+  scalars_to_apply: ['variable']
+  metrics:
+    - 'all'
+    # - "*"
+
+
 # length of the "rollout" window (see Keisler's paper)
 rollout:
   start: 1

diff --git a/src/anemoi/training/data/datamodule.py b/src/anemoi/training/data/datamodule.py
@@ -77,6 +77,10 @@ def statistics(self) -> dict:
     def metadata(self) -> dict:
         return self.ds_train.metadata
 
+    @cached_property
+    def supporting_arrays(self) -> dict:
+        return self.ds_train.supporting_arrays | self.grid_indices.supporting_arrays
+
     @cached_property
     def data_indices(self) -> IndexCollection:
         return IndexCollection(self.config, self.ds_train.name_to_index)

diff --git a/src/anemoi/training/data/dataset.py b/src/anemoi/training/data/dataset.py
@@ -109,6 +109,11 @@ def metadata(self) -> dict:
         """Return dataset metadata."""
         return self.data.metadata()
 
+    @cached_property
+    def supporting_arrays(self) -> dict:
+        """Return dataset supporting_arrays."""
+        return self.data.supporting_arrays()
+
     @cached_property
     def name_to_index(self) -> dict:
         """Return dataset statistics."""