Skip to content
This repository has been archived by the owner on Feb 3, 2025. It is now read-only.

Commit

Permalink
Merge remote-tracking branch 'upstream/develop' into fix/nodeweights
Browse files Browse the repository at this point in the history
  • Loading branch information
OpheliaMiralles committed Jan 6, 2025
2 parents 1b7c729 + c9fa231 commit baa6e1f
Show file tree
Hide file tree
Showing 26 changed files with 663 additions and 123 deletions.
34 changes: 25 additions & 9 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,45 +8,58 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
Please add your functional changes to the appropriate section in the PR.
Keep it human-readable, your future self will thank you!

## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.1...HEAD)
## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.2...HEAD)

## [0.3.2 - Multiple Fixes, Checkpoint updates, Stretched-grid/LAM updates](https://github.com/ecmwf/anemoi-training/compare/0.3.1...0.3.2) - 2024-12-19

### Fixed

- Not update NaN-weight-mask for loss function when using remapper and no imputer [#178](https://github.com/ecmwf/anemoi-training/pull/178)
- Dont crash when using the profiler if certain env vars arent set [#180](https://github.com/ecmwf/anemoi-training/pull/180)
- Remove saving of metadata to training checkpoint [#57](https://github.com/ecmwf/anemoi-training/pull/190)
- Fixes to callback plots [#182] (power spectrum large numpy array error + precip cmap for cases where precip is prognostic).
- GraphTrainableParameters callback will log a warning when no trainable parameters are specified [#173](https://github.com/ecmwf/anemoi-training/pull/173)
- Fixes to checkpoint saving - ensure last checkpoint if saving when using max_steps [#191] (https://github.com/ecmwf/anemoi-training/pull/191)
- Identify stretched grid models based on graph rather than configuration file [#204](https://github.com/ecmwf/anemoi-training/pull/204)

### Added

- Introduce variable to configure: transfer_learning -> bool, True if loading checkpoint in a transfer learning setting.
- <b> TRANSFER LEARNING</b>: enabled new functionality. You can now load checkpoints from different models and different training runs.
-
<b> TRANSFER LEARNING</b>: enabled new functionality. You can now load checkpoints from different models and different training runs.
- Effective batch size: `(config.dataloader.batch_size["training"] * config.hardware.num_gpus_per_node * config.hardware.num_nodes) // config.hardware.num_gpus_per_model`.
Used for experiment reproducibility across different computing configurations.
- Added a check for the variable sorting on pre-trained/finetuned models [#120](https://github.com/ecmwf/anemoi-training/pull/120)
- Added default configuration files for stretched grid and limited area model experiments [173](https://github.com/ecmwf/anemoi-training/pull/173)
- Added new metrics for stretched grid models to track losses inside/outside the regional domain [#199](https://github.com/ecmwf/anemoi-training/pull/199)
- Add supporting arrrays (numpy) to checkpoint
- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)
- Improved validation metrics, allow 'all' to be scaled [#202](https://github.com/ecmwf/anemoi-training/pull/202)

### Changed

### Removed
- Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)

### Added

- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)
- Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)

## [0.3.1 - AIFS v0.3 Compatibility](https://github.com/ecmwf/anemoi-training/compare/0.3.0...0.3.1) - 2024-11-28

### Changed

- Perform full shuffle of training dataset [#153](https://github.com/ecmwf/anemoi-training/pull/153)

### Fixed
- Update `n_pixel` used by datashader to better adapt across resolutions #152

- Update `n_pixel` used by datashader to better adapt across resolutions [#152](https://github.com/ecmwf/anemoi-training/pull/152)
- Fixed bug in power spectra plotting for the n320 resolution.
- Allow histogram and spectrum plot for one variable [#165](https://github.com/ecmwf/anemoi-training/pull/165)

### Added

- Introduce variable to configure (Cosine Annealing) optimizer warm up [#155](https://github.com/ecmwf/anemoi-training/pull/155)
- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)
- Bump `anemoi-graphs` version to 0.4.1 [#159](https://github.com/ecmwf/anemoi-training/pull/159)


## [0.3.0 - Loss & Callback Refactors](https://github.com/ecmwf/anemoi-training/compare/0.2.2...0.3.0) - 2024-11-14

### Fixed
Expand All @@ -55,8 +68,10 @@ Keep it human-readable, your future self will thank you!
- Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)
- Updated docs [#115](https://github.com/ecmwf/anemoi-training/pull/115)
- Fix enabling LearningRateMonitor [#119](https://github.com/ecmwf/anemoi-training/pull/119)

- Refactored rollout [#87](https://github.com/ecmwf/anemoi-training/pulls/87)
- Enable longer validation rollout than training

- Expand iterables in logging [#91](https://github.com/ecmwf/anemoi-training/pull/91)
- Save entire config in mlflow

Expand All @@ -66,8 +81,10 @@ Keep it human-readable, your future self will thank you!
- Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)
- Include option to use datashader and optimised asyncronohous callbacks [#102](https://github.com/ecmwf/anemoi-training/pull/102)
- Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)

- Allow updates to scalars [#137](https://github.com/ecmwf/anemoi-training/pulls/137)
- Add without subsetting in ScaleTensor

- Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
- Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)
Expand All @@ -77,7 +94,6 @@ Keep it human-readable, your future self will thank you!
- Functionality to change the weight attribute of nodes in the graph at the start of training without re-generating the graph. [#136] (https://github.com/ecmwf/anemoi-training/pull/136)
- Custom System monitor for Nvidia and AMD GPUs [#147](https://github.com/ecmwf/anemoi-training/pull/147)


### Changed

- Renamed frequency keys in callbacks configuration. [#118](https://github.com/ecmwf/anemoi-training/pull/118)
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
> [!IMPORTANT]
> **Repository Migration Notice**
> >
> This repository has been migrated to our new consolidated Anemoi Core mono-repository. All future development, including new features and bug fixes, will take place in the new repository. Please update your references to use the new location:
>
> 🔗 [@ecmwf/anemoi-core](https://github.com/ecmwf/anemoi-core)
# anemoi-training

[![Documentation Status](https://readthedocs.org/projects/anemoi-training/badge/?version=latest)](https://anemoi-training.readthedocs.io/en/latest/?badge=latest)


**DISCLAIMER**
This project is **BETA** and will be **Experimental** for the foreseeable future.
Interfaces and functionality are likely to change, and the project itself may be scrapped.
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@

author = "Anemoi contributors"

year = datetime.datetime.now(tz="UTC").year
year = datetime.datetime.now(tz=datetime.timezone.utc).year
years = "2024" if year == 2024 else f"2024-{year}"

copyright = f"{years}, Anemoi contributors" # noqa: A001
Expand Down
27 changes: 26 additions & 1 deletion docs/modules/losses.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,11 +73,36 @@ Currently, the following scalars are available for use:
********************

Validation metrics as defined in the config file at
``config.training.validation_metrics`` follow the same initialise
``config.training.validation_metrics`` follow the same initialisation
behaviour as the loss function, but can be a list. In this case all
losses are calculated and logged as a dictionary with the corresponding
name

Scaling Validation Losses
=========================

Validation metrics can **not** by default be scaled by scalars across
the variable dimension, but can be by all other scalars. If you want to
scale a validation metric by the variable weights, it must be added to
`config.training.scale_validation_metrics`.

These metrics are then kept in the normalised, preprocessed space, and
thus the indexing of scalars aligns with the indexing of the tensors.

By default, only `all` is kept in the normalised space and scaled.

.. code:: yaml
# List of validation metrics to keep in normalised space, and scalars to be applied
# Use '*' in reference all metrics, or a list of metric names.
# Unlike above, variable scaling is possible due to these metrics being
# calculated in the same way as the training loss, within the internal model space.
scale_validation_metrics:
scalars_to_apply: ['variable']
metrics:
- 'all'
# - "*"
***********************
Custom Loss Functions
***********************
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ dynamic = [ "version" ]
dependencies = [
"anemoi-datasets>=0.5.2",
"anemoi-graphs>=0.4.1",
"anemoi-models>=0.3",
"anemoi-models>=0.4.1",
"anemoi-utils[provenance]>=0.4.4",
"datashader>=0.16.3",
"einops>=0.6.1",
Expand Down
11 changes: 5 additions & 6 deletions src/anemoi/training/config/graph/limited_area.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ nodes:
node_builder:
_target_: anemoi.graphs.nodes.ZarrDatasetNodes
dataset: ${dataloader.training.dataset}
attributes: ${graph.attributes.data_nodes}
attributes: ${graph.attributes.nodes}
# Hidden nodes
hidden:
node_builder:
Expand All @@ -26,8 +26,8 @@ edges:
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 2 # only for cutoff method
- _target_: anemoi.graphs.edges.CutOffEdges # connects only boundary nodes
cutoff_factor: 1.5 # only for cutoff method
source_mask_attr_name: boundary_mask
attributes: ${graph.attributes.edges}
# Processor configuration
Expand All @@ -46,16 +46,15 @@ edges:
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}


post_processors:
- _target_: anemoi.graphs.processors.RemoveUnconnectedNodes
nodes_name: data
ignore: cutout_mask # optional
save_mask_indices_to_attr: indices_connected_nodes # optional


attributes:
data_nodes:
nodes:
# Attributes for data nodes
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights # options: Area, Uniform
norm: unit-max # options: l1, l2, unit-max, unit-sum, unit-std
Expand Down
18 changes: 8 additions & 10 deletions src/anemoi/training/config/graph/stretched_grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,7 @@ nodes:
node_builder:
_target_: anemoi.graphs.nodes.ZarrDatasetNodes
dataset: ${dataloader.training.dataset}
attributes:
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max
cutout:
_target_: anemoi.graphs.nodes.attributes.CutOutMask
attributes: ${graph.attributes.nodes}
hidden:
node_builder:
_target_: anemoi.graphs.nodes.StretchedTriNodes
Expand All @@ -25,10 +20,6 @@ nodes:
reference_node_name: ${graph.data}
mask_attr_name: cutout
margin_radius_km: 11
attributes:
area_weights:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max

edges:
# Encoder
Expand All @@ -54,6 +45,13 @@ edges:
attributes: ${graph.attributes.edges}

attributes:
nodes:
# Attributes for data nodes
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max
cutout:
_target_: anemoi.graphs.nodes.attributes.CutOutMask
edges:
edge_length:
_target_: anemoi.graphs.edges.attributes.EdgeLength
Expand Down
36 changes: 36 additions & 0 deletions src/anemoi/training/config/lam.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: evaluation
- hardware: example
- graph: limited_area
- model: graphtransformer
- training: default
- _self_


### This file is for local experimentation.
## When you commit your changes, assign the new features and keywords
## to the correct defaults.
# For example to change from default GPU count:
# hardware:
# num_gpus_per_node: 1

dataloader:
dataset:
cutout:
- dataset: ${hardware.paths.data}/${hardware.files.dataset}
thinning: ???
- dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
adjust: all
min_distance_km: 0
grid_indices:
_target_: anemoi.training.data.grid_indices.MaskedGrid
nodes_name: data
node_attribute_name: indices_connected_nodes
model:
output_mask: cutout_mask # it must be a node attribute of the output nodes
hardware:
files:
dataset: ???
forcing_dataset: ???
37 changes: 37 additions & 0 deletions src/anemoi/training/config/stretched.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: evaluation
- hardware: example
- graph: stretched_grid
- model: graphtransformer
- training: default
- _self_


### This file is for local experimentation.
## When you commit your changes, assign the new features and keywords
## to the correct defaults.
# For example to change from default GPU count:
# hardware:
# num_gpus_per_node: 1

dataloader:
dataset:
cutout:
- dataset: ${hardware.paths.data}/${hardware.files.dataset}
thinning: ???
- dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
adjust: all
min_distance_km: 0
training:
loss_scaling:
spatial:
_target_: anemoi.training.data.scaling.ReweightedGraphAttribute
target_nodes: ${graph.data}
scaled_attribute: area_weight # it must be a node attribute of the output nodes
cutout_weight_frac_of_global: ???
hardware:
files:
dataset: ???
forcing_dataset: ???
20 changes: 18 additions & 2 deletions src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,16 +58,32 @@ loss_gradient_scaling: False

# Validation metrics calculation,
# This may be a list, in which case all metrics will be calculated
# and logged according to their name
# and logged according to their name.
# These metrics are calculated in the output model space, and thus
# have undergone postprocessing.
validation_metrics:
# loss class to initialise
- _target_: anemoi.training.losses.mse.WeightedMSELoss
# Scalars to include in loss calculation
# Available scalars include, 'variable'
# Cannot scale over the variable dimension due to possible remappings.
# Available scalars include:
# - 'loss_weights_mask': Giving imputed NaNs a zero weight in the loss function
# Use the `scale_validation_metrics` section to variable scale.
scalars: []
# other kwargs
ignore_nans: True

# List of validation metrics to keep in normalised space, and scalars to be applied
# Use '*' in reference all metrics, or a list of metric names.
# Unlike above, variable scaling is possible due to these metrics being
# calculated in the same way as the training loss, within the internal model space.
scale_validation_metrics:
scalars_to_apply: ['variable']
metrics:
- 'all'
# - "*"


# length of the "rollout" window (see Keisler's paper)
rollout:
start: 1
Expand Down
4 changes: 4 additions & 0 deletions src/anemoi/training/data/datamodule.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,10 @@ def statistics(self) -> dict:
def metadata(self) -> dict:
return self.ds_train.metadata

@cached_property
def supporting_arrays(self) -> dict:
return self.ds_train.supporting_arrays | self.grid_indices.supporting_arrays

@cached_property
def data_indices(self) -> IndexCollection:
return IndexCollection(self.config, self.ds_train.name_to_index)
Expand Down
5 changes: 5 additions & 0 deletions src/anemoi/training/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,11 @@ def metadata(self) -> dict:
"""Return dataset metadata."""
return self.data.metadata()

@cached_property
def supporting_arrays(self) -> dict:
"""Return dataset supporting_arrays."""
return self.data.supporting_arrays()

@cached_property
def name_to_index(self) -> dict:
"""Return dataset statistics."""
Expand Down
Loading

0 comments on commit baa6e1f

Please sign in to comment.