SPINacc

A spinup acceleration tool for land surface model (LSM) family of ORCHIDEE.

Concept: The proposed machine-learning (ML)-enabled spin-up acceleration procedure (MLA) predicts the steady-state of any land pixel of the full model domain after training on a representative subset of pixels. As the computational efficiency of the current generation of LSMs scales linearly with the number of pixels and years simulated, MLA reduces the computation time quasi-linearly with the number of pixels predicted by ML.

Documentation of aims, concepts, workflows are described in Sun et al (2022).

main.py - The main python module that steers the execution of SPINacc.
DEF_*/ - Directories with configuration files for each of the supported ORCHIDEE versions.
- config.py - Settings to configure the machine learning performance.
- varlist.json - Configure paths to ORCHIDEE forcing output and climate data.
- varlist-explained.md - Documentation of data sources used in SPINacc.
Tools/* - Modules called by main.py
AuxilaryTools/SteadyState_checker.py - Tool to assess the state of equilibration in ORCHIDEE simulations.
tests/ - Reproducibility and regression tests
ORCHIDEE_cecill.txt - ORCHIDEE's license file
job - Job file for a bash environment
job_tcsh - Job file for a tcsh environment

Usage

Running SPINacc

Here are the steps to launch SPINacc end-to-end, including the optional tests.

SPINacc has been tested and developed using Python==3.9.*.

Installation

Navigate to the location in which you wish to install and clone the repo as so:
```
git clone [email protected]:CALIPSO-project/SPINacc.git
```

Create a virtual environment and activate:

python3 -m venv ./venv3
source ./venv3/bin/activate

Build all relevant dependencies:

cd SPINacc
pip install -r requirements.txt

Get data from Zenodo

These instructions are applicable regardless of the system you work on, however if you already have access to datasets on the Obelix supercomputer it is likely that SPINacc will run with minimal modification (see Running on Obelix if you believe this is the case). We provide a ZENODO repository that contains forcing data here as well as reference output for reproducibility testing.

It includes:

ORCHIDEE_forcing_data - Explained in DEF_Trunk/varlist-explained.md
reference data - necessary to run the reproducibility checks (Now OUTDATED see Reproducibility tests).

The setup-data.sh script has been provided to automate the download of the associated ZENODO repository and set paths to the forcing data and climate data in DEF_Trunk/varlist.json. The ZENODO repository does not include climate data files (variable name twodeg, without this, initialisation will fail and SPINacc will be unable to proceed). The climate data will be made available upon request to Daniel Goll (https://www.lsce.ipsl.fr/en/pisp/daniel-goll/).

To ensure the script works without error, set the MYTWODEG and MYFORCING paths appropriately. The MYFORCING path points to where you want the forcing data to be extracted to. The default location is ORCHIDEE_forcing_data in the project root.

The script runs the sed command to replace all occurences of /home/surface5/vbastri/ with the downloaded and extracted ORCHIDEE_forcing_data in /your/path/to/forcing/vlad_files/vlad_files/ in DEF_Trunk/varlist.json. This can be done manually if desired.

Running SPINacc

These instructions are designed to get up and running with SPINacc quickly and then run the accompanying tests. See the section below on Obtaining 'best' performance for a more detailed overview of how to optimally adjust ML performance.

In DEF_Trunk/config.py modify the results_dir variable to point to a different path if desired. To run SPINacc from end-to-end, ensure that the steps are set as follows:
```
tasks = [
    1,
    2,
    4,
    5,
]
# 1 = test clustering
# 2 = clustering
# 3 = compress forcing
# 4 = ML
# 5 = evaluation / visualisation
```
If running from scratch, ensure that start_from_scratch is set to True in config.py. The start_from_sratch step creates a packdata.nc file and only needs to be done once for a given version of ORCHIDEE. It is also possible to run just a single task, if desired.
Then run:
```
python main.py DEF_Trunk/
```
By default, main.py will look for the DEF_Trunk directory. SPINacc supports passing other configuration / job directories as arguments to main.py (i.e. python main.py DEF_CNP2/. It is helpful to create copies of the default configurations and then modify for your own purposes to avoid continuously stashing work. )

Results are located in your output directory under MLacc_results.csv. Visualisations of R2, Slope and dNRMSE are can be found each component in Eval_all_biomassCpool.png, Eval_all_litterCpool.png and Eval_all_somCpool.png.

For other versions of ORCHIDEE, i.e. CNP2, outputs will be structured similarly.

Set up baseline reproducibility checks

It is possible to run a set of baseline checks that compare the code to the reference output. As of January 2025, the reference dataset has been updated and is now stored in https://github.com/ma595/SPINacc-results for CNP2 and Trunk. We are working towards a new Zenodo release. These tests are useful to ensure that regressions have not been unexpectedly introduced during development.

Begin by downloading the reference output from GitHub.

git clone https://github.com/ma595/SPINacc-results
In DEF_Trunk/config.py set the reference_dir variable to point to SPINacc-results/Trunk.
[Optional] To execute the reproducibility checks at runtime ensure that True values are set in all relevant steps in DEF_Trunk/config.py.
Alternatively, the tests can be executed after the successful completion of a run by doing the following:
```
pytest --trunk=DEF_Trunk/ -v --capture=sys
```
Above it is possible to point to different output directories with the --trunk flag.

To run a single test do:
```
pytest --trunk=DEF_Trunk -v --capture=sys ./tests/test_task4.py
```
The command line arguments -v and --capture=sys makes test output more visible to users.
The configuration config.py in branch main should be configured correctly. But if not, ensure that the following assignments have been made.
```
kmeans_clusters = 4
max_kmeans_clusters = 9
random_seed = 1000

algorithms = ['bt',]
take_year_average = True
take_unique = False
smote_bat = True
sel_most_PFTs = False
```
The SPINacc-results repo also contains the https://github.com/ma595/SPINacc-results/tree/main/jobs/DEF_Trunk settings used to obtain the reference output.
The checks are as follows:
- test_init.py: Computes recursive compare of packdata.nc to reference packdata.nc.
- test_task1.py: Checks dist_all.npy to the reference.
- test_task2.py: Checks IDloc.npy, IDSel.npy and IDx.npy to the reference.
- test_task3.py: Currently not checked.
- test_task4.py: Compares the new MLacc_results.csv across all components. Tolerance is 1e-2.
- test_task4_2.py: Compares the updated restart file SBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.nc to reference.

Automatic testing

An automated test that runs the entire DEF_Trunkpipeline from end-to-end is executed when a release is tagged. It can be forced to run using GitHub's command line tool gh. See the the official documentation for how to install on your system. Then execute the remote test as follows:

gh run list --workflow=build-and-run.yml

Configuration of SPINacc

The following settings can change the performance of SPINacc:

algorithms: ML algorithms. Multiple can be selected for any given run. The results will be stacked in the MLacc_results.csv. Options include:
- bt: Bagging tree
- rf: Random forest
- nn: Neural network
- ridge : Ridge regression
- best : A 'shotgun' approach that selects the best performing ml algorithm for the given target variable. This is assessed based on the performance on a subset of the data (see select_best_model in train.py), so worse performance may be exhibited on some variables compared to selecting bt directly.
take_year_average (required): If True, all annual data is averaged into a single year's worth of data. If False, all years are used - this has the effect of multiplying the quantity of training data, X, for a given target variable Y, by the number of years.
smote_bat (required): Synthetic minority oversampling.
take_unique(default - True): Take unique pixels only from output of Clustering step - will reduce the number of selected pixels, removing duplicates. This function was kept to gain correspondence with a previous implementation of SPINacc.
old_cluster (default - True): If True, the clustering step will use the old clustering method - i.e. Randomly samples Nc examples or takes all samples if number of samples is less than Nc. If old_cluster = False, the new clustering method will take the max(Nc, 20% subset of locations).
sel_most_PFT_sites (default - False): If True and old_cluster = False, it will preferentially select samples that contain more PFTs using the 20% rule detailed previously. If old_cluster = True and sel_most_PFT_sites = True, an error is thrown.

We recommend always setting parallel = True in config.py to speed up the execution of SPINacc. The serial and parallel execution gives exactly the same results, however it may sometimes be useful to turn this off for debugging purposes.

Obtaining best performance.

The following settings are recommended to obtain best machine learning performance with SPINacc. Note that training time will be longer with take_year_average set to False.

algorithms = ["best"]
take_year_average = False # this will take much longer to finish.
take_unique = True
smote_bat = True

A new clustering approach is still being tested to see if performance is improved. See PR #93. To test the new implementation set the following:

sel_most_PFTs = True
old_cluster = False

Running on the Obelix Supercomputer

If you are already using the obelix supercomputer is likely that SPINacc will work without much adjustment to the varlist.json file.

Jobs can be submitted using the provided pbs scripts, job:

In job : setenv dirpython '/your/path/to/SPINacc/' and setenv dirdef 'DEF_Trunk/'
Then launch your first job using qsub -q short job, for task 1
For tasks 3 and 4, it is better to use qsub -q medium job

Overview of the individual tasks

An overview of the tasks is provided as follows:

Task 0: Initialisation

Extracts climatic variables over 11 years and stores in a packdata.nc file. Subsequent steps are unable to proceed unless this step completes successfully.

Task 1: Optional clustering step

Evaluates the impact of varying the number of K-means clusters on model performance, setting a default of 4 clusters and producing a ‘dist_all.png’ graph.

Task 2: Clustering

Performs the clustering using a K mean algorithm and saves the information on the location of the selected pixels (files starting with 'ID'). The location of the selected pixel (red) for a given PFT and all pixel with a cover fraction exceeding 'cluster_thres' [defined in varlist.json] (grey) are plotted in the figures 'ClustRes_PFT**.png'. Example of PFT2 is shown here:

Task 3: Compressed forcing

Creates compressed forcing files for ORCHIDEE, containing data for selected pixels only, aligned on a global pseudo-grid for efficient pixel-level simulations, with file specifications listed in varlist.json.

Task 4: Machine learning

Performs the ML training on results from ORCHIDEE simulation using the compressed forcing (production mode: resp-format=compressed) or global forcing (debug mode: resp-format=global).
Extrapolation to a global grid.
Writes the state variables into global restart files for ORCHIDEE. For Trunk, this is SBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.nc.
Evaluates ML training outputs vs real model outputs and writes performance metrics to MLacc_results.csv.

Task 5: Optional visualisation

This visualises ML performance from Task 4, offering two evaluation modes, global pixel evaluation and leave-one-cross-validation (LOOCV) for training sites, generating plots for various state variables at the PFT level, including comparisons of ML predictions with conventional spinup data.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
.github/workflows		.github/workflows
AuxilaryTools		AuxilaryTools
DEF_CNP		DEF_CNP
DEF_CNP2		DEF_CNP2
DEF_MICT		DEF_MICT
DEF_Trunk		DEF_Trunk
IMAGES		IMAGES
Tools		Tools
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGES.md		CHANGES.md
CITATION.CFF		CITATION.CFF
CONTRIBUTING.md		CONTRIBUTING.md
MLacc_results.csv		MLacc_results.csv
ORCHIDEE_cecill.txt		ORCHIDEE_cecill.txt
README.md		README.md
USAGE.md		USAGE.md
job		job
job_tcsh		job_tcsh
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup-data.sh		setup-data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPINacc

Contents

Usage

Running SPINacc

Installation

Get data from Zenodo

Running SPINacc

Set up baseline reproducibility checks

Automatic testing

Configuration of SPINacc

Obtaining best performance.

Running on the Obelix Supercomputer

Overview of the individual tasks

Task 0: Initialisation

Task 1: Optional clustering step

Task 2: Clustering

Task 3: Compressed forcing

Task 4: Machine learning

Task 5: Optional visualisation

About

Releases

Packages

Contributors 10

Languages

CALIPSO-project/SPINacc

Folders and files

Latest commit

History

Repository files navigation

SPINacc

Contents

Usage

Running SPINacc

Installation

Get data from Zenodo

Running SPINacc

Set up baseline reproducibility checks

Automatic testing

Configuration of SPINacc

Obtaining best performance.

Running on the Obelix Supercomputer

Overview of the individual tasks

Task 0: Initialisation

Task 1: Optional clustering step

Task 2: Clustering

Task 3: Compressed forcing

Task 4: Machine learning

Task 5: Optional visualisation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages