A spinup acceleration tool for land surface model (LSM) family of ORCHIDEE.
Concept: The proposed machine-learning (ML)-enabled spin-up acceleration procedure (MLA) predicts the steady-state of any land pixel of the full model domain after training on a representative subset of pixels. As the computational efficiency of the current generation of LSMs scales linearly with the number of pixels and years simulated, MLA reduces the computation time quasi-linearly with the number of pixels predicted by ML.
Documentation of aims, concepts, workflows are described in Sun et al (2022).
The SPINacc package includes:
main.py
- The main python module that steers the execution of SPINacc.DEF_*/
- Directories with configuration files for each of the supported ORCHIDEE versions.config.py
- Settings to configure the machine learning performance.varlist.json
- Configure paths to ORCHIDEE forcing output and climate data.varlist-explained.md
- Documentation of data sources used in SPINacc.
Tools/*
- Modules called bymain.py
AuxilaryTools/SteadyState_checker.py
- Tool to assess the state of equilibration in ORCHIDEE simulations.tests/
- Reproducibility and regression testsORCHIDEE_cecill.txt
- ORCHIDEE's license filejob
- Job file for a bash environmentjob_tcsh
- Job file for a tcsh environment
Here are the steps to launch SPINacc end-to-end, including the optional tests.
SPINacc has been tested and developed using
Python==3.9.*
.
- Navigate to the location in which you wish to install and clone the repo as so:
git clone [email protected]:CALIPSO-project/SPINacc.git
- Create a virtual environment and activate:
python3 -m venv ./venv3 source ./venv3/bin/activate
- Build all relevant dependencies:
cd SPINacc pip install -r requirements.txt
These instructions are applicable regardless of the system you work on, however if you already have access to datasets on the Obelix supercomputer it is likely that SPINacc will run with minimal modification (see Running on Obelix if you believe this is the case). We provide a ZENODO repository that contains forcing data here as well as reference output for reproducibility testing.
It includes:
ORCHIDEE_forcing_data
- Explained in DEF_Trunk/varlist-explained.mdreference
data - necessary to run the reproducibility checks (Now OUTDATED see Reproducibility tests).
The setup-data.sh script has been provided to automate the download of the associated ZENODO repository and set paths to the forcing data and climate data in DEF_Trunk/varlist.json
. The ZENODO repository does not include climate data files (variable name twodeg
, without this, initialisation will fail and SPINacc will be unable to proceed). The climate data will be made available upon request to Daniel Goll (https://www.lsce.ipsl.fr/en/pisp/daniel-goll/).
To ensure the script works without error, set the MYTWODEG
and MYFORCING
paths appropriately. The MYFORCING
path points to where you want the forcing data to be extracted to. The default location is ORCHIDEE_forcing_data
in the project root.
The script runs the sed
command to replace all occurences of /home/surface5/vbastri/
with the downloaded and extracted ORCHIDEE_forcing_data
in /your/path/to/forcing/vlad_files/vlad_files/
in DEF_Trunk/varlist.json
. This can be done manually if desired.
These instructions are designed to get up and running with SPINacc quickly and then run the accompanying tests. See the section below on Obtaining 'best' performance for a more detailed overview of how to optimally adjust ML performance.
-
In
DEF_Trunk/config.py
modify theresults_dir
variable to point to a different path if desired. To run SPINacc from end-to-end, ensure that the steps are set as follows:tasks = [ 1, 2, 4, 5, ] # 1 = test clustering # 2 = clustering # 3 = compress forcing # 4 = ML # 5 = evaluation / visualisation
If running from scratch, ensure that
start_from_scratch
is set toTrue
inconfig.py
. Thestart_from_sratch
step creates apackdata.nc
file and only needs to be done once for a given version of ORCHIDEE. It is also possible to run just a single task, if desired. -
Then run:
python main.py DEF_Trunk/
By default,
main.py
will look for theDEF_Trunk
directory. SPINacc supports passing other configuration / job directories as arguments tomain.py
(i.e.python main.py DEF_CNP2/
. It is helpful to create copies of the default configurations and then modify for your own purposes to avoid continuously stashing work. )Results are located in your output directory under
MLacc_results.csv
. Visualisations of R2, Slope and dNRMSE are can be found each component inEval_all_biomassCpool.png
,Eval_all_litterCpool.png
andEval_all_somCpool.png
.For other versions of ORCHIDEE, i.e. CNP2, outputs will be structured similarly.
It is possible to run a set of baseline checks that compare the code to the reference output. As of January 2025, the reference dataset has been updated and is now stored in https://github.com/ma595/SPINacc-results
for CNP2 and Trunk. We are working towards a new Zenodo release.
These tests are useful to ensure that regressions have not been unexpectedly introduced during development.
-
Begin by downloading the reference output from GitHub.
git clone https://github.com/ma595/SPINacc-results
-
In
DEF_Trunk/config.py
set thereference_dir
variable to point toSPINacc-results/Trunk
. -
[Optional] To execute the reproducibility checks at runtime ensure that
True
values are set in all relevant steps inDEF_Trunk/config.py
. -
Alternatively, the tests can be executed after the successful completion of a run by doing the following:
pytest --trunk=DEF_Trunk/ -v --capture=sys
Above it is possible to point to different output directories with the
--trunk
flag.To run a single test do:
pytest --trunk=DEF_Trunk -v --capture=sys ./tests/test_task4.py
The command line arguments
-v
and--capture=sys
makes test output more visible to users. -
The configuration
config.py
in branchmain
should be configured correctly. But if not, ensure that the following assignments have been made.kmeans_clusters = 4 max_kmeans_clusters = 9 random_seed = 1000 algorithms = ['bt',] take_year_average = True take_unique = False smote_bat = True sel_most_PFTs = False
The SPINacc-results repo also contains the https://github.com/ma595/SPINacc-results/tree/main/jobs/DEF_Trunk settings used to obtain the reference output.
-
The checks are as follows:
test_init.py
: Computes recursive compare ofpackdata.nc
to referencepackdata.nc
.test_task1.py
: Checksdist_all.npy
to the reference.test_task2.py
: ChecksIDloc.npy
,IDSel.npy
andIDx.npy
to the reference.test_task3.py
: Currently not checked.test_task4.py
: Compares the newMLacc_results.csv
across all components. Tolerance is 1e-2.test_task4_2.py
: Compares the updated restart fileSBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.nc
to reference.
An automated test that runs the entire DEF_Trunk
pipeline from end-to-end is executed when a release is tagged. It can be forced to run using GitHub's command line tool gh
. See the the official documentation for how to install on your system. Then execute the remote test as follows:
gh run list --workflow=build-and-run.yml
The following settings can change the performance of SPINacc:
algorithms
: ML algorithms. Multiple can be selected for any given run. The results will be stacked in theMLacc_results.csv
. Options include:bt
: Bagging treerf
: Random forestnn
: Neural networkridge
: Ridge regressionbest
: A 'shotgun' approach that selects the best performing ml algorithm for the given target variable. This is assessed based on the performance on a subset of the data (seeselect_best_model
intrain.py
), so worse performance may be exhibited on some variables compared to selectingbt
directly.
take_year_average
(required): IfTrue
, all annual data is averaged into a single year's worth of data. IfFalse
, all years are used - this has the effect of multiplying the quantity of training data, X, for a given target variable Y, by the number of years.smote_bat
(required): Synthetic minority oversampling.take_unique
(default -True
): Take unique pixels only from output of Clustering step - will reduce the number of selected pixels, removing duplicates. This function was kept to gain correspondence with a previous implementation of SPINacc.old_cluster
(default -True
): IfTrue
, the clustering step will use the old clustering method - i.e. Randomly samples Nc examples or takes all samples if number of samples is less than Nc. Ifold_cluster = False
, the new clustering method will take the max(Nc, 20% subset of locations).sel_most_PFT_sites
(default -False
): IfTrue
andold_cluster = False
, it will preferentially select samples that contain more PFTs using the 20% rule detailed previously. Ifold_cluster = True
andsel_most_PFT_sites = True
, an error is thrown.
We recommend always setting parallel = True
in config.py
to speed up the execution of SPINacc. The serial and parallel execution gives exactly the same results, however it may sometimes be useful to turn this off for debugging purposes.
The following settings are recommended to obtain best machine learning performance with SPINacc. Note that training time will be longer with take_year_average
set to False
.
algorithms = ["best"]
take_year_average = False # this will take much longer to finish.
take_unique = True
smote_bat = True
A new clustering approach is still being tested to see if performance is improved. See PR #93. To test the new implementation set the following:
sel_most_PFTs = True
old_cluster = False
If you are already using the obelix supercomputer is likely that SPINacc will work without much adjustment to the varlist.json
file.
Jobs can be submitted using the provided pbs scripts, job:
- In job : setenv dirpython '/your/path/to/SPINacc/' and setenv dirdef 'DEF_Trunk/'
- Then launch your first job using qsub -q short job, for task 1
- For tasks 3 and 4, it is better to use qsub -q medium job
An overview of the tasks is provided as follows:
Extracts climatic variables over 11 years and stores in a packdata.nc
file. Subsequent steps are unable to proceed unless this step completes successfully.
Evaluates the impact of varying the number of K-means clusters on model performance, setting a default of 4 clusters and producing a ‘dist_all.png’ graph.
Performs the clustering using a K mean algorithm and saves the information on the location of the selected pixels (files starting with 'ID'). The location of the selected pixel (red) for a given PFT and all pixel with a cover fraction exceeding 'cluster_thres' [defined in varlist.json] (grey) are plotted in the figures 'ClustRes_PFT**.png'. Example of PFT2 is shown here:
Creates compressed forcing files for ORCHIDEE, containing data for selected pixels only, aligned on a global pseudo-grid for efficient pixel-level simulations, with file specifications listed in varlist.json.
- Performs the ML training on results from ORCHIDEE simulation using the compressed forcing (production mode: resp-format=compressed) or global forcing (debug mode: resp-format=global).
- Extrapolation to a global grid.
- Writes the state variables into global restart files for ORCHIDEE. For Trunk, this is
SBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.nc
. - Evaluates ML training outputs vs real model outputs and writes performance metrics to
MLacc_results.csv
.
This visualises ML performance from Task 4, offering two evaluation modes, global pixel evaluation and leave-one-cross-validation (LOOCV) for training sites, generating plots for various state variables at the PFT level, including comparisons of ML predictions with conventional spinup data.