Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6 glm for cue #7

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

6 glm for cue #7

wants to merge 3 commits into from

Conversation

hrlai
Copy link
Collaborator

@hrlai hrlai commented Feb 20, 2025

This PR addresses #6 (stems from ImperialCollegeLondon/virtual_ecosystem#746), which is to set up a generalised linear regression to estimate parameters for temperature-dependent microbial carbon use efficiency (CUE) that does not predict CUE out of bound (stays between 0 and 1).

There are two aspects to review: (1) model choice and (2) folder structure of this repo going forward.

Model choice:

  • @davidorme suggested that we might look at multiple approaches, any further thoughts? Just within a beta-distribution GLM, there are multiple link function choices (logit, probit and cloglog) but I think you had something else in mind?
  • @jacobcook1995 I'm not sure when do we consider this done to update the soil constants. Probably after a few discussions here?

Folder structure:

  • This is where more people could chime in
  • For data, I stored it in a data directory, but it wasn't commited because the csv files etc. are gitignored, How do we envision data download? Do we always include the URL in the code for others to manually download it...?
  • For code, I set it up like code/soil/cue for my case, so the code directory is a bit like the models directory in VE. Not sure if this is best.
  • @arne-exe I could your review too, @jacobcook1995 told me that you're working on some RStudio notes for this repo? I'm writing from a Rstudio user's perspective too.
  • @vgro I have created a bib directory to address Bibliography #1 and followed the same filename as virtual_ecosystem. The bibliography is supposed to include data sources and refs used in html reports that I create with RMarkdown, which leads to:-
  • Do we want to always include a report with the data analyses? Initially I was going to set up just an .R script, but decided to trial with .Rmd to also generate a report at the end. It is in html intended for anyone to jump right in without having to worry about the details. But the html file would easily go >500 kb (which is the lintr limit), and mine was 900 kb due to figures, so I didn't commit it.

That's all for now on the top of my head. Looking forward to pin down a folder structure for this repo :)

@hrlai hrlai linked an issue Feb 20, 2025 that may be closed by this pull request
@hrlai hrlai requested a review from arne-exe February 20, 2025 07:05
@jacobcook1995
Copy link
Collaborator

jacobcook1995 commented Feb 20, 2025

  • @davidorme suggested that we might look at multiple approaches, any further thoughts? Just within a beta-distribution GLM, there are multiple link function choices (logit, probit and cloglog) but I think you had something else in mind?

I guess it does make sense to try different distributions on this side, but the issue would be that we don't currently have any way to utilise alternative functional implementations within the virtual_ecosystem code. Unless I figure out a way of implementing CUE as a generalised linear model, I would need a different function for each distribution

@davidorme
Copy link
Collaborator

We could do it cheaply with a config option that chose the function. This might be a thing where we hard coded two competing theoretical approaches rather than having to build a registry. But one option for now is fine!

Also you don't need to go anywhere near the GLM itself. The paramaterisation does that and you just need to use an appropriate expression to turn the parameters into predictions.

@jacobcook1995
Copy link
Collaborator

jacobcook1995 commented Feb 20, 2025

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

But yes for now, I would probably only want to implement a specific distribution, because the alternative is not a scalable approach to soil model parameterisation

Copy link
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chatted to Rob and @jacobcook1995 about notebook formats. We think it is probably going to be better to use Jupyter notebooks rather than RMarkdown. The main reason is that these outputs render better on GitHub.

I've uploaded a couple of files. One is Myst Markdown - the only real advantage there is that Github knows it is Markdown and tries to render it. It doesn't know that .Rmd is Markdown. The .ipynb format is rendered properly though and includes the binary data for the images. That makes them a bit bulky but they are also much easier to read and review.

None of this is nailed down yet, but Jupyter is what we'd use for Python notebooks and having the same framework and proper rendering is a real advantage.

Comment on lines +97 to +112
The model was fitted with Bayesian inference using the `brms` package in `R`.

```{r Model}
m <- brm(
CUE ~ 1 + Temp_centered + (1 | Author),
data = dat_cue,
family = Beta(),
prior =
prior(normal(0, 0.5), ub = 0, class = b),
warmup = 3000,
iter = 4000,
cores = 4,
file = "out/model_cue",
file_refit = "on_change"
)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using Bayes rather than the frequentist betareg package here because of the random effect of Author?

Copy link
Collaborator Author

@hrlai hrlai Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Bayes only because it's my routine these days... For a frequentist approach there is glmmTMB, which has a beta family. I think if we want speed / ease to run on everyone's computer, then we might want frequentist / maximum likelihood as a default going forward???

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think broadly we should use the "best" analysis - and it's great to have a Bayesian viewpoint as I'm very frequentist (but only from habit, really). I don't think everybody has to run every analysis - although we want the repo set up so people can.

@hrlai
Copy link
Collaborator Author

hrlai commented Feb 20, 2025

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

Good point. In addition to the config option @davidorme suggested, a simpler (but less general?) approach is to add a (inverse) link function to a calculate_* function in the soil module. I don't know python precisely but here's a crude draft. Following the nomenclature from some R package and using calculate_carbon_use_efficiency() as an example, there could be another argument transform, which is an inverse link function to backtransform the output to observation scale depending on the GLM:

return transform(reference_cue - cue_with_temperature * (soil_temp - cue_reference_temp))

And transform would be inverse-logit for logit link (my current implementation), exp for log-link, and we would also include an identity link (no transformation) for Gaussian / normal GLM.

We do not need to worry about generalising this if we only pick one GLM for now, so these are more like notes for the future.

@hrlai
Copy link
Collaborator Author

hrlai commented Feb 20, 2025

better to use Jupyter notebooks rather than RMarkdown. The main reason is that these outputs render better on GitHub.

Thanks @davidorme , I'm all for intergration so happy to favour Jupyter over RMarkdown. That said, would it be possible to do a bit of both by building an auto-conversion into git's workflow? There seems to be jupytext --to ipynb example.Rmd. Did you manually converted the RMarkdown or did it this way? The conversion I got looks a bit different from your .ipynb file.

It would be ideal if anyone on the data team like me could stick to the RStudio routine but still deliver Jupyter notebooks as an end product. But if this is too clunky then I'm still happy to switch over 😄

@jacobcook1995
Copy link
Collaborator

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

Good point. In addition to the config option @davidorme suggested, a simpler (but less general?) approach is to add a (inverse) link function to a calculate_* function in the soil module. I don't know python precisely but here's a crude draft. Following the nomenclature from some R package and using calculate_carbon_use_efficiency() as an example, there could be another argument transform, which is an inverse link function to backtransform the output to observation scale depending on the GLM:

return transform(reference_cue - cue_with_temperature * (soil_temp - cue_reference_temp))

And transform would be inverse-logit for logit link (my current implementation), exp for log-link, and we would also include an identity link (no transformation) for Gaussian / normal GLM.

We do not need to worry about generalising this if we only pick one GLM for now, so these are more like notes for the future.

Yes these are definitely useful notes future! For mainly academic background reasons (only just learning what GLMs are 🫠), I'm pretty strongly in favour of trying to use solely mechanistically derived process representations for the soil model. But something Rob has talked about before is trying to implement alternative empirical derived representations, so down the road we could split the soil model into two (e.g. soil_mechanistic and soil_empirical) and compare performance (my guess is which one works out best would be pretty metric dependant)

@jacobcook1995
Copy link
Collaborator

jacobcook1995 commented Feb 21, 2025

On the data side I think using @qiao2019 is pretty much a no-brainer (vs data drawn from a single system at only 3 temperatures!). The lack of tropical study sites is obviously an issue, but I think this is something we are going to have to learn to generally accept (both the tropics and soils are generally understudied, in combo it's really bad).

Looking at the plotted distribution I do prefer your model to the linear models shown. Staying with the range of 0.2 < CUE < 0.8 for a wide range of temperatures is good in my book, because values outside that have never felt biologically reasonable to me, like can cells really bring their respiration down to as 5% of total carbon intake, with the other 95% being assimilated as biomass? (Okay there's literally data points saying that they can, but I remain sceptical)

If we want to implement your model + parameterisation, I would need to change the functional form of calculate_carbon_use_efficiency(). I'm happy to do this but I need to check what form would fit with the statistical model used here, would it be something like

cue = 1 / (1 + exp(-(cue_ref + (temperature - reference_temperature)))?

(Obviously cue_ref would need to be renamed as it's no longer the carbon use efficiency at the reference temperature)

@hrlai
Copy link
Collaborator Author

hrlai commented Feb 21, 2025

split the soil model into two (e.g. soil_mechanistic and soil_empirical)

Agreed, and now I can more clearly see what you were coming from. I think our recent discussion is quite relevant in this regard, i.e., on one hand we have Arrhenius equations that represent the mechanistic / process-based part, and on the other hand we have the CUE line what is a curve-fitting "empirical" model.

At some point, it would be great for the data team to chat about our "default" approach / model choice:

  • empirical, curve-fitting, phenomenological, vs.
  • process=based, mechanistic

would it be something like

cue = 1 / (1 + exp(-(cue_ref + (temperature - reference_temperature)))?

(Obviously cue_ref would need to be renamed as it's no longer the carbon use efficiency at the reference temperature)

Yeap, it would be what python calls expit and what R tends to call inverse-logit or logistic. But the slope is missing:

cue = 1 / (1 + exp(-(cue_ref + cue_T * (temperature - reference_temperature)))

For renaming there are a few options:

  • cue_ref can be log-odds of CUE at ref temp
  • cue_T is the slope of log-odds of CUE with respect to temperature...

@arne-exe
Copy link
Collaborator

Hey Hao Ran, below some comments:

  • Folder structure: Please have a look at the "data repository structure" document on the VE notebook (go to VE Sharepoint > click Notebook > go to Data repository > click "data repository structure"). There are some useful notes there regarding the folder structure.
  • Script structure: I noticed your script has some meta data (title, author, etc.) and is well organised using different sections. It would be good to discuss with everyone whether we want to have a common structure that is similar in all of the scripts.
  • Data download: The abovementioned notebook doc has some good info on this. I also tried out a method where we can facilitate the data download within the R script (using RSelenium). If we were to include this in our scripts we could extract the datasets straight from Zenodo. I'll create a different issue on this with some more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GLM for carbon use efficiency to avoid fitted line out of bound
4 participants