Skip to content

Commit

Permalink
m
Browse files Browse the repository at this point in the history
  • Loading branch information
rhijmans committed Oct 3, 2024
1 parent 539d126 commit 45d97da
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 155 deletions.
6 changes: 3 additions & 3 deletions source/contribute/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Example script
</div>
<div style="visibility: visible;">

In *Carob*, we standardize datasets that can be automatically downloaded. Each original dataset gets its own *R* script. In this page we discuss an example script that standardizes this `dataset doi:10.21421/D2/STACVA <https://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/STACVA>`__ by Hakeem Ayinde Ajeigbe and colleagues. This particular dataset is published on the `ICRISAT dataverse <https://dataverse.icrisat.org/>`__, which is based on the `Harvard Dataverse <https://dataverse.harvard.edu/>`__. This is the most common platform used for sharing open agricultural research data.
In *Carob*, we standardize datasets that can be automatically downloaded. Each original dataset gets its own *R* script. In this page we discuss an example script that standardizes the dataset `doi:10.21421/D2/STACVA <https://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/STACVA>`__ by Hakeem Ayinde Ajeigbe and colleagues. This particular dataset is published on the `ICRISAT dataverse <https://dataverse.icrisat.org/>`__, which is based on the `Harvard Dataverse <https://dataverse.harvard.edu/>`__. This is the most common platform used for sharing open agricultural research data.

In this tutorial we use this dataset because its processing is not very complex. The full script is `available here <https://raw.githubusercontent.com/reagro/carob/refs/heads/master/scripts/agronomy/doi_10.21421_D2_STACVA.R>`__, please have a look at it now, before we explain it in detail.

Expand Down Expand Up @@ -88,7 +88,7 @@ Now use the ``carobiner::get_data()`` function. It will download the to retriev
ff <- carobiner::get_data(uri, path, group)
metadata
Metadata
--------

The metadata section contains the descriptions of the dataset enriching it with some additional information useful for carob. Most of the metadata (authors, dataset title) is extracted with `carobiner::read_metadata` function. Other metadata needs to be added manually. Of particular importance for experimental data is `treatment_vars`, which need to list the variable(s) that capture the experimental treatment. It is also important to include the publication associated with the dataset if there is any. Here is the metadata section for this dataset.
Expand All @@ -112,7 +112,7 @@ The metadata section contains the descriptions of the dataset enriching it with
In this particular example, there is no publication linked to the dataset (`publication="NA"`), but it it is important to check if there is one. An associated publication often provides additional data that can be extracted.


data
Data
----

Now that we have downloaded the data, and created the metadata, we start with the processing of the actual data. The goal is to create single data.frame where rows are experimental units (or similar in a survey), columns represent variables and cell values are measurements. The data.frame should have standard variable names and values (for character variables) or units (for numeric variables), as prescribed by the `terminag <https://github.com/reagro/terminag>`__ controlled vocabulary. There are data sets that do easily not fit in a single data.frame, for example because there are multiple observations over time for an experimental unit and we will describe these elsewhere.
Expand Down
211 changes: 59 additions & 152 deletions source/contribute/guidelines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,174 +7,81 @@ Guidelines

.. raw:: html

</div>
<div style="visibility: visible;">
</div>
<div style="visibility: visible;">


Before you start
----------------

- Anyone is welcome to contribute to *Carob*. It is not easy to
standardize research data. Please try your best to follow the
guidelines provided here. But we won’t get angry if you make mistakes
— as long as you are willing to learn from them.

- When looking at a dataset you want to process, first carefully read
the description provided. If there is a related publication, read the
abstract and scan the Methods and Results section. The Methods
section often provides data for (constant) management variables that
are not treatments. For example, if all treatments received the same
amount of fertilizer, these numbers are frequently omitted from the
dataset.

- consider what the data are about. With experimental data ask: “what
are the treatments”, “how are they captured”, “what are the important
response variables”? All treatments (factors) must be included as one
or more standard variables. There is a variable called “treatment”
that may have a combination of treatments (e.g., “NP”, “PK”, but they
must also be specified in separate variables such as “N_fertilizer”
and “P_fertilizer”. This seems obvious, but in many datasets the
treatments are not explicitly provided as variables, and you may need
to do some work. For example you may need to translate a treatment
code into multiple variables.

- To contribute to Carob you need to install the “carobiner” R package.
You can do that with ``remotes::install_github("reagro/carobiner")``.
Update the package regularly. The package contains some helper
functions and functions that check for compliance with the standard.

- Carob scripts are normally contributed and/or improved via a github
pull request (PR). Before creating a pull request, make sure that
your fork is synced, and that there are no conflicts. We strongly
prefer PRs for a single file at a time.
- Anyone is welcome to contribute to *Carob*. It is not easy to standardize research data. Please try your best to follow the guidelines provided here. But we won’t get angry if you make mistakes — as long as you are willing to learn from them.
-
- When looking at a dataset you want to process, first carefully read the description provided. If there is a related publication, read the abstract and scan the Methods and Results section. The Methods section often provides data for (constant) management variables that are not treatments. For example, if all treatments received the same amount of fertilizer, these numbers are frequently omitted from the dataset.
-
- Consider what the data are about. With experimental data ask: “what are the treatments”, “how are they captured”, “what are the important response variables”? All treatments (factors) must be included as one or more standard variables. There is a variable called “treatment” that may have a combination of treatments (e.g., “NP”, “PK”, but they must also be specified in separate variables such as “N_fertilizer” and “P_fertilizer”. This seems obvious, but in many datasets the treatments are not explicitly provided as variables, and you may need to do some work. For example you may need to translate a treatment code into multiple variables.
-
- To contribute to *Carob* you need to install the “carobiner” R package. You can do that with ``remotes::install_github("reagro/carobiner")``. Update the package regularly. The package contains some helper functions and functions that check for compliance with the standard.
-
- Carob scripts are normally contributed and/or improved via a github pull request (PR). Before creating a pull request, make sure that your fork is synced, and that there are no conflicts. We strongly prefer PRs for a single file at a time.


Scripts
-------

- Carob scripts download and re-organize the data, to make them
compliant with the standard, and save the standardized data and
relevant metadata on disk. See
`\_template.R <https://github.com/reagro/carob/blob/master/scripts/_template.R>`__
in the *scripts* folder for the general structure of such a script.

- All original data should be downloaded from a URI (uniform resource
identifier) such as a DOI or HDL. For example
``"doi:10.7910/DVN/UNLRGC"`` is a valid URI. It is important to use
this specific notation, do *not* use a http address such as
``https://doi.org/10.7910/DVN/UNLRGC``. Data that does not have a URI
but does have a URL (Internet address) can also be used. Data that
can not be downloaded from the Internet should be hosted somewhere.
We can host it on the `carob
dataverse <https://dataverse.harvard.edu/dataverse/carob/>`__. We can
make exceptions for especially valuable datasets that cannot be
(easily) downloaded directly.

- Each original dataset gets its own script. The script file should be
``<nuri>.R`` where <``nuri``> is a normalized URI. That is, a URI
without a colon or slashes. You can create these with
``carobiner::simple_uri``. For example,
``carobiner::simple_uri("doi:10.7910/DVN/UNLRGC")`` returns
``"doi_10.7910_DVN_UNLRGC"`` and filename for the script should be
``"doi_10.7910_DVN_UNLRGC.R"``

- Scripts and data are grouped by domain. These groups include
“fertilizer”, “crop_cuts”, “maize_trials”, “rice_trials” and
“wheat_trials”, “survey” and “conservation_agriculture”. There are
additional requirements/checks for different groups. For example, the
records in the “fertilizer” groups must have the variables that
specify fertilizer application rates. Group membership can be
somewhat arbitrary as they are partly overlapping. That is not a
problem as the aggregated data may include records from multiple
groups.

- Test your script. The last line in a script should always be
``carobiner::write_files``. This function checks whether you are
using the controlled vocabulary, among other things. Fix any errors
or warnings to the extent possible (without guessing things you do
not know or suppressing warnings). It is OK to leave some warnings if
you believe they cannot be fixed by you (perhaps because the
controlled vocabulary needs to be expanded). You can also use
``carobiner::check_terms`` to evaluate compliance with the standard.

- If a dataset is associated with a publication, you can often get
important additional information from the Methods section (for example, on
location, fertilizer used, plant spacing). If you get values from a
related publication, or because of your reasoning, document where you
got these values by adding comments in the script.
- Carob scripts download and re-organize the data, to make them compliant with the standard, and save the standardized data and relevant metadata on disk. See `\_template.R <https://github.com/reagro/carob/blob/master/scripts/_template.R>`__ in the *scripts* folder for the general structure of such a script.
-
- All original data should be downloaded from a URI (uniform resource identifier) such as a DOI or HDL. For example ``"doi:10.7910/DVN/UNLRGC"`` is a valid URI. It is important to use this specific notation, do *not* use a http address such as ``https://doi.org/10.7910/DVN/UNLRGC``. Data that does not have a URI but does have a URL (Internet address) can also be used. Data that can not be downloaded from the Internet should be hosted somewhere. We can host it on the `carob dataverse <https://dataverse.harvard.edu/dataverse/carob/>`__. We can make exceptions for especially valuable datasets that cannot be (easily) downloaded directly.
-
- Each original dataset gets its own script. The script file should be ``<nuri>.R`` where <``nuri``> is a normalized URI. That is, a URI without a colon or slashes. You can create these with ``carobiner::simple_uri``. For example, ``carobiner::simple_uri("doi:10.7910/DVN/UNLRGC")`` returns ``"doi_10.7910_DVN_UNLRGC"`` and filename for the script should be ``"doi_10.7910_DVN_UNLRGC.R"``
-
- Scripts and data are grouped by domain. These groups include “fertilizer”, “crop_cuts”, “maize_trials”, “rice_trials” and “wheat_trials”, “survey” and “conservation_agriculture”. There are additional requirements/checks for different groups. For example, the records in the “fertilizer” groups must have the variables that specify fertilizer application rates. Group membership can be somewhat arbitrary as they are partly overlapping. That is not a problem as the aggregated data may include records from multiple groups.
-
- Test your script. The last line in a script should always be ``carobiner::write_files``. This function checks whether you are using the controlled vocabulary, among other things. Fix any errors or warnings to the extent possible (without guessing things you do not know or suppressing warnings). It is OK to leave some warnings if you believe they cannot be fixed by you (perhaps because the controlled vocabulary needs to be expanded). You can also use ``carobiner::check_terms`` to evaluate compliance with the standard.
-
- If a dataset is associated with a publication, you can often get important additional information from the Methods section (for example, on location, fertilizer used, plant spacing). If you get values from a related publication, or because of your reasoning, document where you got these values by adding comments in the script.


Standardization
---------------

- The aim is to standardize all relevant variables in the original
data. We generally omit variables that are measured in the field to
compute a variable of interest, but are not of much interest
themselves. For example, we include yield (kg/ha), but not the mass
of a sample that was taken to estimate it.

- We use the `terminag <https://github.com/reagro/terminag>`__
standard. If you think that there are terms (concepts) that are
missing in this standard, just add the new terms that you propose to
your script and ignore the warning messages. An editor will look at
your pull-request (PR) and decide whether the standard needs to be
expanded or changed.

- It is expected that there is variation between data sets; they do not
all have the same variables. But all records should have common
variables such as country, crop, yield, longitude and latitude (even
if some of their values are missing (``NA``))

- Some variables such as ``country``, ``crop``, and ``fertilizer_type``
have controlled vocabularies that you need to use (or suggest adding
additional terms). If there are multiple values (for example, two crops or
fertilizer types), separate these with a semi-colon (``;``).

- Apart from the standard variable names you also need to express all
data in standard units. See the `terminag
variables <https://github.com/reagro/terminag/tree/master/variables>`__.

- Check character variables for spelling variations (use ``unique`` and
``table``) and standardize. You can use ``carobiner::fix_name`` in
some cases.

- Make sure that variables that should be numeric are not stored as
text

- Do not guess values to make things work (for example, because a value is
required, or because it needs to match a vocabulary). Instead, submit
the script with warnings/errors so that we can discuss the best way
to handle these.

- Store dates as text after first creating dates
(``as.character(as.Date(x))``) so that they are in a standard format.
You can also store years (e.g., “2023” or year-months such as
“2023-06” if that is all the available information.

- All experimental treatment variables need to be included. These
variables should also be specified at the dataset level under
“treatment_vars”
- The aim is to standardize all relevant variables in the original data. We generally omit variables that are measured in the field to compute a variable of interest, but are not of much interest themselves. For example, we include yield (kg/ha), but not the mass of a sample that was taken to estimate it.
-
- We use the `terminag <https://github.com/reagro/terminag>`__ standard. If you think that there are terms (concepts) that are missing in this standard, just add the new terms that you propose to your script and ignore the warning messages. An editor will look at your pull-request (PR) and decide whether the standard needs to be expanded or changed.
-
- It is expected that there is variation between data sets; they do not all have the same variables. But all records should have common variables such as country, crop, yield, longitude and latitude (even if some of their values are missing (``NA``))
-
- Some variables such as ``country``, ``crop``, and ``fertilizer_type`` have controlled vocabularies that you need to use (or suggest adding additional terms). If there are multiple values (for example, two crops or fertilizer types), separate these with a semi-colon (``;``).
-
- Apart from the standard variable names you also need to express all data in standard units. See the `terminag variables <https://github.com/reagro/terminag/tree/master/variables>`__.
-
- Check character variables for spelling variations (use ``unique`` and ``table``) and standardize. You can use ``carobiner::fix_name`` in some cases.
-
- Make sure that variables that should be numeric are not stored as text
-
- Do not guess values to make things work (for example, because a value is required, or because it needs to match a vocabulary). Instead, submit the script with warnings/errors so that we can discuss the best way to handle these.
-
- Store dates as text after first creating dates (``as.character(as.Date(x))``) so that they are in a standard format. You can also store years (e.g., “2023” or year-months such as “2023-06” if that is all the available information.
-
- All experimental treatment variables need to be included. These variables should also be specified at the dataset level under “treatment_vars”


R coding style
--------------

- We rely as much as we can on base R to keep code simple and
dependencies low.
- While we may use some functions from e.g. ``dplyr`` and ``stringr``,
we otherwise avoid the tidyverse dialect.
- To make it easy to read and debug code, avoid or sparingly use
``|>``. Never use more than 2 in one statement.
- Avoid nesting function calls. Do not nest more than 2 function calls.
For example, instead of nested ``ifelse`` calls, use ``%in%``,
``match`` or ``merge``
- When using ``ifelse`` do not use a default last condition for a known
case (unless it is obvious). Instead, use ``NA`` as the default for
all other, unexpected, conditions). Do not indent nested ``ifelse``
statements.
- Comment your code. Document your assumptions. Document where you got
numbers introduced in the script (from a publication, for example) Comments
start on the line above the code that is commented on (not on the
same line)

This project is under active development. To stay current, you should
frequently pull this repo and update the ``carobiner`` package.
- We rely as much as we can on base R to keep code simple and dependencies low.
-
- While we may use some functions from e.g. ``dplyr`` and ``stringr``, we otherwise avoid the tidyverse dialect.
-
- To make it easy to read and debug code, avoid or sparingly use ``|>``. Never use more than 2 in one statement.
-
- Avoid nesting function calls. Do not nest more than 2 function calls. For example, instead of nested ``ifelse`` calls, use ``%in%``, ``match`` or ``merge``
-
- When using ``ifelse`` do not use a default last condition for a known case (unless it is obvious). Instead, use ``NA`` as the default for all other, unexpected, conditions). Do not indent nested ``ifelse`` statements.
-
- ``#Comment your code``. Document your assumptions. Document where you got numbers introduced in the script (from a publication, for example) Comments start on the line above the code that is commented on (not on the same line)


*Carob* is under active development. To stay current, you should frequently pull the *Carob* repo and update the carobiner package.



Expand Down

0 comments on commit 45d97da

Please sign in to comment.