file_structure_data_metadata.qmd

# File structure (data & metadafiles)

Within the [folder structure](file_organisation.html) of a `traits.build` database repository, the folder `data` contains the raw data from individual studies included in a `traits.build` database.

Records within the `data` folder are organised as coming from a particular study, defined by the `dataset_id`. Data from each study are organised into a separate folder, with two files:

- `data.csv`: a table containing the actual trait data.
- `metadata.yml`: a file that contains study metadata (source, methods, locations, and context), maps trait names and units onto standard types, and lists any substitutions applied to the data in processing.

## `data.csv`

The file `data.csv` contains raw measurements and can be in either long or wide format.

Required columns include the taxon name, the trait name (column in long format, header in wide format), units (column in long format, part of header in wide format), location (if applicable), context (if applicable), date (if available), and trait values.

It is important that all trait measurements made on the same individual or that are the mean of a species' measurements from the same location are kept linked.

- If the data is in wide format, each row should include measurements made on a single individual at a single point in time or a single species-by-location mean, with different trait values as consecutive columns.

- If the data is in long format, an additional column, `individual_id`, is required to ensure multiple trait measurements made on the same individual, or the mean of a species' measurements from the same location, are linked. If the data is in wide format and there are multiple rows of data for the same individual, an `individual_id` column should be included. These `individual_id` columns ensure that related data values remain linked.

We aim to keep the data file in the rawest form possible (i.e. with as few changes as possible) but it must be a single csv file. Additional custom R code may be required to make the file exactly compatible with the `traits.build` format, but these changes should be executed as the trait database is compiled and should be in the `metadata.yml` file under `dataset/custom_R_code` (see below). Any files used to create the submitted `data.csv` file (e.g. Excel ...) should be archived in a sub-folder within the study folder named `raw`.

## `metadata.yml`

The metadata is compiled in a `.yml` file, a structured data file where information is presented in a hierarchical format (see [Appendix for details](yaml.html)).  There are `r length(schema$metadata$elements)` values at the top hierarchical level: `r sprintf("%s", schema$metadata$elements %>% names()) %>% paste(collapse = ", ")`. These are each described below.

As a start, you may want to check out some examples from [existing studies in Austraits](https://github.com/traitecoevo/traits.build/tree/master/data), e.g. [Angevin_2010](https://github.com/traitecoevo/traits.build/blob/master/data/Angevin_2011/metadata.yml) or [Wright_2009](https://github.com/traitecoevo/traits.build/blob/master/data/Wright_2009/metadata.yml).

The [tutorial section](tutorial_datasets.html) of this book leads a new traits.build user through the process of creating `metadata.yml` files, with a lengthy [adding data](adding_data_long.html) chapter offering explanations for how to propagate the information in each of the sections.

### source

This section provides `r tolower(schema$metadata$elements$source$description)` In general we aim to reference the primary source. References are written in structured yml format, under the category `source` and then under sub-groupings `primary`, `secondary`, and `original`. A reference is designated as `secondary` if it is a second publication by the data collector that analyses the data. When the `primary` reference is a compilation of multiple sources for a meta-analysis, the original references are designated as `original`.

General guidelines for describing a source include:

- A maximum of one primary source allowed.
- Elements are names as in [bibtex format](https://en.wikipedia.org/wiki/BibTeX).
- Keys should be named in the format `Surname_year` and the primary source is almost always identical to the name given to the dataset folder. A second instance of the identical Surname_year should have the key Surname_year_2.
- One or more secondary source may be included if traits from a single dataset were presented in two different manuscripts. Multiple sources are also appropriate if an author has compiled data from a number of sources, which are not individually in the trait database, for a published or unpublished compilation.
- If your data is from an unpublished study, only include the elements that are applicable.
- If someone has transcribed a published source, the primary source will be the published work and the person who has completed the transcription will be acknowledged as the `contributor` of the dataset.

An example of a primary source that is a journal article is:

```
source:
  primary:
    key: Falster_2005_1
    bibtype: Article
    author: Daniel S. Falster, Mark Westoby
    year: 2005
    title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
    journal: Journal of Ecology
    volume: 93
    pages: 521--535
    publisher: Wiley-Blackwell
    doi: 10.1111/j.0022-0477.2005.00992.x
```


If a secondary source is included it may look like:

```
  primary:
    key: Choat_2006
    bibtype: Article
    year: '2006'
    author: B. Choat and M. C. Ball and J. G. Luly and C. F. Donnelly and J. A. M.
      Holtum
    journal: Tree Physiology
    title: Seasonal patterns of leaf gas exchange and water relations in dry rain
      forest trees of contrasting leaf phenology
    volume: '26'
    number: '5'
    pages: 657--664
    doi: 10.1093/treephys/26.5.657
  secondary:
    key: Choat_2005
    bibtype: Article
    year: '2005'
    author: Brendan Choat and Marilyn C. Ball and Jon G. Luly and Joseph A. M. Holtum
    journal: Trees
    title: Hydraulic architecture of deciduous and evergreen dry rainforest tree species
      from north-eastern Australia
    volume: '19'
    number: '3'
    pages: 305--311
    doi: 10.1007/s00468-004-0392-1
```

### contributors

This section provides `r tolower(schema$metadata$elements$contributors$description)` The following information is recorded for each data contributor:

```{r, echo=FALSE, results="show"}
schema$metadata$elements$contributors$elements$data_collectors$elements %>%
  austraits::convert_list_to_df1() %>%
  my_kable_styling()
```

An example is as follows:

```
 data_collectors:
  - last_name: Falster
    given_name: Daniel
    ORCID: 0000-0002-9814-092X
    affiliation: Evolution & Ecology Research Centre, School of Biological, Earth,
      and Environmental Sciences, UNSW Sydney, Australia
    additional_role: contact
  - last_name: Westoby
    given_name: Mark
    ORCID: 0000-0001-7690-4530
    affiliation: Department of Biological Sciences, Macquarie University, Australia
```

Note that only the database custodians should have the contributors' e-mail addresses. This information should not be made directly available to all database users or new contributors via Github.

Additional fields within contributors are:

- `Assistants`, `r tolower(schema$metadata$elements$contributors$elements$assistants$description)`
- `dataset_curators`, `r tolower(schema$metadata$elements$contributors$elements$dataset_curators$description)`

### dataset

This section includes `r tolower(schema$metadata$elements$dataset$description)`

The following elements are included under the element `dataset`:

```{r}
values <- schema$metadata$elements$dataset$values

values <- values[!(names(values) %in% c("observation_id", "entity_type", "plot_context_id", "temporal_context_id", "treatment_context_id", "replicates", "basis_of_value", "value_type"))]

for (value in names(values)) {
  sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```


Of these, the fields `collection_date`, `life_stage`, `basis_of_record`, and `measurement_remarks` can all be specified at the dataset level or the traits level (which overrides a dataset-level entry) or location level (which also overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column within the data.csv file (or generated through `custom_R_code`) that includes the relevant information.

- `life_stage`, `basis_of_record`, and `collection_date` are usually included under `metadata$dataset` unless they vary by trait.

- `entity_type`, `replicates`, `basis_of_value`, and `value_type` are usually different across traits and are usually mapped under the `metadata$traits` section (see below), but are allowed to be specified for the entire dataset in this section.

- `traits` and `value` are only specified in metadata$dataset for **long-format** datasets.

- `measurement_remarks` and `individual_id` are only included if required. They are absent from the majority of datasets.

An example is as follows:

```
  data_is_long_format: no
  custom_R_code: '
    data %>%
      mutate(
        location_name = "Howard River catchment",
        date = date %>% mdy()
      ) %>%
      arrange(date) %>%
      group_by(Tree) %>%
        mutate(observation_number = dplyr::row_number()) %>%
      ungroup()
  '
  collection_date: date
  taxon_name: species
  context_name: context
  location_name: location_name
  individual_id: Tree
  description: Measurements of stem CO2 efflux and leaf gas exchange in a tropical
    savanna ecosystem in northern Australia, and assessed the impact of fire on these
    processes.
  basis_of_record: field
  life_stage: adult
  sampling_strategy: The stem CO2 efflux was initially measured at two locations,
    each of which was nested within a 3 km 2 plot...
  original_file: leaf_summary.xls, Rbranch summary2.xls, and Rstem summary6.xls submitted
    by Lucas Cernusak and archived in the raw data folder and GoogleDrive folder.
  notes: none
```

A common use of the `custom_R_code` is to automate the conversion of a verbal description of flowering or fruiting periods into the supported trait values. It might also be used if values for a single trait are expressed across multiple columns and need to be merged. See `Catford_2014` as an example of this. The [adding data](adding_data.html) vignette provides additional examples of code regularly implemented in `custom_R_code`, including functions specifically that were developed for data manipulations within the AusTraits database that are now in the file `scripts\custom.R` available at the [traits.build-template](https://github.com/traitecoevo/traits.build-template/blob/master/R/custom_R_code.R) repository.

### locations
This section provides `r stringr::str_replace(schema$metadata$elements$locations$description,"A","a")`

Although the properties listed under each location are not part of a controlled vocabulary, it is best practice to align with in-use properties whenever possible. These can be identified by running `database$locations %>% distinct(location_property)`.

An example of how a location and its properties, and the value of each property are listed (modified from Vesk_2019 in the AusTraits database), is:
```
  Round Hill-Nombinnie Nature Reserve:
    latitude (deg): -32.965
    longitude (deg): 146.161
    precipitation, MAP (mm): 370
    temperature, summer mean (C): 32.5
    temperature, winter mean (C): 14.2
    soil type: loamy red sands light red clays and light red browns earths
    description: predominantly open Callitris glaucophylla - Eucalyptus populnea woodland
      and Eucalyptus dumosa - E. socialis shrub mallee woodland
    fire frequency (years): 5-20 years
```

### contexts
This section provides `r stringr::str_replace(schema$metadata$elements$contexts$description,"C","c")`

Within the context section is a list of contextual properties, each encapsulating information read in through a different column or created through `custom_R_code` or as elements within specific `traits` (see below).

```{r}
values <- schema$metadata$elements$contexts$elements
for (value in names(values)) {
  sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```

If the contextual values read in are appropriate and no substitutions are required, the field `find` can be omitted, with the values from the data.csv column entered under the field `value`. The field `description` can likewise be omitted if it is redundant; for instance, if the values are simply sequential observation numbers, times of day, or taxon names (e.g. insect host plants).

As with location, the context properties are not part of a controlled vocabulary, but it is best practice to align syntax with in-use properties whenever possible. These can be identified by running `database$contexts %>% distinct(context_property)`.

An example of how the contexts for a study are formatted (modified from Crous_2013 in the AusTraits database), is:

```
contexts:
- context_property: sampling season
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: August
    description: August (late winter)
  - find: DEC
    value: December
    description: December (early summer)
  - find: FEB
    value: February
    description: February (late summer)
- context_property: temperature treatment
  category: treatment_context
  var_in: Temp-trt
  values:
  - value: ambient
    description: Plants grown at ambient temperatures; Jan average max = 29.4 dec
      C / July average min = 3.2 dec C.
  - value: elevated
    description: Plants grown 3 deg C above ambient temperatures.
- context_property: CO2 treatment
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: 400 ppm
    description: Plants grown at ambient CO2 (400 ppm).
  - find: added CO2
    value: 640 ppm
    description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
  category: method_context
  var_in: method_context
  values:
  - find: Measurement made at 20°C
    value: 20°C
    description: Measurement made at 20°C
  - find: Measurement made at 25°C
    value: 25°C
    description: Measurement made at 25°C
```


### traits

This section provides `r stringr::str_replace(schema$metadata$elements$traits$description,"A","a")`

For each trait included in the trait dictionary in a specific trait database, there is the following information:

```{r}
values <- schema$metadata$elements$traits$elements
for (value in names(values)) {
  sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```

The elements `trait_name`, `entity_type`, `value_type`, `basis_of_record`, and `basis of value` are controlled vocabularies; the values for these elements must be from the list of allowable values. Those for traits are listed in the `traits.yml` [file](https://github.com/traitecoevo/traits.build/blob/master/config/traits.yml) or [vignette](trait_definitions.html). For the other elements, see the [database structure](database_structure.html) vignette.

The fields `replicates`, `basis_of_value`, `value_type`, `life_stage`, `basis_of_record`, and `measurement_remarks` can all be specified at the dataset level or the traits level (which overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column (within the `data.csv` file or generated through `custom_R_code`) that includes the relevant information. In addition, fields can be added to specify a specific context (most commonly a `method context`, but occasionally a `temporal context`). If such a field is added, the same name must appear in both the contexts section and for some (or all) of the traits.

Two examples from the AusTraits database are as follows:

```
- var_in: LeafP.m
  unit_in: mg/g
  trait_name: leaf_P_per_dry_mass
  entity_type: individual                   # fixed value
  value_type: value_type_column             # referencing a column
  basis_of_value: measurement               # fixed value
  replicates: count                         # referencing a column
  methods: Oven-dried leaf material was used for determination of total leaf nitrogen
    and phosphorus. Dried ground leaf material was hot-digested in acid-peroxide before
    colorimetric analysis using a flow injection system (QuikChem 8500, Lachat Instruments,
    Loveland, Colorado, USA).

```
 and

```
- var_in: Jmax25
  unit_in: umol/m2/s
  trait_name: Jmax_per_area
  entity_type: individual                    # fixed value
  value_type: raw                            # fixed value
  basis_of_value: measurement                # fixed value
  replicates: 1                              # fixed value
  method_context: 25C                        # optional field
  methods: Controlled photosynthetic CO2 response curve measurements were made using
    Li-Cor 6400 portable infrared gas analysers (LiCor Inc., Lincoln, NE, USA). CO2
    response curves of net CO2 assimilation (Anet) were developed at a constant temperature
    (termed 'Anet-Ci curves') for intact leaves within each tree chamber. These Anet-Ci
    curve measurements progressed at four to five specified leaf temperatures for
    the same leaf (i.e. one leaf per chamber) in each of three seasons (early summer,
    December 2010; late summer, February 2011...

```


### substitutions


This section provides `r tolower(schema$metadata$elements$substitutions$description)`

Substitutions are required whenever the exact word(s) used to describe a categorical trait value in a database's trait dictionary is different from the vocabulary used by the author in the `data.csv` file. It is preferable to align vocabulary using `substitutions` rather than changing the `data.csv` file. Take a look at the [trait definitions file](https://github.com/traitecoevo/austraits.build/blob/master/config/traits.yml) in AusTraits to see the list of supported values for each trait.

Each substitution is documented using the following elements:

```{r}
values <- schema$metadata$elements$substitutions$values
for (value in names(values)) {
  sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```

An example from the AusTraits database is as follows:

```
substitutions:
- trait_name: life_history
  find: p
  replace: perennial
- trait_name: plant_growth_form
  find: s
  replace: shrub
- ...
```

### taxonomic_updates

This section provides `r tolower(schema$metadata$elements$taxonomic_updates$description)`

Each substitution is documented using the following elements:

```{r}
values <- schema$metadata$elements$taxonomic_updates$values
for (value in names(values)) {
  sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```

Some examples of taxonomic updates in AusTraits are as follows:

```
taxonomic_updates:
- find: Drummondita rubroviridis
  replace: Drummondita rubriviridis
  reason: match_07_fuzzy. Fuzzy alignment with accepted canonical name in APC (2022-11-21)
  taxonomic_resolution: Species
- find: Acacia ancistrophylla/sclerophylla
  replace: Acacia sp. [Acacia ancistrophylla/sclerophylla; White_2020]
  reason: match_04. Rewording taxon where `/` indicates uncertain species identification
    to align with `APC accepted` genus (2022-11-10)
  taxonomic_resolution: genus
- find: Polyalthia (Wyvur)
  replace: Polyalthia sp. (Wyvuri B.P.Hyland RFK2632)
  reason: match_15_fuzzy. Fuzzy match alignment with species-level canonical name
    in `APC known` when everything except first 2 words ignored (2022-11-10)
  taxonomic_resolution: Species
```

In AusTraits, algorithms generate a base taxon_list that includes the information required to automatically align outdated taxonomy and taxonomic synonyms to their currently accepted scientific name, so such adjustments are not documented as substitutions in AusTraits. As taxonomic references vary greatly across the Tree of Life, each `traits.build` database will have their own scripts to build a taxon list.

### questions

This section provides `r tolower(schema$metadata$elements$questions)`

An example is as follows:

```
questions:
  questions for author: Triglochin procera has very different seed masses in the main traits spreadsheet and the field seeds worksheet. Which is correct? There are a number of species with values in the field leaves worksheet that are absent in the main traits worksheet - we have included this data into Austraits; please advise if this was inappropriate.
  austraits: need to map aquatic_terrestrial onto an actual trait once one is created.
```
## `R/custom_R_code.R`

The  [`austraits.build`](https://github.com/traitecoevo/austraits.build/) compilation contains an extra folder, `R` containing a file `custom_R_code.R`. This file documents any custom functions used in the compilation, called as part of the [`custom_R_code` section](#dataset) of metadata files. These functions are also available in the [`traits.build-template` repo](https://github.com/traitecoevo/traits.build-template/R/)

It includes functions to:

- Replace duplicate trait values with NA's

- Convert various month formats into a string of 12 NY's (to document flowering, fruiting, recruitment times)

- Move specific categorical trait values to a second trait

For instance, there are many datasets where a species-level trait measurement is repeated across many rows of data but should only be incorporated into the dataset a single time:

```
custom_R_code: `
  data %>%
    mutate(
      across(c(`plant_growth_form`, `leaf_shape`), replace_duplicates_with_NA)
    )
`
```