-
Notifications
You must be signed in to change notification settings - Fork 1
/
file_structure_data_metadata.qmd
423 lines (327 loc) · 21.2 KB
/
file_structure_data_metadata.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
# File structure (data & metadafiles)
Within the [folder structure](file_organisation.html) of a `traits.build` database repository, the folder `data` contains the raw data from individual studies included in a `traits.build` database.
Records within the `data` folder are organised as coming from a particular study, defined by the `dataset_id`. Data from each study are organised into a separate folder, with two files:
- `data.csv`: a table containing the actual trait data.
- `metadata.yml`: a file that contains study metadata (source, methods, locations, and context), maps trait names and units onto standard types, and lists any substitutions applied to the data in processing.
## `data.csv`
The file `data.csv` contains raw measurements and can be in either long or wide format.
Required columns include the taxon name, the trait name (column in long format, header in wide format), units (column in long format, part of header in wide format), location (if applicable), context (if applicable), date (if available), and trait values.
It is important that all trait measurements made on the same individual or that are the mean of a species' measurements from the same location are kept linked.
- If the data is in wide format, each row should include measurements made on a single individual at a single point in time or a single species-by-location mean, with different trait values as consecutive columns.
- If the data is in long format, an additional column, `individual_id`, is required to ensure multiple trait measurements made on the same individual, or the mean of a species' measurements from the same location, are linked. If the data is in wide format and there are multiple rows of data for the same individual, an `individual_id` column should be included. These `individual_id` columns ensure that related data values remain linked.
We aim to keep the data file in the rawest form possible (i.e. with as few changes as possible) but it must be a single csv file. Additional custom R code may be required to make the file exactly compatible with the `traits.build` format, but these changes should be executed as the trait database is compiled and should be in the `metadata.yml` file under `dataset/custom_R_code` (see below). Any files used to create the submitted `data.csv` file (e.g. Excel ...) should be archived in a sub-folder within the study folder named `raw`.
## `metadata.yml`
The metadata is compiled in a `.yml` file, a structured data file where information is presented in a hierarchical format (see [Appendix for details](yaml.html)). There are `r length(schema$metadata$elements)` values at the top hierarchical level: `r sprintf("%s", schema$metadata$elements %>% names()) %>% paste(collapse = ", ")`. These are each described below.
As a start, you may want to check out some examples from [existing studies in Austraits](https://github.com/traitecoevo/traits.build/tree/master/data), e.g. [Angevin_2010](https://github.com/traitecoevo/traits.build/blob/master/data/Angevin_2011/metadata.yml) or [Wright_2009](https://github.com/traitecoevo/traits.build/blob/master/data/Wright_2009/metadata.yml).
The [tutorial section](tutorial_datasets.html) of this book leads a new traits.build user through the process of creating `metadata.yml` files, with a lengthy [adding data](adding_data_long.html) chapter offering explanations for how to propagate the information in each of the sections.
### source
This section provides `r tolower(schema$metadata$elements$source$description)` In general we aim to reference the primary source. References are written in structured yml format, under the category `source` and then under sub-groupings `primary`, `secondary`, and `original`. A reference is designated as `secondary` if it is a second publication by the data collector that analyses the data. When the `primary` reference is a compilation of multiple sources for a meta-analysis, the original references are designated as `original`.
General guidelines for describing a source include:
- A maximum of one primary source allowed.
- Elements are names as in [bibtex format](https://en.wikipedia.org/wiki/BibTeX).
- Keys should be named in the format `Surname_year` and the primary source is almost always identical to the name given to the dataset folder. A second instance of the identical Surname_year should have the key Surname_year_2.
- One or more secondary source may be included if traits from a single dataset were presented in two different manuscripts. Multiple sources are also appropriate if an author has compiled data from a number of sources, which are not individually in the trait database, for a published or unpublished compilation.
- If your data is from an unpublished study, only include the elements that are applicable.
- If someone has transcribed a published source, the primary source will be the published work and the person who has completed the transcription will be acknowledged as the `contributor` of the dataset.
An example of a primary source that is a journal article is:
```
source:
primary:
key: Falster_2005_1
bibtype: Article
author: Daniel S. Falster, Mark Westoby
year: 2005
title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
journal: Journal of Ecology
volume: 93
pages: 521--535
publisher: Wiley-Blackwell
doi: 10.1111/j.0022-0477.2005.00992.x
```
If a secondary source is included it may look like:
```
primary:
key: Choat_2006
bibtype: Article
year: '2006'
author: B. Choat and M. C. Ball and J. G. Luly and C. F. Donnelly and J. A. M.
Holtum
journal: Tree Physiology
title: Seasonal patterns of leaf gas exchange and water relations in dry rain
forest trees of contrasting leaf phenology
volume: '26'
number: '5'
pages: 657--664
doi: 10.1093/treephys/26.5.657
secondary:
key: Choat_2005
bibtype: Article
year: '2005'
author: Brendan Choat and Marilyn C. Ball and Jon G. Luly and Joseph A. M. Holtum
journal: Trees
title: Hydraulic architecture of deciduous and evergreen dry rainforest tree species
from north-eastern Australia
volume: '19'
number: '3'
pages: 305--311
doi: 10.1007/s00468-004-0392-1
```
### contributors
This section provides `r tolower(schema$metadata$elements$contributors$description)` The following information is recorded for each data contributor:
```{r, echo=FALSE, results="show"}
schema$metadata$elements$contributors$elements$data_collectors$elements %>%
austraits::convert_list_to_df1() %>%
my_kable_styling()
```
An example is as follows:
```
data_collectors:
- last_name: Falster
given_name: Daniel
ORCID: 0000-0002-9814-092X
affiliation: Evolution & Ecology Research Centre, School of Biological, Earth,
and Environmental Sciences, UNSW Sydney, Australia
additional_role: contact
- last_name: Westoby
given_name: Mark
ORCID: 0000-0001-7690-4530
affiliation: Department of Biological Sciences, Macquarie University, Australia
```
Note that only the database custodians should have the contributors' e-mail addresses. This information should not be made directly available to all database users or new contributors via Github.
Additional fields within contributors are:
- `Assistants`, `r tolower(schema$metadata$elements$contributors$elements$assistants$description)`
- `dataset_curators`, `r tolower(schema$metadata$elements$contributors$elements$dataset_curators$description)`
### dataset
This section includes `r tolower(schema$metadata$elements$dataset$description)`
The following elements are included under the element `dataset`:
```{r}
values <- schema$metadata$elements$dataset$values
values <- values[!(names(values) %in% c("observation_id", "entity_type", "plot_context_id", "temporal_context_id", "treatment_context_id", "replicates", "basis_of_value", "value_type"))]
for (value in names(values)) {
sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```
Of these, the fields `collection_date`, `life_stage`, `basis_of_record`, and `measurement_remarks` can all be specified at the dataset level or the traits level (which overrides a dataset-level entry) or location level (which also overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column within the data.csv file (or generated through `custom_R_code`) that includes the relevant information.
- `life_stage`, `basis_of_record`, and `collection_date` are usually included under `metadata$dataset` unless they vary by trait.
- `entity_type`, `replicates`, `basis_of_value`, and `value_type` are usually different across traits and are usually mapped under the `metadata$traits` section (see below), but are allowed to be specified for the entire dataset in this section.
- `traits` and `value` are only specified in metadata$dataset for **long-format** datasets.
- `measurement_remarks` and `individual_id` are only included if required. They are absent from the majority of datasets.
An example is as follows:
```
data_is_long_format: no
custom_R_code: '
data %>%
mutate(
location_name = "Howard River catchment",
date = date %>% mdy()
) %>%
arrange(date) %>%
group_by(Tree) %>%
mutate(observation_number = dplyr::row_number()) %>%
ungroup()
'
collection_date: date
taxon_name: species
context_name: context
location_name: location_name
individual_id: Tree
description: Measurements of stem CO2 efflux and leaf gas exchange in a tropical
savanna ecosystem in northern Australia, and assessed the impact of fire on these
processes.
basis_of_record: field
life_stage: adult
sampling_strategy: The stem CO2 efflux was initially measured at two locations,
each of which was nested within a 3 km 2 plot...
original_file: leaf_summary.xls, Rbranch summary2.xls, and Rstem summary6.xls submitted
by Lucas Cernusak and archived in the raw data folder and GoogleDrive folder.
notes: none
```
A common use of the `custom_R_code` is to automate the conversion of a verbal description of flowering or fruiting periods into the supported trait values. It might also be used if values for a single trait are expressed across multiple columns and need to be merged. See `Catford_2014` as an example of this. The [adding data](adding_data.html) vignette provides additional examples of code regularly implemented in `custom_R_code`, including functions specifically that were developed for data manipulations within the AusTraits database that are now in the file `scripts\custom.R` available at the [traits.build-template](https://github.com/traitecoevo/traits.build-template/blob/master/R/custom_R_code.R) repository.
### locations
This section provides `r stringr::str_replace(schema$metadata$elements$locations$description,"A","a")`
Although the properties listed under each location are not part of a controlled vocabulary, it is best practice to align with in-use properties whenever possible. These can be identified by running `database$locations %>% distinct(location_property)`.
An example of how a location and its properties, and the value of each property are listed (modified from Vesk_2019 in the AusTraits database), is:
```
Round Hill-Nombinnie Nature Reserve:
latitude (deg): -32.965
longitude (deg): 146.161
precipitation, MAP (mm): 370
temperature, summer mean (C): 32.5
temperature, winter mean (C): 14.2
soil type: loamy red sands light red clays and light red browns earths
description: predominantly open Callitris glaucophylla - Eucalyptus populnea woodland
and Eucalyptus dumosa - E. socialis shrub mallee woodland
fire frequency (years): 5-20 years
```
### contexts
This section provides `r stringr::str_replace(schema$metadata$elements$contexts$description,"C","c")`
Within the context section is a list of contextual properties, each encapsulating information read in through a different column or created through `custom_R_code` or as elements within specific `traits` (see below).
```{r}
values <- schema$metadata$elements$contexts$elements
for (value in names(values)) {
sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```
If the contextual values read in are appropriate and no substitutions are required, the field `find` can be omitted, with the values from the data.csv column entered under the field `value`. The field `description` can likewise be omitted if it is redundant; for instance, if the values are simply sequential observation numbers, times of day, or taxon names (e.g. insect host plants).
As with location, the context properties are not part of a controlled vocabulary, but it is best practice to align syntax with in-use properties whenever possible. These can be identified by running `database$contexts %>% distinct(context_property)`.
An example of how the contexts for a study are formatted (modified from Crous_2013 in the AusTraits database), is:
```
contexts:
- context_property: sampling season
category: temporal_context
var_in: month
values:
- find: AUG
value: August
description: August (late winter)
- find: DEC
value: December
description: December (early summer)
- find: FEB
value: February
description: February (late summer)
- context_property: temperature treatment
category: treatment_context
var_in: Temp-trt
values:
- value: ambient
description: Plants grown at ambient temperatures; Jan average max = 29.4 dec
C / July average min = 3.2 dec C.
- value: elevated
description: Plants grown 3 deg C above ambient temperatures.
- context_property: CO2 treatment
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: 400 ppm
description: Plants grown at ambient CO2 (400 ppm).
- find: added CO2
value: 640 ppm
description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
category: method_context
var_in: method_context
values:
- find: Measurement made at 20°C
value: 20°C
description: Measurement made at 20°C
- find: Measurement made at 25°C
value: 25°C
description: Measurement made at 25°C
```
### traits
This section provides `r stringr::str_replace(schema$metadata$elements$traits$description,"A","a")`
For each trait included in the trait dictionary in a specific trait database, there is the following information:
```{r}
values <- schema$metadata$elements$traits$elements
for (value in names(values)) {
sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```
The elements `trait_name`, `entity_type`, `value_type`, `basis_of_record`, and `basis of value` are controlled vocabularies; the values for these elements must be from the list of allowable values. Those for traits are listed in the `traits.yml` [file](https://github.com/traitecoevo/traits.build/blob/master/config/traits.yml) or [vignette](trait_definitions.html). For the other elements, see the [database structure](database_structure.html) vignette.
The fields `replicates`, `basis_of_value`, `value_type`, `life_stage`, `basis_of_record`, and `measurement_remarks` can all be specified at the dataset level or the traits level (which overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column (within the `data.csv` file or generated through `custom_R_code`) that includes the relevant information. In addition, fields can be added to specify a specific context (most commonly a `method context`, but occasionally a `temporal context`). If such a field is added, the same name must appear in both the contexts section and for some (or all) of the traits.
Two examples from the AusTraits database are as follows:
```
- var_in: LeafP.m
unit_in: mg/g
trait_name: leaf_P_per_dry_mass
entity_type: individual # fixed value
value_type: value_type_column # referencing a column
basis_of_value: measurement # fixed value
replicates: count # referencing a column
methods: Oven-dried leaf material was used for determination of total leaf nitrogen
and phosphorus. Dried ground leaf material was hot-digested in acid-peroxide before
colorimetric analysis using a flow injection system (QuikChem 8500, Lachat Instruments,
Loveland, Colorado, USA).
```
and
```
- var_in: Jmax25
unit_in: umol/m2/s
trait_name: Jmax_per_area
entity_type: individual # fixed value
value_type: raw # fixed value
basis_of_value: measurement # fixed value
replicates: 1 # fixed value
method_context: 25C # optional field
methods: Controlled photosynthetic CO2 response curve measurements were made using
Li-Cor 6400 portable infrared gas analysers (LiCor Inc., Lincoln, NE, USA). CO2
response curves of net CO2 assimilation (Anet) were developed at a constant temperature
(termed 'Anet-Ci curves') for intact leaves within each tree chamber. These Anet-Ci
curve measurements progressed at four to five specified leaf temperatures for
the same leaf (i.e. one leaf per chamber) in each of three seasons (early summer,
December 2010; late summer, February 2011...
```
### substitutions
This section provides `r tolower(schema$metadata$elements$substitutions$description)`
Substitutions are required whenever the exact word(s) used to describe a categorical trait value in a database's trait dictionary is different from the vocabulary used by the author in the `data.csv` file. It is preferable to align vocabulary using `substitutions` rather than changing the `data.csv` file. Take a look at the [trait definitions file](https://github.com/traitecoevo/austraits.build/blob/master/config/traits.yml) in AusTraits to see the list of supported values for each trait.
Each substitution is documented using the following elements:
```{r}
values <- schema$metadata$elements$substitutions$values
for (value in names(values)) {
sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```
An example from the AusTraits database is as follows:
```
substitutions:
- trait_name: life_history
find: p
replace: perennial
- trait_name: plant_growth_form
find: s
replace: shrub
- ...
```
### taxonomic_updates
This section provides `r tolower(schema$metadata$elements$taxonomic_updates$description)`
Each substitution is documented using the following elements:
```{r}
values <- schema$metadata$elements$taxonomic_updates$values
for (value in names(values)) {
sprintf("- **%s**: %s", value, values[[value]]) %>% writeLines()
}
```
Some examples of taxonomic updates in AusTraits are as follows:
```
taxonomic_updates:
- find: Drummondita rubroviridis
replace: Drummondita rubriviridis
reason: match_07_fuzzy. Fuzzy alignment with accepted canonical name in APC (2022-11-21)
taxonomic_resolution: Species
- find: Acacia ancistrophylla/sclerophylla
replace: Acacia sp. [Acacia ancistrophylla/sclerophylla; White_2020]
reason: match_04. Rewording taxon where `/` indicates uncertain species identification
to align with `APC accepted` genus (2022-11-10)
taxonomic_resolution: genus
- find: Polyalthia (Wyvur)
replace: Polyalthia sp. (Wyvuri B.P.Hyland RFK2632)
reason: match_15_fuzzy. Fuzzy match alignment with species-level canonical name
in `APC known` when everything except first 2 words ignored (2022-11-10)
taxonomic_resolution: Species
```
In AusTraits, algorithms generate a base taxon_list that includes the information required to automatically align outdated taxonomy and taxonomic synonyms to their currently accepted scientific name, so such adjustments are not documented as substitutions in AusTraits. As taxonomic references vary greatly across the Tree of Life, each `traits.build` database will have their own scripts to build a taxon list.
### questions
This section provides `r tolower(schema$metadata$elements$questions)`
An example is as follows:
```
questions:
questions for author: Triglochin procera has very different seed masses in the main traits spreadsheet and the field seeds worksheet. Which is correct? There are a number of species with values in the field leaves worksheet that are absent in the main traits worksheet - we have included this data into Austraits; please advise if this was inappropriate.
austraits: need to map aquatic_terrestrial onto an actual trait once one is created.
```
## `R/custom_R_code.R`
The [`austraits.build`](https://github.com/traitecoevo/austraits.build/) compilation contains an extra folder, `R` containing a file `custom_R_code.R`. This file documents any custom functions used in the compilation, called as part of the [`custom_R_code` section](#dataset) of metadata files. These functions are also available in the [`traits.build-template` repo](https://github.com/traitecoevo/traits.build-template/R/)
It includes functions to:
- Replace duplicate trait values with NA's
- Convert various month formats into a string of 12 NY's (to document flowering, fruiting, recruitment times)
- Move specific categorical trait values to a second trait
For instance, there are many datasets where a species-level trait measurement is repeated across many rows of data but should only be incorporated into the dataset a single time:
```
custom_R_code: `
data %>%
mutate(
across(c(`plant_growth_form`, `leaf_shape`), replace_duplicates_with_NA)
)
`
```