-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutorial_dataset_5.qmd
299 lines (204 loc) · 13.4 KB
/
tutorial_dataset_5.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
# Tutorial 5: Multiple columns for a trait
## Overview
This is the fifth tutorial on adding datasets to your `traits.build` database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the example datasets in `traits.build-template`. Instructions are available at [Tutorial: Example compilation](tutorial_compilation.html).\
It is also recommended that you first work through some of the earlier tutorials, as many steps for adding datasets to a `traits.build` database are only thoroughly described in the early tutorials.
### Goals
- Learn how to map [context properties into the metadata traits section](#trait_contexts)
- Learn how to add [measurement remarks](#measurement_remarks)
### New functions introduced
- none.
------------------------------------------------------------------------
## Adding tutorial_dataset_5
This dataset is a subset of data from Geange_2017 in AusTraits.
This tutorial focuses on how to input a dataset where there are multiple columns for the same trait, with each column indicating measurements made under different context conditions.
Before you begin creating the metadata file, take a look at the data.csv file. Note that there are two columns for each photosynthesis and conductance. For this dataset these represent repeat measurements made on the same individuals under different experimental treatments. For other studies, these may be multiple columns if the same trait was measured using separate methods.
### Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled `tutorial_dataset_5` within the data folder.
- Ensure that this folder exists on your computer.
- The file `data.csv` exists within the `tutorial_dataset_5` folder.
- There is a folder `raw` nested within the `tutorial_dataset_5` folder, that contains one file, `tutorial_dataset_5_notes.txt`.
### source necessary functions
- If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the `traits.build` package and the custom functions file:
```{r, eval=FALSE}
library(traits.build)
source("R/custom_R_code.R")
```
------------------------------------------------------------------------
### Create a metadata.yml file
#### **Create a metadata template**
To create the metadata template, run:
```{r, eval=FALSE}
metadata_create_template("tutorial_dataset_5")
```
As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:
[data format:]{style="color:blue;"} [**wide**]{style="color:red;"}\
[taxon_name column:]{style="color:blue;"} [**1: species_name**]{style="color:red;"}\
[location_name column:]{style="color:blue;"} [**1: NA**]{style="color:red;"}\
[individual_id column:]{style="color:blue;"} [**1: NA**]{style="color:red;"}\
[collection_date column:]{style="color:blue;"} [**21: date**]{style="color:red;"}\
[Do all traits need `repeat_measurements_id`'s?]{style="color:blue;"} [**2: No**]{style="color:red;"}\
Notes:\
- There is no location column to map in automatically, so that must be added later. You enter `NA` for now.\
- The column `Abbrev!` appears to be a unique identifier for each individual that can be mapped in to identify individual plants. Mapping in an `individual_id` column is essential if multiple rows include measurements for the same individual, as might happen if individuals are measured repeatedly across time. For this dataset, each row includes measurements on a separate individual, so it isn't required to map individual_id. Moreover, if you were to look closely at the values in the column `Abbrev!` you would notice there are 3 instances of duplication. Were you to map in `individual_id: Abbrev!` you would end up with an error - as occurred initially when this dataset was added to AusTraits.\
*Navigate to the dataset's folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.*
------------------------------------------------------------------------
#### **Propagate source information into the metadata.yml file**
Use the function `metadata_add_source_doi` to add the source.\
The reference doi is `10.1186/s40665-017-0033-8`.\
#### **Add measurement remarks** {#measurement_remarks}
There is a free-form comments column called `measurement_remarks` that can be mapped in at the dataset level (i.e. for all measurements) or under specific traits.
This column is **not** used to generate any of the identifiers and therefore cannot be used as a location to document information that is a context property, source, location, method, etc. However, there can be minor notes that have been documented about specific observations or trait measurements that should be retained in the `traits.build` output and if this information if available in a column it can be mapped into measurement remarks.\
For instance, in this dataset, the column `Mother` documents the maternal lineage of each individual. This could be recorded as an official context property, or, alternatively could simply be added as a measurement remark, first mutating a column:\
```{r, eval=FALSE}
custom_R_code: '
data %>%
mutate(
measurement_remarks = paste0("maternal lineage ", Mother)
)
'
```
and then adding `measurement_remarks: measurement_remarks` to the dataset section of the metadata file, below `life_stage`\
#### **Add location details**
There isn't a location name specified in the data.csv file, so use `custom_R_code` to mutate a new column, location_name.\
```{r, eval=FALSE}
custom_R_code: '
data %>%
mutate(
measurement_remarks = paste0("maternal lineage ", Mother),
location = "Australian National University glasshouse"
)
'
```
And then specify this column as the source of `location_name` in the dataset section of the metadata file.\
And manually add the location details to the location section of the metadata file\
```{r, eval=FALSE}
Australian National University glasshouse:
latitude (deg): -35.283
longitude (deg): 149.1167
precipitation, MAP (mm): 622
description: Australian National University glasshouses
```
#### **Add traits**
To select columns in the `data.csv` file that include trait data, run:
```{r, eval=FALSE}
metadata_add_traits(dataset_id = "tutorial_dataset_5")
```
Select columns [**13 14 15 16 17 18 19**]{style="color:red;"}, as these contain trait data.\
Then fill in the details for each trait column in the traits section of the metadata file.\
Remember, the `trait_name` must match a trait concept within the [traits dictionary](https://github.com/traitecoevo/traits.build-template/blob/master/config/traits.yml). For this example:
| column in dataset | trait concept | units_in | entity_type | value_type | basis_of_ value | replicates |
|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| Photo | leaf_photosynthetic_rate_per_area_saturated | umol{CO2}/m2/s | individual | raw | measurement | 1 |
| Cond | leaf_stomatal_conductance_per_area_at_Asat | mol{H2O}/m2/s | individual | raw | measurement | 1 |
| Photo_D | leaf_photosynthetic_rate_per_area_saturated | umol{CO2}/m2/s | individual | raw | measurement | 1 |
| Cond_D | leaf_stomatal_conductance_per_area_at_Asat | mol{H2O}/m2/s | individual | raw | measurement | 1 |
| area_mm2 | leaf_area | mm2 | individual | raw | measurement | 1 |
| SLA_cm_g:4 | leaf_mass_per_area | cm2/g | individual | raw | measurement | 1 |
| %N:1 | leaf_N_per_dry_mass | '%' | individual | raw | measurement | 1 |
#### **Add contexts**
##### **Contexts from columns**
There are two columns in the `data.csv` file that specify contexts, `Elevation` (seed provenance) and `Treatment` (drought treatment).
To add these contexts to the metadata file, run:
```{r, eval=FALSE}
metadata_add_contexts(dataset_id = "tutorial_dataset_5")
```
Select columns [**6 7**]{style="color:red;"} as these contain context properties
The category for both of these is `treatment_context`.
And as the values for both are abbreviations, it is recommended to replace the values for both context properties with proper terms and descriptions.
Therefore, the metadata template will now have the following section:
```{r, eval=FALSE}
contexts:
- context_property: unknown
category: treatment_context
var_in: Elevation
values:
- find: LoElev
value: unknown
description: unknown
- find: HiElev
value: unknown
description: unknown
- context_property: unknown
category: treatment_context
var_in: Treatment
values:
- find: LoWat
value: unknown
description: unknown
- find: HiWat
value: unknown
description: unknown
```
Which will be filled in as:
```{r, eval=FALSE}
- context_property: seed provenance
category: treatment_context
var_in: Elevation
values:
- find: LoElev
value: low elevation
description: Seeds sourced from low elevation populations.
- find: HiElev
value: high elevation
description: Seeds sourced from hight elevation populations.
- context_property: drought treatment
category: treatment_context
var_in: Treatment
values:
- find: LoWat
value: low water
description: Plants assigned to low water treatment.
- find: HiWat
value: high water
description: Plants assigned to high water treatment.
```
##### **Contexts manually added** {#trait_contexts}
As background, at the point in the `traits.build` workflow where the trait metadata is read in, the trait data has been converted to `long` format, with each trait measurement is its own row. This allows columns such as methods, entity_type, and units to be added, which are inherently unique to a specific trait. It also means that context properties can now be added, with different values assigned to different traits.
For this study, in addition to the two contexts that are documented as columns, there is a context property that is documented across columns, the time from last watering to gas exchange measurements. For photosynthesis and conductance, the columns `Photo` and `Cond` document measurements made just after a watering cycle, while the columns `Photo_D` and `Cond_D` document measurements made at the very end of a watering cycle.
For such situations, you add a line to the traits section of the metadata for each of these traits.
For instance, for the column `Photo`, you would add:
```{r, eval=FALSE}
replicates: 1
time_since_watering: start of watering cycle
methods: Gas exchange was measured using...
```
This creates a new column, `time_since_watering` and for the trait column `Photo`, the value is `start of watering cycle`.
You add an identical line to the trait column `Cond`, while for the trait columns `Photo_D` and `Cond_D` instead insert a line `time_since_watering: end of watering cycle`.
For the other traits, no lines are added, as these the context property `time_since_watering` doesn't apply to them.
Since the trait metadata is read in long after the `custom_R_code` code is executed, this context property cannot be read in using a function. Instead it must be manually added as a context_property.
```{r, eval=FALSE}
- context_property: time since watering
category: temporal
var_in: time_since_watering
values:
- value: start of watering cycle
description: Measurements made on the morning following a watering event when the plants were at their least water-limited.
- value: end of watering cycle
description: Measurements made on the final day of a watering cycle when the plants were at the driest point in the cycle.
```
#### **Adding contributors**
The file `data/tutorial_dataset_5/raw/tutorial_dataset_5_notes.txt` indicates the main data_contributor for this study.\
#### **Dataset fields**
The file `data/tutorial_dataset_5/raw/tutorial_dataset_5_notes.txt` indicates how to fill in the `unknown` dataset fields for this study.
### Testing, error fixes, and report building
At this point, run the dataset tests, rebuild the dataset, and check for excluded data:
```{r, eval=FALSE}
dataset_test("tutorial_dataset_5")
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
traits.build_database$excluded_data %>%
filter(dataset_id == "tutorial_dataset_5") %>% View()
```
There should be no errors.\
There are a handful of excluded values, including both negative photosynthetic rates and negative conductance rates and two instances where `leaf_area = 0`. The `leaf_area = 0` values need to be removed using `custom_R_code`.\
```{r, eval=FALSE}
mutate(across(c("area_mm2"), ~na_if(.x,0)))
```
Then remake the database and again check the excluded data table.
If the only excluded values remaining are the negative gas exchange rates, build a report for the study:
```{r, eval=FALSE}
traits.build_database$build_info$version <- "5.0.0"
# a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_5", traits.build_database, overwrite = TRUE)
```