-
Notifications
You must be signed in to change notification settings - Fork 0
/
module3.qmd
756 lines (442 loc) · 52.3 KB
/
module3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
---
title: "Tying It All Together"
---
## Learning Objectives
After completing this module, you will be able to:
- **Identify** the three primary "products" that come out of synthesis groups.
- **Understand** the metadata and other features that make published datasets useful.
- **Evaluate** the reach and reproducibility of an ecological synthesis project's outputs.
- **Create** a plan for your synthesis team's research products that **applies** contribution, publishing, and citation practices that will benefit the team.
## Introduction
So far, we've made the point that ecological synthesis research is collaborative and inclusive, and that it integrates a wide range of data. Synthesis research is also intended to be **influential** and **useful**. There are many definitions of "influential and useful" to consider here, but successful synthesis research tends to expand the boundaries of knowledge and aims to improve human lives or the environment.
<img src="images/mod3_fig1_tying.png" alt="Three circles labeled 'data', 'results' and 'analytical workflow' with arrows connecting each pair pointing in both directions" style="float: right" width="45%" padding="10px"/>
The ability to accomplish this in synthesis research frequently depends on what knowledge or products are created, and how the synthesis team disseminates and communicates them to the outside world.
There are three interconnected, publishable products that are the most common outputs from a synthesis project (or potentially any research project, really): **the data**, **analytical workflows** (code for data cleaning or statistics, for example), and **research results**. Each of these elements is a valuable product of synthesis science, and each one should reference the others. In this module we'll discuss the mechanics of publishing each one, and then how they can be connected and made accessible for the long-term.
## Designing and Publishing Datasets
**Estimated time: 12 min**
In Module 2 we discussed some considerations for creating and formatting harmonized data files useful for synthesis research. We also introduced the importance of metadata for describing data and making it more usable. Publishing harmonized data files and descriptive metadata together as a **dataset** ensures that the data products produced by a synthesis team are **findable**, **accessible**, **interoperable**, and **reusable** (FAIR). FAIR data are an important output for almost any ecological synthesis project.
::: {.callout-tip collapse="true"}
### More about Findable, Accessible, Interoperable, Reusable (FAIR) data
The FAIR principles, standing for Findability, Accessibility, Interoperability, and Reusability, are a community-standard set of guidelines for evaluating the quality and utility of published research data. Making an effort to meet the FAIR criteria promotes both human and machine usability of data, and is a worthy objective when preparing to publish data from a synthesis research project.
The FAIR principles were first defined in the paper by Wilkinson et al (2018)[^1]. Since this time, many resources have arisen to guide the implementation the FAIR principles[^2][^3], and to quantify FAIR data successes and failures in the research and publishing communities[^4][^5].
[^1]: Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
[^2]: [GoFAIR initiative](https://www.go-fair.org/how-to-go-fair/)
[^3]: [The FAIR Cookbook](https://faircookbook.elixir-europe.org)
[^4]: Bahim, C., Casorrán-Amilburu, C., Dekkers, M., Herczog, E., Loozen, N., Repanas, K., Russell, K. and Stall, S. (2020) ‘The FAIR Data Maturity Model: An Approach to Harmonise FAIR Assessments’, Data Science Journal, 19(1), p. 41. Available at: https://doi.org/10.5334/dsj-2020-041.
[^5]: Gries, Corinna, et al. "The environmental data Initiative: Connecting the past to the future through data reuse." Ecology and Evolution 13.1 (2023): e9592. https://doi.org/10.1002/ece3.9592
:::
### Activity 1: Evaluate published datasets
Lets start our journey to publishing datasets by looking at some that are already published. Form breakout groups and course instructors will assign each group a dataset (a DOI) for evaluation. With your group, answer these questions about the dataset:
1. Where were the data collected?
2. What variables were measured and in what units?
3. What is the origin of the data and how have they been altered since collection?
4. Were the first three questions easy to answer? Why or why not?
:::{.panel-tabset}
### Group 1
**Example dataset:** Jarzyna, M.A., K.E. Norman, J.M. LaMontagne, M.R. Helmus, D. Li, S.M. Parker, M. Perez Rocha, S. Record, E.R. Sokol, P. Zarnetske, and T.D. Surasinghe. 2021. temporalNEON: Repository containing raw and cleaned-up organismal data from the National Ecological Observatory Network (NEON) useful for evaluating the links between change in biodiversity and ecosystem stability ver 1. Environmental Data Initiative. <https://doi.org/10.6073/pasta/7f0e0598132e3fea1bfd36a4257af643>.
::: {.callout-caution collapse="true" icon="false"}
### Observations from the instructors
This dataset is published in the EDI repository and comes from a synthesis group that harmonized NEON data to examine synchrony and stability in communities.
- Raw data, processed data, and the R code for the processing steps are all published together here.
- Useful metadata on the geographic, temporal, and spatial coverage of the dataset are included.
- A couple areas for improvement would be that the Abstract and Methods are a bit short on details, and no ORCIDs have been included for contributors.
:::
### Group 2
**Example dataset:** Craine, Joseph M. et al. (2019). Data from: Isotopic evidence for oligotrophication of terrestrial ecosystems [Dataset]. Dryad. <https://doi.org/10.5061/dryad.v2k2607>
::: {.callout-caution collapse="true" icon="false"}
### Observations from the instructors
This dataset contains a harmonized dataset from a global study of oligotrophication (declining nutrient availability) in terrestrial ecosystems. Included in the dataset are a combined data file with values from many data sources, and the R code to analyze the combined data and create figures used in a related paper.
- It is easy to find the "related work" from the Dryad dataset because the DOI appears on the landing page. The DOI (<https://doi.org/10.1038/s41559-018-0694-0>) points to the paper that used and cited this dataset, and this paper has some detail about the data sources and methods for assembling the combined dataset.
- In theory it should be possible to reproduce the figures presented in the text using the R code provided.
- Without having access to the related paper above, it would be difficult, or potentially impossible, to understand and use the CSV published in Dryad because few descriptive metadata are provided there. Though this appears to be a harmonized dataset, without the paper it is also unclear the origin of the source data, and how they have been changed to generate the harmonized CSV. Knowing important information like units, categorical code meanings, original data sources, and methods of harmonization might require contacting the authors. There withou
:::
### Group 3
**Example dataset:** Wieder, W.R., D. Pierson, S.R. Earl, K. Lajtha, S. Baer, F. Ballantyne, A.A. Berhe, S. Billings, L.M. Brigham, S.S. Chacon, J. Fraterrigo, S.D. Frey, K. Georgiou, M. de Graaff, A.S. Grandy, M.D. Hartman, S.E. Hobbie, C. Johnson, J. Kaye, E. Snowman, M.E. Litvak, M.C. Mack, A. Malhotra, J.A.M. Moore, K. Nadelhoffer, C. Rasmussen, W.L. Silver, B.N. Sulman, X. Walker, and S. Weintraub. 2020. SOils DAta Harmonization database (SoDaH): an open-source synthesis of soil data from research networks ver 1. Environmental Data Initiative. <https://doi.org/10.6073/pasta/9733f6b6d2ffd12bf126dc36a763e0b4>
::: {.callout-caution collapse="true" icon="false"}
### Observations from the instructors
This is an EDI dataset is from the SoDaH (Soil Data Harmonization) LTER working group. It provides a nice example of a harmonized data product that also includes **provenance metadata**. We'll talk a little more about this later, but note that all the original data sources are linked to this dataset on the landing page.
- Data provenance is clear and extensive, and the geographic coverage adds useful detail.
- There are many, many variables available in the dataset, most having to do with soil data and where that data came from.
- The methods provide adequate information about how the data were harmonized. Links to related journal articles should provide examples of how those data can be used.
- The flattened database table format probably isn't very space efficient, but should be fairly easy to use once you understand the columns.
:::
### Group 4
**Example dataset:** Woods, B., Trebilco, R., Walters, A., Hindell, M., Duhamel, G., Flores, H., Moteki, M., Pruvost, P., Reiss, C., Saunders, R., Sutton, C., & Van de Putte, A. (2021). Myctobase (1.1) [Data set]. Zenodo. <https://doi.org/10.5281/zenodo.6131579>
::: {.callout-caution collapse="true" icon="false"}
### Observations from the instructors
Myctobase is a global database of mesopelagic fish data.
- Link to a descriptive data paper in Scientific Data is clear.
- There are three tables and they described in the metadata, and metadata are provided in a separate Excel file.
- Taxonomic names have been standardized and checked.
- Data provenance is not very clear. Did the data come from other published sources like those in the reference list, or are there contributed data too?
:::
### Group 5
**Example dataset:** Ross, C.W., L. Prihodko, J.Y. Anchang, S.S. Kumar, W. Ji, and N.P. Hanan. 2018. Global Hydrologic Soil Groups (HYSOGs250m) for Curve Number-Based Runoff Modeling. ORNL DAAC, Oak Ridge, Tennessee, USA. <https://doi.org/10.3334/ORNLDAAC/1566> (2018)
::: {.callout-caution collapse="true" icon="false"}
### Observations from the instructors
This dataset seems like a reasonably good example of a published, global-scale spatial data synthesis product.
- The R code is published with the data.
- Provenance of the source data is reasonably clear in the User Guide.
- The ORNL DAAC repository provides extensive metadata and good tools to preview and access the data. It is login only.
:::
:::
### Metadata
One thing that Activity 1 introduces is the importance of **metadata**. Metadata are data about the data. As a general rule, metadata should describe
* **Who** collected the data
* **What** was observed or measured
* **When** the data were collected
* **Where** the data were collected
* **How** the data were collected (methods, instruments, etc.)
* Sometimes, stating **why** the data were collected can help future users understand data context evaluate fitness for use.
Including metadata of this nature makes data more usable, and helps prevent the deterioration of information about data over time, as illustrated in the figure below (from Michener et al. 1997[^6]).
[^6]: Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. (1997), NONGEOSPATIAL METADATA FOR THE ECOLOGICAL SCIENCES. Ecological Applications, 7: 330-342. https://doi.org/10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2
![Example of the normal degradation in information content associated with data and metadata over time ("information entropy"). Accidents or changes in technology (dashed line) may eliminate access to remaining raw data and metadata at any time (Michener et al 1997.](images/michener97_information_loss.png){width=60% fig-alt="A graphic demonstrating the loss of data and metadata over time with particular events that precipitate more dramatic loss indicated with arrows (e.g., retirement of key personnel, fading memory of those involved, etc."}
#### Data Provenance Metadata
<img style="float: right;" src="images/provenance.png" alt="Provenance: where did your data come from?" width="25%" padding="10px"/>
Provenance metadata deserves special attention for ecological data synthesis projects. **Data provenance** refers to information detailing the origin of the values in a dataset, which is particularly important for synthesis projects that bring together data from many different sources. Synthesis activities typically produce new data products that are **derived** from the original source data after they have been cleaned, harmonized, and analyzed. Provenance metadata should be included with the derived products to point back to the original source data, similar to the way bibliographic references point to the source material for a book or scholarly article.
A few other notes on provenance:
* At its simplest, documenting data sources as you collect and analyze the source data is a great start on provenance metadata.
* Many data repositories provide guidelines, tools, and features for data provenance metadata[^7].
* Provenance metadata can become very detailed if the software and computing environment is also taken into account. This is an active area of study [^8][^9].
[^7]: [Provenance metadata at the EDI repository](https://edirepository.org/resources/provenance-metadata)
[^8]: Lerner, et al., "Making Provenance Work for You", The R Journal, 2023. https://journal.r-project.org/articles/RJ-2023-003/
[^9]: [End-to-End Provenance](https://end-to-end-provenance.github.io/)
#### Licensing
Published datasets should include a license in every copy of the metadata that defines who has what rights to use, reproduce, or distribute the data. Licensing decisions should be made in consultation with the synthesis team after considering the nature of the data (does it contain human subject data, for instance?), its origin (including restrictions on source data, if applicable), and the requirements of the funders and institutions associated with the project. For publicly-funded environmental research data, it is generally appropriate to use open licenses, and the Creative Commons CC-BY attribution, and CC0 public domain, licenses are probably a good choice for most ecological synthesis data. This is not legal advice and your mileage may vary.
#### Metadata Creation and Management
Assembling metadata should be an integral part of the data synthesis activities discussed in Module 2, and can even be built-in to the workflow and project management practices of a project. **Make sure to plan for and start creating metadata early** in a synthesis project. Below are a few ways to do that.
1. **Keep a detailed project log and populate it with metadata for the project, including information like:**
a. what source data the team is using and where they came from.
b. how data are being analyzed and methods used to create derived products.
c. who is doing what.
2. **Start creating distinct publishable datasets (data plus metadata) as data are processed and analyzed.** The team can do this:
a. locally, using a labeled directory for the cleaned, harmonized, of derived data, along with related code and metadata files. Metadata files may be plain text, or use [a metadata template](https://github.com/jornada-im/documentation/raw/main/templates/Jornada_metadata_template.docx).
b. with a repository-based metadata editor, such as [ezEML](https://ezeml.edirepository.org) from the Environmental Data Initiative (EDI) repository.
3. **Get a professional data manager or data curator involved with the synthesis project**. For example, the LTER Network has a community of "Information Managers" [^10] trained in data management, metadata creation, and data publishing. Research data repositories[^11] and associated data curators[^12] may also be a good resource.
[^10]: [List of LTER Information Managers](https://lternet.edu/using-lter-data/#im)
[^11]: [The Registry of Research Data Repositories (re3data.org)](https://www.re3data.org/)
[^12]: [Data curation network](https://datacurationnetwork.org/)
::: {.callout-note}
#### Key insight
**Reproducibility and the creation of metadata are closely related.** Your team's detailed documentation of the research process allows for reproducible science, and can be mined as a source of metadata during data publication.
:::
### Deciding What to Publish
<img style="float: right;" src="images/what_data.png" alt="What data should you publish" width="20%" padding="10px"/>
The overall design of the dataset to be published is often difficult to imagine, particularly for people new to using or creating datasets. One of the most common questions data managers hear is "*What should we publish?*" This is usually a question about what files to include in the published dataset, or what data will be useful as a published dataset.
::: {.panel-tabset}
### Discussion question
> **What should be included in a published dataset?**
### Some general rules
As we learned in Activity 1, every dataset is different, but the answer to "What should we publish?" usually comes down to:
- Publish any data used to generate research results.
- Publish any data that will be used by others (scientists, managers, public stakeholders), including raw data.
- If reproducibility is of interest or concern, publish the workflow.
- Usually this means publishing code, such as scripts written in R, python, or a shell language.
- What code? Any scripts used to process or analyze the data, or to generate research results like figures, are fair game.
- Sometimes code, especially detailed, reusable workflows like an R package, can stand alone as an independent publication. We discuss that in a later section.
And of course... **always publish descriptive metadata about any of the above.**
These are general rules, but you can also look at advice from a repositories like [EDI](https://edirepository.org/resources/designing-a-data-package) and [BCO-DMO](https://guide.bco-dmo.org/prepare/what-is-a-dataset), or from a research network like [NEON](https://www.neonscience.org/data-samples/guidelines-policies/publishing-research-outputs). Asking a data manager, especially one involved with the synthesis group's work, can also be helpful, as will discussion among the full synthesis team.
:::
### Choosing and Publishing to a Repository
There are many, many research data repositories available to researchers now[^11], making the choice of where to publish data fairly challenging. A few basic data repository features are essential when publishing a synthesis dataset. First, the repository should issue persistent, internet-resolveable, unique identifiers for every dataset published. Generally this will be a [Digital Object Identifier](https://doi.org), or DOI, that can be cited every time the dataset is used after publication. Second, repositories should require, and provide the means to create/publish, metadata describing each dataset. Without requiring at least minimal metadata, no repository can ensure that published data are FAIR. Finally, research data repositories should be stable and well supported so that data remain available and usable in perpetuity. Choosing a repository from the [CoreTrustSeal certified repository list](https://amt.coretrustseal.org/certificates) is one way to assess this. Beyond this, asking a few questions about the dataset will help with repository selection:
1. Who are the likely users for this data? Will they belong to a specific scientific discipline, research network, or community of stakeholders?
2. How specialized are your data? Do they fall into a common data type or follow a special formatting standard?
3. Will the data be updated regularly?
4. Does the repository charge for publication?
5. **Will the dataset benefit from some level of peer review?**
![A limited slice from the broad spectrum of research data repositories available for publishing synthesis data. These repositories are weighted towards those based in the U.S.A. ([re3data.org](https://www.re3data.org) has a comprehensive list). Also note that the FAIR spectrum below refers primarily to repository requirements. It is possible, but not always required, to include detailed, community-standard metadata in generalist repositories.](images/repository_spectrum.png){width=90% fig-alt="A graphic containing the logos for many different data repositories arranged along a gradient of 'less FAIR' to 'more FAIR' where FAIRness is defined as 'Metadata/formatting standards'"}
More specialized repositories tend to offer enhanced documentation, custom software tools, and **data curation staff that will review submitted data and assist users with data publication**. Selecting a data repository with metadata requirements or standards, and a review and curation process for submissions, will help ensure that you are publishing a more FAIR data product. Consulting a project data manager if one is available to the synthesis team will also help with repository selection. After making a choice, the process of publishing data varies from repository to repository.
### Additional Data Publishing Resources
- [NEON's derived data publishing guide](https://www.neonscience.org/data-samples/guidelines-policies/publishing-research-outputs)
- [EDI repository data authorship guide](https://edirepository.org/resources/resources-for-data-authors)
- [BCO-DMO repository data publishing guide](https://guide.bco-dmo.org)
## Sharing the Team's Workflow
One of the most valuable, shareable outputs of synthesis research is the analytical workflow used to derive datasets and produce scientific results. Most often, these workflows are written in computer code, such as R, Python, or another language. The code may consist of a collection of scripts, or they may be organized into stand-alone modules or libraries. The latter is easier to share and re-use, but requires more advanced knowledge of software design. Code can be published in a repository (see the options below), with a DOI issued for particular versions of the code, which allows the workflow and code to be cited by the research products that they were used to generate. Ecological forecasting projects are one good place to look for examples,[^13] but this practice is also generally applicable to synthesis research.
[^13]: White EP, Yenni GM, Taylor SD, et al. Developing an automated iterative near-term forecasting system for an ecological study. Methods Ecol Evol. 2019; 10: 332–344. https://doi.org/10.1111/2041-210X.13104
Sharing and citing workflows and code are an essential element of reproducible science because they:
1. Cite the exact process used to prepare and analyze data
2. Create a repeatable method to integrate or analyze new data
3. Allow other scientists to verify results
Even for data users or interested parties who will not directly use the code, a published workflow provides information about:
* The origin of the data.
* Methods for data cleaning, harmonization, analysis, and presentation of results (figures), which may be adaptable to future work
* How the workflow was developed or changed over time
* The contributions made by the team
::: {.panel-tabset}
### Discussion question
> **What features of published code would let you assess whether it is useful for your purposes?**
### Some ideas
* Clear documentation and examples provided
* Commenting in the code
* Tests and build indicators
* Publication in a repository that provides review (more on this below)
:::
In other parts of the course, we have strongly recommended using version control and collaboration platforms, particulary GitHub. GitHub's platform provides several options for sharing & publishing code, but lets explore some others too.
::: {.panel-tabset}
### GitHub
[GitHub](https://github.com) is huge and widely used for sharing code (among many other services). In combination with other software and services, GitHub can be reliably used to publish scientific code in a reproducible way.
**Some features:**
* Zenodo integration is already included in GitHub,[^14] which can make it fairly easy to publish a repository with a DOI.
* Large array of project management features.
[^14]: [GitHub documentation for referencing and citing content](https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content)
### NEON Code Hub
The [NEON Code Hub](https://www.neonscience.org/resources/code-hub) is a good example of a research network focused code repository.
**Some features:**
* Focus is on code useful for working with NEON data.
* Review and placement of submitted code.
### ROpenSci
[ROpenSci](https://ropensci.org) publishes R packages for scientific applications.
**Some features:**
* Wide array of R packages useful for working with scientific data.
* Team provides review and vetting of the code before publication.
* Most packages also go to CRAN.
### PyPI
The [Python Package Index](https://pypi.org) (PyPI) is the most widely used venue for publishing Python packages.
**Some features:**
* Python compatibility checks are performed and metadata about the code resource are required.
### CRAN
The [Comprehensive R Archive Network](https://cran.r-project.org) is a widely used resource for publishing R packages.
**Some features:**
* R compatibility checks are performed and metadata about the code resource are required.
:::
::: {.callout-note}
#### Key insight
**Peer review is valuable for all research outputs.** We expect a peer review process for journal articles, but published datasets and code can undergo peer review as well. As with manuscripts, the review process for data and code leads to higher quality, more useful products.
:::
## Communicating Research Results
One of the primary goals of synthesis research is to find useful, generalizable research results about the system under study. Most often this means writing scientific journal articles. While we aren't going to go into full detail about what constitutes, or how to write, a manuscript for a journal, there are some unique features of writing articles for synthesis projects. First, **data papers** are often an important product for synthesis groups, and these are somewhat different than standard research journal articles. Second, given, the large size and cooperative nature of most synthesis teams, a **collaborative writing process** is called for. An appropriate collaborative writing method, and some team norms and contribution guidelines, should be in place to reduce the potential for conflict or mistakes.
### Data papers
A data descriptor article, usually known as a data paper, is a peer-reviewed journal article written to introduce and describe a (usually) new dataset. For synthesis teams, who are often producing a harmonized dataset as their first major research product, writing a data paper to accompany the dataset makes sense as a way to introduce the data, demonstrate their utility, and get the word out about the dataset. Data papers also lay the groundwork for any future papers that will answer the science questions of interest to the synthesis team.
Data papers may be simpler and shorter than research articles (not always though), but there are still a few gotchas that can arise. Below are some recommendations, and the rationale behind them.
1. **Publish the dataset described by the data paper in a reputable data repository.**
- Although some data paper publishers host data themselves, they are usually published only as supplementary material for article, or are only held for review. Most data-focused journals require that accepted data papers should describe and reference a dataset published in a research data repository. Follow the guidance above to select a repository and prepare the dataset for publication.
2. **Be sure to cite the data paper and the dataset properly.**
- The existence of a data paper and a dataset, each describing the same data and each with its own DOI, can create confusion about what to cite in related works. If the novelty and utility of the dataset, or the methods used to assemble it, are being referenced by a related work, then it may be most appropriate to cite the data paper. If the actual data are being used (analyzed, interpreted, etc.) in a related work, then cite the published dataset. In many cases it is expected to cite both.
3. **Don't shortchange the metadata in the published dataset just because there is also a data paper.**
- Consider the future usability of the data the data paper describes, and ensure that the associated published dataset contains detailed, community-standard metadata. Not all users will see the data paper, and data paper publishers may have incomplete or quirky requirements for metadata.
::: {.callout-tip collapse="true"}
### Data paper examples and publication venues
**Some examples of data papers related to synthesis projects:**
- Komatsu, Kimberly J., et al. "CoRRE Trait Data: A dataset of 17 categorical and continuous traits for 4079 grassland species worldwide." Scientific Data 11.1 (2024): 795. <https://doi.org/10.1038/s41597-024-03637-x>
- ...
**A few suggested venues for publishing data papers:**
- [*Scientific Data*](http://www.nature.com/sdata/) (Nature Publishing Group)
- [*Data*](https://www.mdpi.com/journal/data) (MDPI)
- [*PLOS ONE*](https://journals.plos.org/plosone/) (usually termed "database papers")
- The ESA journal [*Ecology*](https://www.esa.org/publications/be-an-author/), and quite a few other disciplinary journals, now publish data papers.
GBIF also maintains a helpful list of [data paper journals](https://www.gbif.org/data-papers).
:::
### Writing collaboratively
Writing a paper with a large team can be a challenge. It is important to encourage team members to contribute in a way they are comfortable with, but there is the potential for technical, editorial, and personal conflict without some prior planning. Practically, there are two models for writing a manuscript with a bunch of contributors.
::: {.panel-tabset}
### Cloud-based collaborative writing
In this model manuscripts live mainly in web-based writing platforms managed by a cloud service provider (e.g. Google Docs) and all contributors write and edit the document within that platform. Contributions may be asynchronous or synchronous since version control and conflict resolution is generally built into the platform. Most platforms have additional collaboration features, such as user account management, suggested edits, and commenting systems.
**Software platform:** Google Docs, Microsoft 365 Online, Overleaf (LaTeX)
**Pros:** Strong collaboration features (user/permission management, contribution tracking, comments and suggestions). No need to distribute copies and then merge contributions.
**Cons:** Can be unfamiliar to senior contributors. Easy to lose track of links. Limited formatting features compared to local word processors. Privacy/tracking concerns.
### "Pass the manuscript"
This model relies on word processing software installed on contributors' local machines. Copies of the manuscript are distributed to contributors for asynchronous writing and editing assignments, and contributions are then merged together into a synchronized version of the manuscript. In large teams, it may be best to have one person managing the copy/merge process.
**Software platform:** Microsoft Word (usually), email
**Pros:** Familiar to most. Integrates with local data management practices. Most word processors have powerful collaboration and versioning features now. Advanced formatting and editing. Less reliance on cloud providers.
**Cons:** License pricing and institutional availability may be limited. Multiple versions in use, and the copy/merge workflow can easily generate conflicts or become unmanageable in large groups.
:::
In addition to these practical considerations, there are some team considerations as well
1. **Make the expectations for contributing to a manuscript clear.**
- How, when, and where should contributions be made
- Authorship expectations discussed in advance
2. **Make space for new, or early-career team members to contribute.**
- Efficiency and experience level aren't good reasons to exclude contributors
- Synthesis papers are a great learning experience and career opportunity
3. **Team discussions are preferable to unilateral editorial decisions.**
- This can help avoid hurt feelings during the editing process.
4. **It can be beneficial to have a manuscript coordinator.**
- The coordinator can help split up writing and editing tasks equitably
- Someone needs to manage conflicts, check for consistency, etc.
- Often this is the lead author
## Connecting the Pieces
We've now covered how a synthesis team should approach creating and publishing its main research outputs (data, code, results). Now we'll discuss how to begin making these useful to the world, which starts with making sure the products of synthesis research point to each other. Lets begin with an activity.
### Activity 2: Synthesis project detective
**Estimated time: 12 min**
Form breakout groups and course instructors will assign each one a link to a product from a synthesis project (the code, a paper, a dataset, etc.). Using any means necessary (metadata, web search, etc.) figure out what other products are related (other publications, source/derived data, etc.) and who is involved in the synthesis team. Answer these questions as a group:
1. If your group received a link to a paper, were you able to find datasets and a code repository (for an analytical workflow)?
2. If your group received a link to a code repository, were you able to find papers and datasets?
3. If your group received a link to a dataset, were you able to find papers and a code repository?
4. Who was involved in the synthesis project?
5. Could you understand the overall scope and impact of the synthesis project? Why or why not?
::: {.panel-tabset}
### Group 1
**Clue:** <https://cran.r-project.org/web/packages/codyn/index.html>
::: {.callout-caution collapse="true" icon="false"}
### Cracking the case
This is the "community dynamics" synthesis working group that was at least partly supported by the LTER Network.
*Papers*
There is a paper describing the R package
- Hallett, L.M., Jones, S.K., MacDonald, A.A.M., Jones, M.B., Flynn, D.F.B., Ripplinger, J., Slaughter, P., Gries, C. and Collins, S.L. (2016), codyn: An r package of community dynamics metrics. Methods Ecol Evol, 7: 1146-1151. <https://doi.org/10.1111/2041-210X.12569>
And maybe this
- Avolio, M. L., I. T. Carroll, S. L. Collins, G. R. Houseman, L. M. Hallett, F. Isbell, S. E. Koerner, K. J. Komatsu, M. D. Smith, and K. R. Wilcox. 2019. A comprehensive approach to analyzing community dynamics using rank abundance curves. Ecosphere 10(10):e02881. <https://doi.org/10.1002/ecs2.2881>
*Workflows*
The original clue is a link to the `codyn` package on CRAN.
*Datasets*
- Data are included in the R package... but not sure of the provenance.
*Other*
- ?
:::
### Group 2
**Clue:** <https://corredata.weebly.com/>
::: {.callout-caution collapse="true" icon="false"}
### Cracking the case
This is the CoRRE synthesis working group., which has been supported bu iDiv and LTER (and possibly others). The group's website lays out many of the products fairly clearly, though it may not be perfectly up-to-date.
*Papers*
- Avolio, M. L., Pierre, K. J. L., Houseman, G. R., Koerner, S. E., Grman, E., Isbell, F., ... & Wilcox, K. R. (2015). A framework for quantifying the magnitude and variability of community responses to global change drivers. Ecosphere, 6(12), 1-14.
- Wilcox, K. R., Tredennick, A. T., Koerner, S. E., Grman, E., Hallett, L. M., Avolio, M. L., ... & Zhang, Y. (2017). Asynchrony among local communities stabilises ecosystem function of metacommunities. Ecology letters, 20(12), 1534-1545.
- Langley, J. A., Chapman, S. K., La Pierre, K. J., Avolio, M., Bowman, W. D., Johnson, D. S., ... & Tilman, D. (2018). Ambient changes exceed treatment effects on plant species abundance in global change experiments. Global Change Biology, 24(12), 5668-5679.
- Komatsu, K. J., Avolio, M. L., Lemoine, N. P., Isbell, F., Grman, E., Houseman, G. R., ... & Zhang, Y. (2019). Global change effects on plant communities are magnified by time and the number of global change factors imposed. Proceedings of the National Academy of Sciences, 116(36), 17867-17873. <https://doi.org/10.1073/pnas.1819027116>
- and quite a few more....
A recent data paper
- Komatsu, K.J., Avolio, M.L., Padullés Cubino, J. et al. CoRRE Trait Data: A dataset of 17 categorical and continuous traits for 4079 grassland species worldwide. Sci Data 11, 795 (2024). <https://doi.org/10.1038/s41597-024-03637-x>
*Workflows*
- Nothing public found so far... There may be scripts provided as supplementary files with some publications.
*Datasets*
- Other than the data paper mentioned above and some [associated data in EDI](), the data appear to be by request only.
*Other*
- ?
:::
### Group 3
**Clue:** <https://doi.org/10.1029/2022GB007678>
::: {.callout-caution collapse="true" icon="false"}
### Cracking the case
This is a paper from the LTER-supported Silica exports working group. We talked in Module 2 about their repositories and project management practices.
*Papers*
The original clue is a paper from this group. Note the data availability statement links to a dataset published with the USGS, and the acknowledgements give some clue as to the LTER support recieved.
*Workflows*
There are several GitHub repositories. The first one listed is a guide to others.
- <https://github.com/lter/lterwg-silica-data>
- <https://github.com/SwampThingPaul/SiSyn>
- <https://github.com/lsethna/NCEAS_SiSyn_CQ>
- <https://github.com/lter/lterwg-silica-spatial>
- <https://github.com/njlyon0/lter_silica-high-latitude>
*Datasets*
- Jankowski, K.J., Carey, J.C., Julian, P., Johnson, K., Sethna, L.R., Thomas, P.K., Wymore, A.S., Shogren, A.J., McKnight, D.M., McDowell, W.H., Heindel, R.C., Sullivan, P.L., and Jones, J. B., 2023, Dissolved silicon concentration and yield estimates from streams and rivers in North America and Antarctica,1964-2021: U.S. Geological Survey data release, <https://doi.org/10.5066/P951UKQB>
*Other*
- ?
:::
### Group 4
**Clue:** <https://doi.org/10.3389/fmars.2021.724913>
::: {.callout-caution collapse="true" icon="false"}
### Cracking the case
This is a paper describing an early effort to create a harmonized, global ocean oxygen product. It was published in 2021, and there is currently not much other information about progress on the effort.
*Papers*
The original clue is the primary paper describing this effort.
*Workflows*
- ???
*Datasets*
- ???
*Other*
- ???
:::
### Group 5
**Clue:** <https://github.com/sokole/ltermetacommunities>
::: {.callout-caution collapse="true" icon="false"}
### Cracking the case
This is the LTER Synthesis group called "Metacommunities"
*Papers*
- Wisnoski, Nathan I., Riley Andrade, Max C. N. Castorani, Christopher P. Catano, Aldo Compagnoni, Thomas Lamy, Nina K. Lany, et al. 2023. “ Diversity–Stability Relationships across Organism Groups and Ecosystem Types Become Decoupled across Spatial Scales.” Ecology 104(9): e4136. <https://doi.org/10.1002/ecy.4136>
*Workflows*
- The original "clue" is a link to GitHub where code is stored.
- This is one of the LTER-supported synthesis groups that led to the creation of the [ecocomDP](https://ediorg.github.io/ecocomDP/) data model and R package [^15]
*Datasets*
There is a dataset in Zenodo
- Nathan I. Wisnoski. (2023). nwisnoski/dsr-metacom: Code for Ecology publication (v1.0.4). Zenodo. <https://doi.org/10.5281/zenodo.8067504>
*Other*
- Eric may know more!
[^15]: O'Brien, Margaret, et al. "ecocomDP: a flexible data design pattern for ecological community survey data." Ecological Informatics 64 (2021): 101374.
:::
:::
### More ways to synthesize
<img src="images/mod3_more_products.png" alt="Three circles labeled 'data', 'results' and 'analytical workflow', plus many more possible products" style="float: right" width="50%"/>
We've talked about the three most common products of synthesis: papers, datasets, and workflows. But, we've also seen that there are plenty of other ways to share synthesis research! Education and outreach can become an important goal in for some synthesis teams, and providing access to data and actionable research results, such as forecasts, can be very useful to stakeholders. As time goes, on synthesis teams may produce many things that meet these goals and needs, moving well beyond the three kinds of products we've already talked about. See below for a few ideas and examples.
:::{.panel-tabset}
### Teaching materials
Synthesis research produces new scientific knowledge that other researchers, students, or stakeholders can learn and build on. Synthesis can also generate applied-science tools and methods that others need to learn how to use for themselves. Teaching modules are an important way of sharing both of these outcomes, and of broadening the reach of a synthesis project.
Examples:
- The [EDDIE project](https://serc.carleton.edu/eddie/teaching_materials/index.html) is a clearinghouse of contributed teaching materials for the earth and environmental sciences.
- This website is an example of teaching materials produced by a synthesis team.
### Web apps
Interactive web applications can provide users with easy access to scientific datasets, especially large ones, analytical results, visualizations, interpretation, and many, many other things. Creating web apps is not necessarily an easy task, but if your synthesis team has the expertise, or access to web developers, web apps may be useful for outreach, or as tools the synthesis team itself can use. Frameworks like [Shiny](https://shiny.posit.co/) (for R), [Streamlit](https://streamlit.io), or [Flask](https://flask.palletsprojects.com) (both for python), and services like [Shinyapps.io](https://www.shinyapps.io/) and [Plotly](https://plotly.com), can make creation of apps relatively painless.
Examples
- An app for finding and exploring [ecocomDP data](https://ecocomdp.neonscience.org/) in the NEON and EDI repositories.
- A [dashboard app](https://projects.ecoforecast.org/neon4cast-dashboard/phenology) for the NEON ecological forecasting challenge.
- The [Jornada LTER interactive viewer](https://jornada-data.shinyapps.io/jrn_dataviewer/) for weather station data.
### Automation
Some research efforts have developed automation systems for research data processing, analytics, and publishing. These often fall into the "continuous integration/continuous deployment" class of web-enabled software and data pipelines, in which one software processes (data processing, analytics, publication, etc.) may be automatically triggered by events that occur in another, connected software service (such as adding new data to a GitHub repository). These technologies enable researchers to build software pipelines that can be useful for quality control of new data, updating forecasts, and rapid deployment of data or analysis products.
Examples:
- The Portal Project in southeast Arizona has developed a well-described [near-term ecological forecasting pipeline](https://portal.naturecast.org/).[^14]
- [Automated quality control](https://github.com/SCBI-ForestGEO/Dendrobands) of dendrometer band data.[^16]
- [Forecasting Lake and Reservoir Ecosystems](https://flare-forecast.org/) (FLARE) project.
[^16]: Kim, A. Y., Herrmann, V., Barreto, R., Calkins, B., Gonzalez-Akre, E., Johnson, D. J., Jordan, J. A., Magee, L., McGregor, I. R., Montero, N., Novak, K., Rogers, T., Shue, J., & Anderson-Teixeira, K. J. (2022). Implementing GitHub Actions continuous integration to reduce error rates in ecological data collection. Methods in Ecology and Evolution, 13, 2572–2585. https://doi.org/10.1111/2041-210X.13982
### Project websites
At a certain point, the outputs of a synthesis project can become numerous and challenging to present to the public in an organized way. Project websites can serve as a gateway to an entire synthesis project by providing comprehensive listings of project outputs (papers, datasets, GitHub repositories, etc), a narrative for the research, appealing images or graphics for outreach, and links to related projects, funders, or institutions. [GitHub Pages](https://pages.github.com/) sites are a common solution for creating simple, cost-effective (free, usually) project websites nowadays, but there are other options. A good project website can become a cohesive, engaging clearinghouse for information about a synthesis project, but they can be laborious to create and keep up-to-date.
Examples:
- [The Portal Project](https://portal.weecology.org/)
- [The SoDaH project]().
- [The CoRRE project](https://corredata.weebly.com/)
:::
### Linking synthesis products together
Reflecting on all the information above, we can see one common feature of the many different products of a synthesis team: they exist primarily as digital objects on the internet. The internet may seem fluid, but fortunately there are ways to identify and connect these digital objects in a stable way.
#### Persistent identifiers
Persistent identifiers, or [PIDs](https://en.wikipedia.org/wiki/Persistent_identifier), are references to digital objects that are intended to last a long time. For objects on the internet, they are intended to be unique, i.e. having a 1:1 relationship between the PID and the digital object, and machine actionable, meaning they can be understood by software like web browsers. There are many different types of PIDs, but the most useful ones in the context of publishing research products are:
- [Digital Object Identifiers](https://www.doi.org/) (DOI), used to identify digital publications like journal articles, datasets, or governement reports.
- [Open Researcher and Contributor ID](https://orcid.org/) (ORCID), used to identify individuals, usually in the context of research or publishing activities.
- [Research Organization Registry](https://ror.org/) (ROR), used to identify organizations, also in the context of research and publishing, primarily.
These identifiers can and should be associated with all journal articles and published datasets resulting from synthesis projects. DOIs and ORCIDs can easily be associated with GitHub and other code repositories as well.
#### Citing synthesis products
The best way to ensure that use of a research product is recognized is through proper citation. This is already common practice for journal articles, but is only recently being adopted for published datasets. The most logical place in an article to cite a published dataset is in the Methods section and in the Data Availability Statement, which most reputable journals now require. Be sure to check journal data sharing requirements well in advance so that data publication preparation can begin early enough. **When citing datasets, be sure that the full bibliographic entry is correctly included in the article's References list.** Citation of code is not as widely practiced, but some journals require it and it is a best practice.
:::{.panel-tabset}
### A useful data availability statement
From Currier and Sala 2022[^17]. Note that source datasets are properly cited in the Data Availability Statement, meaning an in-text citation is given and the full bibliographic entry is provided in the article reference list (not shown). The DOIs included here are helpful for quickly finding the data.
> All original and derived phenology data produced by the authors, and R scripts for data processing, statistical analyses, and figure production are publicly available in the Environmental Data Initiative (EDI) repository. EDI package knb-lter-jrn.210574001.2 (Currier & Sala, 2022a) contains daily phenocam image data, derived timeseries and associated scripts for processing and is available at <https://doi.org/10.6073/pasta/836360dce9311130383c9672e836d640>. EDI package knb-lter-jrn.210574002.2 (Currier & Sala, 2022b) contains observed phenological indicators and environmental drivers as well as associated scripts for final analyses and figure construction presented in this manuscript and these data are available at <https://doi.org/10.6073/pasta/d327a77f6474131db8aa589011e29c29>. No novel code was generated by the authors of this manuscript. The precipitation data used in all analyses are derived from G-BASN data in EDI package knb-lter-jrn.210520001 (Yao et al., 2020) available at <https://doi.org/10.6073/pasta/cf3c45e5480551453f1f9041d664a28f>. Daily air temperature summaries from 4 June 1914 to the present for the Jornada Experimental Range Headquarters (NOAA station GHCND:USC00294426) are freely available upon request via the National Ocean and Atmospheric Administration (<https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00294426/detail>).
[^17]: Currier, Courtney M., and Osvaldo E. Sala. 2022. “ Precipitation versus Temperature as Phenology Controls in Drylands.” Ecology 103(11): e3793. https://doi.org/10.1002/ecy.3793
### Not as helpful
> Data used in the figures are included in the supplementary material. The full dataset will be provided upon reasonable request to the corresponding author.
:::
## Maintaining Momentum
As we discussed in Module 1, starting a synthesis project benefits from motivating scientific questions, a well-planned foundation for team science, and significant activation energy from the team. When successful, synthesis projects gather enough momentum to be productive for many years. Below are a few ideas on how to maintain this momentum.
### Give everyone credit
Everyone deserves credit for the work they do, and in academic environments this is too often overlooked. Synthesis working groups commonly begin without any dedicated personnel support, which means that some participants, usually early-career scientists, will be contributing unpaid time to the project. In the absence of pay, leaders of a synthesis team should take the initiative to make sure everyone receives appropriate credit and opportunities for career advancement when they contribute to the project. Below are a few thoughts on how to do that.
:::{.panel-tabset}
### **Do's**
- Discuss and define in advance some of the contributions team members will make.
- This is particularly important for deciding authorship of journal articles.
- The [CRediT framework](https://credit.niso.org/) is a good starting point.
- More detail on this is in [Module 1](module1.qmd).
- Be willing to credit participants for a wide variety of contributions.
- This includes writing code, cleaning data, taking meeting notes, and more.
- Make sure all contributors have an [ORCID](https://orcid.org/register). They are easy to obtain and widely used.
- Use ORCIDs to associate contributors with a research product whenever possible.
- List contributors on websites, GitHub repositories, and other public-facing team materials.
- Its nice to include affiliations, bios, links to profile pages, and other information too.
### **Don'ts**
- Don't rely on any one metric for valuing contributions to the team.
- Code commits in GitHub, for example, may reflect the input of many people besides the one that actually wrote and committed the code.
- Don't forget students, technicians, early-career scientists, and others.
- Don't forget to put your name on your work!
### **Discuss**
> **What are we missing here?**
:::
### Encourage new contributions
Interests and commitment to synthesis projects change over time. To sustain active research contributions by the team, and continued use of the data, make sure new people can find a way to participate.
- **Provide a path for new data contributions.**
- This follows from making the data preparation/harmonization workflow reproducible.
- **Have open meetings when possible.**
- This helps bring in new team members that are interested and willing to contribute.
- **Give all team members the freedom and support to lead analyses, papers, and other valuable project activities.**
### Find support
Maintaining momentum for a synthesis project over the long term is highly dependent on the ability to keep scientists engaged and find support for dedicated personnel time. Usually this means getting monetary support in the form of grants.
- **Explore and apply to the funding sources presented in [Module 1](module1.qmd).**
- Personnel support may need to come from larger grants since working group funding often provides only meeting support.
- **Think creatively about how to get students and postdocs participating in synthesis projects.**
- If student/postdoc research interests & plans overlap, dedicating some time to synthesis group work can lead to career-building opportunities (networking, high-impact papers).
- **Promote the synthesis team's work!**
- It is difficult to attract interest from new participants and new resources for a project without doing this.
### HAVE FUN!
When done correctly, ecological synthesis research means having lots of fun doing science with a great team.