The recommended exchange format for data to and from the CoL Clearinghouse is a tabular text format with a fixed set of files and columns.
The format is a single ZIP archive that bundles various delimited text files described below together with a metadata.yaml file providing basic metadata about the entire dataset. Each file holds records for the same class of things shown in this diagram:
The ColDP format was developed to overcome limitations existing in currently used formats for sharing taxonomic information, namely Darwin Core Archives and the Catalogue of Life submission format also known as ACEF (Annual Checklist Exchange Format). Darwin Core Archives and ACEF could still be used for exchanging data to and from the Catalogue of Life clearinghouse, but the COLDP format will support the most features. The following table provides an overview of different features supported in each of the 3 formats:
Feature | ACEF | DwC-A | ColDP |
---|---|---|---|
Linnean classification (KPCOFG) | x | x | x |
Extended Linnean classification (subranks) | - | - | x |
Flexible Parent-child classification | - | x | x |
Unrestricted ranks | - | x | x |
Higher taxon details | - | x | x |
Infraspecific taxa | x | x | x |
Nested infraspecific taxa | - | x | x |
Basionyms | - | x | x |
Synonyms | x | x | x |
Synonyms for higher taxa | - | x | x |
Name identifier | - | x | x |
Nomenclatural status | x | x | x |
Fossils/extinction flags | x | x | x |
Name & taxon separation | - | - | x |
Species interactions | - | - | x |
Species estimates | - | - | x |
Structured references | x | - | x |
Microcitations | - | - | x |
Nomenclatural relations | - | - | x |
Type species | - | x | x |
Type specimen | - | x | x |
Taxon concepts | - | x | x |
Taxon concept relations | - | x | x |
Vernacular names | x | x | x |
Structured distributions | x | x | x |
Taxon descriptions | - | x | x |
Multimedia metadata | - | x | x |
x
= supported-
= not supported
The filename for an entity in the above diagram is a case insensitive version of the class name, any number of ignored hyphens or underscores and a known tabular text suffix. The suffix specifies one of the two supported tabular flavours, comma separated or tab separated files:
csv
: a comma separated, optionally quoted CSV file as per RFC 4180tsv
,tab
ortxt
: indicates a tab seperated file without quoting
Valid examples are Taxon.tsv
or vernacular-name.csv
It is recommended to place all data files in a subfolder called data
, but having them on the root level is allowed.
All files should be encoded in UTF-8.
- metadata.yaml
- Name
- NameRelation
- Taxon
- Synonym
- NameUsage
- TaxonConceptRelation
- SpeciesInteraction
- SpeciesEstimate
- Reference
- Reference JSON-CSL
- Reference BIBTEX
- NameReference
- TypeMaterial
- Distribution
- Media
- VernacularName
- Treatment documents
A YAML file with metadata about the entire data package should be included. The file consists mostly of key value pairs like title, see the comments in metadata.yaml for all available keys. An exception are the contact and authorsAndEditors properties which takes a simple person object, see yaml example. Additional entries to the YAML file is allowed to express non standard properties.
All data files should contain a header row that specifies the name of the columns as given below. In the absence of a header row it is expected that all columns exist in the exact order given below. With headers given it is allowed to share additional columns which are not part of the standard as listed below.
See NAMES.md for examples and rationales.
Unique name identifier that is referred to elsewhere via nameID
.
Identifier of the name which is the original combination of this name. Also known as the basionym or protonym. Contrary to the strict basionym definition it is recommended to populate this field also for original names which should point to itself.
Note there is an alternative way to share the information about an original name by using a NameRelation with type=basionym
.
The field basionymID exists for simplicity and because it is an important information to be shared.
Required scientific name excluding the authorship
Authorship of the scientificName
type: rank enum
The rank of the name preferrably given in case insensitive english. The recommended vocabulary is included in rank_enum.
The single-word name of generic or higher rank names.
The genus part of a bi/trinomial. Note that for generic names the uninomial field should be used, not genus!
The infrageneric epithet in case of bi/trinomials. In zoological names often the subgenus.
The specific epithet in case of bi/trinomials.
The infraspecific epithet in case of bi/trinomials.
The name of the cultivar for name governed by the cultivar code.
type: code enum
The nomenclatural code the name falls under.
type: nomStatus enum
The broad nomenclatural status of the name. For the exact status note, e.g. nomen nudum, the remarks field should additionally be used Alternatively a URI or simple name from a class of the NOMEN ontology can be used.
A pointer to a Reference indicating the original publication of the name in its given combination, not the basionym.
The effective year the name was published, given as a 4 digit integer. It is the year that is nomenclaturally relevant for the given combination. In most cases this will be the same as the publication year given in the linked reference record via referenceID. But in some cases this might be different.
The exact single page number where the name was published. If the description spans multiple pages, the first page should be given.
A URL to the exact page where the name was published. If the description spans multiple pages, the link to the first page should be given.
A link to a webpage provided by the source depicting the name.
Additional nomenclatural remarks about the name. Often indicating its status or relevant rules in the code.
A directed nomenclatural name relation. See NAMES.md#name-relations for examples.
The subject name this relation originates from. Refers to an existing Name.ID or NameUsage.ID within this data package.
The object name this relation relates to. Refers to an existing Name.ID or NameUsage.ID within this data package.
type: enum
The kind of directed nomenclatural relation.
The reference or nomenclatural act where this nomenclatural relation was established.
Remarks about the relation.
Type material designated to names. Type material should only be associated with the original name, not with a recombination.
Optional unique identifier for the specimen. If possible use the existing specimen identifier, e.g. the collection/institution code and catalogue number. If coming from a Darwin Core world dwc:occurrenceID is a great fit.
Pointer to the typified name referring to an existing Name.ID within this data package. Type material should only be associated with an original name, not with recombinations.
Material citation of the type material, i.e. type specimen. The citation is ideally given in the verbatim form as it was used in the original publication of the name or the subsequent designation. Otherwise it is recommended to follow the material citation guidelines published by European Journal of Taxonomy. If atomized fields below are given a citation is not needed. Otherwise it is required.
type: type status enum The status of the type material, e.g. holotype
A referenceID pointing to the Reference table indicating the publication of the type designation. Most often this is equivalent to the original names referenceID, but for subsequent designations a later reference should be cited.
The type locality. Ideally from largest area to smallest.
The country of the type locality. Preferably as ISO codes.
Decimal latitude of the type locality given in WGS84
Decimal longitude of the type locality given in WGS84
Altitude of the type locality. Ideally given as meters above mean seal level. Depth should be given as negative altitudes.
Indicates the host organism from which the type specimen was obtained (symbiotype).
Date the type material was gathered. Recommended to be given as ISO 8601 dates.
The collectors name
A link to further information about the specimen, e.g. as provided by the institute holding the collection.
Any further remarks on the type material.
An accepted name with a taxonomic classification given either as a parent-child relation or as a flat, denormalized record.
Unique taxon identifier that is referred to elsewhere via taxonID
.
The direct parent taxon's ID in the classification. This is the preferred way of exchanging a hierarchy and takes precedence over any classification given in the denormalized fields.
Pointer to the accepted name referring to an existing Name.ID within this data package.
An optional, unrestricted, lose phrase appended to the name just for this taxon. E.g. the phrase "sensu latu" may be added to the name to describe this taxon more precisely.
A reference ID to the publication that established the taxonomic concept used by this taxon.
The author & year of the reference will be used to qualify the name with sensu AUTHOR, YEAR
.
The ID must refer to an existing Reference.ID within this data package.
Name of the person who is the latest scrutinizer who revised or reviewed the taxonomic concept.
Identifier for the scrutinizer. Highly recommended are ORCID ids.
type: ISO8601 date The date when the taxonomic concept was last revised or reviewed by the scrutinizer.
type: boolean
A flag indicating that the taxon is only provisionally accepted and should be handled with care.
A comma concatenated list of reference IDs supporting the taxonomic concept that has been reviewed by the scrutinizer. Each ID must refer to an existing Reference.ID within this data package.
type: boolean
Nullable flag indicating that the taxon is extinct (true) or extant (false). This includes species that died out very recently.
type: enum
Earliest appearance of the taxon in the geological time scale. Recommended values are geochronological names from the official International Commission on Stratigraphy (ICS).
type: enum
Latest appearance of the taxon in the geological time scale. Recommended values are geochronological names from the official International Commission on Stratigraphy (ICS).
type: enum[] A comma delimited list of environments this taxon is known to exist in.
The species binomial the taxon is classified in. If parentID is given this field is ignored.
The (botanical) section the taxon is classified in. Considered a botanical rank below subgenus, not a zoological above family. If parentID is given this field is ignored.
The subgenus the taxon is classified in. If parentID is given this field is ignored.
The genus the taxon is classified in. If parentID is given this field is ignored.
The subtribe the taxon is classified in. If parentID is given this field is ignored.
The tribe the taxon is classified in. If parentID is given this field is ignored.
The subfamily the taxon is classified in. If parentID is given this field is ignored.
The family the taxon is classified in. If parentID is given this field is ignored.
The superfamily the taxon is classified in. If parentID is given this field is ignored.
The suborder the taxon is classified in. If parentID is given this field is ignored.
The order the taxon is classified in. If parentID is given this field is ignored.
The subclass the taxon is classified in. If parentID is given this field is ignored.
The class the taxon is classified in. If parentID is given this field is ignored.
The subphylum the taxon is classified in. If parentID is given this field is ignored.
The phylum the taxon is classified in. If parentID is given this field is ignored.
The kingdom the taxon is classified in. If parentID is given this field is ignored.
A link to a webpage provided by the source depicting the taxon.
Any further taxonomic remarks.
A synonymous name for a taxon. Note that the same name can be linked to mulitple taxa by having several Synonym records to model pro parte synonyms.
Optional unique identifier for the synonym. If given it should not clash with the taxon ids.
Pointer to the taxon that this synonym is used for. For pro parte synonyms with multiple accepted names several synonym records sharing the same name but having different taxonIDs should be created. Refers to an existing Taxon.ID within this data package.
Pointer to the synonymous name referring to an existing Name.ID within this data package.
An optional, unrestricted, lose phrase appended to the name just for this synonym.
E.g. the phrase "sensu latu" may be added to the name to describe this synonym more precisely.
Or "auct. mult." or "auct. amer." for misapplied names that cannot refer to a single publication.
Misapplied names that refer to a single publication should use accordingToID
instead.
A reference ID to the publication that established the taxonomic concept used by this taxon.
The author & year of the reference will be used to qualify the name with sensu AUTHOR, YEAR
.
Strongly recommended in case of misapplied names.
The ID must refer to an existing Reference.ID within this data package.
type: enum
The kind of synonym. One of synonym, ambiguous synonym or misapplied.
A comma concatenated list of reference IDs supporting the synonym status of the name. Each ID must refer to an existing Reference.ID within this data package.
A link to a webpage provided by the source depicting the synonym.
Any further taxonomic remarks.
As a simpler alternative to the 3 entities Name, Taxon and Synonym a single NameUsage
entity can be supplied.
A NameUsage record can either be an accepted Taxon or a Synonym and is easily distinguished by its status.
A NameUsage.ID acts both as a taxonID and nameID if referred to from other table, e.g TypeMaterial or VernacularName.
For synonyms the parentID
field is used to link to the accepted taxon.
All properties available in the individual entities can also be used for the single NameUsage:
There are two clashing properties that exist both on a Name and Taxon/Synonym, but which have a slightly different meaning. Therefore the following properties deviate slightly from their usage in their classic version:
- parentID: for taxa it points to the next higher taxon's ID to form the classification, for synonyms it points at the accepted taxon.
- status: is the taxonomic name usage status which includes Synonym.status and the Taxon.provisional flag.
- nameStatus: corresponds to the nomenclatural name status.
- genus: is the taxonomic classification of a name usage and corresponds to Taxon.genus. For synonyms it often is not the same as the genus part of the name
- genericName: corresponds to the genus field of a name and represents the atomized genus of a scientificName.
- referenceID: corresponds to the taxonomic reference(s) otherwise given in Taxon/Synonym.referenceID.
- nameReferenceID: corresponds to the nomenclatural reference otherwise given in Name.referenceID.
- namePublishedInYear: corresponds to Name.publishedInYear.
- namePublishedInPage: corresponds to Name.publishedInPage.
- namePublishedInPageLink: corresponds to Name.publishedInPageLink.
If a single NameUsage entity is given no further Name, Taxon or Synonym entity must exist.
A directed taxon relation representing RCC5 taxon concept assertions.
The subject taxon this relation originates from.
The object this taxon relates to.
type: enum The kind of directed RCC5 relation that specifies how the two taxon concepts are related.
A reference where this relation was documented or who asserted it.
Remarks about the concept relation.
A directed taxon relation representing species interactions. Different to a TaxonConceptRelation a species interaction can also point to a species (name) outside of the local dataset.
The subject taxon the species interaction is about. Always required to point to an existing taxonID in the local dataset.
The related taxon this interaction is describing. If given it must refer to a local taxonID from the dataset. If missing, the 'relatedTaxonScientificName' must be given instead.
The scientificName of the related taxon this interaction is describing. Includes the authorship if known. It is mutually exclusive with relatedTaxonID and if given no relatedTaxonID should exist. The relatedTaxonScientificName can be used to document species interactions without the need to have full blown name and taxon records.
type: enum
The kind of directed species interaction. Each interaction exists also in reverse to allow the alternative relatedTaxonScientificName field to be used. Species interaction types are heavily inspired by https://www.globalbioticinteractions.org and the OBO Relation Ontology http://www.ontobee.org/ontology/RO to which all entries are mapped.
A reference where the interaction was documented.
Remarks about the species interation.
An estimation of the number of species for a given higher taxon, e.g. a family. The estimation must be based on a reference and should give the number of species according to a certain "type" that is expected to exist.
The higher taxon's ID that is the estimate refers to.
type: [integer] The estimated number of species.
type: enum The exact kind of estimation, e.g. number of described living species or total estimated species including yet to be described organisms. If none is given the type defaults to 'described species living'.
A mandatory reference ID that supports the estimate and also provides a temporal context.
Structured bibliographic references with a unique id to refer to from other entities. References can be given in various degrees of atomization:
- A simple citation string is the minimum required
- Semi structured using the field list below mostly corresponding to Dublin Core
- Fully parsed references in the well established BibTex or CSL-JSON format. See the sections below with for how to share alternative formats that do not conform to tabular CSV/TSV files.
The local identifier for the reference as used in referenceID in other entities.
Full bibliographic citation as one single string as an alternative to the rest of the more structured fields. If individual fields are given the full citation can be ignored.
The author(s) of the work. If multiple authors use a style that can safely be parsed. Recommended is to list authors by comma and prefix their surname with initials. If a comma is used to separate surname, firstname please use a semicolon to delimit individual authors.
The title of the work. In case of journal articles the article title, not the journal itself.
The year of the publication.
The title of the journal or book the work was published in. The source should exclude volume, edition, pages and other specifics.
All details to locate the work within the source, sometimes also referred to as collation. That can include journal volume, edition, pages, pointer to illustrations or anything else.
The DOI of the reference
A URL link to the reference
Additional comments about the reference.
Instead of the main reference file a reference.json
file can be added to provide a JSON array of highly structured references
in the CSL-JSON format, e.g. as provided by CrossRef:
curl --location --silent --header "Accept: application/vnd.citationstyles.csl+json" https://doi.org/10.1126/science.169.3946.635
The id
field in each record of the array is used as the primary key and referred to from referenceID
fields elsewhere.
[
{
"id": "science.169.3946.635",
"publisher": "American Association for the Advancement of Science (AAAS)",
"issue": "3946",
"published-print": {
"date-parts": [
[
1970,
8,
14
]
]
},
"DOI": "10.1126/science.169.3946.635",
"type": "article-journal",
"created": {
"date-parts": [
[
2006,
10,
5
]
],
"date-time": "2006-10-05T12:56:56Z",
"timestamp": 1160053016000
},
"page": "635-641",
"source": "Crossref",
"title": "The Structure of Ordinary Water: New data and interpretations are yielding new insights into this fascinating substance",
"prefix": "10.1126",
"volume": "169",
"author": [
{
"given": "H. S.",
"family": "Frank",
"sequence": "first",
"affiliation": []
}
],
"container-title": "Science",
"original-title": [],
"language": "en",
"link": [
{
"URL": "https://syndication.highwire.org/content/doi/10.1126/science.169.3946.635",
"content-type": "unspecified",
"content-version": "vor",
"intended-application": "similarity-checking"
}
],
"deposited": {
"date-parts": [
[
2020,
2,
5
]
],
"date-time": "2020-02-05T16:15:06Z",
"timestamp": 1580919306000
},
"subtitle": [],
"short-title": [],
"issued": {
"date-parts": [
[
1970,
8,
14
]
]
},
"journal-issue": {
"published-print": {
"date-parts": [
[
1970,
8,
14
]
]
},
"issue": "3946"
},
"URL": "http://dx.doi.org/10.1126/science.169.3946.635",
"ISSN": [
"0036-8075",
"1095-9203"
],
"subject": [
"Multidisciplinary"
],
"container-title-short": "Science"
}
]
Alternatively to CSL-JSON a BibTex file reference.bib
can be given to provide highly structured citations.
The id
field following the curly opening bracket is used as the primary key and referred to from referenceID
fields elsewhere.
@article{Droege_2016,
title={The Global Genome Biodiversity Network (GGBN) Data Standard specification},
volume={2016},
ISSN={1758-0463},
url={http://dx.doi.org/10.1093/database/baw125},
DOI={10.1093/database/baw125},
journal={Database},
publisher={Oxford University Press (OUP)},
author={Droege, G. and Barker, K. and Seberg, O. and Coddington, J. and Benson, E. and Berendsohn, W. G. and Bunk, B. and Butler, C. and Cawsey, E. M. and Deck, J. and et al.},
year={2016},
pages={baw125}
}
@article{Frank_1970,
title = {The Structure of Ordinary Water: New data and interpretations are yielding new insights into this fascinating substance},
volume = {169},
ISSN = {1095-9203},
url = {http://dx.doi.org/10.1126/science.169.3946.635},
DOI = {10.1126/science.169.3946.635},
number = {3946},
journal = {Science},
publisher = {American Association for the Advancement of Science (AAAS)},
author = {Frank, H. S.},
year = {1970},
month = {Aug},
pages = {635–641}
}
References are usually classic bibliographic citations on the article level. In many cases it is desirable to point and link to specific pages, e.g. where exactly a name has first been published.
In order to avoid highly redundant references ColDP allows to share NameReference records that are microcitations to a specific page where a name was mentioned that belongs to a specific Reference record. Using NameReferences is optional.
Pointer to the name that is mentioned in the given reference. Refers to an existing Name.ID or NameUsage.ID within this data package.
Pointer to the reference that includes the name. Refers to an existing Reference.ID within this data package.
The exact page the microcitation is pointing to within the reference
A URL link to the exact page of the reference. If only a link for the entire reference is available this should only be included in the main reference, not here again.
Opportunity to indicate the context, e.g. "Plate VIII: Genitalia drawings" or "Original description of Abies", etc.
A structured distribution record for a taxon in a given area.
Pointer to the taxon referring to an existing Taxon.ID within this data package.
The geographic area this distribution record is about.
type: enum
The geographic gazetteer the area is defined in.
type: enum Distribution status.
Pointer to the reference that supports this distribution. Refers to an existing Reference.ID within this data package.
An optional microcitation to a specific page within the reference given by referenceID. Multiple page references can be given by as a comma concatenated list.
Remarks about the distribution.
Multimedia items for a taxon such as an image, audio or video.
Pointer to the taxon referring to an existing Taxon.ID within this data package.
The URL that resolves to the media item itself, not a webpage that depicts it.
The MIME-type of the media item the url identifies.
Preferrably the full type/subtype combination, e.g image/jpeg
, but the primary type alone is sufficient (image
, video
, audio
).
Optional title for the item.
type: ISO8601 date Date the media item was recorded.
Author of the media item.
type: license
Optional webpage from the source this media item is shown on.
A vernacular or common name for a taxon.
Pointer to the taxon referring to an existing Taxon.ID within this data package.
The vernacular name in the original script.
An optional transliteration of the vernacular name into the latin script.
Language of the vernacular name given as an ISO 639-3 letter code.
Country this vernacular name is used in given as an ISO 3166-2 letter code.
Optional area describing the geographic use of the vernacular name in free text within the given country.
type: enum
Optional sex of the organism this vernacular name is restricted to.
Pointer to the reference that supports this vernacular name. Refers to an existing Reference.ID within this data package.
An optional microcitation to a specific page within the reference given by referenceID. Multiple page references can be given by as a comma concatenated list.
Treatments are parts of publications that "treat" a single taxon. They can be an original description for a new species, but also subsequent taxonomic works and usually include several sections such as a diagnosis, description, material examied, distribution, etc.
ColDP captures an entire treatment either as an TXT, HTML or XML document that lives as an individual file in a subfolder treatments
and is named by the corresponding taxonID of the name usage it describes. The taxons accordingToID
should always point to the reference the treatment is published in.
Example: treatments/19854332.html
would be an html document which is the marked up treatment for the taxon with ID 19854332
.