Skip to content

Configuration file

John J Czaplewski edited this page Aug 1, 2016 · 1 revision

The file config.yml is the primary configuration file for your application. It contains the following fields:

app_name: This is the name of your application. It should be relevant to your research question and target terms, and not contain any spaces.

description: A brief description of what your application does and what kind of result it produces.

user: Your full name

email: The email address you would like us to use to contact you

language: The primary programming language your application is written in. Used to run the proper dependency install command on the GeoDeepDive infrastructure. For a list of currently supported languages please see the supported languages page.

The following two fields are used for culling the corpus to be more relevant for your application. In a majority of cases, applications that are run against a subset of the corpus with a high signal will produce better results than those run against the entire corpus. For example, if you are interested in "coffee", but "coffee" only occurs in 1% of all documents, the application will run much faster and produce a better result if only that 1% of documents is used.

The idea is not to completely eliminate noise, but rather to increase signal. Choose terms whose presence in a document is a good indication that the document may contain content of interest.

dictionaries: One or more comma-separated dictionaries to use for culling the corpus. GeoDeepDive contains categorized lists of preindexed terms to make subsetting the corpus easier. Current dictionaries include a list of all taxa from the Paleobiology Database and all stratigraphic names from Macrostrat. For a list of all available dictionaries, please see https://geodeepdive.org/api/dictionaries?all.

terms: One or more comma-separated terms to use for culling the corpus.

Clone this wiki locally