Skip to content

DataSteam API helper in R

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

datastreamapp/datastreamr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataStream Logo
DataStream R Package

This is tool is useful for those who want to extract large volumes of data from DataStream. This R package allows users to call to the DataStream Public API using built-in R functions and specific search queries. The package includes several functions which accept a selection of filtering queries and then return a dataframe with the desired data from DataStream.

You might use this tool, for example, if you:

  • Want to download all available DataStream pH data in Ontario
  • Want to count how many sites in New Brunswick have cesium data on DataStream

Note: DataStream's Custom Download tool is another option that allows users to download csv data from across datasets in a particular DataStream hub using basic filters. This tool has fewer filtering options than the API, but works well for basic searches. You can find it via 'Explore Data' in the header menu from any DataStream regional hub.

To have full API permissions, users must request an API token which is required to call to the API

Installation

To install the most recent version in R:

# install.packages("devtools")
devtools::install_github("datastreamapp/datastreamr")

Attribution/Citation

Thank you ahead of time for using this data responsibly and providing the appropriate citations when necessary when presenting work to external parties. These dataset citations must be accompanied by a link to the DOI (https://doi.org/{value}). The dataset licence, citation, and DOI can be retrieved from the /Metadata endpoint.

Licence representations

The API returns the URL for a dataset's licence, these should be mapped to the full licence name with a link to the full licence details.

The Functions

The following functions are used to call to the DataStream API and pull desired information.

metadata():

Description
Pulls only the dataset level metadata information including dataset name, citation, licence, abstract, etc.

Usage

metadata(list(`$select` = "Id, DatasetName"))
metadata( 
  list(
    `$select` = NULL,
    `$filter` = NULL,
    `$top` = NULL,
    `$count` = "false"
  )
)

locations():

Description
Pulls only the location data including Location ID, Location Name, Latitude, and Longitude.

Usage

locations( 
  list(
    `$select` = NULL,
    `$filter` = NULL,
    `$top` = NULL,
    `$count` = "false"
  )
)

records():

Description
Pulls data formatted the same as the downloaded DataStream CSV’s including all columns listed in the DataStream schema .

Usage

  • This function will take longer than observations, but provides all available columns in one request.
  • Use this function if you aim to pull all location and parameter data in one call
records( 
  list(
    `$select` = NULL,
    `$filter` = NULL,
    `$top` = NULL,
    `$count` = "false"
  )
)

observations():

Description
Pulls data in a condensed format that must be joined with other endpoints to create a full dataset with all the DataStream columns. Specifically, location rows are not pulled, instead LocationId is pulled for each observation and then can be used in combination with locations().

Usage

  • This function will be quicker than records, but if location specifics are needed, needs to be paired with locations()
  • Use this function either if you are uninterested in specific location coordinates, or in combination with locations() when you plan to pull millions of rows of data
observations( 
  list(
    `$select` = NULL,
    `$filter` = NULL,
    `$top` = NULL,
    `$count` = "false"
  )
)

Function Inputs

All of the functions above accept query parameters. The ones supported are:

  • api_token: A character string containing your unique API key

    • Click here to request an api token

  • select: A list of allowable columns to return

    • Fields to be selected are entered as a list.
    • Example: select="DatasetName,Abstract"
    • Default: All columns available.

$\color{blue}{\text{Note}}$: refer to Allowed Values section below for available select fields

  • filter: A list of conditions to filter by

    • Available operators:
    • eq: Used for exact matches.
    • ne: Used for not equal to.
    • gt: Used for greater than.
    • lt: Used for less than.
    • ge: Used for greater than or equal to.
    • le: Used for less than or equal to.
    • and: Used to combine multiple filters with an “and” condition.
    • Grouping: filter="CharacteristicName eq 'Dissolved oxygen saturation' and DOI eq '10.25976/n02z-mm23'"
    • Temporal (Dataset creation): filter="CreateTimestamp gt 2020-03-23"
    • Temporal (Data date-range): filter="ActivityStartYear gt '2019'"
    • Spatial: filter=RegionId eq 'hub.atlantic'
      • RegionId Values (these values are subject to change):
      • DataStream Hubs: hub.{atlantic,lakewinnipeg,mackenzie,greatlakes,pacific }
      • Countries: admin.2.{ca}
      • Provinces/Territories: admin.4.ca.{ab,bc,mb,nb,nl,ns,nt,nu,on,pe,qc,sk,yt}

$\color{blue}{\text{Note}}$: refer to Allowed Values section below for available filter fields

  • top: Number of rows to return

    • Maximum: 10000
    • Example: top=10

  • count: When TRUE, returns number of observations rather than the data itself

    • Return only the count for the request. When the value is large enough it becomes an estimate (~0.0005% accurate)
    • Example: count=true
    • Default: false

    Performance Tips

    • Use select to request only the parameters you need. This will decrease the amount of data needed to process and transfer.

Allowed Values

The allowed select and filter options for each of the functions are listed HERE.

$\color{green}{\text{Note:}}$ When using the filter field, a useful resource is the "allowed values" tab of our upload template . This will give you available strings for:

  • MonitoringLocationType

  • ActivityMediaName

  • CharacteristicName

  • metadata
select: 'DOI', 'Version', 'DatasetName', 'DataStewardEmail', 'DataCollectionOrganization', 
'DataUploadOrganization', 'ProgressCode', 'MaintenanceFrequencyCode', 'Abstract', 
'DataCollectionInformation', 'DataProcessing', 'FundingSources', 'DataSourceURL', 
'OtherDataSources', 'Citation', 'Licence', 'Disclaimer', 'TopicCategoryCode', 'Keywords', 
'CreateTimestamp'

filter: 'DOI', 'DatasetName', 'RegionId', 'Latitude', 'Longitude', 'CreateTimestamp'
  • locations
select: 'Id', 'DOI', 'NameId', 'Name', 'Latitude', 'Longitude', 
'HorizontalCoordinateReferenceSystem', 'HorizontalAccuracyMeasure',
'HorizontalAccuracyUnit', 'VerticalMeasure', 'VerticalUnit', 'MonitoringLocationType'

filter: 'Id', 'DOI', 'MonitoringLocationType', 'ActivityStartYear', 
'ActivityMediaName', 'CharacteristicName', 'RegionId', 'Name'
  • records
select: 'Id', 'DOI', 'DatasetName', 'MonitoringLocationID', 'MonitoringLocationName', 
'MonitoringLocationLatitude','MonitoringLocationLongitude', 
'MonitoringLocationHorizontalCoordinateReferenceSystem', 
'MonitoringLocationHorizontalAccuracyMeasure', 'MonitoringLocationHorizontalAccuracyUnit',
'MonitoringLocationVerticalMeasure', 'MonitoringLocationVerticalUnit', 'MonitoringLocationType', 
'ActivityType', 'ActivityMediaName', 'ActivityStartDate', 'ActivityStartTime', 'ActivityEndDate', 
'ActivityEndTime', 'ActivityDepthHeightMeasure', 'ActivityDepthHeightUnit', 
'SampleCollectionEquipmentName', 'CharacteristicName', 'MethodSpeciation', 'ResultSampleFraction', 
'ResultValue', 'ResultUnit', 'ResultValueType', 'ResultDetectionCondition', 
'ResultDetectionQuantitationLimitMeasure','ResultDetectionQuantitationLimitUnit', 
'ResultDetectionQuantitationLimitType','ResultStatusID', 'ResultComment', 
'ResultAnalyticalMethodID', 'ResultAnalyticalMethodContext', 'ResultAnalyticalMethodName', 
'AnalysisStartDate', 'AnalysisStartTime', 'AnalysisStartTimeZone', 'LaboratoryName', 
'LaboratorySampleID'

filter: 'DOI', 'MonitoringLocationType', 'ActivityStartDate', 'ActivityMediaName', 
'CharacteristicName', 'RegionId'
  • observations
select: 'Id', 'DOI', 'LocationId', 'ActivityType', 'ActivityStartDate', 'ActivityStartTime', 
'ActivityEndDate', 'ActivityEndTime', 'ActivityDepthHeightMeasure', 'ActivityDepthHeightUnit', 
'SampleCollectionEquipmentName', 'CharacteristicName', 'MethodSpeciation', 'ResultSampleFraction', 
'ResultValue', 'ResultUnit', 'ResultValueType','ResultDetectionCondition', 
'ResultDetectionQuantitationLimitUnit', 'ResultDetectionQuantitationLimitMeasure',
'ResultDetectionQuantitationLimitType', 'ResultStatusId', 'ResultComment', 'ResultAnalyticalMethodId',
'ResultAnalyticalMethodContext', 'ResultAnalyticalMethodName', 'AnalysisStartDate', 'AnalysisStartTime', 
'AnalysisStartTimeZone', 'LaboratoryName', 'LaboratorySampleId', 'CreateTimestamp'

filter: 'DOI', 'MonitoringLocationType', 'ActivityStartDate', 'ActivityMediaName', 
'CharacteristicName', 'RegionId', 'LocationId'

Authentication

By default the environment variable "DATASTREAM_API_KEY" is used for setting the API key. The API key can also be set by:

setAPIKey('xxxxxxxxxx') 

Full examples

Get the citation and licence for a dataset:

metadata(api_token,filter=c("DOI='10.25976/1q5q-zy55'"), select=c("DOI","DatasetName","Licence","Citation","Version"))
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "",
    `$filter` = ""
  )
metadata(qs)

Get all pH observations in Alberta:

setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "Id, DOI, LocationId, CharacteristicName, ActivityType, ActivityMediaName, ActivityStartDate, ActivityStartTime, ActivityEndDate, ActivityEndTime, ActivityDepthHeightMeasure, ActivityDepthHeightUnit, SampleCollectionEquipmentName, MethodSpeciation, ResultSampleFraction, ResultValue, ResultUnit, ResultValueType, ResultDetectionCondition, ResultDetectionQuantitationLimitUnit, ResultDetectionQuantitationLimitMeasure, ResultDetectionQuantitationLimitType, ResultStatusID, ResultComment, ResultAnalyticalMethodID, ResultAnalyticalMethodContext, ResultAnalyticalMethodName, AnalysisStartDate, AnalysisStartTime, AnalysisStartTimeZone, LaboratoryName, LaboratorySampleID",
    `$filter` = "CharacteristicName eq 'pH' and RegionId eq 'admin.4.ca.ab'"
  )
observations(qs)

More Examples:

# Pull all metadata for all datasets in the Atlantic DS Hub 
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "",
    `$filter` = ""
  )
Example01 = metadata(qs)

# Pull all metadata for all datasets in BC
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$filter` = "RegionId eq 'admin.4.ca.bc'"
  )
Example02 = metadata(qs)

# Pull only the DOI's and contact emails for all datasets in the Great Lakes Hub 
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "DOI, DataStewardEmail",
    `$filter` = "RegionId eq 'hub.greatlakes'"
  )
Example03 = metadata(qs)

# Pull all location information for sites in Ontario 
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$filter` = "RegionId eq 'admin.4.ca.on'",
    `$top` = "1000"
  )
Example04 = locations(qs)

# Pull the site names and lat/lon coordinates for a particular dataset 
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "Name, Latitude, Longitude",
    `$filter` = "DOI eq '10.25976/1q5q-zy55'"
  )
Example05 = locations(qs)

# Pull all ph data available in the Atlantic DS Hub (only pulling top 1000)
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$filter` = "RegionId eq 'hub.atlantic' and CharacteristicName eq 'pH'",
    `$top` = "1000"
  )
Example06 = records(qs)

# Now, only select desired columns 
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "DOI, DatasetName, MonitoringLocationName, MonitoringLocationLatitude",
    `$filter` = "RegionId eq 'hub.atlantic' and CharacteristicName eq 'pH'",
    `$top` = "1000"
  )
Example07 = records(qs)

# Now, only pull data before 2015 
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "DOI, DatasetName, MonitoringLocationName, MonitoringLocationLatitude",
    `$filter` = "RegionId eq 'hub.atlantic' and CharacteristicName eq 'pH' and ActivityStartYear lt '2015'",
    `$top` = "1000"
  )
Example08 = records(qs)

# Try observations()
setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "ResultValue",
    `$filter` = "CharacteristicName eq 'pH' and ActivityStartYear gt '2019'",
    `$top` = "1000"
  )
Example09 = observations(qs)
# Use the count filter 

setAPIKey(YOUR_API_KEY)
qs <- list(
    `$select` = "ResultValue",
    `$filter` = "RegionId eq 'hub.atlantic' and CharacteristicName eq 'Ammonia' and ActivityStartYear gt '2019'",
    `$count` = "true"
  )
Example10 = observations(qs)

Tests

Dockerfile is provided to run the unit tests and the integration tests. To build the docker image for running tests and other debugging purposes you can run:

docker build -t datastreamr .

To run the unit tests:

docker run --rm -e DATASTREAM_API_KEY=$(cat api_key.txt) datastreamr R -e "library(testthat); test_file('tests/testthat/test_unit.R')"

To run the integration tests:

docker run --rm -e DATASTREAM_API_KEY=$(cat api_key.txt) datastreamr R -e "library(testthat); test_file('tests/testthat/test_integration.R')"

About

DataSteam API helper in R

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published