Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce disk space requirement for eddy covariance download #131

Open
s-kganz opened this issue Apr 8, 2024 · 2 comments
Open

Reduce disk space requirement for eddy covariance download #131

s-kganz opened this issue Apr 8, 2024 · 2 comments

Comments

@s-kganz
Copy link

s-kganz commented Apr 8, 2024

Is your feature request related to a problem? Please describe.

Requesting several seasons of eddy covariance data can require a large amount of storage because all levels of the data product are bundled together. In my case, I only work with the level 4 products. Downloading the raw data takes about 60 GB of storage on disk before stacking. After running stackEddy, I am left with a 33 MB table with the NSAE data and QC flags I care about.

This discourages reproducibility because 1) downloading takes a long time, 2) it is antisocial to download tens of GB to a collaborator's machine, and 3) it encourages hosting a processed data table outside of NEON to get around (1) and (2).

Describe the solution you'd like

The optimal solution would be to just have users download eddy covariance data in FLUXNET format. I know this partially exists already on the Ameriflux data portal, but many sites don't have any FLUXNET-formatted data. This happens to affect my main study site (WREF), so here I am (note this also means I have to run REddyProc myself with potentially different settings than what site managers would prefer).

Another option is to download only the desired data level. But, I imagine this would require backend changes to the API that are not feasible.

A third option is to modify the zipsByProduct -> stackEddy workflow to operate one site-month at a time instead of processing all site-months together as done in this tutorial. This works, but deleting files is error-prone (unlink doesn't even raise a warning if it fails) and you still have to wait for 60 GB to download.

Describe alternatives you've considered

Right now I'm running zipsByProduct and stackEddy one site-month at a time, deleting any intermediate products along the way so that only ~250 MB of disk space is needed at any one time. A brief reprex:

library(neonUtilities)
library(foreach)
library(dplyr)

tdir <- tempdir()
fpath <- file.path(tdir, "filesToStack00200")

# Download five site-months of H20/CO2 NSAE
site_mos <- paste0("2019-0", seq(5, 9))

vars <- c(
  "timeBgn", "timeEnd",
  "data.fluxCo2.nsae.flux",
  "qfqm.fluxCo2.nsae.qfFinl",
  "data.fluxH2o.nsae.flux",
  "qfqm.fluxH2o.nsae.qfFinl"
)

wref_nsae <- foreach(sm=site_mos, .combine=rbind) %do% {
  zipsByProduct(
    "DP4.00200.001",
    site="WREF",
    startdate=sm,
    enddate=sm,
    savepath=tempdir(),
    check.size=FALSE
  )
  
  myeddy <- stackEddy(fpath)[["WREF"]] %>%
    select(all_of(vars))
  
  unlink(fpath, recursive=TRUE)
  stopifnot(!dir.exists(fpath))
  
  return(myeddy)
}

With my machine/internet this takes about 2 hours to download all the flux data I work with.

Additional context

I think this package is filling a really important role in the research community. I'd love to be able to write a paper and have a script linked that will run the entire analysis all the way through generating figures that appear in the manuscript. Having more flexibility in how flux data are downloaded would make this goal much more achievable.

@cklunch
Copy link
Collaborator

cklunch commented Apr 10, 2024

@s-kganz Thanks for your suggestions! As you noted, this is a challenge rooted in the way the eddy covariance files are stored, and there are limited options within neonUtilities itself. For your use case, I think your script for iterating over the files to be downloaded and deleting as you go is the best option available. Also, keep an eye on Ameriflux for reformatted files to appear there.

And I do expect that eventually we'll have more options for working with the H5 files, or more options for file formatting, but at this point I can't give an estimated timeline, we're still in the exploration phase. We've been experimenting with cloud-based methods for working with H5 files, which would avoid download, and we've talked about possible file format alternatives. I'll post updates here, but it may be a while.

@s-kganz
Copy link
Author

s-kganz commented Apr 10, 2024

Thanks for your comments @cklunch! I'm glad this is on your radar, and I look forward to hearing any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants