-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update all scripts and data #1
Comments
If only the location has changed then this may not be such a big deal. Otherwise it might be. |
The URLs listed in https://github.com/avoindata/mml/blob/master/rscripts/Kapsi/kapsi2rdata.R are still kosher, so fortunately there is no need for bigger update. It would be good to have all data in dirs per year, now there seems to some redundancy in the repo, e.g. years 2012 and 2016 separately and then e.g. |
I see, e.g. |
Yes. the 2012 folder was added after someone requested the old versions after I had removed them. But this was years ago. I doubt that anyone really needs the 2012 folder any more, it could be removed for clarity as well. I am not aware of other apps than ropengov/louhos pkgs that use this data resource so I think even file/folder structure can be improved/changed if necessary. |
Is the consensus now that the data will be fetched directly from github, or shall we find another host or even release a separate data package? |
I would keep the old versions of there's space. In fact, it's a shame we haven't actively collected these. AFAIK, no instance keeps a (public) record of changing datasets, which may actually be very useful in studies. |
Right. We can try collect these from now on. Not sure how often the data is updated. Collecting annually might be sufficient. |
You may already know my take on the hosting issue 😉 I don't know about the consensus. I don't think a separate data package is really necessary, although hosting one using |
Data package would potentially reduce network traffic and speed up execution in some cases but not sure how essential this would be. Github certainly fine with as long as proved otherwise. |
Since the data doesn't update so often (max annually I guess), a data package would simplify things at least because:
However, there would be a bit of conceptual shift for the packages using |
The MML license should perfectly allow data packaging as far as I see. Even the current solution is not really an "API package" since it is based on our pre-processed and independently distributed RData files rather than the MML service. To achieve conceptual clarity, the package functions should download the data straight from Kapsi/MML and perform preprocessing on the fly. But that is not practical. If we rely on our own preprocessed data anyway then I am not sure if it makes a big difference whether the data is hosted in Github or in a data package. Any pragmatic solution is fine. |
Yes, it does. Ideally the data provider deals with the packaging the data, but in this case we could be just as good (or better!).
Yep, but the data is still loaded on-need basis. It's worth noting that Kapsi is not a MML service either, which makes this even less API-like.
I agree that downloading the data every time is not practical. However, as a user I would like the package to do as little filtering as possible (i.e. subsetting data). Value-adding pre-processing (fixing strings, setting types etc) is great, as long as it's clear what was done. In this sense a data package might be a very good solution as it enables good documentation, versioning and provenance (i.e. distributing the code). Currently, it's a bit unclear where the data is coming from ( |
If we package the data using |
+1 for data package, in other words. |
Yes an R data package starts to seem a good solution. I do not know whether CRAN data package or Github drat package is better. If there is no added value from hosting at CRAN then drat might be optimal as updates will be easier. Or we could ask Kapsi to add our scripts in their pipeline and host the data files/packages (for MML it is not realistic I think). But that would add some overhead and any changes/updates would be heavier to make. As a side note, Feather would work great for sharing data frames but not in this case (shapefiles) I guess. |
Obviously having the package in CRAN wouldn't hurt as so long as 1) the size-limit (~5 MB) is not an issue, and 2) one is willing to get an angry email from BDR 😄 .
Might be a bit overkill, yes.
Well, if we switch completely to |
R data packages hosted in CRAN can exceed the typical 5MB size limit if we can motivate the need for BDR. Perhaps good to start with Github+drat and move to CRAN later if it seems useful. Not sure if RData or Rds are any better for long-term storage than feather. Except for the fact that Feather is still under development and may hence be less stable. Ok, perhaps Rds files would be the best here now (saveRDS / readRDS) |
+1 to everything. |
Things have changed at Kapsi and this repo should be updated accordingly.
The text was updated successfully, but these errors were encountered: