-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move datasets to delete first in line #14
Conversation
We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports. It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances. This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.
Codecov Report
@@ Coverage Diff @@
## datagov #14 +/- ##
========================================
Coverage 42.16% 42.16%
========================================
Files 46 46
Lines 3166 3166
========================================
Hits 1335 1335
Misses 1831 1831
Continue to review full report at Codecov.
|
Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset? |
The issue is with the data provided itself, especially CSDGM/FGDC or ISO without a unique identifier. These WAF harvest sources can't assume the unique identifier is there because it's not required, so it uses a combination of things to test if it's "seen" this harvest object before. If the URL of the source/WAF changes, or if the title changes, then the harvester can't track it and assumes it's a "new" dataset, and doesn't find the "old" dataset and removes it. The problem is the order; since it currently removes data last then the "new" dataset has a name collision with the old one and we get a URL change for downstream users of data.gov. If we fully remove the dataset before adding the new one, the downtime is minimal and the URL should stay the same. |
@FuhuXia @jbrown-xentity Did this fix the name changing issue? |
Unknown. We could do an analysis to see how many of these exist out there, using the api (checking if last character is a 1, and scanning for how many entries, when they were created, etc). Or we could test this manually by harvesting CSDGM locally, changing the file name, and then re-harvesting to see what happens. |
See upstream PR and comments here: ckan#261 |
We have reports of datasets that get re-harvested with an extra
1
in the URL. We have confirmed these reports.It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.