-
Notifications
You must be signed in to change notification settings - Fork 2
Add coverage data to status.json #11
Comments
I would like to add (from discussion with @kindly & @Bjwebb) -
By this I refer to the example whereby a single grant might have multiple locations (for example). If we only had a count of the instances of the fields then we might miss the potential to see if some grants are more populous than others. eg: a dataset with a 1000 grants could have 1000 locations, but only in 1 grant |
There's now a pull request at ThreeSixtyGiving/datagetter#22 What's done:
What's not done:
Here's the coverage.json generated by that branch https://storage.googleapis.com/datagetter-360giving-output/branch/11-coverage/coverage.json Here's a snippet of what the coverage data looks like: "datagetter_coverage": {
"/grants": {
"grants_with_field": null,
"standard": true,
"total_fields": 1
},
"/grants/Grant Application Scheme": {
"grants_with_field": 66,
"standard": false,
"total_fields": 66
},
"/grants/amountAwarded": {
"grants_with_field": 228,
"standard": true,
"total_fields": 228
},
"/grants/awardDate": {
"grants_with_field": 228,
"standard": true,
"total_fields": 228
},
"/grants/beneficiaryLocation": {
"grants_with_field": 228,
"standard": true,
"total_fields": 228
},
"/grants/beneficiaryLocation/countryCode": {
"grants_with_field": 210,
"standard": true,
"total_fields": 221
},
"/grants/beneficiaryLocation/name": {
"grants_with_field": 228,
"standard": true,
"total_fields": 239
}, @stevieflow @drkane How does this look? |
Looks great. Would it always be a separate file or would it populate the status.json file? We would presumably then end up with a three-tiered approach: Something that could cause confusion is that the fields are identified using the proper schema representation, but in the file they are likely to be in the |
Just looked at |
I'd be interested to know if this is the direction we take, as have made similar comments elsewhere. @drkane - as a User (!) - would you find this useful>? |
@drkane wrote:
Yeah this would be reasonably involved to do. Options are to reconstruct the title by looking in the schema, or to get flatten-tool to produce its "heading source map" (this is how CoVE knows the real name of the column for validation errors). |
@drkane Would be useful to know how well you think it works. The separation is partly based on what you said on a call about maybe keeping the data in |
For me personally, I'd be using the most detailed version of the data, which is currently Where it might make sense to have 3 separate files is that if there is a |
Bit of a new comer to this issue, but the work that was referenced in the related pull request has now been merged so closing this issue. If in error please feel free to re-open. |
Great that this has now been merged @michaelwood. One issue I've noticed since it went live is that The example I found was in the large National Lottery Community Fund CSV file, so presumably it's related to the caching that happens to avoid processing that file every time. The (It would also be good if a cached version of |
Looks like equivalent code for |
I think this is an ordering issue specific to the way Travis is used to run the test. At the moment it is setup like this:
This means that all the data json documents created via run.sh python scripts will be without the National lottery data in, which as you mentioned gets inserted later. As an experiment I'll see if we can get Travis to re-run |
Had to make a couple of modifications to coverage to make sure it doesn't trip up on failed downloads. The test branch seems to output as expected. https://storage.googleapis.com/datagetter-360giving-output/branch/mw/test_issue_11_coverage/coverage.json here is a diff between the master branch and my modifications with the ordering issue fixed: https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 Think that looks right? |
I don't think the output is quite right - the big file is being skipped rather than included (here in the diff). I'm not entirely sure why - it's in the |
Might need to check the file exists here: https://github.com/ThreeSixtyGiving/datatester/blob/master/coverage.py#L35 |
@drkane yeah that was my thought, if you see the branch that was what I added https://github.com/ThreeSixtyGiving/datatester/blob/mw/test_issue_11_coverage/coverage.py#L36 I'll check in with @Bjwebb |
Ah sorry, missed that. Maybe it's a case of doing |
I've run it locally instead of on travis (which was being hit by some time out issues yesterday) and I get a much larger result. Diff is here https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 full file is here https://gist.github.com/michaelwood/1390d9dad228730a072c95a8f3fcfe5a As we now have a new server with the data on it might be better to move away from using Travis and use the new server with a cron job, that way we can get away from any issues that might be introduced in the Travis environment such as memory and disk space and execution time restrictions. |
@michaelwood given that the server has been set up without using Salt, we're going to have to rebuild it at least once before we can use it for production infrastructure. The daily data test is used by 360Giving Insights (at least) so we need to make sure that we can rebuild the server that provides it quickly and easily if it ever goes down, and we can only really have confidence in that process if we've done it! So, while using the new server for the DDT in the future (as it has loads more resources than Travis!) is probably the right thing to do, I don't want us to end up in a state where we can't do the Salt development because we're using the server for production services! |
Suggested by @drkane in ThreeSixtyGiving/test_registry#5
The text was updated successfully, but these errors were encountered: