Add coverage data to status.json #11

Bjwebb · 2018-07-12T14:07:12Z

Suggested by @drkane in ThreeSixtyGiving/test_registry#5

for each field in the file:
- field name
- number of null/blank values
- (future development) - number of values matching the required format (eg conform to regex)
- (future development) - for fields that can be externally validated (eg identifiers) - number of values that are externally verified
- whether the field appears in the 360 giving standard (plus guess at the field type if a non-standard field)

stevieflow · 2018-08-02T13:16:29Z

I would like to add (from discussion with @kindly & @Bjwebb) -

Number of fields per unique identifier

By this I refer to the example whereby a single grant might have multiple locations (for example). If we only had a count of the instances of the fields then we might miss the potential to see if some grants are more populous than others.

eg: a dataset with a 1000 grants could have 1000 locations, but only in 1 grant

Bjwebb · 2018-08-08T11:30:07Z

There's now a pull request at ThreeSixtyGiving/datagetter#22

What's done:

field name
whether the field appears in the 360 giving standard
~~Number of fields per unique identifier~~ - actually I count the number of grant json objects with the field, which is almost always the same thing

What's not done:

number of null/blank values
(guess at the field type if a non-standard field)
(future development) - number of values matching the required format (eg conform to regex)
(future development) - for fields that can be externally validated (eg identifiers) - number of values that are externally verified

Here's the coverage.json generated by that branch https://storage.googleapis.com/datagetter-360giving-output/branch/11-coverage/coverage.json

Here's a snippet of what the coverage data looks like:

    "datagetter_coverage": {
      "/grants": {
        "grants_with_field": null,
        "standard": true,
        "total_fields": 1
      },
      "/grants/Grant Application Scheme": {
        "grants_with_field": 66,
        "standard": false,
        "total_fields": 66
      },
      "/grants/amountAwarded": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 228
      },
      "/grants/awardDate": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 228
      },
      "/grants/beneficiaryLocation": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 228
      },
      "/grants/beneficiaryLocation/countryCode": {
        "grants_with_field": 210,
        "standard": true,
        "total_fields": 221
      },
      "/grants/beneficiaryLocation/name": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 239
      },

@stevieflow @drkane How does this look?

drkane · 2018-08-08T20:57:57Z

Looks great.

Would it always be a separate file or would it populate the status.json file? We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.

Something that could cause confusion is that the fields are identified using the proper schema representation, but in the file they are likely to be in the Recipient Org:Name format. It would be helpful to have the original field name, perhaps as a name key in the object for each field, to help map these fields to the data itself. Not sure if that's possible with how the coverage file is generated at the moment though.

drkane · 2018-08-08T20:59:25Z

Just looked at coverage.py, looks like it's probably not possible.

stevieflow · 2018-08-09T07:26:49Z

We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.

I'd be interested to know if this is the direction we take, as have made similar comments elsewhere. @drkane - as a User (!) - would you find this useful>?

Bjwebb · 2018-08-14T15:10:05Z

@drkane wrote:

Something that could cause confusion is that the fields are identified using the proper schema representation, but in the file they are likely to be in the Recipient Org:Name format. It would be helpful to have the original field name, perhaps as a name key in the object for each field, to help map these fields to the data itself. Not sure if that's possible with how the coverage file is generated at the moment though.

Just looked at coverage.py, looks like it's probably not possible.

Yeah this would be reasonably involved to do. Options are to reconstruct the title by looking in the schema, or to get flatten-tool to produce its "heading source map" (this is how CoVE knows the real name of the column for validation errors).

Bjwebb · 2018-08-14T15:15:35Z

We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.

I'd be interested to know if this is the direction we take, as have made similar comments elsewhere. @drkane - as a User (!) - would you find this useful>?

@drkane Would be useful to know how well you think it works. The separation is partly based on what you said on a call about maybe keeping the data in status.json small enough that you can read through the file more easily.

drkane · 2018-08-15T11:46:11Z

@drkane Would be useful to know how well you think it works. The separation is partly based on what you said on a call about maybe keeping the data in status.json small enough that you can read through the file more easily.

For me personally, I'd be using the most detailed version of the data, which is currently coverage.json. Not sure if file size will become a problem - I was thinking aloud when we were talking about it, and actually the file size doesn't get that big.

Where it might make sense to have 3 separate files is that if there is a data.json > status.json > coverage.json separation you can tell what processes have been run on each file, and so what state you can expect it to be in. If it overwrites the file you wouldn't be able to tell what status.json contains at a glance, depending on which script has been run.

michaelwood · 2019-08-01T11:42:41Z

Bit of a new comer to this issue, but the work that was referenced in the related pull request has now been merged so closing this issue. If in error please feel free to re-open.

drkane · 2019-08-06T11:21:47Z

Great that this has now been merged @michaelwood.

One issue I've noticed since it went live is that coverage.json doesn't build on the exact same data as status.json, so there are inconsistencies between them.

The example I found was in the large National Lottery Community Fund CSV file, so presumably it's related to the caching that happens to avoid processing that file every time. The datagetter_metadata section is available in status.json but not in coverage.json. Presumably the cached data is brought in at a point after coverage.json is produced.

(It would also be good if a cached version of datagetter_coverage was available for this file.

drkane · 2019-08-06T11:29:34Z

Looks like equivalent code for coverage.json would need to be added to https://github.com/ThreeSixtyGiving/datatester/blob/master/big_files_hardcoded/insert.py

michaelwood · 2019-08-06T14:35:55Z

I think this is an ordering issue specific to the way Travis is used to run the test. At the moment it is setup like this:

script: 
  # Don't convert big file becasue Travis doesn't have the RAM`
  - travis_wait 60 ./run.sh --no-convert-big-files`
  - python big_files_hardcoded/insert.py

This means that all the data json documents created via run.sh python scripts will be without the National lottery data in, which as you mentioned gets inserted later.

As an experiment I'll see if we can get Travis to re-run coverage.py after insert.py to regenerate the coverage.json

michaelwood · 2019-08-06T17:27:51Z

Had to make a couple of modifications to coverage to make sure it doesn't trip up on failed downloads. The test branch seems to output as expected. https://storage.googleapis.com/datagetter-360giving-output/branch/mw/test_issue_11_coverage/coverage.json here is a diff between the master branch and my modifications with the ordering issue fixed: https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 Think that looks right?

drkane · 2019-08-07T09:14:27Z

I don't think the output is quite right - the big file is being skipped rather than included (here in the diff). I'm not entirely sure why - it's in the status.json file and the code looks like it should just pass it through if it's not found in coverage.py.

drkane · 2019-08-07T09:38:51Z

Might need to check the file exists here: https://github.com/ThreeSixtyGiving/datatester/blob/master/coverage.py#L35

michaelwood · 2019-08-07T10:36:42Z

@drkane yeah that was my thought, if you see the branch that was what I added https://github.com/ThreeSixtyGiving/datatester/blob/mw/test_issue_11_coverage/coverage.py#L36

I'll check in with @Bjwebb

drkane · 2019-08-07T10:59:20Z

Ah sorry, missed that. Maybe it's a case of doing stats.append(dataset) if the exception is triggered.

michaelwood · 2019-08-07T12:17:19Z

I've run it locally instead of on travis (which was being hit by some time out issues yesterday) and I get a much larger result. Diff is here https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 full file is here https://gist.github.com/michaelwood/1390d9dad228730a072c95a8f3fcfe5a

As we now have a new server with the data on it might be better to move away from using Travis and use the new server with a cron job, that way we can get away from any issues that might be introduced in the Travis environment such as memory and disk space and execution time restrictions.

robredpath · 2019-08-08T06:20:53Z

@michaelwood given that the server has been set up without using Salt, we're going to have to rebuild it at least once before we can use it for production infrastructure. The daily data test is used by 360Giving Insights (at least) so we need to make sure that we can rebuild the server that provides it quickly and easily if it ever goes down, and we can only really have confidence in that process if we've done it! So, while using the new server for the DDT in the future (as it has loads more resources than Travis!) is probably the right thing to do, I don't want us to end up in a state where we can't do the Salt development because we're using the server for production services!

Bjwebb mentioned this issue Jul 12, 2018

status.json wishlist ThreeSixtyGiving/test_registry#5

Closed

Bjwebb self-assigned this Aug 1, 2018

Bjwebb added a commit that referenced this issue Aug 6, 2018

[#11] Add coverage.py to generate coverage.json with coverage info

3409a99

Bjwebb added a commit that referenced this issue Aug 7, 2018

[#11] Rename function and argument to be more descriptive

4dca0c8

Bjwebb added a commit that referenced this issue Aug 8, 2018

[#11] Deploy the coverage.json file

3732f93

Bjwebb mentioned this issue Aug 8, 2018

Add coverage.py to generate coverage.json with coverage info #22

Closed

michaelwood pushed a commit that referenced this issue Aug 1, 2019

[#11] Add coverage.py to generate coverage.json with coverage info

f329839

michaelwood pushed a commit that referenced this issue Aug 1, 2019

[#11] Deploy the coverage.json file

1d2b601

michaelwood closed this as completed Aug 1, 2019

drkane reopened this Aug 6, 2019

michaelwood added a commit that referenced this issue Aug 6, 2019

Rerun coverage Test run for issue #11

54addbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add coverage data to status.json #11

Add coverage data to status.json #11

Bjwebb commented Jul 12, 2018 •

edited

Loading

stevieflow commented Aug 2, 2018

Bjwebb commented Aug 8, 2018

drkane commented Aug 8, 2018

drkane commented Aug 8, 2018

stevieflow commented Aug 9, 2018

Bjwebb commented Aug 14, 2018

Bjwebb commented Aug 14, 2018

drkane commented Aug 15, 2018

michaelwood commented Aug 1, 2019

drkane commented Aug 6, 2019

drkane commented Aug 6, 2019

michaelwood commented Aug 6, 2019

michaelwood commented Aug 6, 2019

drkane commented Aug 7, 2019

drkane commented Aug 7, 2019

michaelwood commented Aug 7, 2019

drkane commented Aug 7, 2019

michaelwood commented Aug 7, 2019

robredpath commented Aug 8, 2019

Add coverage data to status.json #11

Add coverage data to status.json #11

Comments

Bjwebb commented Jul 12, 2018 • edited Loading

stevieflow commented Aug 2, 2018

Bjwebb commented Aug 8, 2018

drkane commented Aug 8, 2018

drkane commented Aug 8, 2018

stevieflow commented Aug 9, 2018

Bjwebb commented Aug 14, 2018

Bjwebb commented Aug 14, 2018

drkane commented Aug 15, 2018

michaelwood commented Aug 1, 2019

drkane commented Aug 6, 2019

drkane commented Aug 6, 2019

michaelwood commented Aug 6, 2019

michaelwood commented Aug 6, 2019

drkane commented Aug 7, 2019

drkane commented Aug 7, 2019

michaelwood commented Aug 7, 2019

drkane commented Aug 7, 2019

michaelwood commented Aug 7, 2019

robredpath commented Aug 8, 2019

Bjwebb commented Jul 12, 2018 •

edited

Loading