Replace DCAT-US validator #3503

jbrown-xentity · 2021-10-27T17:42:05Z

User Story

In order to ensure the dashboard can be deprecated, data.gov admins want a validator of DCAT-US files built into catalog (ckanext-datajson extension).

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN an endpoint exists in catalog to validate DCAT-US files
WHEN a URL to a file (or file) is provided with a DCAT-US format
THEN the user is provided with meaningful output of valid and invalid format
Redirect Dashboard Data.json Validator #4001
THEN my browser is redirected to the new validator location

Background

Current dashboard validator exists here.
An API route already exists (but is never used/tested) to validate pod files here.

Security Considerations (required)

Need to ensure the SSRF protections currently in the dashboard are also in this implementation.

Sketch

Would involve investigating the following:

Current implementation doesn't seem to work, should investigate why locally and check logs.
What security is in place on the /pod/validate route?
- Try converting some of the existing SSRF tests over
What options does the /pod/validate route provide?
Can we create a basic, simple html page to hit the above api?
Do we need to provide a "prettified" format of the code? Or would we need to implement the inverse (a JSON response), as the code seems to return html?
- The previous implementation does have a JSON output option which is used for tests; not clear whether there's other clients of that functionality.

The text was updated successfully, but these errors were encountered:

nickumia-reisys · 2022-10-04T18:25:00Z

The route does work on a local version of CKAN

nickumia-reisys · 2022-10-04T22:34:06Z

So... the current state of this is that the route works in the datajson extension, but isn't getting recognized by the catalog main app. We will eventually fix this. However, the more pressing issue is that the main do_validation function is not checking the right schema.

The general flow steps are:

[LINK] Hit the flask route at /pod/validate and get redirected to the validator function
[LINK] Parse the data.json link from the POST request.
[LINK] Download the file and make sure it is a valid JSON.
[LINK] Validate it is a proper DATA.JSON.
- [LINK] The function that does the "validation"

The problem is in the do_validation function. It tries to make sure the data.json is a list (which is wrong). It should be a dict.

@FuhuXia pointed me to the validate_schema function. But this function isn't actually doing the validation of the schema. It is just doing the validation that the schema is valid name of a schema.

The closest thing that is doing the validation is the _validata_dataset function which is part of the DatasetHarvesterBase class. However, I don't know if we want to use the Harvester class to do data.json validation.

@FuhuXia @jbrown-xentity

Do we need the old do_validation function if it is validating the wrong things?
Can you please help me find the right function that should be doing the data.json validation?

jbrown-xentity · 2022-10-04T22:40:37Z

You're probably right, the validation is embedded in the harvester class.
We do not need the old do_validation if it is wrong.
I would consider pulling out the validation from the harvester in some way, and having the harvester code and the validate code reference the same function. If that is difficult because of the way the validation has to be implemented for the harvester, maybe there is a way to "break out" the validation into small chunks that can be shared in some way.
For the record, this "feature" of ckanext-datajson is poorly documented and not sure how well it ever worked (certainly not since ckan2.8), so it's entirely possible that most of the current logic will need to be thrown out...

jbrown-xentity · 2022-10-04T22:50:55Z

What I didn't say is: the validation that occurs in the harvester logic does seem to be solid, and has worked fine in the past. So the logic there should be able to be kept.

nickumia-reisys · 2022-10-05T14:16:46Z

Got it! That all makes sense to me. I like the idea of throwing out a bunch of stuff. My tentative plan:

Throw out the old do_validation code
Strip out the validation code from the harvester (and add unit tests)
Integrate with Inventory and make sure the export data.json function still works

🚧

nickumia-reisys · 2022-10-05T16:01:12Z

Just for the record ➡️ I removed the do_validation function and all references to it and the tests are still passing which means that we were never testing it (no surprise, but still just saying haha..)

GSA/ckanext-datajson@4b69c66

nickumia-reisys · 2022-10-05T18:08:37Z

Scratch that... the current logic is very very .... very complicated. Turns out the do_validation function was doing the right thing, just at the dataset level and not the catalog level.

nickumia-reisys · 2022-10-05T18:09:03Z

I'm going to start documenting and cleaning up things.

nickumia-reisys · 2022-10-05T21:23:38Z

Okay. I didn't do much documentation (...yet). But here's the updates:

The /pod/validate route works
The original validation function "works". It validates the datasets, but it doesn't validate the head matter.
Unit tests have been written for good and bad cases.

TODO...

Testing on catalog develop
@jbrown-xentity @FuhuXia Do we want to validate the top head matter? The stuff below ⬇️

  "@type": "dcat:Catalog",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",

nickumia-reisys · 2022-10-05T21:27:12Z

Also, do want want to change the route? @jbrown-xentity @FuhuXia @hkdctol @Jin-Sun-tts @btylerburton @jbrown-xentity

As it stands right now, the public route (not available yet) would be https://catalog.data.gov/pod/validate... do we want to make it something more meaningful?

nickumia-reisys · 2022-10-05T21:38:24Z

Deployed to `develop`

nickumia-reisys · 2022-10-05T21:40:09Z

GSA data.json seemed to process extremely fast.

It's only 336 datasets; but still. Will test a larger data.json to see if there are memory issues or anything like that.

nickumia-reisys · 2022-10-05T22:08:01Z

Answering the questions in the ticket:

Current implementation doesn't seem to work, should investigate why locally and check logs.

Per previous comments, the current implementation does work. The datajson plugin just wasn't being loaded into production (along with small bugs in the ckanext-datajson extension).

What security is in place on the /pod/validate route?

Try converting some of the existing SSRF tests over

Most of the old tests are useless now because they were testing that curl wouldn't go do something stupid. The datajson implementation uses the built-in urllib python library which isn't as susceptible to the same things. Tests have been added as necessary.

What options does the /pod/validate route provide?

The [GET] route displays an empty form page (with some helper text) asking for url input. The [POST] route takes only one input (i.e. 'url') and does the work to validate it.

Can we create a basic, simple html page to hit the above api?

The page already exists and it is not an api route.

Do we need to provide a "prettified" format of the code? Or would we need to implement the inverse (a JSON response), as the code seems to return html?

The previous implementation does have a JSON output option which is used for tests; not clear whether there's other clients of that functionality.

There is no JSON output for the results. If we want to output that, that would be a new feature of datajson.

jbrown-xentity · 2022-10-06T13:25:55Z

Also, do want want to change the route? @jbrown-xentity @FuhuXia @hkdctol @Jin-Sun-tts @btylerburton @jbrown-xentity

As it stands right now, the public route (not available yet) would be https://catalog.data.gov/pod/validate... do we want to make it something more meaningful?

https://catalog.data.gov/datajson/validator

https://catalog.data.gov/datajson/validate

https://catalog.data.gov/datajsonvalidator

I would use /dcat-us/validator. I don't like the branding has become datajson, I hate that. It's like a naming convention merged with a file type, not a standard...

nickumia-reisys · 2022-10-06T14:19:30Z

TODO:

Deprecate old do_validation function AND Replace all references with the new jsonschema pod validation stuff
Change route from pod/validator to dcat-us/validator
~~Change route from data.json to dcat-us/data.json~~
Move validation route to new plugin, datajson_validator, in datajson extension

nickumia-reisys · 2022-10-07T18:27:55Z

Form is live at: https://catalog.data.gov/dcat-us/validator

jbrown-xentity · 2022-10-07T18:35:48Z

Note that it failed to handle https://data.doi.gov/data.json. The dashboard was inconsistent on whether it could handle it or not, it contains around 30K records and file size is around 80M. So consider anything above that to be untestable at this point, unless we really scale up.

nickumia-reisys · 2022-10-08T18:08:23Z

Increasing RAM to `1200M` allows the above `data.json` to be processed.

jbrown-xentity · 2022-10-10T14:09:36Z

We will make follow up tickets to match the functionality of dashboard, and decide how important/useful those features are... Looks great!

hkdctol added this to data.gov team board May 18, 2022

hkdctol moved this to Icebox in data.gov team board May 18, 2022

hkdctol moved this to Product Backlog in data.gov team board Sep 29, 2022

hkdctol moved this from Product Backlog to Sprint Backlog [7] in data.gov team board Sep 29, 2022

nickumia-reisys self-assigned this Oct 3, 2022

nickumia-reisys moved this from Sprint Backlog [7] to In Progress [8] in data.gov team board Oct 3, 2022

nickumia-reisys mentioned this issue Oct 4, 2022

Datajson Validator Route GSA/ckanext-datajson#122

Merged

nickumia-reisys moved this from In Progress [8] to Blocked in data.gov team board Oct 4, 2022

nickumia-reisys moved this from Blocked to Needs Review [2] in data.gov team board Oct 6, 2022

nickumia-reisys mentioned this issue Oct 7, 2022

Datajson Validator Documentation GSA/ckanext-datajson#123

Merged

nickumia-reisys moved this from Needs Review [2] to In Progress [8] in data.gov team board Oct 7, 2022

nickumia-reisys mentioned this issue Oct 7, 2022

Add Data.json Validation Route GSA/catalog.data.gov#567

Merged

nickumia-reisys moved this from In Progress [8] to Needs Review [2] in data.gov team board Oct 7, 2022

This was referenced Oct 7, 2022

Fix MANIFEST.in includes GSA/ckanext-datajson#124

Merged

Data.json Validator - Missing Template GSA/catalog.data.gov#570

Merged

nickumia-reisys mentioned this issue Oct 7, 2022

Redirect Dashboard Data.json Validator #4001

Closed

3 tasks

nickumia-reisys closed this as completed Oct 7, 2022

Repository owner moved this from Needs Review [2] to Done in data.gov team board Oct 7, 2022

This was referenced Oct 10, 2022

DCAT-US Validator Feature: Load File #4002

Open

DCAT-US Validator Feature: Raw JSON #4003

Open

DCAT-US Validator Feature: Select Schema #4004

Open

jbrown-xentity mentioned this issue Nov 21, 2022

CSV Error Logs GSA/project-open-data-dashboard#406

Closed

nickumia-reisys mentioned this issue May 25, 2023

Redirect Dashboard Homepage #4331

Closed

3 tasks

nickumia-reisys added component/catalog Related to catalog component playbooks/roles Feature labels Oct 7, 2023

nickumia-reisys moved this from ✔ Done to 🗄 Closed in data.gov team board Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace DCAT-US validator #3503

Replace DCAT-US validator #3503

jbrown-xentity commented Oct 27, 2021 •

edited

Loading

nickumia-reisys commented Oct 4, 2022

nickumia-reisys commented Oct 4, 2022 •

edited

Loading

jbrown-xentity commented Oct 4, 2022

jbrown-xentity commented Oct 4, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022 •

edited

Loading

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

jbrown-xentity commented Oct 6, 2022

nickumia-reisys commented Oct 6, 2022 •

edited

Loading

nickumia-reisys commented Oct 7, 2022

jbrown-xentity commented Oct 7, 2022

nickumia-reisys commented Oct 8, 2022

jbrown-xentity commented Oct 10, 2022

Replace DCAT-US validator #3503

Replace DCAT-US validator #3503

Comments

jbrown-xentity commented Oct 27, 2021 • edited Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

nickumia-reisys commented Oct 4, 2022

The route does work on a local version of CKAN

nickumia-reisys commented Oct 4, 2022 • edited Loading

jbrown-xentity commented Oct 4, 2022

jbrown-xentity commented Oct 4, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022 • edited Loading

nickumia-reisys commented Oct 5, 2022

nickumia-reisys commented Oct 5, 2022

Deployed to develop

nickumia-reisys commented Oct 5, 2022

GSA data.json seemed to process extremely fast.

nickumia-reisys commented Oct 5, 2022

Answering the questions in the ticket:

jbrown-xentity commented Oct 6, 2022

nickumia-reisys commented Oct 6, 2022 • edited Loading

nickumia-reisys commented Oct 7, 2022

Form is live at: https://catalog.data.gov/dcat-us/validator

jbrown-xentity commented Oct 7, 2022

nickumia-reisys commented Oct 8, 2022

Increasing RAM to 1200M allows the above data.json to be processed.

jbrown-xentity commented Oct 10, 2022

jbrown-xentity commented Oct 27, 2021 •

edited

Loading

nickumia-reisys commented Oct 4, 2022 •

edited

Loading

nickumia-reisys commented Oct 5, 2022 •

edited

Loading

Deployed to `develop`

nickumia-reisys commented Oct 6, 2022 •

edited

Loading

Increasing RAM to `1200M` allows the above `data.json` to be processed.