Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DCAT-US validator #3503

Closed
2 tasks done
jbrown-xentity opened this issue Oct 27, 2021 · 19 comments
Closed
2 tasks done

Replace DCAT-US validator #3503

jbrown-xentity opened this issue Oct 27, 2021 · 19 comments
Assignees
Labels
component/catalog Related to catalog component playbooks/roles Feature

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Oct 27, 2021

User Story

In order to ensure the dashboard can be deprecated, data.gov admins want a validator of DCAT-US files built into catalog (ckanext-datajson extension).

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN an endpoint exists in catalog to validate DCAT-US files
    WHEN a URL to a file (or file) is provided with a DCAT-US format
    THEN the user is provided with meaningful output of valid and invalid format
  • Redirect Dashboard Data.json Validator #4001
    THEN my browser is redirected to the new validator location

Background

Current dashboard validator exists here.
An API route already exists (but is never used/tested) to validate pod files here.

Security Considerations (required)

Need to ensure the SSRF protections currently in the dashboard are also in this implementation.

Sketch

Would involve investigating the following:

  • Current implementation doesn't seem to work, should investigate why locally and check logs.
  • What security is in place on the /pod/validate route?
  • What options does the /pod/validate route provide?
  • Can we create a basic, simple html page to hit the above api?
  • Do we need to provide a "prettified" format of the code? Or would we need to implement the inverse (a JSON response), as the code seems to return html?
    • The previous implementation does have a JSON output option which is used for tests; not clear whether there's other clients of that functionality.
@hkdctol hkdctol moved this to Icebox in data.gov team board May 18, 2022
@hkdctol hkdctol moved this to Product Backlog in data.gov team board Sep 29, 2022
@hkdctol hkdctol moved this from Product Backlog to Sprint Backlog [7] in data.gov team board Sep 29, 2022
@nickumia-reisys nickumia-reisys self-assigned this Oct 3, 2022
@nickumia-reisys nickumia-reisys moved this from Sprint Backlog [7] to In Progress [8] in data.gov team board Oct 3, 2022
@nickumia-reisys
Copy link
Contributor

The route does work on a local version of CKAN

image

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Oct 4, 2022

So... the current state of this is that the route works in the datajson extension, but isn't getting recognized by the catalog main app. We will eventually fix this. However, the more pressing issue is that the main do_validation function is not checking the right schema.

The general flow steps are:

  • [LINK] Hit the flask route at /pod/validate and get redirected to the validator function
  • [LINK] Parse the data.json link from the POST request.
  • [LINK] Download the file and make sure it is a valid JSON.
  • [LINK] Validate it is a proper DATA.JSON.
    • [LINK] The function that does the "validation"

The problem is in the do_validation function. It tries to make sure the data.json is a list (which is wrong). It should be a dict.

@FuhuXia pointed me to the validate_schema function. But this function isn't actually doing the validation of the schema. It is just doing the validation that the schema is valid name of a schema.

The closest thing that is doing the validation is the _validata_dataset function which is part of the DatasetHarvesterBase class. However, I don't know if we want to use the Harvester class to do data.json validation.


@FuhuXia @jbrown-xentity

  • Do we need the old do_validation function if it is validating the wrong things?
  • Can you please help me find the right function that should be doing the data.json validation?

@nickumia-reisys nickumia-reisys moved this from In Progress [8] to Blocked in data.gov team board Oct 4, 2022
@jbrown-xentity
Copy link
Contributor Author

You're probably right, the validation is embedded in the harvester class.
We do not need the old do_validation if it is wrong.
I would consider pulling out the validation from the harvester in some way, and having the harvester code and the validate code reference the same function. If that is difficult because of the way the validation has to be implemented for the harvester, maybe there is a way to "break out" the validation into small chunks that can be shared in some way.
For the record, this "feature" of ckanext-datajson is poorly documented and not sure how well it ever worked (certainly not since ckan2.8), so it's entirely possible that most of the current logic will need to be thrown out...

@jbrown-xentity
Copy link
Contributor Author

What I didn't say is: the validation that occurs in the harvester logic does seem to be solid, and has worked fine in the past. So the logic there should be able to be kept.

@nickumia-reisys
Copy link
Contributor

Got it! That all makes sense to me. I like the idea of throwing out a bunch of stuff. My tentative plan:

  • Throw out the old do_validation code
  • Strip out the validation code from the harvester (and add unit tests)
  • Integrate with Inventory and make sure the export data.json function still works

🚧

@nickumia-reisys
Copy link
Contributor

Just for the record ➡️ I removed the do_validation function and all references to it and the tests are still passing which means that we were never testing it (no surprise, but still just saying haha..)

@nickumia-reisys
Copy link
Contributor

Scratch that... the current logic is very very .... very complicated. Turns out the do_validation function was doing the right thing, just at the dataset level and not the catalog level.

image

@nickumia-reisys
Copy link
Contributor

I'm going to start documenting and cleaning up things.

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Oct 5, 2022

Okay. I didn't do much documentation (...yet). But here's the updates:

  • The /pod/validate route works
  • The original validation function "works". It validates the datasets, but it doesn't validate the head matter.
  • Unit tests have been written for good and bad cases.

TODO...

  • Testing on catalog develop
  • @jbrown-xentity @FuhuXia Do we want to validate the top head matter? The stuff below ⬇️
  "@type": "dcat:Catalog",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",

@nickumia-reisys
Copy link
Contributor

Also, do want want to change the route? @jbrown-xentity @FuhuXia @hkdctol @Jin-Sun-tts @btylerburton @jbrown-xentity

As it stands right now, the public route (not available yet) would be https://catalog.data.gov/pod/validate... do we want to make it something more meaningful?

@nickumia-reisys
Copy link
Contributor

Deployed to develop

image

@nickumia-reisys
Copy link
Contributor

GSA data.json seemed to process extremely fast.

It's only 336 datasets; but still. Will test a larger data.json to see if there are memory issues or anything like that.

image

@nickumia-reisys
Copy link
Contributor

Answering the questions in the ticket:

Per previous comments, the current implementation does work. The datajson plugin just wasn't being loaded into production (along with small bugs in the ckanext-datajson extension).

Most of the old tests are useless now because they were testing that curl wouldn't go do something stupid. The datajson implementation uses the built-in urllib python library which isn't as susceptible to the same things. Tests have been added as necessary.

  • What options does the /pod/validate route provide?

The [GET] route displays an empty form page (with some helper text) asking for url input. The [POST] route takes only one input (i.e. 'url') and does the work to validate it.

  • Can we create a basic, simple html page to hit the above api?

The page already exists and it is not an api route.

  • Do we need to provide a "prettified" format of the code? Or would we need to implement the inverse (a JSON response), as the code seems to return html?
    • The previous implementation does have a JSON output option which is used for tests; not clear whether there's other clients of that functionality.

There is no JSON output for the results. If we want to output that, that would be a new feature of datajson.

@jbrown-xentity
Copy link
Contributor Author

Also, do want want to change the route? @jbrown-xentity @FuhuXia @hkdctol @Jin-Sun-tts @btylerburton @jbrown-xentity

As it stands right now, the public route (not available yet) would be https://catalog.data.gov/pod/validate... do we want to make it something more meaningful?

I would use /dcat-us/validator. I don't like the branding has become datajson, I hate that. It's like a naming convention merged with a file type, not a standard...

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Oct 6, 2022

TODO:

  • Deprecate old do_validation function AND Replace all references with the new jsonschema pod validation stuff
  • Change route from pod/validator to dcat-us/validator
  • Change route from data.json to dcat-us/data.json
  • Move validation route to new plugin, datajson_validator, in datajson extension

@nickumia-reisys
Copy link
Contributor

Form is live at: https://catalog.data.gov/dcat-us/validator

image

Repository owner moved this from Needs Review [2] to Done in data.gov team board Oct 7, 2022
@jbrown-xentity
Copy link
Contributor Author

Note that it failed to handle https://data.doi.gov/data.json. The dashboard was inconsistent on whether it could handle it or not, it contains around 30K records and file size is around 80M. So consider anything above that to be untestable at this point, unless we really scale up.

@nickumia-reisys
Copy link
Contributor

Increasing RAM to 1200M allows the above data.json to be processed.

image

@jbrown-xentity
Copy link
Contributor Author

We will make follow up tickets to match the functionality of dashboard, and decide how important/useful those features are... Looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/catalog Related to catalog component playbooks/roles Feature
Projects
Archived in project
Development

No branches or pull requests

2 participants