Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog harvest does not report error when data.json is invalid #3658

Closed
FuhuXia opened this issue Jan 25, 2022 · 7 comments
Closed

Catalog harvest does not report error when data.json is invalid #3658

FuhuXia opened this issue Jan 25, 2022 · 7 comments
Assignees
Labels
bug Software defect or bug

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Jan 25, 2022

When a data.json harvest source contains invalid entries according to DCAT-US Schema v1.1, the harvest report a successful harvest job with 0 changes. It is reported by rrb and confirmed on job https://admin-catalog-next.data.gov/harvest/rrb-json/job/2f728b7a-d954-469e-9b3a-81a2364a17b6.

How to reproduce

Create a data.json source containing entries that has no keyword and identifier. Harvest it.

Expected behavior

Harvest report specific errors.

Actual behavior

Successful harvest on the UI and email notification.

Context

If a data.json is not valid using tool https://dashboard.data.gov/validate, the gathering process should pick up the error. We had it working before but I am not sure when it started to be broken.

@FuhuXia FuhuXia added the bug Software defect or bug label Jan 25, 2022
@FuhuXia
Copy link
Member Author

FuhuXia commented Jan 25, 2022

Maybe related to another bug #3532.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jul 8, 2022

doc-gov json failed due to the same reason. It is an invalid DCAT-US Schema json file but gather continues to process the file without validating it first, then failed with some error in the gather log, but the error is not reported in the job report.

image

@hkdctol hkdctol moved this to Product Backlog in data.gov team board Aug 2, 2022
@FuhuXia
Copy link
Member Author

FuhuXia commented Nov 7, 2022

doc-gov harvest failed at gather stage when a record does not have the required identifier field. The error is not reported.

Traceback (most recent call last):
File "/home/vcap/deps/1/bin/ckan", line 8, in <module>
sys.exit(ckan())
File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/cli.py", line 241, in gather_consumer
utils.gather_consumer()
File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/utils.py", line 340, in gather_consumer
gather_callback(consumer, method, header, body)
File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 374, in gather_callback
harvest_object_ids = gather_stage(harvester, job)
File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 432, in gather_stage
harvest_object_ids = harvester.gather_stage(job)
File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckanext/datajson/datajson.py", line 231, in gather_stage
if dataset['identifier'] in unique_datasets:
KeyError: 'identifier'
Exit status 1

@FuhuXia FuhuXia moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Nov 7, 2022
@FuhuXia FuhuXia removed the status in data.gov team board Nov 7, 2022
@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Nov 17, 2022
@nickumia-reisys
Copy link
Contributor

@FuhuXia Is this being captured now because of #3532?

@FuhuXia
Copy link
Member Author

FuhuXia commented Nov 29, 2022

no, still an issue.

@Jin-Sun-tts Jin-Sun-tts self-assigned this Mar 13, 2023
@Jin-Sun-tts Jin-Sun-tts moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Mar 13, 2023
@Jin-Sun-tts
Copy link
Contributor

Before schema validation, the gather_stage and import_stage were referring to the identifier and title. In case these were not present in the dataset, errors would stop the harvest without reporting. To ensure that errors are reported, we have included checks for the dataset's title and identifier before referring to them.

@Jin-Sun-tts
Copy link
Contributor

Tested on DEV with one dataset with missing identifier and one dataset with missing title , the log reports the error as expected like below:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug
Projects
Archived in project
Development

No branches or pull requests

4 participants