Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harvest job stuck when the json source has non-string title #4172

Closed
Jin-Sun-tts opened this issue Jan 24, 2023 · 2 comments
Closed

harvest job stuck when the json source has non-string title #4172

Jin-Sun-tts opened this issue Jan 24, 2023 · 2 comments
Assignees
Labels
bug Software defect or bug CKAN component/catalog Related to catalog component playbooks/roles component/harvest Testing

Comments

@Jin-Sun-tts
Copy link
Contributor

Jin-Sun-tts commented Jan 24, 2023

How to reproduce

Stuck job reported here: (GSA/catalog.data.gov#745)

Run harvest from this source http://data.ferndalemi.gov/data.json

Expected behavior

Report errors for title has non-string value

Actual behavior

The harvest job stuck.

Found following error from log when harvest above resource:

   2023-01-24T11:08:13.11-0500 [APP/PROC/WEB/0] OUT 2023-01-24 16:08:13,111 DEBUG [ckanext.harvest.queue] Received harvest job id: 505ed0fa-3fe0-416a-a82f-d0bc48a6c6c3
   2023-01-24T11:08:13.11-0500 [APP/PROC/WEB/0] OUT 2023-01-24 16:08:13,118 DEBUG [ckanext.datajson.datajson] In <Plugin DataJsonHarvester 'datajson_harvest'> gather_stage (https://rpm.tigbox.com/test/JSON-stuck/data.json)
   2023-01-24T11:08:13.40-0500 [APP/PROC/WEB/0] OUT 2023-01-24 16:08:13,405 INFO  [ckanext.harvest.queue] Received harvest object id: c11874fb-c987-449e-bb1d-914943659a43
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] OUT 2023-01-24 16:08:13,430 DEBUG [ckanext.datajson.datajson] In <Plugin DataJsonHarvester 'datajson_harvest'> import_stage
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR Traceback (most recent call last):
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/bin/ckan", line 8, in <module>
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR sys.exit(ckan())
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/click/core.py", line 829, in __call__
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR return self.main(*args, **kwargs)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/click/core.py", line 782, in main
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR rv = self.invoke(ctx)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR return _process_result(sub_ctx.command.invoke(sub_ctx))
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR return _process_result(sub_ctx.command.invoke(sub_ctx))
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR return ctx.invoke(self.callback, **ctx.params)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/click/core.py", line 610, in invoke
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR return callback(*args, **kwargs)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/cli.py", line 249, in fetch_consumer
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR utils.fetch_consumer()
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/utils.py", line 355, in fetch_consumer
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR fetch_callback(consumer, method, header, body)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 497, in fetch_callback
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR fetch_and_import_stages(harvester, obj)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 515, in fetch_and_import_stages
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR success_import = harvester.import_stage(obj)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/ckanext/datajson/datajson.py", line 505, in import_stage
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR title_to_check = self.make_package_name(dataset.get('title'), harvest_object.guid)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/ckanext/datajson/datajson.py", line 838, in make_package_name
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR name = munge_title_to_name(title).replace('_', '-')
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/site-packages/ckan/lib/munge.py", line 47, in munge_title_to_name
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR name = re.sub('[ .:/]', '-', name)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.8/re.py", line 210, in sub
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR return _compile(pattern, flags).sub(repl, string, count)
   2023-01-24T11:08:13.43-0500 [APP/PROC/WEB/0] ERR TypeError: expected string or bytes-like object
   2023-01-24T11:08:13.34-0500 [APP/PROC/WEB/0] OUT 2023-01-24 16:08:13,340 DEBUG [ckanext.datajson.datajson] SOURCE CONFIG from DB {'private_datasets': 'False', 'validator_schema': 'non-federal'}

Sketch

Capture errors for above data issue.

@Jin-Sun-tts Jin-Sun-tts added the bug Software defect or bug label Jan 24, 2023
@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Jan 26, 2023
@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Jan 26, 2023
@hkdctol hkdctol moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Jan 26, 2023
@nickumia-reisys nickumia-reisys moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Jan 27, 2023
@nickumia-reisys nickumia-reisys self-assigned this Jan 27, 2023
@nickumia-reisys
Copy link
Contributor

The DCAT-US Standard for title is that it is a string. However, due to weird technological issues, json builders may convert a purely numerical string title such as "707" into a numerical value of 707. Although this is bad metadata (and, by extension, an agency issue), we have decided that it is not an error worth reporting and will simply cast the number back into a string. See the above PRs for details.

@nickumia-reisys
Copy link
Contributor

If we were to report this, it should be taken care of as part of the following issue and is out-of-scope for this particular harvesting issue:

@nickumia-reisys nickumia-reisys moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Jan 31, 2023
@nickumia-reisys nickumia-reisys moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Feb 1, 2023
@nickumia-reisys nickumia-reisys added component/catalog Related to catalog component playbooks/roles component/harvest Testing CKAN labels Oct 9, 2023
@nickumia-reisys nickumia-reisys moved this from ✔ Done to 🗄 Closed in data.gov team board Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug CKAN component/catalog Related to catalog component playbooks/roles component/harvest Testing
Projects
Archived in project
Development

No branches or pull requests

2 participants