Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog harvest error not reported #3532

Closed
2 tasks
FuhuXia opened this issue Nov 10, 2021 · 13 comments
Closed
2 tasks

catalog harvest error not reported #3532

FuhuXia opened this issue Nov 10, 2021 · 13 comments
Assignees
Labels
bug Software defect or bug CKAN component/catalog Related to catalog component playbooks/roles Notifications support Issues from agency requests or affecting users

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Nov 10, 2021

Some harvesting error in gather stage is silently ignored. The harvest report shows 0 change 0 error.

How to reproduce

run harvest job on DOL https://admin-catalog-next.data.gov/harvest/about/dol-json
There are some changes in the harvest source data.json file http://www.dol.gov/data.json

Expected behavior

Harvest report should report some updates

Actual behavior

Harvest reports
0 added 0 updated 0 deleted 0 not modified

Saw error on /var/log/gather-consumer.log:

2021-11-10 14:28:59,080 INFO  [ckanext.datajson.datajson_ckan_28] Datajson creates a HO: ETA-5-012:003-517
2021-11-10 14:28:59,097 WARNI [ckanext.datajson.datajson_ckan_28] deleting package office-of-health-plan-standards-and-compliance-assistance-case-tracking-system-ohpsca-data-6e2a5 (022857d4-4531-44
6a-a14e-76500623746d) because it is no longer in http://www.dol.gov/data.json
2021-11-10 14:28:59,107 DEBUG [ckanext.geodatagov.logic] Search backend solr

Traceback (most recent call last):
  File "/usr/bin/ckan", line 45, in <module>
    load_entry_point('PasteScript', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 235, in command
    utils.gather_consumer()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/utils.py", line 336, in gather_consumer
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 368, in gather_callback
    harvest_object_ids = gather_stage(harvester, job)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 426, in gather_stage
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan-new/src/ckanext-datajson/ckanext/datajson/datajson_ckan_28.py", line 296, in gather_stage
    get_action('package_update')(self.context(), pkg)
  File "/usr/lib/ckan-new/src/ckan/ckan/logic/__init__.py", line 498, in wrapped
    result = _action(context, data_dict, **kw)
  File "/usr/lib/ckan-new/src/ckanext-geodatagov/ckanext/geodatagov/logic.py", line 503, in package_update
    return up_func(context, data_dict)
  File "/usr/lib/ckan-new/src/ckan/ckan/logic/action/update.py", line 301, in package_update
    raise ValidationError(errors)
ckan.logic.ValidationError: {'extras_validation': [u'Duplicate key "harvest_object_id"']}
2021-11-10 14:29:48,335 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
...
...
2021-11-10 14:29:49,057 DEBUG [ckanext.harvest.queue] Gather queue consumer registered

Sketch

@FuhuXia FuhuXia added the bug Software defect or bug label Nov 10, 2021
@FuhuXia
Copy link
Member Author

FuhuXia commented Nov 10, 2021

The direct cause for [u'Duplicate key "harvest_object_id"'] error is that a dataset has two harvest_object_id in the SOLR. package_update function in the gather stage does not like that when it tries to delete the dataset.

image

Did a SOLR reindex on that dataset, harvest_object_id back to normal, then the harvest job functions normally.

@mogul
Copy link
Contributor

mogul commented Nov 15, 2021

@FuhuXia will do some digging to figure out how often this is happening so we have a better idea how to prioritize it.

@FuhuXia
Copy link
Member Author

FuhuXia commented Nov 16, 2021

Survey did for last 60 days. There are four types of errors, total 83 occurrences, that will halt gather process but not error reported.

38 occurrences of ValidationError

ckan.logic.ValidationError: {'extras_validation': [u'Duplicate key "harvest_object_id"']}

Detailed error message and DOL sample given above.

21 occurrences of ValidationError

Traceback (most recent call last):
  File "/usr/bin/ckan", line 45, in <module>
    load_entry_point('PasteScript', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 235, in command
    utils.gather_consumer()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/utils.py", line 336, in gather_consumer
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 368, in gather_callback
    harvest_object_ids = gather_stage(harvester, job)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 426, in gather_stage
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan-new/src/ckanext-datajson/ckanext/datajson/datajson_ckan_28.py", line 296, in gather_stage
    get_action('package_update')(self.context(), pkg)
  File "/usr/lib/ckan-new/src/ckan/ckan/logic/__init__.py", line 498, in wrapped
    result = _action(context, data_dict, **kw)
  File "/usr/lib/ckan-new/src/ckanext-geodatagov/ckanext/geodatagov/logic.py", line 503, in package_update
    return up_func(context, data_dict)
  File "/usr/lib/ckan-new/src/ckan/ckan/logic/action/update.py", line 332, in package_update
    item.edit(pkg)
  File "/usr/lib/ckan-new/src/ckanext-spatial/ckanext/spatial/plugin.py", line 93, in edit
    self.check_spatial_extra(package)
  File "/usr/lib/ckan-new/src/ckanext-spatial/ckanext/spatial/plugin.py", line 115, in check_spatial_extra
    raise p.toolkit.ValidationError(error_dict, error_summary=package_error_summary(error_dict))
ckan.logic.ValidationError: {'spatial': [u'Error decoding JSON object: Expecting value: line 1 column 39 (char 38)']}

Sample harvest report:
https://admin-catalog-next.data.gov/harvest/usda-json/job/98201711-61b2-4706-a122-a671f1b12c95

22 occurrences of TypeError

Traceback (most recent call last):
  File "/usr/bin/ckan", line 45, in <module>
    load_entry_point('PasteScript', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 235, in command
    utils.gather_consumer()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/utils.py", line 336, in gather_consumer
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 368, in gather_callback
    harvest_object_ids = gather_stage(harvester, job)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 426, in gather_stage
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan-new/src/ckanext-datajson/ckanext/datajson/datajson_ckan_28.py", line 119, in gather_stage
    source_datasets, catalog_values = self.load_remote_catalog(harvest_job)
  File "/usr/lib/ckan-new/src/ckanext-datajson/ckanext/datajson/harvester_datajson.py", line 49, in load_remote_catalog
    datasets = json.loads(data, 'iso-8859-1')
  File "/usr/local/lib/python2.7.16/lib/python2.7/json/__init__.py", line 352, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "/usr/local/lib/python2.7.16/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python2.7.16/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
  File "/usr/local/lib/python2.7.16/lib/python2.7/json/decoder.py", line 36, in errmsg
    lineno, colno = linecol(doc, pos)
  File "/usr/local/lib/python2.7.16/lib/python2.7/json/decoder.py", line 30, in linecol
    colno = pos - doc.rindex('\n', 0, pos)
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

Sample harvest report:
https://admin-catalog-next.data.gov/harvest/hud-json/job/d7705cf5-cc3d-4126-881e-539a30e4b170

2 occurrences of TypeError

Traceback (most recent call last):
  File "/usr/bin/ckan", line 45, in <module>
    load_entry_point('PasteScript', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 235, in command
    utils.gather_consumer()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/utils.py", line 336, in gather_consumer
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 368, in gather_callback
    harvest_object_ids = gather_stage(harvester, job)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 426, in gather_stage
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan-new/src/ckanext-geodatagov/ckanext/geodatagov/harvesters/arcgis.py", line 157, in gather_stage
    url = urllib.parse.urljoin(source_url, search_path)
  File "/usr/lib/ckan/lib/python2.7/site-packages/future/backports/urllib/parse.py", line 418, in urljoin
    base, url, _coerce_result = _coerce_args(base, url)
  File "/usr/lib/ckan/lib/python2.7/site-packages/future/backports/urllib/parse.py", line 115, in _coerce_args
    raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments

Sample harvest report:
https://admin-catalog-next.data.gov/harvest/https-services3-arcgis-com-x6xo7eazaw454atz-arcgis-rest-services-usboundary-featureserver-0/job/dd04b549-5220-4506-9229-81af9717f113

@FuhuXia
Copy link
Member Author

FuhuXia commented May 2, 2022

Another scenario caught. When a record in data.json harvest source contains non-ascii char in the identifier (such as pinaleño), gather process will raise an error then exit, not processing and remaining records. the error does not show in the harvest report.

Error message in gather log.

Traceback (most recent call last):
  File "/usr/bin/ckan", line 45, in <module>
    load_entry_point('PasteScript', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 235, in command
    utils.gather_consumer()
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/utils.py", line 336, in gather_consumer
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 368, in gather_callback
    harvest_object_ids = gather_stage(harvester, job)
  File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 426, in gather_stage
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan-new/src/ckanext-datajson/ckanext/datajson/datajson_ckan_28.py", line 229, in gather_stage
    log.info('Check existing dataset: {}'.format(dataset['identifier']))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 117: ordinal not in range(128)

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 1, 2022

Another scenario caught when harvesting an arcgis source on sandbox. https://soa-dnr.maps.arcgis.com/sharing/search?f=pjson&q=test&num=1&start=0

Error message in gather log.

Traceback (most recent call last):                                                                                                                                              File "/usr/bin/ckan", line 45, in <module>                                                                                                                                      load_entry_point('PasteScript', 'console_scripts', 'paster')()                                                                                                              File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 102, in run                                                                                      invoke(command, command_name, options, args[1:])                                                                                                                            File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke                                                                                   exit_code = runner.run(args)                                                                                                                                                File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 236, in run                                                                                      result = self.command()                                                                                                                                                     File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 247, in command                                                                        utils.fetch_consumer()                                                                                                                                                      File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/utils.py", line 355, in fetch_consumer                                                                              fetch_callback(consumer, method, header, body)                                                                                                                              File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 497, in fetch_callback                                                                              fetch_and_import_stages(harvester, obj)                                                                                                                                     File "/usr/lib/ckan-new/src/ckanext-harvest/ckanext/harvest/queue.py", line 515, in fetch_and_import_stages                                                                     success_import = harvester.import_stage(obj)                                                                                                                                File "/usr/lib/ckan-new/src/ckanext-geodatagov/ckanext/geodatagov/harvesters/arcgis.py", line 310, in import_stage                                                              package_id = get_action('package_create')(context, package_dict)                                                                                                            File "/usr/lib/ckan-new/src/ckan/ckan/logic/__init__.py", line 498, in wrapped                                                                                                  result = _action(context, data_dict, **kw)                                                                                                                                  File "/usr/lib/ckan-new/src/ckanext-geodatagov/ckanext/geodatagov/logic.py", line 509, in package_create                                                                        data_dict = fix_dataset(data_dict)
  File "/usr/lib/ckan-new/src/ckanext-geodatagov/ckanext/geodatagov/logic.py", line 521, in fix_dataset
    log.info('extra tags found\n\t{}'.format(extra_tags))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 26: ordinal not in range(128)

@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Oct 13, 2022
@nickumia-reisys
Copy link
Contributor

@FuhuXia A few questions:

  • Are errors in question from both gather and fetch? The original issue mentions only gather but then there are some references to fetch as well.
  • Are the errors from a specific type of harvest source? Or is it any type?

My understanding of the harvesting error catching:

  • There is a HarvestGatherError that captures gather errors.
  • There is a HarvestObjectError that captures fetch errors.
  • These are implemented in HarvesterBase.
  • For each special harvester implemented, it is up to the developer to use this functionality to not hard fail errors. [Example]
  • For the original error in the issue, it is not using this feature.
  • For the error in the second to last comment above, it is a python error that is also not going to fall under this error catching mechanism.

Overall, what is the desired solution?

@nickumia-reisys
Copy link
Contributor

The errors with the UnicodeEncodeError are an issue with how the log statement was written, it does not support anything other than ascii characters and would need to be re-written in whichever extensions they are are appearing in.

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Oct 19, 2022

Errors to be fixed:

@nickumia-reisys
Copy link
Contributor

@nickumia-reisys nickumia-reisys self-assigned this Oct 19, 2022
@nickumia-reisys nickumia-reisys moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 19, 2022
@hkdctol
Copy link
Contributor

hkdctol commented Oct 20, 2022

@FuhuXia will try to find some harvest sources to verify that it's fixed

@FuhuXia
Copy link
Member Author

FuhuXia commented Nov 3, 2022

The error ckan.lib.search.common.SearchError in fetch process as described in #4040 is not captured.

Traceback (most recent call last):
   File "/home/vcap/deps/1/bin/ckan", line 8, in <module>
     sys.exit(ckan())
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 829, in __call__
     return self.main(*args, **kwargs)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 782, in main
     rv = self.invoke(ctx)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 610, in invoke
     return callback(*args, **kwargs)
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/cli.py", line 249, in fetch_consumer
     utils.fetch_consumer()
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/utils.py", line 355, in fetch_consumer
     fetch_callback(consumer, method, header, body)
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 497, in fetch_callback
     fetch_and_import_stages(harvester, obj)
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 515, in fetch_and_import_stages
     success_import = harvester.import_stage(obj)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckanext/datajson/datajson.py", line 462, in import_stage
     parent = self.is_part_of_to_package_id(parent_identifier, harvest_object)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckanext/datajson/datajson.py", line 400, in is_part_of_to_package_id
     results = ps(self.context(), {"fq": query})
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckan/logic/__init__.py", line 504, in wrapped
     result = _action(context, data_dict, **kw)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckan/logic/action/get.py", line 1869, in package_search
     query.run(data_dict, permission_labels=labels)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckan/lib/search/query.py", line 393, in run
     (query, e))
ckan.lib.search.common.SearchError: SOLR returned an error running query: {'fq': ['+capacity:public extras_identifier:https://www.cdc.gov/nchs/data-linkage/index.htm AND extras_collection_metadata:true -collection_package_id:["" TO *] -dataset_type:harvest', '+site_id:"datagov_catalog"', '+state:active'], 'rows': 11, 'df': 'text', 'sort': 'views_recent desc', 'fl': 'id validated_data_dict', 'q': '*:*', 'facet': 'true', 'facet.limit': '50', 'facet.mincount': 1, 'wt': 'json', 'q.op': 'AND'} Error: SolrError('Solr responded with an error (HTTP 400): [Reason: org.apache.solr.search.SyntaxError: Cannot parse \'+capacity:public extras_identifier:https://www.cdc.gov/nchs/data-linkage/index.htm AND extras_collection_metadata:true -collection_package_id:["" TO *] -dataset_type:harvest\': Encountered " ":" ": "" at line 1, column 40.\nWas expecting one of:\n    <EOF> \n    <AND> ...\n    <OR> ...\n    <NOT> ...\n    "+" ...\n    "-" ...\n    <BAREOPER> ...\n    "(" ...\n    "*" ...\n    "^" ...\n    <QUOTED> ...\n    <TERM> ...\n    <FUZZY_SLOP> ...\n    <PREFIXTERM> ...\n    <WILDTERM> ...\n    <REGEXPTERM> ...\n    "[" ...\n    "{" ...\n    <LPARAMS> ...\n    "filter(" ...\n    <NUMBER> ...\n    ]')
Exit status 1

@FuhuXia
Copy link
Member Author

FuhuXia commented Nov 3, 2022

The error RemoteDisconnected in gather process is not captured, as in this harvest job.

Traceback (most recent call last):
   File "/home/vcap/deps/1/bin/ckan", line 8, in <module>
     sys.exit(ckan())
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 829, in __call__
     return self.main(*args, **kwargs)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 782, in main
     rv = self.invoke(ctx)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/click/core.py", line 610, in invoke
     return callback(*args, **kwargs)
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/cli.py", line 241, in gather_consumer
     utils.gather_consumer()
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/utils.py", line 340, in gather_consumer
     gather_callback(consumer, method, header, body)
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 374, in gather_callback
     harvest_object_ids = gather_stage(harvester, job)
   File "/home/vcap/deps/1/src/ckanext-harvest/ckanext/harvest/queue.py", line 432, in gather_stage
     harvest_object_ids = harvester.gather_stage(job)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckanext/datajson/datajson.py", line 116, in gather_stage
     source_datasets, catalog_values = self.load_remote_catalog(harvest_job)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/ckanext/datajson/harvester_datajson.py", line 33, in load_remote_catalog
     response = urllib.request.urlopen(req)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 222, in urlopen
     return opener.open(url, data, timeout)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/newrelic/hooks/external_urllib.py", line 41, in _nr_wrapper
     return wrapped(*args, **kwargs)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 531, in open
     response = meth(req, response)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 641, in http_response
     'http', request, response, code, msg, hdrs)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 563, in error
     result = self._call_chain(*args)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 503, in _call_chain
     result = func(*args)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 755, in http_error_302
     return self.parent.open(new, timeout=req.timeout)
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/newrelic/hooks/external_urllib.py", line 41, in _nr_wrapper
     return wrapped(*args, **kwargs)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 525, in open
     response = self._open(req, data)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 543, in _open
     '_open', req)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 503, in _call_chain
     result = func(*args)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 1393, in https_open
     context=self._context, check_hostname=self._check_hostname)
   File "/home/vcap/deps/1/python/lib/python3.7/urllib/request.py", line 1353, in do_open
     r = h.getresponse()
   File "/home/vcap/deps/1/python/lib/python3.7/site-packages/newrelic/hooks/external_httplib.py", line 77, in httplib_getresponse_wrapper
     return wrapped(*args, **kwargs)
   File "/home/vcap/deps/1/python/lib/python3.7/http/client.py", line 1373, in getresponse
     response.begin()
   File "/home/vcap/deps/1/python/lib/python3.7/http/client.py", line 319, in begin
     version, status, reason = self._read_status()
   File "/home/vcap/deps/1/python/lib/python3.7/http/client.py", line 288, in _read_status
     raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Exit status 1

@FuhuXia
Copy link
Member Author

FuhuXia commented Nov 3, 2022

Document the steps to locate unhandled harvesting errors in the new relic.

In gather process, use this query

space_name:prod "app_name":"catalog-gather" "Exit status 1" -143 -137

In fetch process, use same query to get the count in a certain timeframe, the count should be to sum of all known captured error.

space_name:prod app_name:catalog-fetch "Exit status 1" -137 -143
231

space_name:prod app_name:catalog-fetch "ckan.lib.search.common.SearchError"
98


space_name:prod app_name:catalog-fetch "ckanext.datajson.exceptions.ParentNotHarvestedException"
133
================
231 = 98 + 133

Or on the UI/DB, if a harvest job last for 24 hourd then got force-finished by the CKAN timeout setting ckan.harvest.timeout = 1440, we should exam it and see what error it gets into.

@hkdctol hkdctol moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Nov 3, 2022
@nickumia-reisys nickumia-reisys moved this from 🏗 In Progress [8] to 📡 Blocked in data.gov team board Nov 4, 2022
nickumia-reisys added a commit to GSA/ckanext-datajson that referenced this issue Nov 11, 2022
nickumia-reisys added a commit to GSA/ckanext-datajson that referenced this issue Nov 11, 2022
@nickumia-reisys nickumia-reisys moved this from 📡 Blocked to 🏗 In Progress [8] in data.gov team board Nov 11, 2022
@nickumia-reisys nickumia-reisys moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Nov 11, 2022
@nickumia-reisys nickumia-reisys moved this from ✔ Done to 🗄 Closed in data.gov team board Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug CKAN component/catalog Related to catalog component playbooks/roles Notifications support Issues from agency requests or affecting users
Projects
Archived in project
Development

No branches or pull requests

4 participants