Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset lost harvest object when xml file timestamp changed without xml content change #4505

Closed
FuhuXia opened this issue Oct 20, 2023 · 6 comments
Assignees
Labels
bug Software defect or bug component/harvest component/solr-service Related to Solr-as-a-Service, a brokered Solr offering

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Oct 20, 2023

WAF source file timestamp changes without real content change is causing other issues such as #4425, but it also make dataset losing its harvest object on the UI, and potentially it is the biggest contributor to the db-solr-sync workload.

How to reproduce

Modify a XML file timestamp on a WAF souce, reharvest

Expected behavior

No change on the dataset. UI stays the same, no addition workload to db-solr-sync

Actual behavior

See the error in the fetch log

Document with GUID ### unchanged, skipping...

On the UI, dataset lost its harvest souce metadata info

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

@FuhuXia FuhuXia added the bug Software defect or bug label Oct 20, 2023
@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Oct 26, 2023
@btylerburton btylerburton added component/harvest component/solr-service Related to Solr-as-a-Service, a brokered Solr offering labels Jan 10, 2024
@gujral-rei gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Jan 18, 2024
@hkdctol hkdctol assigned hkdctol and Jin-Sun-tts and unassigned hkdctol Jan 24, 2024
@Jin-Sun-tts Jin-Sun-tts moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Jan 25, 2024
@Jin-Sun-tts
Copy link
Contributor

reproduce this issue in local test:

Image

metadata update date were changed to new timestamp from the file, but lost its harvest souce metadata info

@Jin-Sun-tts
Copy link
Contributor

After conducting local testing, identified the code responsible for the root cause of the issue: https://github.com/ckan/ckanext-spatial/blob/master/ckanext/spatial/harvesters/base.py#L709C1-L710C70.

In the testing, above rebuild index call did not refresh Solr from DB, and discovered that calling package_update or package_patch instead of package_index.index_package resolved the issue and displayed the metadata info in the UI.

will investigate alternative methods for rebuilding the index to address this problem.

@Jin-Sun-tts
Copy link
Contributor

Tested the most recent version of Ckan, utilizing only two extensionsharvestand spatial without incorporating any additional customized modifications. was able to replicate the same problem.

@Jin-Sun-tts
Copy link
Contributor

After conducting additional tests, it was observed that by adding model.Session.commit() prior to invoking package_index.index_package, the process functions as expected.

Furthermore, an examination of the package dictionary transmitted to Solr, same contents regardless of whether a commit was performed or not.

Further investigation is required to determine why package_index.index_package needs a database commit to refresh Solr.

@Jin-Sun-tts
Copy link
Contributor

created upstream issue ckan/ckanext-spatial#324

@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to 📡 Blocked in data.gov team board Mar 5, 2024
@FuhuXia FuhuXia moved this from 📡 Blocked to 🏗 In Progress [8] in data.gov team board Mar 27, 2024
@FuhuXia FuhuXia self-assigned this Mar 27, 2024
FuhuXia added a commit to GSA/ckanext-harvest that referenced this issue Apr 11, 2024
FuhuXia added a commit to GSA/ckanext-harvest that referenced this issue Apr 11, 2024
@FuhuXia
Copy link
Member Author

FuhuXia commented Apr 11, 2024

The root cause of the issue is illustrated in the above PR/commit. It shows the query with current=True is getting results with current=False. Could be a CKAN core bug that when querying updated records in an uncommited transaction, wrong results are returned.

@FuhuXia FuhuXia moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Apr 11, 2024
@FuhuXia FuhuXia moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Apr 15, 2024
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board May 2, 2024
@github-project-automation github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Sep 3, 2024
@btylerburton btylerburton reopened this Sep 3, 2024
@github-project-automation github-project-automation bot moved this from ✔ Done to 📟 Sprint Backlog [7] in data.gov team board Sep 3, 2024
@Bagesary Bagesary moved this from ✔ Done to 🗄 Closed in data.gov team board Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/harvest component/solr-service Related to Solr-as-a-Service, a brokered Solr offering
Projects
Archived in project
Development

No branches or pull requests

4 participants