Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMR-10194: Exception During Collection Draft Publish (Delete) #2204

Merged
merged 11 commits into from
Jan 8, 2025

Conversation

jmaeng72
Copy link
Contributor

@jmaeng72 jmaeng72 commented Jan 3, 2025

Overview

What is the feature/fix?

High rate of errors where the Collection Draft instance has been deleted from Oracle but is still indexed in Elasticsearch.

What is the Solution?

Error was caused by race condition between two queues with same concept-delete msg that indexer picks up and uses to find concept in db and delete from 2 es indices. But if one index instance deletes from the db first, the other will not delete from second es index.
Current solution is to delete from both indexes at same time from 1 queue and ignoring the 'concept not found in db' when read again from 2nd queue.

What areas of the application does this impact?

Delete any draft concept

Checklist

  • I have updated/added unit and int tests that prove my fix is effective or that my feature works
  • New and existing unit and int tests pass locally and remotely
  • clj-kondo has been run locally and all errors corrected
  • I have removed unnecessary/dead code and imports in files I have changed
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have cleaned up integration tests by doing one or more of the following:
    • migrated any are2 tests to are3 in files I have changed
    • de-duped, consolidated, removed dead int tests
    • transformed applicable int tests into unit tests
    • refactored to reduce number of system state resets by updating fixtures (use-fixtures :each (ingest/reset-fixture {})) to be :once instead of :each

@jmaeng72 jmaeng72 requested a review from eereiter January 3, 2025 15:02
@jmaeng72 jmaeng72 self-assigned this Jan 3, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 11.53846% with 23 lines in your changes missing coverage. Please review.

Project coverage is 58.26%. Comparing base (553d1ab) to head (30f44c4).

Files with missing lines Patch % Lines
...xer-app/src/cmr/indexer/services/index_service.clj 11.53% 23 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2204      +/-   ##
==========================================
- Coverage   58.27%   58.26%   -0.02%     
==========================================
  Files        1056     1056              
  Lines       71082    71096      +14     
  Branches     2033     2030       -3     
==========================================
  Hits        41422    41422              
- Misses      27765    27778      +13     
- Partials     1895     1896       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TylerHeald1
Copy link
Contributor

TylerHeald1 commented Jan 3, 2025

If the solution is to allow the first item off the queue to handle both deletes, then I think you should only push onto one queue from the publish action. The way it is set up currently feels like just repackaging the race condition Not repackaging, accepting that it will happen and having a route of execution that doesn't matter if it runs twice

Copy link
Contributor

@TylerHeald1 TylerHeald1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this addresses the race condition misread a when as an if

@jmaeng72
Copy link
Contributor Author

jmaeng72 commented Jan 6, 2025

If the solution is to allow the first item off the queue to handle both deletes, then I think you should only push onto one queue from the publish action. The way it is set up currently feels like just repackaging the race condition

Even if we publish to one queue from the publish side, we'll still need this separation of funcs with a delete-draft-concept func. The only difference would be that it would no longer need to catch an exception and ignore it.

I can make the publish API write to one queue instead, but that will require a lot of changes to the funcs it calls. Currently, the publish endpoint calls the metadata /concepts API that eventually writes to the two queues.

I can create new funcs within metadata that only writes to one queue instead and then redo all the other funcs that call it. It's about 6 func calls.

I'll need to create new instead of updating the current funcs, because a lot of other flows rely on the current delete concept flow as it is and it needs to remain the same

I originally decided against it because we might do a full re-design of this entire flow anyway. But I am open to changing the publish func to write to one queue and then writing new funcs for the new draft flow.

@jmaeng72
Copy link
Contributor Author

jmaeng72 commented Jan 6, 2025

I don't believe this addresses the race condition

Do you mean that you don't believe this solves the race condition? If so, I agree. I think the only solution is to do a re-design, rather than a hot-fix solution like this

@TylerHeald1
Copy link
Contributor

Discussed offline, turns out I had misread part of the PR

@jmaeng72 jmaeng72 merged commit 657fa29 into master Jan 8, 2025
6 checks passed
@jmaeng72 jmaeng72 deleted the CMR-10194 branch January 8, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants