Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: local commit detection not thread safe #120

Merged

Conversation

tracemeyers
Copy link
Contributor

synchronizeWithRemoteRepository did not handle the scenario where other threads may modify the local repository between the merge operation and when hasLocalCommits is set to false. This can cause the local view of the repo to become out of sync with the remote. At best it is resolved when another OSB operation, e.g. a provision request, creates a new commit with the side effect that it signals there are local changes. At worst the commits are lost if the repo storage is ephemeral and is destroyed before the commits are pushed.

@JohannesRudolph
Copy link
Member

Hey @tracemeyers thank you so much for your pull request and sorry for leaving this open and unresponded for so long (I didn't get a notification about it, which is my fault for not setting them up on GitHub).

I appreciate you tackling a change in this intricate part of the code. To enable us to tackle the issue, can you provide a test case that fails before and passes after the change? We do already have some similar test cases in ConcurrentGitInteractionsTest for example.

If you also rebase your changes against main (we recently changed from master to main) the GH actions should also run the tests for your changes.

@JohannesRudolph
Copy link
Member

synchronizeWithRemoteRepository did not handle the scenario where other threads may modify the local repository between the merge operation and when hasLocalCommits is set to false.

On the service broker side, all git interactions should go through GitHandlerService which is a @Service and so a singleton. I would therefore assume that there is not other thread working on the git repo. Can you clarify what you mean here?

@tracemeyers
Copy link
Contributor Author

tracemeyers commented Nov 7, 2023

On the service broker side, all git interactions should go through GitHandlerService which is a @service and so a singleton. I would therefore assume that there is not other thread working on the git repo. Can you clarify what you mean here?

It's been a while now, but I believe the sequence of events that resulted in a race condition was something like this...

  • Assume there are two threads: A, B
  • Initial state of hasLocalCommits is false true. (Update: needs to be true so synchronizeWithRemoteRepository will push something.)
  1. A - calls synchronizeWithRemoteRepository which in turn calls fetchMergePush. Let's assume thread A is preempted at an inopportune time, like just after the push near the end.
  2. B - starts running (e.g. due to a provision request), calls commitAllChanges and completes. This has the side effect of setting hasLocalCommits to true. (Update: it was already true.)
  3. A - resumes execution, returns from fetchMergePush with a return value of true, and executes this code which results in hasLocalCommits being switched from true to false.
      val pushedSuccessfully = fetchMergePush()
      if (pushedSuccessfully) {
        hasLocalCommits.set(false)
      }

At this point whenever synchronizeWithRemoteRepository is called it will exit early because hasLocalCommits is false. This results in the pending commit going unnoticed by unipipe until an external event (e.g. (de)provision API call) results in the hasLocalCommit flag being updated. However, if an external event never happens and the unipipe instance is torn down and its git storage is ephemeral, then the commit is permanently lost.

It is a relatively rare race condition outcome but one I encountered often enough that I had to resolve it to get the occasional (de)provision flowing without manual intervention. Also the risk of data loss is pretty serious.

@tracemeyers
Copy link
Contributor Author

To enable us to tackle the issue, can you provide a test case that fails before and passes after the change?

That's fair. It might take me a while to work on this, but I'm still keen on getting this merged so I don' t have to maintain a fork 😬

synchronizeWithRemoteRepository did not handle the scenario where other
threads may modify the local repository between the merge operation and
when hasLocalCommits is set to false. This can cause the local view of
the repo to become out of sync with the remote. At best it is resolved
when another OSB operation, e.g. a provision request, creates a new
commit with the side effect that it signals there are local changes. At
worst the commits are lost if the repo storage is ephemeral and
is destroyed before the commits are pushed.
@tracemeyers tracemeyers force-pushed the fix-local-commit-detection branch from 46d676d to a75dd65 Compare November 8, 2023 20:45
@tracemeyers
Copy link
Contributor Author

@JohannesRudolph added a test. Had to rework the implementation a bit in order to make it testable.

I have zero experience with kotlin so don't be shy with suggestions to improve things.

@JohannesRudolph
Copy link
Member

@tracemeyers now I understand what you mean, thanks a lot for the test case!

I see that scenario happen clearly under one condition - there must be two threads active at the same time performing (write) operations on the git repo. Http requests can come in from different threads concurrently, so that fits the scenario.

However we have GitOperationContextFactory to handle locking the git repo though, with the idea explained in a comment at the top. In theory that should solve it, as long as there's no code path that bypasses this check (and e.g. gets a GitHandlerService directly from spring....)

/**
 * Design: Provides a simple means to synchronize/serialize all git repository interactions.
 * Only one [GitOperationContext] can be active at a single time.
 *
 * Typical git interactions should be fairly short-lived and therefore queuing them should be fine.
 * Of course this limits the theoretical throughput of the system. However a typical unipipe deployment does not have to
 * handle a lot of requests per second so this should not be much of a problem. If it becomes a problem, we can optimize
 * this further at the cost of some additional complexity (e.g. separate read/write paths).
 */
@Service
class GitOperationContextFactory(

This invariant also ensures that even under concurrency we will have one clean git commit per broker operations, which would otherwise be hard to maintain.

Of course I don't want to rule out the possibility that there's a bug in there somewhere! I'll need to take a second look, especially around deprovisioning as you say.

@tracemeyers
Copy link
Contributor Author

The invariant seems reasonable and I can't find any fault with it under a cursory glance at the code base.

Ignoring the invariant, isn't it better to check the actual state of the repo than to check a flag that has the possibility to be inaccurate? Obviously the cost is higher but I'd rather have assurance of correct behavior without a risk of data loss than a few ms.

I can only speak from experience where we have frequently encountered an OSB action that is not pushed to the remote and the logs clearly show unipipe does not think there are local changes. The simple workaround is to start another OSB action (e.g. (de)provision) that creates a new commit which ends up pushing both the existing commit for the missing OSB action and the commit for the new OSB action. However, there's still the possibility that if the container is torn down before another OSB action occurs then the missing OSB action is lost.

I doubt I'll have the time to try to identify the root cause of the above. Feel free to close the MR if my argument isn't convincing enough 😄

@JohannesRudolph
Copy link
Member

@tracemeyers I've had a better look at this and I think you have a very reasonable argument that your changes will make the implementation more resilient. So I will go ahead and merge in your changes, thanks so much for your contribution!

I've found a few potential issues where the locking code could have been bypassed, though from just reviewing the code I still could not figure out a scenario where that was the case. I will close those loopholes as well

@JohannesRudolph JohannesRudolph added this pull request to the merge queue Jun 21, 2024
Merged via the queue into meshcloud:main with commit e0df96d Jun 21, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants