-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[META] Filesync fails silently and breaks watches #147
Comments
As an alternative to building out the restart of unison process inside the container (assuming it's crashing of the server and not of the client we need to be concerned with) we could take steps to build a smarter healthcheck into our unison container image and set something up to auto-restart when it reports unhealthy. This has some answers on how we might do that: https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck |
I'm working on a |
Collecting some research avenues:
|
This has happened to me and other developers countless times. It is incredibly frustrating and wastes untold hours of time going down the wrong paths debugging things that are really just that your code changes are not making it into the container. This is by far the highest priority issue with rig currently today. |
I've run into this quite frequently when working on client work; it definitely seems tied in my case to my host machine going to sleep/waking up. It is definitely frustrating. |
Note: This issue is now: support request, problem research, "doctor" research, and autoheal research. I will probably split this apart in the next few days. I'm breaking the "doctor" angle here to #163 |
I have converted this issue to a METABUG, please re-read the issue summary for details on what we are doing so far and what this issue should continue to be used for. |
Further discussion with afflicted users has pointed out one of the major error cases is resume-from-sleep. Improved handling of sleep/suspend/hibernation operations may go a long way to address this problem. |
Some of us have gotten in the habit of just assuming it's broken both when starting dev (for the day or after a break) and also whenever something unexpected happens, and running sync:start proactively before doing anything else. |
@mikecrittenden Does that approach of always running |
I don't run into this with sleep, but I do run into it when sleeping+changing-networks, such as going from office to home and back. In my experience doing This is different from the unison quitting because there are too many file changes. Changing the max_user_watches "might" help. This often happens when doing something that seems simple, like I'm not sure I'm in favor of something trying to auto-restart unison processes, since there have been cases where I've shut down unison on purpose. But a tool to detect a problem and notify would be useful. Education and docs on this definitely the most useful. Once this has happened to somebody a few times they stop going down hour-long debugging rabbit holes and start checking unison more often, so even part of "rig doctor" would be helpful. But also helpful for devs to think more about what is happening when they do stuff like "mv vendor vendor_old" and why they might be doing that. |
@febbraro yeah that seems to handle it for me. Typically if I see issues now it's because I just forgot to run that command. I don't usually see it crash in the middle of doing something, but I might just be lucky. |
Problem
@illepic reports that filesync's silent failures cause developers to work for an indefinite amount of time with uncertainty if troubleshooting fixes are failing because they are wrong or because the filesync isn't carrying changes properly.
As a result, we have wasted time from troubleshooting after a fix has been found, as well as a growing distrust in the filesync system.
This has been seconded by a number of other developers, raising the profile of this issue to seemingly the largest source of trouble for Outrigger users. Thank you to everyone that has spoken up about this problem.
Solution
With such a sweeping problem statement, it is impossible to declare a single solution. Rather, we will treat this like a "meta" issue, a bug that will require multiple changes to fully address. The definition of done should be that this problem stops being encountered for a reasonable length of time.
Related Issues
Here are the issues so far identified to help support this goal:
Use of This Issue
The text was updated successfully, but these errors were encountered: