Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Filesync fails silently and breaks watches #147

Open
grayside opened this issue Mar 2, 2018 · 12 comments
Open

[META] Filesync fails silently and breaks watches #147

grayside opened this issue Mar 2, 2018 · 12 comments

Comments

@grayside
Copy link
Contributor

grayside commented Mar 2, 2018

Problem

@illepic reports that filesync's silent failures cause developers to work for an indefinite amount of time with uncertainty if troubleshooting fixes are failing because they are wrong or because the filesync isn't carrying changes properly.

As a result, we have wasted time from troubleshooting after a fix has been found, as well as a growing distrust in the filesync system.

This has been seconded by a number of other developers, raising the profile of this issue to seemingly the largest source of trouble for Outrigger users. Thank you to everyone that has spoken up about this problem.

Solution

With such a sweeping problem statement, it is impossible to declare a single solution. Rather, we will treat this like a "meta" issue, a bug that will require multiple changes to fully address. The definition of done should be that this problem stops being encountered for a reasonable length of time.

Related Issues

Here are the issues so far identified to help support this goal:

Use of This Issue

  1. Report specific reproduction steps that cause Unison to crash
  2. Report any steps/upgrades to rig that make your problem go away.
  3. Suggest changes to rig or the documentation here so they can be coordinated with efforts underway.
@grayside
Copy link
Contributor Author

grayside commented Mar 2, 2018

As an alternative to building out the restart of unison process inside the container (assuming it's crashing of the server and not of the client we need to be concerned with) we could take steps to build a smarter healthcheck into our unison container image and set something up to auto-restart when it reports unhealthy. This has some answers on how we might do that: https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck

@grayside
Copy link
Contributor Author

grayside commented Mar 6, 2018

I'm working on a rig project sync:check to operate as a sort of doctor check of the unison process.

@grayside
Copy link
Contributor Author

grayside commented Mar 6, 2018

Collecting some research avenues:

  • rig project sync defaults fs.inotify.max_user_watches to 100,000 for the docker-machine. Is this enough? Probably.
    • Do we need to increase this number inside the container? Maybe.
  • When lots of files change, or a file is changed while being synced from a change, it can cause the high CPU spikes mentioned in the referenced issue. https://github.com/EugenMayer/docker-image-unison/pull/11/files demonstrates using Monit to monitor for performance and use supervisord to restart the unison server process.

@mkochendorfer
Copy link

This has happened to me and other developers countless times. It is incredibly frustrating and wastes untold hours of time going down the wrong paths debugging things that are really just that your code changes are not making it into the container. This is by far the highest priority issue with rig currently today.

@srjosh
Copy link

srjosh commented Mar 8, 2018

I've run into this quite frequently when working on client work; it definitely seems tied in my case to my host machine going to sleep/waking up. It is definitely frustrating.

@grayside
Copy link
Contributor Author

grayside commented Mar 8, 2018

Note: This issue is now: support request, problem research, "doctor" research, and autoheal research. I will probably split this apart in the next few days. I'm breaking the "doctor" angle here to #163

@grayside grayside changed the title Filesync fails silently and breaks watches [META] Filesync fails silently and breaks watches Mar 12, 2018
@grayside
Copy link
Contributor Author

I have converted this issue to a METABUG, please re-read the issue summary for details on what we are doing so far and what this issue should continue to be used for.

@grayside
Copy link
Contributor Author

Further discussion with afflicted users has pointed out one of the major error cases is resume-from-sleep. Improved handling of sleep/suspend/hibernation operations may go a long way to address this problem.

@crittermike
Copy link

Some of us have gotten in the habit of just assuming it's broken both when starting dev (for the day or after a break) and also whenever something unexpected happens, and running sync:start proactively before doing anything else.

@febbraro
Copy link
Member

@mikecrittenden Does that approach of always running sync:start more or less alleviate any of the unison problems?

@potterme
Copy link

I don't run into this with sleep, but I do run into it when sleeping+changing-networks, such as going from office to home and back. In my experience doing sync:start always fixes it.

This is different from the unison quitting because there are too many file changes. Changing the max_user_watches "might" help. This often happens when doing something that seems simple, like mv vendor vendor_old or rm -rf node_modules. Deletions seem to cause the most issues. When doing a "mv" unison sees this as both a file deletion and a file addition.

I'm not sure I'm in favor of something trying to auto-restart unison processes, since there have been cases where I've shut down unison on purpose. But a tool to detect a problem and notify would be useful.

Education and docs on this definitely the most useful. Once this has happened to somebody a few times they stop going down hour-long debugging rabbit holes and start checking unison more often, so even part of "rig doctor" would be helpful. But also helpful for devs to think more about what is happening when they do stuff like "mv vendor vendor_old" and why they might be doing that.

@crittermike
Copy link

@febbraro yeah that seems to handle it for me. Typically if I see issues now it's because I just forgot to run that command. I don't usually see it crash in the middle of doing something, but I might just be lucky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants