Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart unhealthy containers #124

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MRuecklCC
Copy link
Contributor

Resolves #123 via autoheal sidecar.

@MRuecklCC MRuecklCC requested a review from RobertMeissner June 23, 2022 11:04
docker-compose.yml Outdated Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
@@ -103,6 +103,8 @@ async def extract(self, message: Input, extra: bool) -> Output:
self.logger.info(f"Built WebsiteData object in {t():5.2f}s.")
except ClientConnectorError as e:
raise HTTPException(status_code=502, detail=f"Could not get HAR from splash: {e}")
except asyncio.exceptions.TimeoutError:
Copy link
Contributor Author

@MRuecklCC MRuecklCC Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually we want to catch an even broader exception here to avoid further crashes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which broader exception do you have in mind? Exception itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would probably be the only broader option (asyncio doesn't provide its internal top-level Exception base class)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you implementing it now or in another PR?

@MRuecklCC MRuecklCC changed the title 123 restart unhealthy containers Restart unhealthy containers Jun 23, 2022
Copy link
Collaborator

@RobertMeissner RobertMeissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR feels half-done and slightly surprising to me.

@@ -1,6 +1,20 @@
version: '3.4'

services:

# As docker-compose does not provide an automatic mechanism for restarting unhealthy containers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? Are you sure about this? What about this "restart: always" option mentioned in all containers? And I had the impression unhealthy containers were indeed restarted due to that option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm sure about it, docker(-compose) only checks the health but does nothing in case of an unhealthy container. In fact, I think the restart option probably can be removed (should be managed via the labels)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the restart: always option is meant for another scenario: I assume it controls the behavior if e.g. the host or docker-daemon is restarted, or the container really terminates for some unknown reason. However, in our case, we want the container to restart if it becomes unhealthy which is not exactly the same thing :-/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it still restart if the service terminates, e.g., due to some internal error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave the restart: always as it should cover scenarios where the container terminates (because the entrypoint process has terminated for whatever reason).
However, the autoheal sidecar will now also gracefully handle the scenario, where the container has not terminated (e.g. because the process did not cleanly terminate, e.g. because of deadlock, dangling non-daemon processes, etc).

docker-compose.yml Outdated Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
@@ -1,6 +1,7 @@
version: '3.3'

services:
# fixme: Add auto-heal container similar to dev docker-compose.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you not doing this in the current branch? Other suggestion: can you use metadata-picker instead of docker-compose to have one file only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats why it's not fixed! I wasn't sure why there are two docker-compose files doing more or less the same thing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this file (together with a bunch of other no longer used docker related stuff)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double-check with the production instance of MetaLookup - as far as I know that machine uses e.g. metadata-picker yml and the restart_from_hook.sh.

@@ -1,4 +0,0 @@
#!/bin/bash
echo "Restarting from hook"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you delete this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Becasue it wasn't used anywhere!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double-check with the production instance of MetaLookup - as far as I know that machine uses e.g. metadata-picker yml and the restart_from_hook.sh.

@@ -103,6 +103,8 @@ async def extract(self, message: Input, extra: bool) -> Output:
self.logger.info(f"Built WebsiteData object in {t():5.2f}s.")
except ClientConnectorError as e:
raise HTTPException(status_code=502, detail=f"Could not get HAR from splash: {e}")
except asyncio.exceptions.TimeoutError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which broader exception do you have in mind? Exception itself?

@MRuecklCC MRuecklCC force-pushed the 123-restart-unhealthy-containers branch 3 times, most recently from 42b1d32 to 3eee73f Compare June 29, 2022 14:13
@MRuecklCC MRuecklCC force-pushed the 123-restart-unhealthy-containers branch from 3eee73f to a193bf4 Compare July 18, 2022 10:27
@MRuecklCC
Copy link
Contributor Author

I will shelve this PR for now, as it is unclear to me how the production deployment would work (traefik etc). Once we get to deploying the new changes to prod, we can probably use this PR and adapt it to also suit the production needs.

@RobertMeissner
Copy link
Collaborator

I will shelve this PR for now, as it is unclear to me how the production deployment would work (traefik etc). Once we get to deploying the new changes to prod, we can probably use this PR and adapt it to also suit the production needs.

Sounds good. What needs to be done to deploy to production? Do you need insights into traefik? Is something else missing?

@MRuecklCC MRuecklCC force-pushed the 123-restart-unhealthy-containers branch from a193bf4 to c119499 Compare July 19, 2022 14:48
@MRuecklCC MRuecklCC changed the base branch from dev to main July 19, 2022 15:42
@MRuecklCC
Copy link
Contributor Author

Part of this is now present in #144 . Once #144 is merged, rebase this PR and then we can do another review!

@MRuecklCC MRuecklCC force-pushed the 123-restart-unhealthy-containers branch 3 times, most recently from a5eb358 to 8392c3f Compare July 26, 2022 11:44
- Use a sidecar container (autoheal) that has access to the hosts docker
  socket to restart unhealthy containers.
- Remove obsolete restart_from_hook files
@MRuecklCC MRuecklCC force-pushed the 123-restart-unhealthy-containers branch from 8392c3f to 9974377 Compare August 8, 2022 11:37
@MRuecklCC MRuecklCC force-pushed the main branch 6 times, most recently from 4c8c95c to 5d83f67 Compare September 20, 2022 08:48
@MRuecklCC MRuecklCC force-pushed the main branch 10 times, most recently from 3da3841 to b8869e0 Compare September 21, 2022 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unhealthy containers do not restart automatically
2 participants