-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart unhealthy containers #124
base: main
Are you sure you want to change the base?
Conversation
@@ -103,6 +103,8 @@ async def extract(self, message: Input, extra: bool) -> Output: | |||
self.logger.info(f"Built WebsiteData object in {t():5.2f}s.") | |||
except ClientConnectorError as e: | |||
raise HTTPException(status_code=502, detail=f"Could not get HAR from splash: {e}") | |||
except asyncio.exceptions.TimeoutError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually we want to catch an even broader exception here to avoid further crashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which broader exception do you have in mind? Exception itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would probably be the only broader option (asyncio doesn't provide its internal top-level Exception base class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you implementing it now or in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR feels half-done and slightly surprising to me.
docker-compose.yml
Outdated
@@ -1,6 +1,20 @@ | |||
version: '3.4' | |||
|
|||
services: | |||
|
|||
# As docker-compose does not provide an automatic mechanism for restarting unhealthy containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What? Are you sure about this? What about this "restart: always" option mentioned in all containers? And I had the impression unhealthy containers were indeed restarted due to that option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm sure about it, docker(-compose) only checks the health but does nothing in case of an unhealthy container. In fact, I think the restart option probably can be removed (should be managed via the labels
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the restart: always
option is meant for another scenario: I assume it controls the behavior if e.g. the host or docker-daemon is restarted, or the container really terminates for some unknown reason. However, in our case, we want the container to restart if it becomes unhealthy
which is not exactly the same thing :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it still restart if the service terminates, e.g., due to some internal error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave the restart: always
as it should cover scenarios where the container terminates (because the entrypoint process has terminated for whatever reason).
However, the autoheal sidecar will now also gracefully handle the scenario, where the container has not terminated (e.g. because the process did not cleanly terminate, e.g. because of deadlock, dangling non-daemon processes, etc).
metadata-picker.yml
Outdated
@@ -1,6 +1,7 @@ | |||
version: '3.3' | |||
|
|||
services: | |||
# fixme: Add auto-heal container similar to dev docker-compose.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you not doing this in the current branch? Other suggestion: can you use metadata-picker instead of docker-compose to have one file only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats why it's not fixed! I wasn't sure why there are two docker-compose files doing more or less the same thing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this file (together with a bunch of other no longer used docker related stuff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double-check with the production instance of MetaLookup - as far as I know that machine uses e.g. metadata-picker yml and the restart_from_hook.sh.
restart_from_hook.sh
Outdated
@@ -1,4 +0,0 @@ | |||
#!/bin/bash | |||
echo "Restarting from hook" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you delete this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Becasue it wasn't used anywhere!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double-check with the production instance of MetaLookup - as far as I know that machine uses e.g. metadata-picker yml and the restart_from_hook.sh.
@@ -103,6 +103,8 @@ async def extract(self, message: Input, extra: bool) -> Output: | |||
self.logger.info(f"Built WebsiteData object in {t():5.2f}s.") | |||
except ClientConnectorError as e: | |||
raise HTTPException(status_code=502, detail=f"Could not get HAR from splash: {e}") | |||
except asyncio.exceptions.TimeoutError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which broader exception do you have in mind? Exception itself?
42b1d32
to
3eee73f
Compare
3eee73f
to
a193bf4
Compare
I will shelve this PR for now, as it is unclear to me how the production deployment would work (traefik etc). Once we get to deploying the new changes to prod, we can probably use this PR and adapt it to also suit the production needs. |
Sounds good. What needs to be done to deploy to production? Do you need insights into traefik? Is something else missing? |
a193bf4
to
c119499
Compare
a5eb358
to
8392c3f
Compare
- Use a sidecar container (autoheal) that has access to the hosts docker socket to restart unhealthy containers. - Remove obsolete restart_from_hook files
8392c3f
to
9974377
Compare
4c8c95c
to
5d83f67
Compare
3da3841
to
b8869e0
Compare
Resolves #123 via autoheal sidecar.