Randomly getting stuck at 100% CPU #754

cristianocca · 2018-03-20T18:23:56Z

Hello,

I'm facing an issue with a docker deployed splash, that the python3 process will randomly get stuck at 100% CPU and any request coming afterwards will eventually timeout (using a 70s timeout right now).

Now this is an issue that only happens randomly, in a process pipe that will process about a hundred pages per hour.

I can't really tell if this is related to very specific pages causing a processing bottleneck or some bad settings.

The call looks like: GET /render.jpg with params:

render_all: 0
width: 1024
quality: 50
timeout: 60 (the actual http request is set to timeout to 70)
wait: 3

I don't really care if a bad page errors out or not. but it's critical that it doesn't make everything else to get stuck, is there anything that can be done to either improve this or debug the issue?

Thanks.

landoncope · 2018-03-20T22:06:31Z

Could be a similar issue to this: TeamHG-Memex/aquarium#1

kmike · 2018-03-21T14:24:40Z

Hey! In production we're doing health checks (checking if /_ping endpoint is responsive) and restarting containers if they don't respond; this is done via Mesos, not docker-compose; unfortunately this setup is not open source. Aquarium does something similar to that, but it looks like in some cases it is less reliable. A lot of problems were solved in Splash 3.2 release where HTML5 video and audio is disabled by default; this change decreased crash/restart count up to 10x for us. What Splash version are you using?

cristianocca · 2018-03-21T18:25:32Z

3.1
Will give 3.2 a try

cristianocca · 2018-03-24T00:08:58Z

Issue is still happening, if not worse. Decided to restart the docker container only once a day instead of once an hour to give it a try.
Kind of worked fine for a day, until it got stuck as usual. However, this time a docker restart command wasn't even enough to "unstuck" it. Had to completely restart the machine.
Docker not being able to restart it might have been just something random, but will keep an eye now.

landoncope · 2018-03-24T01:19:30Z

I've had Splash containers do that a few times (maybe once a month) where the only solution was to restart the host. That particular issue is likely a Docker issue. Make sure you're running the latest Docker.

My current plan regarding the getting-stuck issue is to test the /_ping endpoint as soon as I have a container go rogue. Usually I have one or two go rogue daily (I'm running around 20 currently), but it's been a couple of days and they have all been fine. Go figure.

Anyway, if the /_ping endpoint in fact shows the container is unresponsive, my plan is to write a simple script that checks the endpoint for each container on the host, and then restarts containers as necessary. I'll run it every 15 minutes via cron.

jhart-r7 · 2018-04-04T19:18:11Z

FYI, I encounter this on a regular basis when running splash inside docker. Every time I've debugged it, the sockets that splash was using to communicate with the hosts in question are in CLOSE_WAIT and stay that way. Splash 3.2.

sloth2012 · 2018-11-22T02:42:18Z

met same issue!

exic · 2019-09-03T12:10:17Z

We seem to be running in to the same issue, running Splash 3.3.1 in a Docker Stack with 5 instances on Docker Swarm.

davidkong0987 · 2022-08-23T22:07:40Z

3.5 same issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomly getting stuck at 100% CPU #754

Randomly getting stuck at 100% CPU #754

cristianocca commented Mar 20, 2018

landoncope commented Mar 20, 2018

kmike commented Mar 21, 2018

cristianocca commented Mar 21, 2018

cristianocca commented Mar 24, 2018

landoncope commented Mar 24, 2018 •

edited

Loading

jhart-r7 commented Apr 4, 2018

sloth2012 commented Nov 22, 2018

exic commented Sep 3, 2019

davidkong0987 commented Aug 23, 2022

Randomly getting stuck at 100% CPU #754

Randomly getting stuck at 100% CPU #754

Comments

cristianocca commented Mar 20, 2018

landoncope commented Mar 20, 2018

kmike commented Mar 21, 2018

cristianocca commented Mar 21, 2018

cristianocca commented Mar 24, 2018

landoncope commented Mar 24, 2018 • edited Loading

jhart-r7 commented Apr 4, 2018

sloth2012 commented Nov 22, 2018

exic commented Sep 3, 2019

davidkong0987 commented Aug 23, 2022

landoncope commented Mar 24, 2018 •

edited

Loading