Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buggy behaviour on rabbitmq-server restart #20

Open
fmonjalet opened this issue Jan 15, 2016 · 8 comments
Open

Buggy behaviour on rabbitmq-server restart #20

fmonjalet opened this issue Jan 15, 2016 · 8 comments

Comments

@fmonjalet
Copy link

Hi,

The IRMA architecture (particularly the brain) falls into an incoherent state when applying the following scenario:

  • stop rabbitmq-server
  • start rabbitmq-server 10 seconds later (simple restart does not seem to provoke the bug)

The result is that after that, the frontend ends up accepting scans, the celery.scan service seems to see the probes, but the celery.result service never starts any task (although its logs are saying that it is correctly connected to rabbitmq).

Final result is that every service is up and happy, the frontend accepts scans, but they never finish. Would you have a fix for that?

Do not hesitate if you need more info. Thanks!

Florent

@ch0k0bn
Copy link

ch0k0bn commented Jan 15, 2016

Did you restart also the celery daemon or not ? Which version of celery are you using ?

@fmonjalet
Copy link
Author

My bad, I forgot to tell you that restarting celery.result solves the problem (IRMA works again). But still, manual restart of celery.result should not be necessary when rabbitmq-server is down for some seconds (this is the issue I am raising here, sorry for being unclear). This is very reproducible on our setup.

Celery is 3.1.18.

@fmonjalet
Copy link
Author

Hi, any update on the subject? Were you able to reproduce it? We are currently using IRMA 1.2.0.

@ch0k0bn
Copy link

ch0k0bn commented Feb 3, 2016

Sorry @fmonjalet, didnt had time to test it. To be continued

@fmonjalet
Copy link
Author

Hi, any updates? We are still experiencing the bug here:

  • restart rabbitmq
  • celery workers end up reconnecting
  • celery scans worker raises some BrokenPipe errors when a new request is seen
  • after some lost tasks, it seems to get back on its feet if the downtime wasn't too long

Maybe some persistence configuration in rabbitmq/celery (I don't really know how celery works) may help recover from restarts : https://www.rabbitmq.com/persistence-conf.html. I think it won't fix the real problem here, though.

Thanks,

Florent

@ch0k0bn
Copy link

ch0k0bn commented Mar 1, 2016

Hi Florent,

I tried this 2 scenarios:

  • launched a 50 files scan, during the scan stopped rabbitmq-server for 30s then restarted it: after few reconnection attempts both brain celery apps reconnected and the scan finished normally.
  • stopped rabbitmq-server for 60s while IRMA was inactive, restarted it, when celery daemons was reconnected, tried to launch a scan.
    raised this exception:
    SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /api/v1.1/scans/8668befd-79e8-4983-b683-28de0f325d0d/launch (ip 172.16.1.1) !!!

Next scan was ok.

We are using AMQP as backend and results are kept for 5 minutes (http://docs.celeryproject.org/en/latest/internals/reference/celery.backends.amqp.html#celery.backends.amqp.AMQPBackend.Exchange.delivery_mode).

Need more info to help you.

@fmonjalet
Copy link
Author

Hi,

Thanks for the quick answer, and sorry for my very slow one. We are still using irma v1.2, it may explain the different behaviour. If you say it's ok with the current release, I will update ASAP and keep you in touch.

Just for the record, I conducted some more tests, both with the web interface and the REST API.

When I stop rabbitmq-server for 10 sec before starting a transfer:

  • The two first transfers fail, no matter what amount of time I wait after
    rabbitmq restarted.
  • The next transfer hangs forever (the web UI says 'running', 0/0 tasks and
    the progress
  • The next transfer has the right amount of tasks, but one job is dropped. We
    have two probes, their logs are telling that they analyzed all files, but one
    result of one probe does not appear in the results worker logs.
  • Next one is ok!

As I said, I will upgrade and see how it goes after. Thanks again for testing on
your side!

Florent

@ch0k0bn
Copy link

ch0k0bn commented Mar 21, 2016

If you upgrade to 1.3.2, you will be able to add "debug = 1" to the [log] section of brain.ini and see if it helps identifying this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants