Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shinken core services are crashing with no visible health issues on service state #2008

Open
maltesh opened this issue Feb 4, 2021 · 1 comment

Comments

@maltesh
Copy link

maltesh commented Feb 4, 2021

Hardware:

CPU : 24 Core
RAM : 24 GB
Shinken version: 2.0.3
Python Version:2.6.6
OS: Centos 6.10

Hosts Monitored: 409
Total Services : 14600

About 60% service checks are either health checks (wmi or win-rm) with check interval of 5 to 15 minutes.
About 3~5 % service checks are HTTP health checks for Rabbitmq with check interval of 1 min and notification interval of 1 min.

Its standalone machine and it’s not scaled.
we are running
a) poller with min_worker as 6 and max_worker as 16
b) And reactionner with min_worker as 4 and max_worker with 12.

Commonly seen in logs:

Reactionner Log:

File "/usr/lib/python2.6/site-packages/shinken/action.py", line 125, in execute
return self.execute__() ## OS specific part
File "/usr/lib/python2.6/site-packages/shinken/action.py", line 311, in execute__
preexec_fn=os.setsid)
File "/usr/lib64/python2.6/subprocess.py", line 642, in init
errread, errwrite)
File "/usr/lib64/python2.6/subprocess.py", line 1238, in _execute_child
raise child_exception
TypeError: execve() arg 2 must contain only strings

Broker Log:

Error :   Back trace of this error: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/shinken/daemon.py", line 864, in http_daemon_thread
    self.http_daemon.run()
  File "/usr/lib/python2.6/site-packages/shinken/http_daemon.py", line 283, in run
    self.srv.run()
  File "/usr/lib/python2.6/site-packages/shinken/http_daemon.py", line 123, in run
    raise PortNotFree(msg)
PortNotFree: Error: Sorry, the port 7772 is not free: No socket could be created

Poller Log:

[1606292549] Error : [Livestatus Query] Error: 'Hosts' object has no attribute 'itersorted'
[1606292744] Error : [broker-master] The external module livestatus goes down unexpectedly!
[1606292744] Error : [broker-master] The external module npcdmod goes down unexpectedly!
[1606292744] Warning : [broker-master] Connection problem to the scheduler scheduler-master: Connexion error to http://localhost:7768/ : couldn't connect to host
[1606292747] Warning : [broker-master] Connection problem to the poller poller-master: Connexion error to http://localhost:7771/ : Operation timed out after 3000

Dmesg:

TCP: too many of orphaned sockets
__ratelimit: 192 callbacks suppressed
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets

Netstat;

netstat –anp | grep 7772
we see it in either FIN_WAIT1 or FIN_WAIT2 state

Currently we run sysctl -w net.ipv4.tcp_max_orphans=0 and kill and restart all shinken services to make it up and running .
This happens 2 or 3 times in a day .


Please help us on overcoming this problem .
Upgrading to shinken 2.4.3 will fixe the problem ? Or tuning kernel params like net.ipv4.tcp_mem, net.ipv4.tcp_fin_timeout, etc..will further help..

@geektophe
Copy link
Collaborator

Hello, the issue your're facing, it's strange. I'm running a Shinken platform with more than 2k hosts, and more than 45k services, and I never had such problems.

It's a fairly old Shinken release you are running. It should be a good idea to try to upgrade, anyway. I doubt the latest release will run on Python 2.6, through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants