Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openvas stuck with 100% CPU in destroy_scan_globals #1271

Open
dhedberg opened this issue Dec 19, 2022 · 7 comments
Open

openvas stuck with 100% CPU in destroy_scan_globals #1271

dhedberg opened this issue Dec 19, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@dhedberg
Copy link

We're on the latest greenbone/ospd-openvas:stable, dated 2022-12-12T14:39:05.303393489Z, and we're seeing an issue where the openvas processes hang around after the tasks have been marked as done, taking up 100% of a CPU core.

The container is running in kubernetes with a deployment converted (with some small adaptations) from the community container docker-compose file.

Example:

root        1860 60.9  0.1 11574221292 29444 ?   tNL  Dec18 510:51 openvas --scan-start f59ff7de-7f35-4435-affe-bc7ed162b5bd

The task is already marked as finished. Attaching with gdb gives:

(gdb) bt
#0  0x00007f39f099ef1e in resize_set (reallocated_buckets_bitmap=0x7472682d5010, old_size=2660584736, hash_table=0x563b9e954760) at ../../../glib/ghash.c:834
#1  g_hash_table_resize (hash_table=0x563b9e954760) at ../../../glib/ghash.c:888
#2  0x00007f39f09a0d3e in g_hash_table_destroy (hash_table=0x563b9e954760) at ../../../glib/ghash.c:1488
#3  0x00007f39f0aa18c0 in destroy_scan_globals () from /usr/local/lib/libopenvas_misc.so.22
#4  0x0000563b9d621107 in openvas ()
#5  0x00007f39f0591d0a in __libc_start_main (main=0x563b9d61cd50 <main>, argc=3, argv=0x7ffe314163d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe314163c8)
    at ../csu/libc-start.c:308
#6  0x0000563b9d61cd8a in _start ()

it seems to be stuck in:

=> 0x00007f39f099ef00 <+960>:	add    $0x1,%ecx
   0x00007f39f099ef03 <+963>:	add    %ecx,%edx
   0x00007f39f099ef05 <+965>:	and    %r12d,%edx
   0x00007f39f099ef08 <+968>:	mov    %edx,%eax
   0x00007f39f099ef0a <+970>:	mov    %edx,%r13d
   0x00007f39f099ef0d <+973>:	shr    $0x5,%eax
   0x00007f39f099ef10 <+976>:	and    $0x1f,%r13d
   0x00007f39f099ef14 <+980>:	lea    (%rdi,%rax,4),%r11
   0x00007f39f099ef18 <+984>:	mov    (%r11),%eax
   0x00007f39f099ef1b <+987>:	bt     %edx,%eax
   0x00007f39f099ef1e <+990>:	jb     0x7f39f099ef00 <g_hash_table_resize+960>

.. which I think corresponds to

          while (get_status_bit (reallocated_buckets_bitmap, hash_val)) \
            {                                                           \
              step++;                                                   \
              hash_val += step;                                         \
              hash_val &= hash_table->mask;                             \
            }                                                           \

in glib/ghash.c

That's as far as I have dug; I'm guessing it might be some sort of memory corruption or concurrency issue?

@dhedberg dhedberg added the bug Something isn't working label Dec 19, 2022
@ArnoStiefvater
Copy link
Member

Hey @dhedberg ,

Thanks for the bug report and sorry for the late reply.

We tried to replicate the issue but without success. We guess that this is a possible bug in glib.

Does this issue still happen or have you found a solution since then? Do you have some minimal reproducer, so we can replicate the issue?

@dhedberg
Copy link
Author

dhedberg commented May 9, 2023

Thanks for looking into this.

As far as I can tell from our logs and CPU usage monitoring, the issue still persists. We're working around the issue for now by regularly killing any lingering openvas --scan-start processes that matches tasks marked as Done in the API with a cronjob. We also automatically update to the latest stable images (at least) once a week.

I don't have a minimal reproducer at hand. We don't do that much customization; we create a custom port list from the IANA one with one port removed, a custom scan config from "Full and fast" with around 10 oids disabled, and have a few tasks which scan a bunch of IP ranges with ssh auth enabled. Usually there's only one task running at a time.

@vedrixNL
Copy link

vedrixNL commented Sep 11, 2023

Hi all, I seem to have the same issue.

OSPD[7] 2023-09-11 11:04:31,470: INFO: (ospd.command.command) Scan 3b93c961-e15f-460e-859a-15fa12c26368 added to the queue in position 2.
OSPD[7] 2023-09-11 11:04:40,670: INFO: (ospd.ospd) Currently 1 queued scans.
OSPD[7] 2023-09-11 11:04:41,325: INFO: (ospd.ospd) Starting scan 3b93c961-e15f-460e-859a-15fa12c26368.
OSPD[7] 2023-09-11 11:47:09,521: INFO: (ospd.ospd) 3b93c961-e15f-460e-859a-15fa12c26368: Host scan finished.
OSPD[7] 2023-09-11 11:47:09,528: INFO: (ospd.ospd) 3b93c961-e15f-460e-859a-15fa12c26368: Scan finished.

root@ospd-openvas:/ospd-openvas# ps fax
PID TTY STAT TIME COMMAND
33270 pts/0 Ss 0:00 bash
33286 pts/0 R+ 0:00 _ ps fax
1 ? Ss 0:00 /sbin/docker-init -- /usr/bin/tini -- /usr/local/bin/entrypoint ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker
6 ? S 0:00 /usr/bin/tini -- /usr/local/bin/entrypoint ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker-address mqtt-broker
7 ? Sl 3:19 _ /usr/bin/python3 /usr/local/bin/ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker-address mqtt-broker --notus
20 ? Sl 0:09 _ /usr/bin/python3 /usr/local/bin/ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker-address mqtt-broker --n
1133 ? RNL 3:50 openvas --scan-start 3b93c961-e15f-460e-859a-15fa12c26368

It seems to stay in memory and not cleaning. Restarting the docker process helps, but somewhere ospd-openvas is not giving the memory claim free.

R = running or runnable (on run queue)
N = low-priority (nice to other users)
L = has pages locked into memory (for real-time and custom IO)

Especially the last part L is happening. Is this a bug in ospd-openvas?
I've never experienced this issue before when using Debian 11.5 with same specs as below, except Debian 12

Debian 12 (bookworm)
8 vCPU
16 MEM
Used the docker solution from OpenVas
Latest images pulled and installed

Extra information:

Tasks: 169 total, 4 running, 165 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 0.2 sy, 70.7 ni, 23.3 id, 0.0 wa, 0.0 hi, 2.5 si, 2.6 st
MiB Mem : 15823.6 total, 249.8 free, 2185.4 used, 13855.3 buff/cache
MiB Swap: 8192.0 total, 8075.0 free, 117.0 used. 13638.3 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

1153249 docker 30 10 10.8t 16228 7556 R 100.0 0.1 9,57 openvas --scan-start 57278eb0-2a35-456b-8661-ec049570ea49
1153257 docker 30 10 10.8t 16240 7568 R 99.7 0.1 9,53 openvas --scan-start c0ce0ed3-6856-48d1-9321-86d0593afcd1
1153253 docker 30 10 10.8t 16220 7552 R 99.3 0.1 9,55 openvas --scan-start 3b420acc-947c-4fd3-a4f8-2e0b94465c9c

Scans keep on running infinitely. When using strace on the pid no output is shown. The process is not a zombie. Somewhere the process is not getting killed.

@vedrixNL
Copy link

vedrixNL commented Sep 12, 2023

I've tried with the default docker-compose file on a Debian 12 server. The VM is running on Amazon btw. Doesn't make a difference, because I used a Debian 11.5 before on Amazon (same architecture) and that one worked just fine. Same docker-compose file.

@jonathon2nd
Copy link

I am seeing the same thing with our deployment using our helm chart: https://github.com/fpm-git/Greenbone-Community-Edition-Helm

@co60ca
Copy link

co60ca commented Nov 25, 2024

We've been investigating this an it appears to be similar to the OP with a more recent build of OpenVAS.

Including the quoted bit below.

(gdb) bt
#0  0x00007f39f099ef1e in resize_set (reallocated_buckets_bitmap=0x7472682d5010, old_size=2660584736, hash_table=0x563b9e954760) at ../../../glib/ghash.c:834
#1  g_hash_table_resize (hash_table=0x563b9e954760) at ../../../glib/ghash.c:888
#2  0x00007f39f09a0d3e in g_hash_table_destroy (hash_table=0x563b9e954760) at ../../../glib/ghash.c:1488
#3  0x00007f39f0aa18c0 in destroy_scan_globals () from /usr/local/lib/libopenvas_misc.so.22
#4  0x0000563b9d621107 in openvas ()
#5  0x00007f39f0591d0a in __libc_start_main (main=0x563b9d61cd50 <main>, argc=3, argv=0x7ffe314163d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe314163c8)
    at ../csu/libc-start.c:308
#6  0x0000563b9d61cd8a in _start ()

We seem to see it only when there is authenticated scans being used.

@shanmukh-awn
Copy link

Hi @ArnoStiefvater, Would you be able to debug if given a core-dump on openvas process ? Will provide openvas version, so can sync source to the correct version to build debug symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants