Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] salt-call hangs after highstate #62550

Closed
OrangeDog opened this issue Aug 26, 2022 · 35 comments
Closed

[BUG] salt-call hangs after highstate #62550

OrangeDog opened this issue Aug 26, 2022 · 35 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases.

Comments

@OrangeDog
Copy link
Contributor

Description
salt-call fails to terminate after a rendering error during highstate. Must be manually killed.

Setup

{{ salt["does.not.exist"]() }}

Steps to Reproduce the behavior

# salt-call state.apply
local:
    Data failed to compile:
----------
    Rendering SLS 'base:test' failed: Jinja variable 'salt.utils.templates.AliasedLoader object' has no attribute 'does.not.exist'; line 1

---
{{ salt["does.not.exist"]() }}    <======================
---

Expected behavior
Process should terminate.

Versions Report
Master is 3004.2, minion is 3005 (classic package).

salt-call --versions
Salt Version:
          Salt: 3005

Dependency Versions:
          cffi: 1.14.5
      cherrypy: Not Installed
      dateutil: 2.7.3
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10.1
       libgit2: 1.5.0
      M2Crypto: 0.31.0
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.20
      pycrypto: Not Installed
  pycryptodome: 3.6.1
        pygit2: 1.10.0
        Python: 3.8.10 (default, Jun 22 2022, 20:18:18)
  python-gnupg: 0.4.5
        PyYAML: 5.3.1
         PyZMQ: 18.1.1
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.2

System Versions:
          dist: ubuntu 20.04 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-125-generic
        system: Linux
       version: Ubuntu 20.04 focal

Additional context
There's no debug logging after printing the result either. Last messages (after all the rendering) are:

[DEBUG] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'csrf-stag2-vm.eng.cam.ac.uk', 'tcp://129.169.140.30:4506')
[DEBUG] Closing AsyncReqChannel instance
[DEBUG] The functions from module 'highstate' are being loaded by dir() on the loaded module
[DEBUG] LazyLoaded highstate.output

@OrangeDog OrangeDog added Bug broken, incorrect, or confusing behavior Regression The issue is a bug that breaks functionality known to work in previous releases. needs-triage Phosphorus v3005.0 Release code name and version labels Aug 26, 2022
@OrangeDog
Copy link
Contributor Author

OrangeDog commented Aug 26, 2022

I probably should have checked first, but the rendering error has nothing to do with it.
It also hangs after a successful highstate.

It does not hang after e.g. test.ping.

@OrangeDog OrangeDog changed the title [BUG] salt-call hangs after SLS rendering failure [BUG] salt-call hangs after highstate Aug 26, 2022
@Deshke
Copy link

Deshke commented Aug 29, 2022

same here after updating to 3005, manually calling salt-call state.sls some.state

Summary for local
--------------
Succeeded: 170 (changed=11)
Failed:      0
--------------
Total states run:     170
Total run time:    69.927 s
 


^C

the state finishes but the salt-call is kept running. Same behavior if executed from the master

@SGA-max-faxalv
Copy link

Able to reproduce. Upgraded minion to 3005, now hangs with "state.highstate" after printing the result.
Even with -l all there is not a single log line after the result has been printed. The salt-call call has to be killed.

@d--j
Copy link
Contributor

d--j commented Aug 29, 2022

I'm affected, too. Ubuntu 20.04 with Salt minion 3005 (classic).

Switching the salt-minion to the onedir installation fixes this problem. But since onedir has other issues (e.g. I cannot get the pip install mysqlclient to work) this is currently not an option for me.

Here is an excerpt of strace salt-call -l all state.highstate for the classic install:

…
Summary for local
--------------
Succeeded: 504 (changed=5)
Failed:      0
--------------
Total states run:     504
Total run time:    82.333 s) = 78377
write(1, "\n", 1
)                       = 1
close(3)                                = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f62fb003090}, {sa_handler=0x62b020, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f62fb003090}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f62fb003090}, {sa_handler=0x62b020, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f62fb003090}, 8) = 0
munmap(0x7f62fa180000, 262144)          = 0
close(5)                                = 0
close(6)                                = 0
close(25)                               = 0
close(26)                               = 0
close(29)                               = 0
close(30)                               = 0
munmap(0x7f62ce197000, 262144)          = 0
munmap(0x7f62ce157000, 262144)          = 0
munmap(0x7f62ce117000, 262144)          = 0
munmap(0x7f62ce0d7000, 262144)          = 0
munmap(0x7f62ce097000, 262144)          = 0
munmap(0x7f62ce217000, 262144)          = 0
munmap(0x7f62ce257000, 262144)          = 0
munmap(0x7f62ce297000, 262144)          = 0
munmap(0x7f62ce2d7000, 262144)          = 0
munmap(0x7f62ce317000, 262144)          = 0
munmap(0x7f62ce357000, 262144)          = 0
munmap(0x7f62ce397000, 262144)          = 0
munmap(0x7f62cea97000, 262144)          = 0
munmap(0x7f62cea57000, 262144)          = 0
munmap(0x7f62ce9d7000, 262144)          = 0
munmap(0x7f62ce497000, 262144)          = 0
munmap(0x7f62ce457000, 262144)          = 0
munmap(0x7f62ce417000, 262144)          = 0
getpid()                                = 3235856
close(27)                               = 0
close(4)                                = 0
getpid()                                = 3235856
write(8, "\1\0\0\0\0\0\0\0", 8)         = 8
getpid()                                = 3235856
write(8, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=7, events=POLLIN}], 1, -1)    = 1 ([{fd=7, revents=POLLIN}])
read(7, "\1\0\0\0\0\0\0\0", 8)          = 8
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
munmap(0x7f62f44b4000, 8392704)         = 0
close(11)                               = 0
close(10)                               = 0
munmap(0x7f62f4cb5000, 8392704)         = 0
close(9)                                = 0
close(8)                                = 0
close(7)                                = 0
close(28)                               = 0
munmap(0x7f62f8453000, 262144)          = 0
close(24)                               = 0
munmap(0x7f62ce1d7000, 262144)          = 0
close(23)                               = 0
munmap(0x7f62cd25b000, 262144)          = 0
munmap(0x7f62cd19b000, 262144)          = 0
munmap(0x7f62cce5b000, 262144)          = 0
munmap(0x7f62cd29b000, 262144)          = 0
munmap(0x7f62cce9b000, 262144)          = 0
munmap(0x7f62ceb97000, 262144)          = 0
getpid()                                = 3235856
write(39, "\1\0\0\0\0\0\0\0", 8)        = 8
poll([{fd=31, events=POLLIN}], 1, -1
^Cstrace: Process 3235856 detached
 <detached ...>

Here is an excerpt of strace salt-call -l all state.highstate for the onedir install:

…
Summary for local
--------------
Succeeded: 258 (changed=1)
Failed:      0
--------------
Total states run:     258
Total run time:    28.174 s) = 36375
write(1, "\n", 1
)                       = 1
getpid()                                = 736097
write(2, "[DEBUG   ] Restoring sys.argv to"..., 98[DEBUG   ] Restoring sys.argv to: ['/opt/saltstack/salt/run/run', '-l', 'all', 'state.highstate']
) = 98
close(4)                                = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f1890e8a420}, {sa_handler=0x7f1890b15220, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f1890e8a420}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f1890e8a420}, {sa_handler=0x7f1890b15220, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f1890e8a420}, 8) = 0
close(15)                               = 0
close(6)                                = 0
close(7)                                = 0
close(28)                               = 0
close(29)                               = 0
close(27)                               = 0
close(31)                               = 0
munmap(0x7f188a0b6000, 2119584)         = 0
munmap(0x7f1889eb4000, 2104392)         = 0
close(5)                                = 0
munmap(0x7f188a8c1000, 2099624)         = 0
munmap(0x7f188b710000, 2100408)         = 0
munmap(0x7f188b30b000, 2107624)         = 0
munmap(0x7f188c198000, 2099832)         = 0
munmap(0x7f188bf96000, 2101848)         = 0
munmap(0x7f188bd54000, 2101864)         = 0
munmap(0x7f188bb53000, 2100672)         = 0
munmap(0x7f188b911000, 2102048)         = 0
munmap(0x7f188a2bc000, 2104288)         = 0
munmap(0x7f1887c2c000, 2137704)         = 0
munmap(0x7f1887a22000, 2137992)         = 0
munmap(0x7f1889df0000, 537024)          = 0
munmap(0x7f188b50e000, 2102896)         = 0
munmap(0x7f188b0c7000, 2110528)         = 0
munmap(0x7f188aec5000, 2104520)         = 0
munmap(0x7f188acc3000, 2101840)         = 0
munmap(0x7f188aac2000, 2100856)         = 0
munmap(0x7f1888b09000, 262144)          = 0
munmap(0x7f1888ac9000, 262144)          = 0
close(26)                               = 0
munmap(0x7f188a4be000, 2101168)         = 0
munmap(0x7f188a6bf000, 2101840)         = 0
munmap(0x7f1889371000, 262144)          = 0
close(3)                                = 0
munmap(0x7f186cac9000, 262144)          = 0
munmap(0x7f186ca89000, 262144)          = 0
munmap(0x7f186ce49000, 262144)          = 0
munmap(0x7f186d049000, 262144)          = 0
munmap(0x7f186cf89000, 262144)          = 0
munmap(0x7f186cec9000, 262144)          = 0
munmap(0x7f186cfc9000, 262144)          = 0
munmap(0x7f186d009000, 262144)          = 0
munmap(0x7f186cdc9000, 262144)          = 0
munmap(0x7f186ce09000, 262144)          = 0
munmap(0x7f186d209000, 262144)          = 0
munmap(0x7f186cd89000, 262144)          = 0
munmap(0x7f186c949000, 262144)          = 0
munmap(0x7f186c9c9000, 262144)          = 0
munmap(0x7f186cb49000, 262144)          = 0
munmap(0x7f186ca09000, 262144)          = 0
munmap(0x7f186c8c9000, 262144)          = 0
munmap(0x7f186cc09000, 262144)          = 0
exit_group(0)                           = ?
+++ exited with 0 +++

To my untrained eye, it looks like an event loop is not terminated properly.

`salt-call --versions` for classic:
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.7.3
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10.1
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: 1.4.4
     pycparser: Not Installed
      pycrypto: 2.6.1
  pycryptodome: 3.6.1
        pygit2: Not Installed
        Python: 3.8.10 (default, Jun 22 2022, 20:18:18)
  python-gnupg: 0.4.5
        PyYAML: 5.3.1
         PyZMQ: 18.1.1
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.2
 
System Versions:
          dist: ubuntu 20.04 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-125-generic
        system: Linux
       version: Ubuntu 20.04 focal
 
`salt-call --versions` for onedir:
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
        Python: 3.9.13 (main, Aug 23 2022, 18:31:04)
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: ubuntu 20.04 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-125-generic
        system: Linux
       version: Ubuntu 20.04 focal
 

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 29, 2022

@d--j for the onedir issue did you try running salt-pip install <package name>

And since you stated it works on onedir, it makes me think this is a dependency issue. It looks like there are difference in dependency versions.

Here's the diff between the classic nad onedir packages:

(heist-3.8-1)  ✘ ch3ll@megan-precision5550  ~/git/salt   fix_pkg ±  diff -u /tmp/classic /tmp/onedir
--- /tmp/classic        2022-08-29 09:58:18.101369271 -0600
+++ /tmp/onedir 2022-08-29 09:58:31.714589389 -0600
@@ -2,31 +2,31 @@
           Salt: 3005
  
 Dependency Versions:
-          cffi: Not Installed
-      cherrypy: Not Installed
-      dateutil: 2.7.3
+          cffi: 1.14.6
+      cherrypy: 18.6.1
+      dateutil: 2.8.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
-        Jinja2: 2.10.1
+        Jinja2: 3.1.0
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
-       msgpack: 0.6.2
+       msgpack: 1.0.2
   msgpack-pure: Not Installed
-  mysql-python: 1.4.4
-     pycparser: Not Installed
-      pycrypto: 2.6.1
-  pycryptodome: 3.6.1
+  mysql-python: Not Installed
+     pycparser: 2.21
+      pycrypto: Not Installed
+  pycryptodome: 3.9.8
         pygit2: Not Installed
-        Python: 3.8.10 (default, Jun 22 2022, 20:18:18)
-  python-gnupg: 0.4.5
-        PyYAML: 5.3.1
-         PyZMQ: 18.1.1
+        Python: 3.9.13 (main, Aug 23 2022, 18:31:04)
+  python-gnupg: 0.4.8
+        PyYAML: 5.4.1
+         PyZMQ: 23.2.0
          smmap: Not Installed
-       timelib: Not Installed
+       timelib: 0.2.4
        Tornado: 4.5.3
-           ZMQ: 4.3.2
+           ZMQ: 4.3.4
  
 System Versions:
           dist: ubuntu 20.04 focal

If anyone is willing to test out each dependency difference and install the newer versions on the classical packages and see if it works that would be great. If not whoever is assigned to this issue can try that, if that's discovered to be the issue.

@d--j
Copy link
Contributor

d--j commented Aug 29, 2022

@d--j for the onedir issue did you try running salt-pip install

yes I tried

/opt/saltstack/salt/run/run pip install mysqlclient

=> Python.h not found

and

/opt/saltstack/salt/run/run pip install --global-option=build_ext --global-option="-I/opt/saltstack/salt/run/include/python/" mysqlclient

=> ld not found?!

Packages without native extensions (e.g. IPy) can be installed this way but at least for the MySQL package (that I do need nearly on all minions) it does not work.

Some modules also expect sys.executable to be a Python interpreter and not the wrapper /opt/saltstack/salt/run/run and fail because of that (e.g. cmd.run with runas parameter).

These issues are probably unrelated to this issue and I will open bug reports for them when I get the time.

And since you stated it works on onedir, it makes me think this is a dependency issue.

Yes it might very well be. Or maybe usage of a specific state module might cause this. Since in my case the state run on onedir had some errors this state module might have not been executed by chance.

If anyone is willing to test out each dependency difference and install the newer versions on the classical packages and see if it works that would be great.

I will try to do that.

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 29, 2022

Thanks, yes if you could open a separate issue I would appreciate it.

@Deshke
Copy link

Deshke commented Aug 29, 2022

still strange that after apt update ; apt install salt-minion it hangs with a salt-call from https://repo.saltproject.io/py3/ubuntu/20.04/amd64/latest focal main

@d--j
Copy link
Contributor

d--j commented Aug 29, 2022

It looks like it is indeed a dependency problem (python3-zmq 18.1.1 from Ubuntu focal is the culprit).

This works for me:

apt install python3-pip
pip3 install pyzmq==23.2.0 
resulting salt-call --versions
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.7.3
     docker-py: 4.1.0
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10.1
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: 1.4.4
     pycparser: Not Installed
      pycrypto: 2.6.1
  pycryptodome: 3.6.1
        pygit2: Not Installed
        Python: 3.8.10 (default, Jun 22 2022, 20:18:18)
  python-gnupg: 0.4.5
        PyYAML: 5.3.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: ubuntu 20.04 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-125-generic
        system: Linux
       version: Ubuntu 20.04 focal

@OrangeDog OrangeDog added the Core relates to code central or existential to Salt label Aug 30, 2022
@Ch3LL Ch3LL added this to the Phosphorus v3005.0 milestone Aug 30, 2022
@dmurphy18
Copy link
Contributor

Wondering if something else is the problem here, since I have tried with the shipping pyzmq 18.1.1 and pip3 installed pyzmq 23.2.0 and both produce the same error:

[DEBUG   ] The functions from module 'highstate' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded highstate.output
[TRACE   ] data = {'local': ['Pillar failed to render with the following messages:', "Rendering SLS 'base' failed. Please see master log for details."]}
local:
    Data failed to compile:
----------
    Pillar failed to render with the following messages:
----------
    Rendering SLS 'base' failed. Please see master log for details.
root@david-VirtualBox:/home/david# 

for state file

root@david-VirtualBox:/home/david# cat /srv/salt/test.sls 

mytest:
  file.directory:
    - name: /home/david/test
    - user: david
    - group: david
    - dir_mode: 755
root@david-VirtualBox:/home/david# 

However when I place the test.sls in /srv/salt on the salt-master it all works, even with pyzmq==18.1.1
And I don't see any hang. Now my salt-master is 3005, will try with 3004.2

@d--j
Copy link
Contributor

d--j commented Aug 30, 2022

@dmurphy18 the error message says that your pillar data failed to compile. What does your pillar/top.sls look like and what about the pillar/base.sls? Have a look at the master's log file to see the reason, why the pillar failed.

For me after the upgrading the master to 3005 this code in pillar/top.sls failed:

{% set reverse_id_dot = grains.id.split('.') | reverse() | join('.') %}
{% set reverse_id_slash = grains.id.split('.') | reverse() | join('/') %}
{% if salt['pillar.file_exists']('minions/'+reverse_id_slash+'.sls') %}
  {{ grains.id }}:
    - minions.{{ reverse_id_dot }}
{% endif %}

(could not find pillar.file_exists)

But after a reboot of the master the unchanged code started working again.

Anyways, this looks like an unrelated issue and if it persists you should probably create a new issue for it.

@dmurphy18
Copy link
Contributor

@OrangeDog I'm not able to reproduce this with Centos 7 salt-master running 3005 or 3004.2
@d--j I am troubled by your switching to onedir and everything worked, given the setup of pillar etc should have been the same

@dmurphy18
Copy link
Contributor

@d--j My pillar top.sls is cleared out, the comment out stuff is due to old proxy testing

root@Unknown:/srv/salt# cat /srv/pillar/top.sls 

## base:
##     '*':
##     - base
##   'p7000':
##     - p7000
##   'p8000':
##     - p8000
##   'p9000':
##     - p9000
root@Unknown:/srv/salt# 

@dmurphy18 dmurphy18 added the cannot-reproduce cannot be replicated with info/context provided label Aug 30, 2022
@d--j
Copy link
Contributor

d--j commented Aug 30, 2022

@dmurphy18 the minion needs to run 3005. I do not think that the master's version matters to this issue. When the minion runs 3005 then execute salt-call state.highstate on the minion.

@dmurphy18
Copy link
Contributor

dmurphy18 commented Aug 30, 2022

@d--j The minion is 3005 classic

local:
    Salt Version:
              Salt: 3005
     
    Dependency Versions:
              cffi: Not Installed
          cherrypy: Not Installed
          dateutil: 2.7.3
         docker-py: Not Installed
             gitdb: Not Installed
         gitpython: Not Installed
            Jinja2: 2.10.1
           libgit2: Not Installed
          M2Crypto: Not Installed
              Mako: 1.1.0
           msgpack: 1.0.4
      msgpack-pure: Not Installed
      mysql-python: Not Installed
         pycparser: Not Installed
          pycrypto: Not Installed
      pycryptodome: 3.6.1
            pygit2: Not Installed
            Python: 3.8.10 (default, Jun 22 2022, 20:18:18)
      python-gnupg: 0.4.5
            PyYAML: 5.3.1
             PyZMQ: 18.1.1
             smmap: Not Installed
           timelib: Not Installed
           Tornado: 4.5.3
               ZMQ: 4.3.2
     
    System Versions:
              dist: ubuntu 20.04 focal
            locale: utf-8
           machine: x86_64
           release: 5.15.0-46-generic
            system: Linux
           version: Ubuntu 20.04 focal

state.apply without params results in highstate, but will call highstate separately :)

@dmurphy18
Copy link
Contributor

Successful

[TRACE   ] data = {'local': {'file_|-test_dgm_|-/home/david/test_|-directory': {'name': '/home/david/test', 'changes': {}, 'result': True, 'comment': 'The directory /home/david/test is in the correct state', '__sls__': 'test_dgm', '__run_num__': 0, 'start_time': '16:50:53.059235', 'duration': '11.084 ms', '__id__': 'test_dgm'}}}
local:
----------
          ID: test_dgm
    Function: file.directory
        Name: /home/david/test
      Result: True
     Comment: The directory /home/david/test is in the correct state
     Started: 16:50:53.059235
    Duration: 11.084 ms
     Changes:   

Summary for local
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:  11.084 ms
root@david-VirtualBox:/home/david# 

@dmurphy18
Copy link
Contributor

I think there must be something else going for the hang to occur.
@d--j @OrangeDog Is there any other information you could supply that might help in duplicating the issue.
I am using pretty clean basic VMs in VirtualBox that are uptodate.

@d--j
Copy link
Contributor

d--j commented Aug 30, 2022

@dmurphy18 Hm, then it probably matters what states are in the highstate/apply, how many are in it (500+ in my case) or how long the state.apply takes (+1 minute in my case). My minions are running Ubuntu 20.04 (uptodate) on various hypervisors (VMware/KVM).

@dmurphy18
Copy link
Contributor

@d--j So this might be a scale issue.
I am using a simple master and minion setup. I don't have the resources to setup 500+ machines for this.
Can you try to reproduce the issue with a smaller number of resources, or even simple master and minion. There was an issue with pyzmq before release but that was corrected with the shipping pyzmq v18.1.1.

If there is a logic problem, it will be seen with a single master and minion, usually.
With trace logging disabled

root@david-VirtualBox:/home/david# time salt-call state.highstate
local:
----------
          ID: test_dgm
    Function: file.directory
        Name: /home/david/test
      Result: True
     Comment: The directory /home/david/test is in the correct state
     Started: 17:18:07.203682
    Duration: 7.486 ms
     Changes:   

Summary for local
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:   7.486 ms

real	0m3.900s
user	0m1.803s
sys	0m0.250s
root@david-VirtualBox:/home/david# 

@d--j
Copy link
Contributor

d--j commented Aug 31, 2022

This is a minimal sls file that produces the issue on my test minion:

system/package-manager/init.sls:

apt:
  pkg.installed:
  - pkgs:
    - apt
    - apt-transport-https
    - unattended-upgrades
/tmp/sources.list: # originally /etc/apt/sources.list of course but this should not matter
  file.managed:
  - source: salt://system/package-manager/apt/sources.list
  - template: jinja

with system/package-manager/apt/sources.list being:

{% set oscodename = grains.get('lsb_codename') or grains.get('oscodename') %}
#
# {{ grains['os'] }} {{ oscodename }} - Generated by salt
# Modification will be overwritten - add other sources to /etc/apt/sources.list.d
#

{% if grains['os'] == 'Ubuntu' %}
deb {{ pillar.get('ubuntu_mirror','http://fr.archive.ubuntu.com/ubuntu') }} {{ oscodename }} main restricted universe multiverse
deb-src {{ pillar.get('ubuntu_mirror','http://fr.archive.ubuntu.com/ubuntu') }} {{ oscodename }} main restricted universe multiverse

deb {{ pillar.get('ubuntu_mirror','http://fr.archive.ubuntu.com/ubuntu') }} {{ oscodename }}-updates main restricted universe multiverse
deb-src {{ pillar.get('ubuntu_mirror','http://fr.archive.ubuntu.com/ubuntu') }} {{ oscodename }}-updates main restricted universe multiverse

deb {{ pillar.get('ubuntu_mirror','http://fr.archive.ubuntu.com/ubuntu') }} {{ oscodename }}-backports main restricted universe multiverse
deb-src {{ pillar.get('ubuntu_mirror','http://fr.archive.ubuntu.com/ubuntu') }} {{ oscodename }}-backports main restricted universe multiverse

#
# Security updates
#
deb http://security.ubuntu.com/ubuntu {{ oscodename }}-security main restricted universe multiverse
deb-src http://security.ubuntu.com/ubuntu {{ oscodename }}-security main restricted universe multiverse

{% elif grains['os'] == 'Debian' %}

{% if oscodename == 'jessie' %}
deb http://deb.debian.org/debian/ jessie main contrib non-free
deb-src http://deb.debian.org/debian/ jessie main contrib non-free

deb http://security.debian.org/ jessie/updates main contrib non-free
deb-src http://security.debian.org/ jessie/updates main contrib non-free
{% else %}

deb {{ pillar.get('debian_mirror','http://ftp.fr.debian.org/debian') }} {{ oscodename }} main contrib non-free
deb-src {{ pillar.get('debian_mirror','http://ftp.fr.debian.org/debian') }} {{ oscodename }} main contrib non-free

deb http://security.debian.org/ {{ oscodename }}/updates main contrib non-free
deb-src http://security.debian.org/ {{ oscodename }}/updates main contrib non-free

deb {{ pillar.get('debian_mirror','http://ftp.fr.debian.org/debian') }} {{ oscodename }}-updates main contrib
deb-src {{ pillar.get('debian_mirror','http://ftp.fr.debian.org/debian') }} {{ oscodename }}-updates main contrib
{% endif %}

{% if pillar.get('is_proxmox', False) %}
# PVE packages provided by proxmox.com
deb http://download.proxmox.com/debian {{ oscodename }} pve
{% endif %}

{% endif %}

With that executing salt-call state.sls system.package-manager will hang.

@OrangeDog
Copy link
Contributor Author

OrangeDog commented Aug 31, 2022

I don't know whether everyone else here with the same problem is the same, but:

  • I'm using gitfs (pygit2)
  • I'm using a kvm VM
  • The master is 3004.2
  • A single minion with a single state is sufficient to reproduce (though it has a few files worth of pillar)
  • state.single does not have the same problem, only state.apply

@d--j
Copy link
Contributor

d--j commented Aug 31, 2022

  • No gitfs
  • VMware and/or KVM
  • master 3005 (onedir or classic)
  • All minions I tested have this issue – but they all include the sls file from above so that's not a surprise
  • Applying some sls files will work. Applying others will hang.
  • It is not a timing issue or an issue how many states need applying. I tried with test sls files like (working):
#!pydsl

__pydsl__.set(ordered=True)

state('/tmp/test').file.directory()
state().cmd.run('sleep 30')
for i in range(5000):
  state('/tmp/test/file%s' % i).file.managed(contents='%s'%i)

@OrangeDog I went thru my states/top.sls and applied each sls individually (salt-call state.sls sls.path.in.dot.notation). That's how I found my minimal example. maybe you could do the same and we might find commonalities?

@OrangeDog
Copy link
Contributor Author

It looks like it's usage of salt:// sources that triggers it. Does that agree with what you are seeing?

@d--j
Copy link
Contributor

d--j commented Aug 31, 2022

It looks like it's usage of salt:// sources that triggers it. Does that agree with what you are seeing?

Not all of them. I tried with a static salt:// source in a file.managed state and that works just fine. … and leaving the pkg.installed state out of above minimal test case also makes the sls file not hang.

@OrangeDog
Copy link
Contributor Author

Hmm, that seemed to be the pattern with my states. e.g. this hangs:

client-pkgs:
  pkg.installed:
    - pkgs:
        - krb5-user
        - libpam-krb5

/etc/krb5.conf.d:
  file.directory: []

/etc/krb5.conf:
  file.managed:
    - source: salt://{{ tpldir }}/krb5.conf
    - require:
      - file: /etc/krb5.conf.d

But this doesn't:

vm.swappiness:
  sysctl.present:
    - value: 0

net.ipv4.tcp_timestamps:
  sysctl.present:
    - value: 0

net.ipv4.tcp_keepalive_time:
  sysctl.present:
    - value: 600

{% for pkg in ('htop', 'jq', 'unzip') %}
{{ pkg }}:
  pkg.installed: []
{% endfor %}

en_GB.UTF-8:
  locale.system: []

qemu-guest-agent:
{% if grains['virtual'] == 'kvm' %}
  pkg.installed: []
{% else %}
  pkg.purged: []
{% endif %}

{% if grains['os_family'] == 'Debian' %}
debian-goodies:
  pkg.installed: []
{% endif %}

{% if salt['file.file_exists']('/etc/apt/trusted.gpg') and salt['file.stats']('/etc/apt/trusted.gpg')['size'] <= 32 %}
# an empty keyring causes invalid format warnings
/etc/apt/trusted.gpg:
  file.absent: []
{% endif %}

@d--j
Copy link
Contributor

d--j commented Aug 31, 2022

So in both our cases this fails:

install-some-packages:
  pkg.installed:
    - pkgs:
        - pkg1
        - pkg2

/tmp/manage-a-file-with-source:
  file.managed:
    - source: salt://path/to/source

Since your sls does not use template: jinja2 for the file.managed state it looks like it is not a problem with template evaluation.

@dmurphy18
Copy link
Contributor

@d--j @OrangeDog Thanks for that, used the last test case and got the hang

@dmurphy18 dmurphy18 removed the cannot-reproduce cannot be replicated with info/context provided label Aug 31, 2022
@dmurphy18
Copy link
Contributor

Yup using pyzmq 23.2.1 fixes the issue.
The hang occurs after the work of the state file is done.
Later versions of pyzmq, don't see much in changelog that can explain the issue apart the following for 20.0

Hold GIL while calling zmq_curve_ functions, which may fix apparent threadsafety issues.

Since 19.0.2 hangs and 20.0 doesn't.

@dmurphy18
Copy link
Contributor

Using Ubuntu 20.04 python3-zmq v18.1.1-3 and pushing it differently and hitting that thread issue fixed in pyzmq v20.0.

@dmurphy18
Copy link
Contributor

dmurphy18 commented Aug 31, 2022

Don't see the problem hang on the following with classic packaging:
Centos 7 pyzmq 18.0.1
Centos 8 pyzmq 20.0.0
Debian 10 pyzmq 17.1.2
Debian 11 pyzmq 20.0.0
Ubuntu 18.04 pyzmq 17.1.2

Note: all onedir on Linux use pyzmq 23.2.0 so don't have hang issue.

Looks like Ubuntu 20.04 is the only version affected, will fix packaging for Ubuntu 20.04, Debian 11 has pyzmq 20.0.0 which should work with Ubuntu 20.04

@dos533
Copy link

dos533 commented Sep 1, 2022

Any manual how i best install pyzmq 20.0.0 on Ubuntu 20.04?

@d--j
Copy link
Contributor

d--j commented Sep 1, 2022

@dos533 see #62550 (comment)

@dmurphy18
Copy link
Contributor

Fixed by saltstack/salt-pack-py3#255
The updated classic packaging for Salt 3005 on Ubuntu 20.04 with python3-pyzmq 20.0.0 is under-going testing and will be available after testing is completed

@dmurphy18
Copy link
Contributor

Closing as fix is now available from https://repo.saltproject.io/py3/ubuntu/20.04/amd64/latest/ and https://repo.saltproject.io/py3/ubuntu/20.04/arm64/latest/

@TheBigBear
Copy link
Contributor

TheBigBear commented Nov 15, 2022

I still get this on a Centos 8.7 look-alike ( AlmaLinux 8.7 ) 3005.1 on it comes with PyZMQ 22.0.3 and not yet with PyZMQ 23.2.0.
see issue over here #63061 could that be the issue?
How could I get a windows 'onedir' minion installed and tested to have and use the newer higher version 23.2.0? If I read the comments correctly then the windows onedir minion will not pick up and use a manually updated or installed newer higher version, as I understand it this only works on the old 'classic' installs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases.
Projects
None yet
Development

No branches or pull requests

8 participants