Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adviser tests are failing because of allocated CPU time exceeded #266

Open
mayaCostantini opened this issue Feb 25, 2022 · 18 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.

Comments

@mayaCostantini
Copy link
Contributor

Describe the bug
Tests for the thamos_advise feature are producing the following error in stage:

ERROR    thoth.adviser.run:155: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

To Reproduce
Steps to reproduce the behavior:
See last integration tests report for stage environment.

Expected behavior
Tests complete successfully.

@mayaCostantini
Copy link
Contributor Author

/priority critical-urgent

@sesheta sesheta added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 25, 2022
@mayaCostantini
Copy link
Contributor Author

/kind bug

@sesheta sesheta added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2022
@fridex
Copy link
Contributor

fridex commented Feb 28, 2022

To test the resolver is such cases, I try to create a lock file using Pipenv and submit an advise with the lock file as generated by Pipenv. In that case, resolver reports why it removes packages Pipenv resolved:

it might be a good idea to experiment with requirements (and possibly constraints as well) to narrow down to the issue one wants to debug. An example can be a failure when adviser was not able to find a resolution that would satisfy requirements. In such a case, it might be good to generate a lock file with expected pinned set of packages using other tools (e.g. Pipenv, pip-tools) and submit the lock file to the recommender system. The logs produced during the resolution and stack level justifications might give hints why the given resolution was rejected.

See docs.

@fridex
Copy link
Contributor

fridex commented Mar 1, 2022

/sig stack-guidance
/priority critical-urgent

@sesheta sesheta added the sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. label Mar 1, 2022
@fridex
Copy link
Contributor

fridex commented Mar 7, 2022

Failing tests:

  • runtime environment ps-cv-pytorch , without user stack supplied and without static analysis

Failure:

2022-03-07 15:24:20,728  23 INFO     thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:20,733  23 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:21,467  23 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-statio
n.ninja/j/install_error
2022-03-07 15:24:21,468  23 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
  • runtime environment ps-nlp-tensorflow , without user stack supplied and without static analysis

Failure:

2022-03-07 15:20:19,411  22 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:20:20,288  22 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:20:20,288  22 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
  • runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis

Failure:

2022-03-07 15:22:28,279  22 INFO     thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:22:28,284  22 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:22:28,979  22 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:22:28,979  22 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
  • runtime environment ps-nlp-tensorflow-gpu , without user stack supplied and without static analysis

Failure:

2022-03-07 15:24:20,728  23 INFO     thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:20,733  23 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:21,467  23 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:24:21,468  23 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack

Based on the lock file we use in repos, it looks like that thoth-solver was not able to solve jupyter-tensorboard==0.2.0' in the given runtime environment.

However, for some schenarios adviser was able to resolve application dependencies when triggered manually. I've created a new integration-tests job to confirm if these tests are still failing. Nevertheless, it would be great to check why thoth-solver did not solve jupyter-tensorboard in the given runtime environment.

@fridex
Copy link
Contributor

fridex commented Mar 7, 2022

thoth-solver fails to install jupyterlab-tensorboard with the following error:

Command exited with non-zero status code (1):     ERROR: Command errored out with exit status 1:
     command: /opt/app-root/src/solver-venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"'; __file__='"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-vgv1i21f/install-record.txt --single-version-externally-managed --compile --install-headers /opt/app-root/src/solver-venv/include/site/python3.8/jupyter-tensorboard
         cwd: /tmp/pip-install-0i94_y48/jupyter-tensorboard/
    Complete output (71 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib
    creating build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/application.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/tensorboard_manager.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/api_handlers.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/__init__.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/handlers.py -> build/lib/jupyter_tensorboard
    creating build/lib/jupyter_tensorboard/static
    copying jupyter_tensorboard/static/tensorboardlist.js -> build/lib/jupyter_tensorboard/static
    copying jupyter_tensorboard/static/style.css -> build/lib/jupyter_tensorboard/static
    copying jupyter_tensorboard/static/tree.js -> build/lib/jupyter_tensorboard/static
    running build_scripts
    creating build/scripts-3.8
    copying scripts/jupyter-tensorboard -> build/scripts-3.8
    changing mode of build/scripts-3.8/jupyter-tensorboard from 644 to 755
    running install_lib
    creating /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/application.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/tensorboard_manager.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/api_handlers.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/__init__.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/handlers.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    creating /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    copying build/lib/jupyter_tensorboard/static/tensorboardlist.js -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    copying build/lib/jupyter_tensorboard/static/style.css -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    copying build/lib/jupyter_tensorboard/static/tree.js -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/application.py to application.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/tensorboard_manager.py to tensorboard_manager.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/api_handlers.py to api_handlers.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/__init__.py to __init__.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/handlers.py to handlers.cpython-38.pyc
    running install_egg_info
    running egg_info
    writing jupyter_tensorboard.egg-info/PKG-INFO
    writing dependency_links to jupyter_tensorboard.egg-info/dependency_links.txt
    writing entry points to jupyter_tensorboard.egg-info/entry_points.txt
    writing requirements to jupyter_tensorboard.egg-info/requires.txt
    writing top-level names to jupyter_tensorboard.egg-info/top_level.txt
    reading manifest file 'jupyter_tensorboard.egg-info/SOURCES.txt'
    writing manifest file 'jupyter_tensorboard.egg-info/SOURCES.txt'
    Copying jupyter_tensorboard.egg-info to /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard-0.2.0-py3.8.egg-info
    running install_scripts
    copying build/scripts-3.8/jupyter-tensorboard -> /opt/app-root/src/solver-venv/bin
    changing mode of /opt/app-root/src/solver-venv/bin/jupyter-tensorboard to 755
    Installing jupyter-tensorboard script to /opt/app-root/src/solver-venv/bin
    writing list of installed files to '/tmp/pip-record-vgv1i21f/install-record.txt'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 52, in <module>
        setup(
      File "/opt/app-root/src/solver-venv/lib64/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 47, in run
        enable_extension_after_install()
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 30, in enable_extension_after_install
        from jupyter_tensorboard.application import (
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/jupyter_tensorboard/__init__.py", line 3, in <module>
        from .handlers import load_jupyter_server_extension   # noqa
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/jupyter_tensorboard/handlers.py", line 3, in <module>
        from tornado import web
    ModuleNotFoundError: No module named 'tornado'
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/app-root/src/solver-venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"'; __file__='"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-vgv1i21f/install-record.txt --single-version-externally-managed --compile --install-headers /opt/app-root/src/solver-venv/include/site/python3.8/jupyter-tensorboard Check the logs for full command output.

The issue here is that jupyter-tensorboard executes code after installation that expects tornado present in the environment. As we install jupyter-tensorboard without dependencies, the code behind executing the post-install procedure to register the extension fails.

@fridex
Copy link
Contributor

fridex commented Mar 8, 2022

In the recent report, the adviser was able to find a resolution to this issue - that is using an older version of jupyter-tensorboard that does not perform any post-install procedure.

Closing this as integration tests are green. Nevertheless, we should report this upstream and see what their opinion is on this one.

/close

@sesheta sesheta closed this as completed Mar 8, 2022
@sesheta
Copy link
Member

sesheta commented Mar 8, 2022

@fridex: Closing this issue.

In response to this:

In the recent report, the adviser was able to find a resolution to this issue - that is using an older version of jupyter-tensorboard that does not perform any post-install procedure.

Closing this as integration tests are green. Nevertheless, we should report this upstream and see what their opinion is on this one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fridex
Copy link
Contributor

fridex commented May 3, 2022

/reopen

@sesheta sesheta reopened this May 3, 2022
@sesheta
Copy link
Member

sesheta commented May 3, 2022

@fridex: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codificat
Copy link
Member

Today's aws-prod tests show green.
Worth ensuring stage tests are also green

@codificat
Copy link
Member

/assign @fridex
/lifecycle active

@sesheta sesheta added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label May 9, 2022
@fridex
Copy link
Contributor

fridex commented May 9, 2022

Scheduled integration-tests for stage, we should receive an email report after the integration tests finish.

@codificat
Copy link
Member

Right now, integration tests in stage are not running (thoth-station/thoth-application#2599)
/remove-lifecycle active
until this is addressed

@sesheta sesheta removed the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 27, 2022
@codificat
Copy link
Member

In yesterday's run of the integration tests in aws-prod, one of the adviser tests failed (ps-cv-pytorch):

... Then I ask for an advise for the cloned application for runtime environment ps-cv-pytorch , without user stack supplied and without static analysis (965.794s) 
...
2022-06-28 03:17:46,572 thoth.adviser.run           ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

Captured logging:
INFO:thamos.lib:Using 'latest' recommendation type - see https://thoth-station.ninja/recommendation-types/
WARNING:thamos.lib:The user stack found in the lock file will not be supplied as requested
INFO:thamos.lib:Successfully submitted advise analysis 'adviser-220628030145-f174942db191749e' to 'https://api.prod.thoth-station.ninja/api/v1'

@codificat
Copy link
Member

Another anecdotal update: yesterday's aws-prod integration test runs have 2 tests failing due to allocated CPU time exceeded: ps-cv-pytorch and ps-cv-tensorflow

@codificat codificat moved this to 🏗 In progress in Planning Board Aug 8, 2022
@sesheta
Copy link
Member

sesheta commented Oct 3, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2022
@harshad16
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@sesheta sesheta added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 4, 2022
@mayaCostantini mayaCostantini removed their assignment Dec 2, 2022
@codificat codificat removed their assignment Jan 16, 2023
@codificat codificat moved this from 🏗 In progress to 📋 Backlog in Planning Board Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.
Projects
Status: 📋 Backlog
Development

No branches or pull requests

5 participants