Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimadeclient stuck in loading #76

Closed
unkcpz opened this issue Jun 29, 2022 · 12 comments
Closed

optimadeclient stuck in loading #76

unkcpz opened this issue Jun 29, 2022 · 12 comments

Comments

@unkcpz
Copy link
Contributor

unkcpz commented Jun 29, 2022

Hi @dou-du

I got following error message when I found optimadeclient is stuck in loading. There is template related information in tracklog.

2022-06-29T18:27:09.904617195Z app[web.1]: Traceback (most recent call last):                                                                                                                                                              
2022-06-29T18:27:09.904621217Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/tornado/web.py", line 1704, in _execute                                                                                                 
2022-06-29T18:27:09.904624340Z app[web.1]:     result = await result                                                                                                                                                                       
2022-06-29T18:27:09.904627185Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/handler.py", line 225, in get
2022-06-29T18:27:09.904630068Z app[web.1]:     async for html in gen:                                                                                                                                                                      
2022-06-29T18:27:09.904632749Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/handler.py", line 142, in get_generator
2022-06-29T18:27:09.904635723Z app[web.1]:     rendered, rendering = await render_task                                                                                                                                                     
2022-06-29T18:27:09.904638655Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/asyncio/futures.py", line 263, in __await__
2022-06-29T18:27:09.904641468Z app[web.1]:     return self.result()  # May raise too.                                                                                                                                                      
2022-06-29T18:27:09.904644247Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/asyncio/futures.py", line 178, in result
2022-06-29T18:27:09.904647112Z app[web.1]:     raise self._exception                                                                                                                                                                       
2022-06-29T18:27:09.904649728Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/asyncio/tasks.py", line 280, in __step
2022-06-29T18:27:09.904652492Z app[web.1]:     result = coro.send(None)                                                                                                                                                                    
2022-06-29T18:27:09.904655121Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/notebook_renderer.py", line 185, in generate_content_hybrid                                                                       
2022-06-29T18:27:09.904658065Z app[web.1]:     async for html_snippet, _ in generator:                                                                                                                                                     
2022-06-29T18:27:09.904660706Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/exporter.py", line 101, in generate_from_notebook_node
2022-06-29T18:27:09.904663547Z app[web.1]:     async for output in self.template.generate_async(nb=nb_copy, resources=resources, **extra_context, static_url=self.static_url):
2022-06-29T18:27:09.904666434Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/environment.py", line 1373, in generate_async
2022-06-29T18:27:09.904669299Z app[web.1]:     yield self.environment.handle_exception()                                                                                                                                                   
2022-06-29T18:27:09.904672262Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/environment.py", line 936, in handle_exception                                                                                   
2022-06-29T18:27:09.904675209Z app[web.1]:     raise rewrite_traceback_stack(source=source)                                                                                                                                                
2022-06-29T18:27:09.904677865Z app[web.1]:   File "/app/.heroku/python/share/jupyter/voila/templates/materialscloud-iframe/index.html.j2", line 1, in top-level template code                                                              
2022-06-29T18:27:09.904680708Z app[web.1]:     {%- extends 'nbconvert/templates/materialscloud-iframe/index.html.j2' -%}                                        
2022-06-29T18:27:09.904684511Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/materialscloud-iframe/index.html.j2", line 7, in top-level template code
2022-06-29T18:27:09.904687556Z app[web.1]:     {% set nb_title = nb.metadata.get('title', '') or resources['metadata']['name'] %}                                 
2022-06-29T18:27:09.904690436Z app[web.1]:   File "/app/.heroku/python/share/jupyter/voila/templates/lab/index.html.j2", line 4, in top-level template code                             
2022-06-29T18:27:09.904693266Z app[web.1]:     {% from 'voila_setup.macro.html.j2' import voila_setup_helper_functions, voila_setup_nbextensions with context %}
2022-06-29T18:27:09.904696072Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/lab/index.html.j2", line 3, in top-level template code                                                                            
2022-06-29T18:27:09.904703048Z app[web.1]:     {% from 'jupyter_widgets.html.j2' import jupyter_widgets %}                                                                                                                                 
2022-06-29T18:27:09.904706085Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/lab/base.html.j2", line 260, in top-level template code
2022-06-29T18:27:09.904708955Z app[web.1]:     {% set div_id = uuid4() %}                                         
2022-06-29T18:27:09.904711640Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/base/display_priority.j2", line 1, in top-level template code                                                                     
2022-06-29T18:27:09.904714536Z app[web.1]:     {%- extends 'base/null.j2' -%}                                                                                                                                                              
2022-06-29T18:27:09.904717273Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/base/null.j2", line 26, in top-level template code
2022-06-29T18:27:09.904720094Z app[web.1]:     {%- block body -%}                                                 
2022-06-29T18:27:09.904722741Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/base/null.j2", line 29, in block 'body'
2022-06-29T18:27:09.904725536Z app[web.1]:     {%- block body_loop -%}
2022-06-29T18:27:09.904728207Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/materialscloud-iframe/index.html.j2", line 395, in block 'body_loop'
2022-06-29T18:27:09.904731351Z app[web.1]:     {{ super() }}
2022-06-29T18:27:09.904733956Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/async_utils.py", line 65, in auto_await
2022-06-29T18:27:09.904736768Z app[web.1]:     return await t.cast("t.Awaitable[V]", value)
2022-06-29T18:27:09.904739617Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/runtime.py", line 366, in <listcomp>
2022-06-29T18:27:09.904743322Z app[web.1]:     [x async for x in self._stack[self._depth](self._context)]  # type: ignore
2022-06-29T18:27:09.904746087Z app[web.1]:   File "/app/.heroku/python/share/jupyter/voila/templates/lab/index.html.j2", line 47, in block 'body_loop'
2022-06-29T18:27:09.904748924Z app[web.1]:     {%- block body_loop -%}
2022-06-29T18:27:09.904751556Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/async_utils.py", line 65, in auto_await
2022-06-29T18:27:09.904754659Z app[web.1]:     return await t.cast("t.Awaitable[V]", value)
2022-06-29T18:27:09.904757422Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/notebook_renderer.py", line 152, in inner_kernel_start
2022-06-29T18:27:09.904760286Z app[web.1]:     return await self._jinja_kernel_start(nb, kernel_id, kernel_future)
2022-06-29T18:27:09.904763547Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/notebook_renderer.py", line 213, in _jinja_kernel_start
2022-06-29T18:27:09.904766514Z app[web.1]:     self.executor.kc = await self.executor.async_start_new_kernel_client() 
2022-06-29T18:27:09.904769232Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/nbclient/client.py", line 532, in async_start_new_kernel_client
2022-06-29T18:27:09.904772073Z app[web.1]:     await ensure_async(self.kc.wait_for_ready(timeout=self.startup_timeout))
2022-06-29T18:27:09.904774803Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/nbclient/util.py", line 96, in ensure_async
2022-06-29T18:27:09.904784041Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jupyter_client/client.py", line 205, in _async_wait_for_ready
2022-06-29T18:27:09.904786972Z app[web.1]:     raise RuntimeError("Kernel didn't respond in %d seconds" % timeout)
2022-06-29T18:27:09.904789892Z app[web.1]: RuntimeError: Kernel didn't respond in 60 seconds
2022-06-29T18:27:09.904792584Z app[web.1]: ERROR:tornado.general:Cannot send error response after headers written
@giovannipizzi
Copy link

Is this reproducible? If so, to understand if the issue is with the template or not, I suggest to deploy another app (e.g. on the dev machine) with a standard template.
If it's the template, check what changed (maybe newer versions of voila require changes to the template?).
Or if it's not the template, change the code simplifying it to see when it stops timing out?

@unkcpz
Copy link
Contributor Author

unkcpz commented Jul 8, 2022

@giovannipizzi thanks a lot for the suggestions.

Is this reproducible?

I won't say it is fully reproducible. It appears more frequently when I use less pre-heat kernel. On the dev server, since the load is not big, it rarely happened. If there is a tool to intentionally try to request the service and force it to open more kernels, we can give it a try.
But I still do not fully understand how voila allocate kernels. It is based on IP or every time a user opens a browser session it will open a kernel? @dou-du Do you know how it works under the hood?

I think it will be also useful if we have some tool that can conduct a stress test for any dokku app. @eimrek Is that possible to have this as what the monitor server is doing now?

@eimrek
Copy link
Member

eimrek commented Jul 8, 2022

Hi, could this be related to gliderlabs/herokuish#659?

And also this fix might be related:

https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_dokku/tasks/fix-dockerfile.yml

Edit: regarding the stress test, yes, in principle we could use something similar as we use in the monitor server to access the app many times: https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_monitor_server/files/check_httpjs.py

@unkcpz
Copy link
Contributor Author

unkcpz commented Jul 14, 2022

@eimrek thanks a lot! I think they are related. The change in fix-dockerfile.yml made by @ltalirz was to change the permission of /dev/random to 640 to make it inaccessible. However, neither the /dev/random on host or on the optimadeclient container have 444. I am not pretty sure what is this gliderlabs/herokuish image used for but there is a new one with the tag latest which I go into image and the /dev/random is 444.

Maybe @dou-du ever build the image by hand? I remember you mentioned you have updated the version? Can you confirm with it?

I think to fix this, I need first figure out what is this gliderlabs/herokuish used for and check if the /dev/random is modified to workaround the issue. If that is the case, we can tear down all the app and delete the gliderlabs/herokuish images and redeploy the server by running the whole ansible role. Any idea and comment on this? @giovannipizzi @eimrek

@unkcpz
Copy link
Contributor Author

unkcpz commented Jul 14, 2022

We can now use python 3.8 for optimade-client which means dokku or herokuish (I need to spend some time to understand how these things work under the hood, sorry for the vagueness of the terminologies here.) is updated as mentioned at materialscloud-org/tools-optimade-client#42

@dou-du
Copy link
Collaborator

dou-du commented Jul 14, 2022

@eimrek thanks a lot! I think they are related. The change in fix-dockerfile.yml made by @ltalirz was to change the permission of /dev/random to 640 to make it inaccessible. However, neither the /dev/random on host or on the optimadeclient container have 444. I am not pretty sure what is this gliderlabs/herokuish image used for but there is a new one with the tag latest which I go into image and the /dev/random is 444.

Maybe @dou-du ever build the image by hand? I remember you mentioned you have updated the version? Can you confirm with it?

I think to fix this, I need first figure out what is this gliderlabs/herokuish used for and check if the /dev/random is modified to workaround the issue. If that is the case, we can tear down all the app and delete the gliderlabs/herokuish images and redeploy the server by running the whole ansible role. Any idea and comment on this? @giovannipizzi @eimrek

I updated the version number from ansible.

@ltalirz
Copy link
Member

ltalirz commented Jul 15, 2022

Indeed, just start from the latest version of the image and apply the fix to prevent the "stuck due to lack of entropy" issue.

@unkcpz
Copy link
Contributor Author

unkcpz commented Jul 15, 2022

Thanks @ltalirz. Could you elaborate a bit on how the image gliderlabs/herokuish:latest is used in the dokku server? I using docker run -it gliderlabs/herokuish:latest to go into the container and find the /start is the modified one with the change you applied. But the /dev/random permission is not changed. Is that because /start script as entrypoint is not executed or is that the /dev/random is not able to apply changes for permission?

@ltalirz
Copy link
Member

ltalirz commented Jul 15, 2022

If you just run the image with the entry point bash, then the permission won't be changed.

herokuish is built on top of the heroku docker image and basically does what they do (detect what kind of app the user is trying to run, install dependencies, etc.)

@unkcpz
Copy link
Contributor Author

unkcpz commented Jul 15, 2022

If you just run the image with the entry point bash, then the permission won't be changed.

Then how do I test if this change is taking effect? I just can not find where the image built by ansible task https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_dokku/tasks/fix-dockerfile.yml is used in the server.

@ltalirz
Copy link
Member

ltalirz commented Jul 15, 2022

Then how do I test if this change is taking effect?

Well, if you exec into a container started by dokku, you will see.

I just can not find where the image built by ansible task https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_dokku/tasks/fix-dockerfile.yml is used in the server.

Well, dokku should be using this image (you may need to look at the dokku source code).

In any case, you can simply look at the containers running in one of the dokku instances and see whether they are using the image

@unkcpz
Copy link
Contributor Author

unkcpz commented Jul 25, 2022

@ltalirz thanks and sorry for the late reply, I was on vacation last week.

Well, if you exec into a container started by dokku, you will see.

I did this and it is for sure not using have the /dev/random set to correct permission.

In any case, you can simply look at the containers running in one of the dokku instances and see whether they are using the image

I use docker inspect to trace the image of the optimadeclient and then inspect the image. What I found are 1) the sha256 id of Parent image can not be found in the list of docker ps and the herokuish's dockerhub https://hub.docker.com/r/gliderlabs/herokuish/tags?page=1. 2) There are maybe related useful label information to debug, I attached below.

            "Env": [                                                                                                                                                                                                                       
                "CACHE_PATH=/cache",                                                                                                                                                                                                       
                "USER=herokuishuser",                                                                                                                                                                                                      
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",                                                                                                                                                       
                "STACK=heroku-20",                                                                                                                                                                                                         
                "DEBIAN_FRONTEND=noninteractive"                                                                                                                                                                                           
            ],                                                                                                                                                                                                                             
            "Cmd": [                                                                                                                                                                                                                       
                "/bin/sh",                                                                                                                                                                                                                 
                "-c",                                                                                                                                                                                                                      
                "#(nop) ",                                                                                                                                                                                                                 
                "LABEL org.label-schema.vendor=dokku"                                                                                                                                                                                      
            ],                                                                                                                                                                                                                             
            "Image": "sha256:832c50abf474930c7cee20461729e54fc30280bf55f9b0f3a40136637a2bbfe1",                                                                                                                                            
            "Volumes": null,                                                                                                                                                                                                               
            "WorkingDir": "",                                                                                                                                                                                                              
            "Entrypoint": null,                                                                                                                                                                                                            
            "OnBuild": null,                                                                                                                                                                                                               
            "Labels": {                                                                                                                                                                                                                    
                "com.dokku.app-name": "optimadeclient",                                                                                                                                                                                    
                "com.dokku.image-stage": "release",                                                                                                                                                                                        
                "com.gliderlabs.herokuish/stack": "heroku-20",                                                                                                                                                                             
                "dokku": "",                                                                                                                                                                                                               
                "org.label-schema.schema-version": "1.0",                                                                                                                                                                                  
                "org.label-schema.vendor": "dokku"                                                                                                                                                                                         
            }     

I suspect the new version of dokku use the new tag point to the image, I am curious what this heroku-20 means. Do you have any idea what to look further?

@eimrek eimrek closed this as completed Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants