-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic Workflow examples requires another service? #406
Comments
Have you configured merlin per the instructions in the tutorial? https://merlin.readthedocs.io/en/latest/modules/installation/installation.html#id3 |
I didn't know there was a tutorial! 😆 I see redis now. Would it make sense to have a link to that on the workflows page? Like "before you do this, you should have followed the instructions here." I started at the workflows page, then went to the install page to pip install, and didn't realize there was an intermediate step. |
Is it possible to run redis just locally, or via an external service? I'm planning to run this on kubernetes, and the redis service will be another container in a pod, meaning the service will be available (and I don't want merlin to launch singularity or docker or podman). I'm reading here https://merlin.readthedocs.io/en/latest/merlin_server.html. |
Yes you can start your own redis server locally without a container or have one running with an external server and configure your configuration file to point to it. This can be complicated, so merlin server aims to make this easier. You can also run with the —local flag to bypass a server and run serially, which is sufficient for the basic example workflows |
Here are some instructions if you want to configure your container to run your own server: https://merlin.readthedocs.io/en/latest/modules/installation/installation.html#id5 |
Advanced configuration docs are here: https://merlin.readthedocs.io/en/latest/merlin_config.html |
Running via Kubernetes would be pretty cool. If you figure out the configuration and/or docs, we should add it! |
The docker-compose example should work! For testing I'm just spinning up redis locally, but for the Flux Operator we have separate containers that provide services: https://flux-framework.org/flux-operator/tutorials/services.html. There are two modes (and likely I'll test and think about both). For one - the service is provided as a sidecar container. So an indexed job with N containers running merlin would have one sidecar per container. That wouldn't be ideal if all nodes running merlin (via flux) need access to the same database. The second design (which I haven't tested yet, but technically it's simpler than the above) is to bring up a single (shared) service for all the worker nodes to access. Glancing at these files, the hardest thing (for the Flux Operator) will be to generate the shared config files and volumes in advance. Likely I'll just do this manually for now and make a tutorial, but (if you like this idea generally) this could be an opportunity to make a simple Kubernetes operator just for running these merlin workflows. They are fairly fun to make! I'm also thinking about if it might be possible to have the concept of a Flux Operator plugin - e.g., adding an ability to use "Flux Operator + Merlin" and then having some of the complexity handled by the plugin. I'll do more research/reading about this today - editing audio first but should have some time later. Happy Saturday! |
Heyo! I have my docker-compose setup and I think I'm fairly close - but a question. Is redis requested to use ssl? I haven't generated certificates for it, and I'm wondering if this is redis or rabbitmq (or possibly I have a configuration error). I also had some trouble matching the environment variables for the various cert files (they generate with very different names) so I might have messed that up. Here is the current debug output:
Here is my app.yaml
and my docker-compose: version: '3.9'
services:
# This can also be set up with TLS
# https://merlin.readthedocs.io/en/latest/modules/installation/installation.html#id7
# see 2.4.2 "Redis TLS Service"
redis:
restart: always
hostname: redis
container_name: redis
image: 'redis:latest'
ports:
- "6379:6379"
networks:
- rabbitmq
rabbitmq:
restart: always
# The hostname should not be necessary - didn't work either way
hostname: rabbitmq
container_name: rabbitmq
image: rabbitmq:3.8-management
ports:
- "15672:15672"
- "15671:15671"
- "5672:5672"
- "5671:5671"
environment:
- RABBITMQ_SSL_CACERTFILE=/cert_rabbitmq/client_rabbitmq_certificate.pem
- RABBITMQ_SSL_KEYFILE=/cert_rabbitmq/server_rabbitmq_key.pem
- RABBITMQ_SSL_CERTFILE=/cert_rabbitmq/server_rabbitmq_certificate.pem
- RABBITMQ_SSL_VERIFY=verify_none
- RABBITMQ_SSL_FAIL_IF_NO_PEER_CERT=false
- RABBITMQ_DEFAULT_USER=merlinu
- RABBITMQ_DEFAULT_VHOST=/merlinu
- RABBITMQ_DEFAULT_PASS=guest
volumes:
- ./merlinu/cert_rabbitmq:/cert_rabbitmq
networks:
- rabbitmq
# Yer a weezard Harry!
merlin:
build: .
container_name: merlin
networks:
- rabbitmq
# I only added these because they weren't showing up (didn't change anything)
# You can try with them removed
links:
- rabbitmq
- redis
networks:
rabbitmq:
driver: bridge Also note that we have to now pin the rabbitmq management container to that version -if you go higher you'll get an error because they changed it to not accept envars (and only accept a config file). Thanks for your help - I think I'm close (and am excited to at least try this out in the flux operator, once it's working in compose!) |
@lucpeterson I just tested removing conda/mamba from the container (and installing to system python alongside flux) and I reproduced the same ssl error, if that helps. |
I’ve seen a similar error with incompatibilities between versions of celery/kombu and certifi. Trying different versions of certifi could maybe work? |
The broker lists as amqps so it is expecting a ssl config, you can try setting the broker to amqp or setting cert_reqs: none . The rabbitmq server is also configured with certs so it may need to have ssl configured for the initial handshake. The TSL configuration is definitely the most complicated component of the server system and we have seen issues with specific versions of openssl. |
I haven't added the redis ssl yet - it was presented like it was optional in the docs. I can do that now. |
I think the rabbitmq broker is giving you the exception based on the amqp transport.py file emitting the exception. |
okay trying this new way - have you seen this before? Here is what redis sees inside the container:
When I use the entrypoint command:
I get permission denied
And that doesn't make sense because the user in the container is root. When I change ownership of that directory to uid 0 it doesn't change the error. |
No - this would be entirely external to lab resources. Is there someone I can contact that has the Dockerfiles / Kubernetes configs that are driving the working variant? |
Great news! I banged on this a bit more and checked all the things I thought might be wrong (user ids, permissions, configs) and rabbitmq and redit are working! The next bug I hit is with respect to running the demo - it looks like there should be a samples file (but I don't see it): I think maybe there was a silent error - possibly me forgetting to install something called spellbook? (I found this in the feature demo YAML file and tried running it on my own: I think I found it here: https://github.com/LLNL/merlin-spellbook Then after I installed that, it made some output:
I think that might have worked? Do I look at the output file?
I'm not sure if that is right because the workflow description says it will launch 10 hello worlds (I can't find them!) I then found that I can define the batch type as "flux" batch:
- type: local
+ type: flux And then start the flux instance: $ # sudo -u fluxuser -E PATH=$PATH -E PYTHONPATH=$PYTHOPATH -E LD_LIBRARY_PATH=$LD_LIBRARY_PATH flux start --test-size=4
$ whoami
fluxuser And I ran it! $ merlin run feature_demo/feature_demo.yaml And that launched flux jobs!
And the job attach (output) showed me a warning that (bad me!) I should not be running as root. So this is a good stopping point for tonight, but next I need to go back and redo the build so ownership and the expected user is fluxuser (and not root) otherwise I get that warning:
But this is great progress - we have our services! Here are the changes I needed and my WIP demo here. I'll update the group email thread, and tomorrow will start figuring out the fluxuser port. If I can get this running here with the fluxuser, that should be enough to try in the flux operator. Some feedback for the docs - the rabbitmq config parameters are different, and the certs had to be bound to the merlin container too. What really tripped me up was the app.yaml because there were so many options, and I didn't realize "rediss" means "redis with ssl" as opposed to "Did a snake write this?" 😆 I'll ping again after some more work tomorrow. Thanks for the help and connecting me to the larger team today! The |
okay went a little further - when I try the flux example workflows, it's still trying to use srun: The cmd.sh in the studies / merlin_info looks OK python3 /workflow/flux/scripts/make_samples.py -dims 2 -n 10 -outfile=/workflow/studies/flux_test_20230322-053246/merlin_info/samples.npy I did some debugging, and in And then I realized the pip installed version doesn't even have that logic! So I reinstalled directly from the branch here - that actually seemed to run! I had to look here Lines 331 to 338 in 060826f
Restart: None
Scheduled?: True
[2023-03-22 05:56:58,151: INFO] Executing step 'runs' in '/workflow/studies/flux_test_20230322-055655/runs/09'...
[2023-03-22 05:56:58,337: INFO] Execution returned status OK.
[2023-03-22 05:56:58,337: INFO] Step 'runs' in '/workflow/studies/flux_test_20230322-055655/runs/09' finished successfully.
[2023-03-22 05:56:58,394: INFO] Task merlin.common.tasks.merlin_step[4838c7da-7ea5-4aad-bd0f-719f83b94ede] succeeded in 0.24390821799170226s: ReturnCode.OK
[2023-03-22 05:56:58,396: INFO] Task merlin:chordfinisher[4a7c517b-76c9-4410-a108-2de2a3634019] received
[2023-03-22 05:56:58,408: INFO] Task merlin:chordfinisher[4a7c517b-76c9-4410-a108-2de2a3634019] succeeded in 0.01129624602617696s: 'SYNC'
[2023-03-22 05:56:58,409: INFO] Task merlin.common.tasks.expand_tasks_with_samples[1d54784b-e69f-4a1e-a53e-2c7f1c2404b4] received
[2023-03-22 05:56:58,421: INFO] Task merlin.common.tasks.expand_tasks_with_samples[1d54784b-e69f-4a1e-a53e-2c7f1c2404b4] succeeded in 0.011088147992268205s: None
[2023-03-22 05:56:58,422: INFO] Task merlin.common.tasks.merlin_step[5432e58a-a375-4388-a2b7-3e10766ba722] received
[2023-03-22 05:56:58,423: INFO] Directory does not exist. Creating directories to /workflow/studies/flux_test_20230322-055655/data
[2023-03-22 05:56:58,423: INFO] Generating script for data into /workflow/studies/flux_test_20230322-055655/data
[2023-03-22 05:56:58,423: INFO] Running workflow step 'data' locally.
[2023-03-22 05:56:58,423: INFO] Script: /workflow/studies/flux_test_20230322-055655/data/data.slurm.sh
Restart: None
Scheduled?: True
[2023-03-22 05:56:58,423: INFO] Executing step 'data' in '/workflow/studies/flux_test_20230322-055655/data'...
[2023-03-22 05:56:58,503: WARNING] Unrecognized Merlin Return code: 1, returning SOFT_FAIL
[2023-03-22 05:56:58,503: WARNING] *** Step 'data' in '/workflow/studies/flux_test_20230322-055655/data' soft failed. Continuing with workflow.
[2023-03-22 05:56:58,514: INFO] Task merlin.common.tasks.merlin_step[5432e58a-a375-4388-a2b7-3e10766ba722] succeeded in 0.09122752398252487s: ReturnCode.SOFT_FAIL
[2023-03-22 05:56:58,515: INFO] Task merlin:chordfinisher[bc64e65c-f988-4401-85d8-d22e710a9517] received
[2023-03-22 05:56:58,516: INFO] Task merlin:chordfinisher[bc64e65c-f988-4401-85d8-d22e710a9517] succeeded in 0.0006944899796508253s: 'SYNC' Also note that "flux mini" is getting deprecated - so should update that eventually (not soon if you want backwards compatibility).
I'm also wondering if it always makes sense to run merlin via an allocation? E.g., for my use case, I'm going to be giving the merlin command to flux start. Theoretically it will already be inside a flux instance, so it could just do flux submit. Maybe that launch command for flux should be more customizable? I also think the merlin run -> merlin run-workers command is a bit confusing for a new user - my expectation is that "run" actually runs the workflow. Perhaps there could be two avenues:
My branch is now updated with the changes I needed to run as the fluxuser. https://github.com/rse-ops/flux-hpc/tree/add/merlin/merlin-demos. Will try out the other flux examples tomorrow! |
Based on other issues I've been helping users with and now these suggestions too, I think it's time for us to update our script adapters so that they're more consistent with Maestro and up-to-date with flux/slurm/lsf (this would also include updates to the docs to make everything more clear for users). I appreciate your recommendations here and I'll be bookmarking this for when I go take a look at making these updates |
Thanks for beta testing the new flux native interface. We don't have the same version you used installed yet so the interface change is news to me, we will get a fix in for that. The run and run-workers are generally separate because you can start workers on a different machine than the study was submitted. That way you can spin-up more workers independent from the study, the producer-consumer model. Separating the scheduler batch/run configuration from the code would make these api changes much easier to implement. |
The new pr #407 should fix the deprecation messages. |
Gotcha! And that makes sense. For the operator we have a commands -> pre block where I can run it before the official "launch the jobs!" command. I'm pretty far into that now - hopefully will have an update soon. I kind of cheated for the redis/rabbitmq containers because I just built them already with the certs they need for the demo. 😆 In a real production sense you'd want to generate them dynamically and then have read only config map volumes. You could even have a merlin operator to handle this! |
I sent this update via email, but will post here too! I got it mostly working in the flux operator - I had to do an interactive submit mode because (as far as I can tell) there is no single command to give to, for example, flux start that will generate some DAG, submit and wait for all jobs, and exit only when that's done. I have a lot of questions about design (and am wondering if things might be simplified) so I'm hoping there is interest to have a meeting so you can show me some of the design internals / interaction with Flux. |
Longer term a cleaner interface might be to create a flux transport channel that celery can hook into directly. Currently celery can do rabbitmq, redis, sqs and zookeeper (although we only have merlin hooks for the first two) as brokers (there are lots more backends). https://github.com/celery/kombu/tree/a3de6f66c1c62cba5008f078c2df20d97f32dcbe/kombu/transport |
@lucpeterson I like that idea - but how could kombu accept a contribution for a different kind of transport that explicitly is to a job queue (and isn't a general message or event?) Would we try to make some kind of additional plugin to work with it (or similar?) |
There are other abstractions to think about too - e.g.,a celery "backend" is more of the database. Here is a random example I found for a custom one. https://github.com/pilwon/celery-backends-rethinkdb/blob/master/rethinkdb_backend.py Arguably if we submit a job to flux, it would serve as running the task and be able to give us the result. I've never developed for kombu / celery so apologies as I try to get my head around the different components (and what we are interested in). |
Hi! I'm trying to follow the tutorial - I've installed it, done
merlin config
(not included in the docs there) and also init'd the workflow. When I run the command as shown here https://merlin.readthedocs.io/en/latest/merlin_workflows.html I getI think I'm possibly missing something - was I supposed to configure a service? Is there a complete tutorial with step by step instructions I could do somewhere? Thanks!
The text was updated successfully, but these errors were encountered: