-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDMA Hadoop/Spark not working with Slurm submission scripts #276
Comments
I have never tested with RDMA hadoop, so I don't know if Magpie works with it. Obviously, any number of changes to RDMA hadoop can make it not work with Magpie, as Magpie assumes the hadoop scripts work in a certain way, the patches apply to it cleanly, the same configuration & tool options exist, etc. etc. Without any knowledge of your situation, here's a guess on the problem. Magpie assumes the node's hostname (as it is configured in Slurm), such as "foo[1-10]" is the hostname to use for networking communication. i.e. the Hadoop nodemanager works off the node & port of If your cluster is not like this, then perhaps the Infiniband portion of RDMA Hadoop is confused, b/c its trying to connect to the host/IP that Magpie configured for it, which is not the host/IP it wants. Other than that, I think good ole fashioned log/conf file debugging is the way to go. I'm glad to help. If you use the script |
I'm trying to launch the job with the script that you recommended me but I'm getting some errors like these in the slurm-jobid.out file: Also showing this one: I have tried to modify some config files provided by Magpie adding some options used in my own files, but it still doesn't work. |
I'm unsure of your setup, but there seems to be something core/basic wrong. Unclear of what it could be. For
the error is this line
The environment variable MAGPIE_CLUSTER_NODERANK isn't defined, leading to the script error. But this is defined by Magpie in Perhaps you can try a stupid test. If you run a job, can you output the environment variables SLURM_NODEID, SLURM_NNODES, SLURM_JOB_NODELIST, SLURM_JOB_NAME, and SLURM_JOB_ID on each node of your allocation? Because Magpie needs these, and I believe at the moment it simply assumes Slurm always provides them. |
I have configured rdma hadoop and spark by myself in an InfiniBand cluster and it works, but when I try to use the submission script magpie.sbatch-srun-spark-with-yarn-and-hdfs (just for testing hadoop by now), it allocates the nodes perfectly in slurm but doesn't work properly. ResourceManager appears on jps command but it doesn't start, showing and InfiniBand error in the resourcemanager.out error, while not showing errors in de .log file, so nodemanagers.log files show a connection problem to the resource manager node.
Seems like these scripts are not ready for this RDMA version of hadoop and spark, because I can make it work fine by myself with the conf files provided in the hadoop guide that I followed, any suggestions??
I would really appreciate any help you can provide.
The text was updated successfully, but these errors were encountered: