New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing SLURM environment parsing #2895
Conversation
This patch fixes parsing of the slurm environment in the presence of running different SLURM steps with different number of tasks and nodes.
IIUC, there are situations where the |
Am 11.09.2017 5:28 nachm. schrieb "Hartmut Kaiser" <notifications@github.com
:
IIUC, there are situations where the SLURM_STEP_... are not set (depending
whether salloc or srun have been used, possibly other criteria..., see here
<https://groups.google.com/forum/#!topic/slurm-devel/3tLPgShGM9A>). I think
we should always check the SLURM_... env variables if the SLURM_STEP_...
variable is not set.
On the slurm installations I checked, they are both set. The post you
linked to concerns the `SLURM_JOB` variables, which seem to be the same as
the ones without `_JOB`.
http://rostam.cct.lsu.edu/builders/hpx_gcc_7_boost_1_65_centos_x86_64_debug/builds/19
Shows that the assertion which was a symptom of the bug seems to be gone.
|
The link is related to the Would it be a problem to play it safe and to allow for a fallback in case the
I don't doubt that for our setup things work now. Other environments may still break. |
No, shouldn't be a problem to add it as fallback to be on the safe side.
|
Am 11.09.2017 6:49 nachm. schrieb "Thomas Heller" <thom.heller@gmail.com>:
No, shouldn't be a problem to add it as fallback to be on the safe side.
Reading through the post again I think what we have now is correct. The
occasions where there is no `_STEP_` variable set is when you have an
allocation, but not actually started the actual parallel application. I
think what's missing is the detection of that case and a proper diagnostic
and single locality fallback.
|
I just checked the SLURM environments I have access to again. Having the fallback as you suggested would lead to similar failures. The PR is correct as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of SLURM is that the env vars are set inside a job allocation, but that the SLURM_STEP_XXX env vars are set inside an srun (job step), these can be a subset of the larger allocation (any sbatch script can in fact launch multiple srun jobs). Since slurm always launches work inside srun, this patch is correct. If a job is manually launched from inside an allocation - without using srun, then it is effectively a single node job and the fallback to SLURM_XXX vars would give incorrect node counts etc.
This patch fixes parsing of the slurm environment in the presence of running
different SLURM steps with different number of tasks and nodes.