Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing SLURM environment parsing #2895

Merged
merged 1 commit into from Sep 12, 2017
Merged

Fixing SLURM environment parsing #2895

merged 1 commit into from Sep 12, 2017

Conversation

sithhell
Copy link
Member

This patch fixes parsing of the slurm environment in the presence of running
different SLURM steps with different number of tasks and nodes.

This patch fixes parsing of the slurm environment in the presence of running
different SLURM steps with different number of tasks and nodes.
@hkaiser
Copy link
Member

hkaiser commented Sep 11, 2017

IIUC, there are situations where the SLURM_STEP_... are not set (depending whether salloc or srun have been used, possibly other criteria..., see here). I think we should always check the SLURM_... env variables if the SLURM_STEP_... variable is not set.

@sithhell
Copy link
Member Author

sithhell commented Sep 11, 2017 via email

@hkaiser
Copy link
Member

hkaiser commented Sep 11, 2017

IIUC, there are situations where the SLURM_STEP_... are not set (depending whether salloc
or srun have been used, possibly other criteria..., see here). I think we should always check
the SLURM_... env variables if the SLURM_STEP_... variable is not set.

On the slurm installations I checked, they are both set. The post you
linked to concerns the SLURM_JOB variables, which seem to be the same as
the ones without _JOB.

The link is related to the _JOB_ variables, yes - but if you read it you'll see that for the same job the _STEP_ variables are not always present depending on the way things are run.

Would it be a problem to play it safe and to allow for a fallback in case the _STEP_ variables are not present?

http://rostam.cct.lsu.edu/builders/hpx_gcc_7_boost_1_65_centos_x86_64_debug/builds/19
Shows that the assertion which was a symptom of the bug seems to be gone.

I don't doubt that for our setup things work now. Other environments may still break.

@sithhell
Copy link
Member Author

sithhell commented Sep 11, 2017 via email

@sithhell
Copy link
Member Author

sithhell commented Sep 11, 2017 via email

@sithhell
Copy link
Member Author

I just checked the SLURM environments I have access to again. Having the fallback as you suggested would lead to similar failures. The PR is correct as is.

Copy link
Contributor

@biddisco biddisco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of SLURM is that the env vars are set inside a job allocation, but that the SLURM_STEP_XXX env vars are set inside an srun (job step), these can be a subset of the larger allocation (any sbatch script can in fact launch multiple srun jobs). Since slurm always launches work inside srun, this patch is correct. If a job is manually launched from inside an allocation - without using srun, then it is effectively a single node job and the fallback to SLURM_XXX vars would give incorrect node counts etc.

@sithhell sithhell merged commit 61eefd2 into master Sep 12, 2017
@sithhell sithhell deleted the fix_slurm branch September 12, 2017 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants