sbatch won't work while srun runs fine #sbatch #slurm


rodrigoceccatodefreitas@...
 

Hello,

I am having problems with $ sbatch on the cluster I am working;
As shown in the image below, $srun runs just fine (above the yellow line), while the script jobtest.sh will not run with $sbatch;


Our slurm.conf file:


#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=sorgan-cluster
ControlMachine=sorgan
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
SrunPortRange=60001-63000
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
FastSchedule=0
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=sorgan
#AccountingStorageLoc=/var/log/slurmacct.log
AccountingStorageLoc=slurmdb
AcctGatherNodeFreq=30
AccountingStorageEnforce=associations,limits
AccountingStoragePort=7031
#AccountingStoragePass=
#AccountingStorageUser=
#
#GENERAL RESOURCE
GresTypes=""
#
#EXAMPLE CONFIGURATION - copy,comment out, and edit
#
#COMPUTE NODES
#NodeName=gpu-compute-1 Gres=gpu:gtx_TitanX:4 Sockets=2 CoresPerSocket=8 State=UNKNOWN
NodeName=compute-1 Sockets=1 CoresPerSocket=16 State=UNKNOWN
# PARTITIONS
#PartitionName=high Nodes=compute-[0-1] Default=YES MaxTime=INFINITE State=UP PriorityTier=10
#PartitionName=gpu  Nodes=gpu-compute-1 Default=YES MaxTime=INFINITE State=UP PriorityTier=5 AllowGroups=slurmusers
PartitionName=low  Nodes=compute-1 Default=YES MaxTime=2-00:00:00 State=UP

If I missed any relevant information, I will gladly post it;

Thanks in advance,

Rodrigo;

Additional info:
I have installed Slurm using this Ansible roles https://github.com/XSEDE/CRI_XCBC (stateless nodes);


Brian Andrus
 

That is because your compute node does not have the same user IDs as the node that submitted it.

Brian Andrus

On 6/4/2020 7:45 AM, rodrigoceccatodefreitas@... wrote:

Hello,

I am having problems with $ sbatch on the cluster I am working;
As shown in the image below, $srun runs just fine (above the yellow line), while the script jobtest.sh will not run with $sbatch;


Our slurm.conf file:


#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=sorgan-cluster
ControlMachine=sorgan
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
SrunPortRange=60001-63000
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
FastSchedule=0
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=sorgan
#AccountingStorageLoc=/var/log/slurmacct.log
AccountingStorageLoc=slurmdb
AcctGatherNodeFreq=30
AccountingStorageEnforce=associations,limits
AccountingStoragePort=7031
#AccountingStoragePass=
#AccountingStorageUser=
#
#GENERAL RESOURCE
GresTypes=""
#
#EXAMPLE CONFIGURATION - copy,comment out, and edit
#
#COMPUTE NODES
#NodeName=gpu-compute-1 Gres=gpu:gtx_TitanX:4 Sockets=2 CoresPerSocket=8 State=UNKNOWN
NodeName=compute-1 Sockets=1 CoresPerSocket=16 State=UNKNOWN
# PARTITIONS
#PartitionName=high Nodes=compute-[0-1] Default=YES MaxTime=INFINITE State=UP PriorityTier=10
#PartitionName=gpu  Nodes=gpu-compute-1 Default=YES MaxTime=INFINITE State=UP PriorityTier=5 AllowGroups=slurmusers
PartitionName=low  Nodes=compute-1 Default=YES MaxTime=2-00:00:00 State=UP

If I missed any relevant information, I will gladly post it;

Thanks in advance,

Rodrigo;

Additional info:
I have installed Slurm using this Ansible roles https://github.com/XSEDE/CRI_XCBC (stateless nodes);


rodrigoceccatodefreitas@...
 

Oh, that was it!

I used "$wwsh file sync" to pass the /etc/passwd, /etc/group and /etc/group to the compute nodes and then sbatch worked :)

Thank you very much!