Date
1 - 3 of 3
sbatch won't work while srun runs fine #sbatch #slurm
rodrigoceccatodefreitas@...
Hello,
I am having problems with $ sbatch on the cluster I am working; As shown in the image below, $srun runs just fine (above the yellow line), while the script jobtest.sh will not run with $sbatch; Our slurm.conf file: # # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=sorgan-cluster ControlMachine=sorgan #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/tmp SlurmdSpoolDir=/tmp/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= SrunPortRange=60001-63000 #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= SelectType=select/cons_res SelectTypeParameters=CR_CPU FastSchedule=0 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 # AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=sorgan #AccountingStorageLoc=/var/log/slurmacct.log AccountingStorageLoc=slurmdb AcctGatherNodeFreq=30 AccountingStorageEnforce=associations,limits AccountingStoragePort=7031 #AccountingStoragePass= #AccountingStorageUser= # #GENERAL RESOURCE GresTypes="" # #EXAMPLE CONFIGURATION - copy,comment out, and edit # #COMPUTE NODES #NodeName=gpu-compute-1 Gres=gpu:gtx_TitanX:4 Sockets=2 CoresPerSocket=8 State=UNKNOWN NodeName=compute-1 Sockets=1 CoresPerSocket=16 State=UNKNOWN # PARTITIONS #PartitionName=high Nodes=compute-[0-1] Default=YES MaxTime=INFINITE State=UP PriorityTier=10 #PartitionName=gpu Nodes=gpu-compute-1 Default=YES MaxTime=INFINITE State=UP PriorityTier=5 AllowGroups=slurmusers PartitionName=low Nodes=compute-1 Default=YES MaxTime=2-00:00:00 State=UP
If I missed any relevant information, I will gladly post it; Thanks in advance, Rodrigo; Additional info: I have installed Slurm using this Ansible roles https://github.com/XSEDE/CRI_XCBC (stateless nodes);
|
|
Brian Andrus
That is because your compute node does not have the same user IDs as the node that submitted it. Brian Andrus
On 6/4/2020 7:45 AM,
rodrigoceccatodefreitas@... wrote:
Hello,
|
|
rodrigoceccatodefreitas@...
Oh, that was it!
I used "$wwsh file sync" to pass the /etc/passwd, /etc/group and /etc/group to the compute nodes and then sbatch worked :) Thank you very much!
|
|