MPI runs 3x faster on master than stateless node #sbatch #slurm


Per Jørgensen
 

Hi,

I'm trying to configure a SLURM/Warewulf OHPC cluster, but I can't get the nodes to perform.
I tried to run a small OpenFOAM MPI program, it took 88s on the master and 250s on the compute nodes.
It takes around 250s on the nodes no matter how I run it - srun --tty batch, sbatch, ssh on node, console on node
Changing the master to a i7-8700K with 10G Ethernet, could reduce the runtime to 220s but still much longer than directly on the master

Could it be that stateless provision could impact OpenFOAM MPI that much? Or could there difference in the provisioned mpi and the devel mpi?
Any ideas for what the bottleneck is?

The system is
- 19 compute nodes Xeon Phi 7210 KNL with 64 core on 1 socket with 1G Ethernet
- master node on Xeon Phi 7250 KNL with 68 core on 1 socket
- alternative master node on i7-8700K with 10G Ethernet
- stateless provisioning
- CentOS 7.7 kernel 3.10.0-1062.el7.x86_64
- Basically I have followed the OpenHPC recipe for Warewulf/SLURM on CentOS 7.6 (I checked that others had used CentOS 7.7)

Best regards,

Per


 

Hi,

Can you confirm the numbers are as follows?
 - Xeon Phi 7210 = 250s
 - Xeon Phi 7250 = 88s
 - i7-8700K = 220s

I would naively assume the i7 and 7250 results to be the other way around (ie 7250 should be similar but a bit faster than 7210), but that's not what your mail implies.

Can you also share the full MPI command (number of cores in each job) and how OMP_NUM_THREADS is set.

It would also probably be better to start by benchmarking something more predictable like HPL before moving on to OpenFOAM.

Generally speaking, there are a bunch of other considerations I would be aiming to rule out (KNL memory mode, presence/absence of I/O, etc) before concluding that this has anything to do with the provisioning method.

Regards,
Chris

On Wed, 30 Oct 2019 at 11:52, Per Jørgensen <per@...> wrote:
Hi,

I'm trying to configure a SLURM/Warewulf OHPC cluster, but I can't get the nodes to perform.
I tried to run a small OpenFOAM MPI program, it took 88s on the master and 250s on the compute nodes.
It takes around 250s on the nodes no matter how I run it - srun --tty batch, sbatch, ssh on node, console on node
Changing the master to a i7-8700K with 10G Ethernet, could reduce the runtime to 220s but still much longer than directly on the master

Could it be that stateless provision could impact OpenFOAM MPI that much? Or could there difference in the provisioned mpi and the devel mpi?
Any ideas for what the bottleneck is?

The system is
- 19 compute nodes Xeon Phi 7210 KNL with 64 core on 1 socket with 1G Ethernet
- master node on Xeon Phi 7250 KNL with 68 core on 1 socket
- alternative master node on i7-8700K with 10G Ethernet
- stateless provisioning
- CentOS 7.7 kernel 3.10.0-1062.el7.x86_64
- Basically I have followed the OpenHPC recipe for Warewulf/SLURM on CentOS 7.6 (I checked that others had used CentOS 7.7)

Best regards,

Per


Per Jørgensen
 

Hi Chris,

Thanks for the answer - good point I will try HPL and see if it behaves the same way.

The program is compiled for Xeon Phi, so it can't run on i7-8700K in the cluster (but it runs around 250s on 6 cores in the normal version)
250s and 220s are both for Xeon Phi 7210 but with different masters (Xeon Phi 7250 and i7-8700K)
If I run on Xeon Phi 7210 a non-cluster installation the run time is 66s (slightly faster because I used intel MPI that runs faster than OpenMpi on Xeon Phi)

All are run in the 16G MCDRAM - OMP_NUM_THREADS is not set
mpirun -np 64 numactl -m 1 simpleFoam -parallel > log.simpleFoam_d1 2>&1

When I monitor the network with nload there is not much traffic

Best regards,

Per


Per Jørgensen
 

Hi,

Now I tried to run HPL and it looks worse
on the master node I get 1741 GFlops
on the provisioned node I get 145 GFlops
if I make a normal stand alone installation on a node I get 1752 GFlops

Best regards,

Per


chris.collins@...
 

Hi Per,

Have you tried running on the provisioned node, but outside Slurm to rule out anything there (cgroup / affinity or other Slurm config - been there, seen that, see my post here a few weeks back!).

Cheers,

Chris


Per Jørgensen
 

Hi Chris,

I logged directly on to the node before I ran it.
The performance drop is consistent in all the ways I have run it

Best regards,

Per