#sbatch An ORTE daemon error #sbatch


Afelete
 

Dear All,
I kindly need some urgent help to fix below issue.

 I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message  as below


[
:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)

[:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)

--------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before

communicating back to mpirun. This could be caused by a number

of factors, including an inability to create a connection back

to mpirun due to a lack of common network interfaces and/or no

route found between them. Please check network connectivity

(including firewalls and network routing requirements).

--------------------------------------------------------------------------

I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
also below is the sbatch script 
#!/bin/bash
#SBATCH -A aworwui
#SBATCH -J raxml_example   # Job name, you can change it to whatever you want
#SBATCH -o slurm-%A_%a.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -n 12               # number of tasks
#SBATCH -N 2              # Number of nodes
#SBATCH -p bigmem          # Slurm partition, where you want the job to be queued
 
module load raxml/8.2.12
 
echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"
 
mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
echo "Program finished with exit code $? at: `date`"



John Hearns
 

It looks like you cannot find some shared libraries.
I suspect that you need to have a module load openmpi (or intelmpi
or whatever MPI) in your batch job file

Please log into a compute node using your own account, ie the account
which is running th jobs.
Then ldd raxmlHPC-MPI-AVX
Tell us what the output is/

On Thu, 6 Sep 2018 at 13:37, <aworwui@...> wrote:

Dear All,
I kindly need some urgent help to fix below issue.

I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message as below


[:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)

[:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)

--------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before

communicating back to mpirun. This could be caused by a number

of factors, including an inability to create a connection back

to mpirun due to a lack of common network interfaces and/or no

route found between them. Please check network connectivity

(including firewalls and network routing requirements).

--------------------------------------------------------------------------

I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
also below is the sbatch script
#!/bin/bash
#SBATCH -A aworwui
#SBATCH -J raxml_example # Job name, you can change it to whatever you want
#SBATCH -o slurm-%A_%a.out # Standard output will be written here
#SBATCH -e %N.%j.err # Standard error will be written here
#SBATCH -n 12 # number of tasks
#SBATCH -N 2 # Number of nodes
#SBATCH -p bigmem # Slurm partition, where you want the job to be queued

module load raxml/8.2.12

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"

mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
echo "Program finished with exit code $? at: `date`"



John Hearns
 

I should ask - is this an Omnipath cluster?

On Thu, 6 Sep 2018 at 14:44, John Hearns <hearnsj@...> wrote:

It looks like you cannot find some shared libraries.
I suspect that you need to have a module load openmpi (or intelmpi
or whatever MPI) in your batch job file

Please log into a compute node using your own account, ie the account
which is running th jobs.
Then ldd raxmlHPC-MPI-AVX
Tell us what the output is/







On Thu, 6 Sep 2018 at 13:37, <aworwui@...> wrote:

Dear All,
I kindly need some urgent help to fix below issue.

I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message as below


[:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)

[:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)

--------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before

communicating back to mpirun. This could be caused by a number

of factors, including an inability to create a connection back

to mpirun due to a lack of common network interfaces and/or no

route found between them. Please check network connectivity

(including firewalls and network routing requirements).

--------------------------------------------------------------------------

I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
also below is the sbatch script
#!/bin/bash
#SBATCH -A aworwui
#SBATCH -J raxml_example # Job name, you can change it to whatever you want
#SBATCH -o slurm-%A_%a.out # Standard output will be written here
#SBATCH -e %N.%j.err # Standard error will be written here
#SBATCH -n 12 # number of tasks
#SBATCH -N 2 # Number of nodes
#SBATCH -p bigmem # Slurm partition, where you want the job to be queued

module load raxml/8.2.12

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"

mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
echo "Program finished with exit code $? at: `date`"



Karl W. Schulz
 

Hello,

Are you also using the openmpi build from openhpc (and raxml built against that variant)? If so, would first suggest updating your job launch line and replace “mpirun” with “prun” which should squash the missing library messages.

-k

On Sep 6, 2018, at 6:01 AM, aworwui@... wrote:

Dear All,
I kindly need some urgent help to fix below issue.

 I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message  as below


[
:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)

[:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)

--------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before

communicating back to mpirun. This could be caused by a number

of factors, including an inability to create a connection back

to mpirun due to a lack of common network interfaces and/or no

route found between them. Please check network connectivity

(including firewalls and network routing requirements).

--------------------------------------------------------------------------

I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
also below is the sbatch script 
#!/bin/bash
#SBATCH -A aworwui
#SBATCH -J raxml_example   # Job name, you can change it to whatever you want
#SBATCH -o slurm-%A_%a.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -n 12               # number of tasks
#SBATCH -N 2              # Number of nodes
#SBATCH -p bigmem          # Slurm partition, where you want the job to be queued
 
module load raxml/8.2.12
 
echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"
 
mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
echo "Program finished with exit code $? at: `date`"




John Hearns
 

Also please log into one of the compute nodes and run ibv_devinfo
You probably have to do this as root

On Thu, 6 Sep 2018 at 14:46, John Hearns <hearnsj@...> wrote:

I should ask - is this an Omnipath cluster?




On Thu, 6 Sep 2018 at 14:44, John Hearns <hearnsj@...> wrote:

It looks like you cannot find some shared libraries.
I suspect that you need to have a module load openmpi (or intelmpi
or whatever MPI) in your batch job file

Please log into a compute node using your own account, ie the account
which is running th jobs.
Then ldd raxmlHPC-MPI-AVX
Tell us what the output is/







On Thu, 6 Sep 2018 at 13:37, <aworwui@...> wrote:

Dear All,
I kindly need some urgent help to fix below issue.

I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message as below


[:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)

[:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)

--------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before

communicating back to mpirun. This could be caused by a number

of factors, including an inability to create a connection back

to mpirun due to a lack of common network interfaces and/or no

route found between them. Please check network connectivity

(including firewalls and network routing requirements).

--------------------------------------------------------------------------

I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
also below is the sbatch script
#!/bin/bash
#SBATCH -A aworwui
#SBATCH -J raxml_example # Job name, you can change it to whatever you want
#SBATCH -o slurm-%A_%a.out # Standard output will be written here
#SBATCH -e %N.%j.err # Standard error will be written here
#SBATCH -n 12 # number of tasks
#SBATCH -N 2 # Number of nodes
#SBATCH -p bigmem # Slurm partition, where you want the job to be queued

module load raxml/8.2.12

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"

mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
echo "Program finished with exit code $? at: `date`"



Worwui Archibald <warchibald@...>
 

Hello K,
Yes, I’m using the openmpi3/3.1.0 build from openhpc and the built the raxml against this variant

Regards
Archie


From: <OpenHPC-users@groups.io> on behalf of "Karl W. Schulz" <karl@...>
Reply-To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Date: Thursday, 6 September 2018 at 12:48
To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] #sbatch An ORTE daemon error

Hello,

Are you also using the openmpi build from openhpc (and raxml built against that variant)? If so, would first suggest updating your job launch line and replace “mpirun” with “prun” which should squash the missing library messages.

-k

On Sep 6, 2018, at 6:01 AM, aworwui@...<mailto:aworwui@...> wrote:

Dear All,
I kindly need some urgent help to fix below issue.

I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message as below


[:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
also below is the sbatch script
#!/bin/bash
#SBATCH -A aworwui
#SBATCH -J raxml_example # Job name, you can change it to whatever you want
#SBATCH -o slurm-%A_%a.out # Standard output will be written here
#SBATCH -e %N.%j.err # Standard error will be written here
#SBATCH -n 12 # number of tasks
#SBATCH -N 2 # Number of nodes
#SBATCH -p bigmem # Slurm partition, where you want the job to be queued

module load raxml/8.2.12

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"

mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
echo "Program finished with exit code $? at: `date`"




________________________________

DISCLAIMER: This message is private and confidential. If you have received this message in error please notify us and remove it from your system. Any views and opinions expressed in this message are those of the individual sender and do not necessarily represent the views and opinions of Medical Research Council Unit The Gambia at the London School of Hygiene & Tropical Medicine.

___________________________________________________________
This communication is confidential and may contain privileged information intended solely for the named recipient(s). It may not be used or disclosed except for the purpose for which it has been sent. If you are not the intended recipient, you must not copy, distribute, take any action or reliance on it. If you have received this communication in error, do not open any attachments but please notify the Help Desk by e-mailing help@... quoting the sender details, and then delete this message along with any attached files. E-mail messages are not secure and attachments could contain software viruses which may damage your computer system. Whilst every reasonable precaution has been taken to minimise this risk, The MRC Unit The Gambia at LSHTM cannot accept any liability for any damage sustained as a result of these factors. You are advised to carry out your own virus checks before opening any attachments. Unless expressly stated, opinions in this message are those of the e-mail author and not of the Medical Research Council Unit The Gambia at LSHTM.
________________________________________________________________________


Worwui Archibald <warchibald@...>
 

Hello
No, is 10Gb Ethernet cluster

Archie

On 06/09/2018, 12:47, "OpenHPC-users@groups.io on behalf of John Hearns" <OpenHPC-users@groups.io on behalf of hearnsj@...> wrote:

I should ask - is this an Omnipath cluster?




On Thu, 6 Sep 2018 at 14:44, John Hearns <hearnsj@...> wrote:
>
> It looks like you cannot find some shared libraries.
> I suspect that you need to have a module load openmpi (or intelmpi
> or whatever MPI) in your batch job file
>
> Please log into a compute node using your own account, ie the account
> which is running th jobs.
> Then ldd raxmlHPC-MPI-AVX
> Tell us what the output is/
>
>
>
>
>
>
>
> On Thu, 6 Sep 2018 at 13:37, <aworwui@...> wrote:
> >
> > Dear All,
> > I kindly need some urgent help to fix below issue.
> >
> > I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message as below
> >
> >
> > [:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
> >
> > [:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)
> >
> > --------------------------------------------------------------------------
> >
> > An ORTE daemon has unexpectedly failed after launch and before
> >
> > communicating back to mpirun. This could be caused by a number
> >
> > of factors, including an inability to create a connection back
> >
> > to mpirun due to a lack of common network interfaces and/or no
> >
> > route found between them. Please check network connectivity
> >
> > (including firewalls and network routing requirements).
> >
> > --------------------------------------------------------------------------
> >
> > I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
> > also below is the sbatch script
> > #!/bin/bash
> > #SBATCH -A aworwui
> > #SBATCH -J raxml_example # Job name, you can change it to whatever you want
> > #SBATCH -o slurm-%A_%a.out # Standard output will be written here
> > #SBATCH -e %N.%j.err # Standard error will be written here
> > #SBATCH -n 12 # number of tasks
> > #SBATCH -N 2 # Number of nodes
> > #SBATCH -p bigmem # Slurm partition, where you want the job to be queued
> >
> > module load raxml/8.2.12
> >
> > echo "Starting at `date`"
> > echo "Running on hosts: $SLURM_NODELIST"
> > echo "Running on $SLURM_NNODES nodes."
> > echo "Running $SLURM_NTASKS tasks."
> > echo "Current working directory is `pwd`"
> >
> > mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
> > echo "Program finished with exit code $? at: `date`"
> >
> >
> >






________________________________

DISCLAIMER: This message is private and confidential. If you have received this message in error please notify us and remove it from your system. Any views and opinions expressed in this message are those of the individual sender and do not necessarily represent the views and opinions of Medical Research Council Unit The Gambia at the London School of Hygiene & Tropical Medicine.

___________________________________________________________
This communication is confidential and may contain privileged information intended solely for the named recipient(s). It may not be used or disclosed except for the purpose for which it has been sent. If you are not the intended recipient, you must not copy, distribute, take any action or reliance on it. If you have received this communication in error, do not open any attachments but please notify the Help Desk by e-mailing help@... quoting the sender details, and then delete this message along with any attached files. E-mail messages are not secure and attachments could contain software viruses which may damage your computer system. Whilst every reasonable precaution has been taken to minimise this risk, The MRC Unit The Gambia at LSHTM cannot accept any liability for any damage sustained as a result of these factors. You are advised to carry out your own virus checks before opening any attachments. Unless expressly stated, opinions in this message are those of the e-mail author and not of the Medical Research Council Unit The Gambia at LSHTM.
________________________________________________________________________


Karl W. Schulz
 

On Sep 6, 2018, at 7:54 AM, Worwui Archibald <warchibald@...> wrote:

Hello K,
Yes, I’m using the openmpi3/3.1.0 build from openhpc and the built the raxml against this variant

Regards
Archie
Ok, thanks for the confirmation. If you haven’t already, I might suggest confirming that you can run simple hello world type jobs on your system across two nodes with your module environment. Assuming you have the ‘examples-ohpc’ package installed, you could do this interactively as follows:

$ mpicc /opt/ohpc/pub/examples/mpi/hello.c

Then, submit an interactive job similar to your job script and test:

$ srun -N 2 -n 12 -A aworwui -p bigmem --pty /bin/bash

Once scheduled, issue the job launch on assigned compute hosts:

test@c1:~> prun ./a.out
[prun] Master compute host = c1
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./a.out (family=openmpi3)

Hello, world (12 procs total)
--> Process # 1 of 12 is alive. -> c1
--> Process # 2 of 12 is alive. -> c1
--> Process # 3 of 12 is alive. -> c1
--> Process # 4 of 12 is alive. -> c1
--> Process # 5 of 12 is alive. -> c1
--> Process # 0 of 12 is alive. -> c1
--> Process # 6 of 12 is alive. -> c2
--> Process # 7 of 12 is alive. -> c2
--> Process # 8 of 12 is alive. -> c2
--> Process # 9 of 12 is alive. -> c2
--> Process # 10 of 12 is alive. -> c2
--> Process # 11 of 12 is alive. -> c2


Can you confirm this works?

-k




From: <OpenHPC-users@groups.io> on behalf of "Karl W. Schulz" <karl@...>
Reply-To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Date: Thursday, 6 September 2018 at 12:48
To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] #sbatch An ORTE daemon error

Hello,

Are you also using the openmpi build from openhpc (and raxml built against that variant)? If so, would first suggest updating your job launch line and replace “mpirun” with “prun” which should squash the missing library messages.

-k


On Sep 6, 2018, at 6:01 AM, aworwui@...<mailto:aworwui@...> wrote:

Dear All,
I kindly need some urgent help to fix below issue.

I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message as below


[:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
also below is the sbatch script
#!/bin/bash
#SBATCH -A aworwui
#SBATCH -J raxml_example # Job name, you can change it to whatever you want
#SBATCH -o slurm-%A_%a.out # Standard output will be written here
#SBATCH -e %N.%j.err # Standard error will be written here
#SBATCH -n 12 # number of tasks
#SBATCH -N 2 # Number of nodes
#SBATCH -p bigmem # Slurm partition, where you want the job to be queued

module load raxml/8.2.12

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"

mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
echo "Program finished with exit code $? at: `date`"




________________________________

DISCLAIMER: This message is private and confidential. If you have received this message in error please notify us and remove it from your system. Any views and opinions expressed in this message are those of the individual sender and do not necessarily represent the views and opinions of Medical Research Council Unit The Gambia at the London School of Hygiene & Tropical Medicine.

___________________________________________________________
This communication is confidential and may contain privileged information intended solely for the named recipient(s). It may not be used or disclosed except for the purpose for which it has been sent. If you are not the intended recipient, you must not copy, distribute, take any action or reliance on it. If you have received this communication in error, do not open any attachments but please notify the Help Desk by e-mailing help@... quoting the sender details, and then delete this message along with any attached files. E-mail messages are not secure and attachments could contain software viruses which may damage your computer system. Whilst every reasonable precaution has been taken to minimise this risk, The MRC Unit The Gambia at LSHTM cannot accept any liability for any damage sustained as a result of these factors. You are advised to carry out your own virus checks before opening any attachments. Unless expressly stated, opinions in this message are those of the e-mail author and not of the Medical Research Council Unit The Gambia at LSHTM.
________________________________________________________________________



Worwui Archibald <warchibald@...>
 

Hello
When prun for resources of 2 nodes it give the same error
But request for only one node run successfully

srun -N 2 -n 12 -A aworwui -p bigmem --pty /bin/bash
[aworwui@fmem1 ~]$ prun ./a.out
[prun] Master compute host = fmem1
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./a.out (family=openmpi3)
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------


srun -n 12 -A aworwui -p bigmem --pty /bin/bash
[aworwui@fmem1 ~]$ prun ./a.out
[prun] Master compute host = fmem1
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./a.out (family=openmpi3)

Hello, world (12 procs total)
--> Process # 0 of 12 is alive. -> fmem1
--> Process # 1 of 12 is alive. -> fmem1
--> Process # 2 of 12 is alive. -> fmem1
--> Process # 3 of 12 is alive. -> fmem1
--> Process # 4 of 12 is alive. -> fmem1
--> Process # 5 of 12 is alive. -> fmem1
--> Process # 6 of 12 is alive. -> fmem1
--> Process # 7 of 12 is alive. -> fmem1
--> Process # 8 of 12 is alive. -> fmem1
--> Process # 9 of 12 is alive. -> fmem1
--> Process # 10 of 12 is alive. -> fmem1
--> Process # 11 of 12 is alive. -> fmem1

On 06/09/2018, 13:28, "OpenHPC-users@groups.io on behalf of Karl W. Schulz" <OpenHPC-users@groups.io on behalf of karl@...> wrote:

prun ./a.out


________________________________

DISCLAIMER: This message is private and confidential. If you have received this message in error please notify us and remove it from your system. Any views and opinions expressed in this message are those of the individual sender and do not necessarily represent the views and opinions of Medical Research Council Unit The Gambia at the London School of Hygiene & Tropical Medicine.

___________________________________________________________
This communication is confidential and may contain privileged information intended solely for the named recipient(s). It may not be used or disclosed except for the purpose for which it has been sent. If you are not the intended recipient, you must not copy, distribute, take any action or reliance on it. If you have received this communication in error, do not open any attachments but please notify the Help Desk by e-mailing help@... quoting the sender details, and then delete this message along with any attached files. E-mail messages are not secure and attachments could contain software viruses which may damage your computer system. Whilst every reasonable precaution has been taken to minimise this risk, The MRC Unit The Gambia at LSHTM cannot accept any liability for any damage sustained as a result of these factors. You are advised to carry out your own virus checks before opening any attachments. Unless expressly stated, opinions in this message are those of the e-mail author and not of the Medical Research Council Unit The Gambia at LSHTM.
________________________________________________________________________


Worwui Archibald <warchibald@...>
 

Hello,
ibv_devinfo is not install on the computer nodes because it is not InfiniBand HPC

Archie

On 06/09/2018, 12:49, "OpenHPC-users@groups.io on behalf of John Hearns" <OpenHPC-users@groups.io on behalf of hearnsj@...> wrote:

Also please log into one of the compute nodes and run ibv_devinfo
You probably have to do this as root
On Thu, 6 Sep 2018 at 14:46, John Hearns <hearnsj@...> wrote:
>
> I should ask - is this an Omnipath cluster?
>
>
>
>
> On Thu, 6 Sep 2018 at 14:44, John Hearns <hearnsj@...> wrote:
> >
> > It looks like you cannot find some shared libraries.
> > I suspect that you need to have a module load openmpi (or intelmpi
> > or whatever MPI) in your batch job file
> >
> > Please log into a compute node using your own account, ie the account
> > which is running th jobs.
> > Then ldd raxmlHPC-MPI-AVX
> > Tell us what the output is/
> >
> >
> >
> >
> >
> >
> >
> > On Thu, 6 Sep 2018 at 13:37, <aworwui@...> wrote:
> > >
> > > Dear All,
> > > I kindly need some urgent help to fix below issue.
> > >
> > > I'm testing a newly configured HPC using openHPC slurm for the scheduling. interactive single Job submitted using srun with all required resource on one node are executed successfully. However, batch mpi jobs submitted using sbatch is not schedule at all but give error message as below
> > >
> > >
> > > [:67445] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
> > >
> > > [:67445] mca_base_component_repository_open: unable to open mca_oob_ud: libosmcomp.so.3: cannot open shared object file: No such file or directory (ignored)
> > >
> > > --------------------------------------------------------------------------
> > >
> > > An ORTE daemon has unexpectedly failed after launch and before
> > >
> > > communicating back to mpirun. This could be caused by a number
> > >
> > > of factors, including an inability to create a connection back
> > >
> > > to mpirun due to a lack of common network interfaces and/or no
> > >
> > > route found between them. Please check network connectivity
> > >
> > > (including firewalls and network routing requirements).
> > >
> > > --------------------------------------------------------------------------
> > >
> > > I understand that the first two line of the error message is related to InfiniBand which we are not using but rather ethernet and that is not the reason why the jobs are not scheduled
> > > also below is the sbatch script
> > > #!/bin/bash
> > > #SBATCH -A aworwui
> > > #SBATCH -J raxml_example # Job name, you can change it to whatever you want
> > > #SBATCH -o slurm-%A_%a.out # Standard output will be written here
> > > #SBATCH -e %N.%j.err # Standard error will be written here
> > > #SBATCH -n 12 # number of tasks
> > > #SBATCH -N 2 # Number of nodes
> > > #SBATCH -p bigmem # Slurm partition, where you want the job to be queued
> > >
> > > module load raxml/8.2.12
> > >
> > > echo "Starting at `date`"
> > > echo "Running on hosts: $SLURM_NODELIST"
> > > echo "Running on $SLURM_NNODES nodes."
> > > echo "Running $SLURM_NTASKS tasks."
> > > echo "Current working directory is `pwd`"
> > >
> > > mpirun raxmlHPC-MPI-AVX -m GTRCAT -s /home/mrc.gm/aworwui/At.TRX.domain.MUSCLEf.phy -n test_RAXML -N 24 -m PROTGAMMAWAG -p 9567154133 -T 4
> > > echo "Program finished with exit code $? at: `date`"
> > >
> > >
> > >






________________________________

DISCLAIMER: This message is private and confidential. If you have received this message in error please notify us and remove it from your system. Any views and opinions expressed in this message are those of the individual sender and do not necessarily represent the views and opinions of Medical Research Council Unit The Gambia at the London School of Hygiene & Tropical Medicine.

___________________________________________________________
This communication is confidential and may contain privileged information intended solely for the named recipient(s). It may not be used or disclosed except for the purpose for which it has been sent. If you are not the intended recipient, you must not copy, distribute, take any action or reliance on it. If you have received this communication in error, do not open any attachments but please notify the Help Desk by e-mailing help@... quoting the sender details, and then delete this message along with any attached files. E-mail messages are not secure and attachments could contain software viruses which may damage your computer system. Whilst every reasonable precaution has been taken to minimise this risk, The MRC Unit The Gambia at LSHTM cannot accept any liability for any damage sustained as a result of these factors. You are advised to carry out your own virus checks before opening any attachments. Unless expressly stated, opinions in this message are those of the e-mail author and not of the Medical Research Council Unit The Gambia at LSHTM.
________________________________________________________________________


Karl W. Schulz
 

On Sep 6, 2018, at 8:55 AM, Worwui Archibald <warchibald@...> wrote:

Hello
When prun for resources of 2 nodes it give the same error
But request for only one node run successfully

srun -N 2 -n 12 -A aworwui -p bigmem --pty /bin/bash
[aworwui@fmem1 ~]$ prun ./a.out
[prun] Master compute host = fmem1
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./a.out (family=openmpi3)
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
—————————————————————————————————————
Ok, this is good info and we know it’s not unique to raxml. Next question: by chance are you configuring SLURM to schedule to your head node? ompi can get confused if the network interfaces available on assigned nodes are not consistent which can be the case if running between a typical head node configuration and compute node (as the compute node is likely to only have one ethernet interface active whereas head node has multiple).

Assuming this might be the case, can you repeat the test and restrict running on two compute nodes which have the same ethernet device configuration? I don’t know the naming scheme for your compute nodes, but if you had two identical compute hosts named “c1” and “c2” you could augment your slurm submission to request those nodes specifically via the -w option:

$ srun -N 2 -n 12 -A aworwui -w c1,c2 -p bigmem --pty /bin/bash

-k


Afelete
 

$ srun -N 2 -n 12 -A aworwui -w fmem1,fmem2 -p bigmem --pty /bin/bash

[aworwui@fmem1 ~]$ prun ./a.out 

[prun] Master compute host = fmem1

[prun] Resource manager = slurm

[prun] Launch cmd = mpirun ./a.out (family=openmpi3)

--------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before

communicating back to mpirun. This could be caused by a number

of factors, including an inability to create a connection back

to mpirun due to a lack of common network interfaces and/or no

route found between them. Please check network connectivity

(including firewalls and network routing requirements).

--------------------------------------------------------------------------


Brian Andrus
 

Have you ensured you have your ssh keys in place to go between nodes? Also, ensure you have the same image on your nodes and all the same software locations/versions, etc.

Try running it on a single node. That will tell you if it is working with btl self at least.
And, of course, ensure there is no firewall running on any nodes.

Brian Andrus


On 9/6/2018 9:47 AM, aworwui@... wrote:

$ srun -N 2 -n 12 -A aworwui -w fmem1,fmem2 -p bigmem --pty /bin/bash

[aworwui@fmem1 ~]$ prun ./a.out 

[prun] Master compute host = fmem1

[prun] Resource manager = slurm

[prun] Launch cmd = mpirun ./a.out (family=openmpi3)

--------------------------------------------------------------------------

An ORTE daemon has unexpectedly failed after launch and before

communicating back to mpirun. This could be caused by a number

of factors, including an inability to create a connection back

to mpirun due to a lack of common network interfaces and/or no

route found between them. Please check network connectivity

(including firewalls and network routing requirements).

--------------------------------------------------------------------------