Slurm: --mail not working #sbatch


Patrick Goetz
 

Before I get into more serious debugging, I thought I'd check with the list on this. I'm writing up some simple documentation for my OpenHPC users and discovered that Slurm will email you when your job finishes, dies; etc.. This seemed useful, so I tested it:

---------------------------------
!/bin/bash
#
#SBATCH --job-name=pg-test
#SBATCH --output=pg.txt
#
#SBATCH --mail-user=pgoetz@...
#SBATCH --mail-type=ALL
#
#SBATCH --ntasks=1
#SBATCH --time=3:00
#SBATCH --mem-per-cpu=100

srun hostname
---------------------------------


but it doesn't seem to work; i.e. I don't get email. So, for the purposes of short circuiting erroneous suggestions:

- I am able to send mail to myself using mailx from the SMS, so it's not an SMTP problem

- When I run

journalctl -f | grep postfix

to monitor what postfix is doing when the Slurm script runs, I get nothing; i.e. it doesn't look like a message is being handed off to the SMTP daemon.

- Finally, man slurm.conf gives:

MailProg
Fully qualified pathname to the program used to send email per user request. The default value is "/bin/mail".


[root@lakers slurm]# which mail
/bin/mail


so the default should be working. Any thoughts?


 

Hi Patrick,

Does mail work from the command line? i.e. ....

$ mail -s Testing pgoetz@...
Test email
. (or ^D)
$

Does mailq, or /var/log/maillog show anything? (p.s. I'm a sendmail
person, so... may not be _exact_ with the postfix thing)

-J

On Fri, May 26, 2017 at 9:50 AM, Patrick Goetz <pgoetz@...> wrote:
Before I get into more serious debugging, I thought I'd check with the list
on this. I'm writing up some simple documentation for my OpenHPC users and
discovered that Slurm will email you when your job finishes, dies; etc..
This seemed useful, so I tested it:

---------------------------------
!/bin/bash
#
#SBATCH --job-name=pg-test
#SBATCH --output=pg.txt
#
#SBATCH --mail-user=pgoetz@...
#SBATCH --mail-type=ALL
#
#SBATCH --ntasks=1
#SBATCH --time=3:00
#SBATCH --mem-per-cpu=100

srun hostname
---------------------------------


but it doesn't seem to work; i.e. I don't get email. So, for the purposes
of short circuiting erroneous suggestions:

- I am able to send mail to myself using mailx from the SMS, so it's not an
SMTP problem

- When I run

journalctl -f | grep postfix

to monitor what postfix is doing when the Slurm script runs, I get nothing;
i.e. it doesn't look like a message is being handed off to the SMTP daemon.

- Finally, man slurm.conf gives:

MailProg
Fully qualified pathname to the program used to send
email per user request. The default value is "/bin/mail".


[root@lakers slurm]# which mail
/bin/mail


so the default should be working. Any thoughts?










Patrick Goetz
 

On 05/26/2017 10:17 AM, Jason Stover wrote:
Hi Patrick,
Does mail work from the command line? i.e. ....
$ mail -s Testing pgoetz@...
Test email
. (or ^D)
$
Does mailq, or /var/log/maillog show anything? (p.s. I'm a sendmail
person, so... may not be _exact_ with the postfix thing)

Hi Jason -

Yes, I mentioned both of those in my message. I can do this:


$ mail pgoetz@...

and I receive the message no problem. And when I monitor the logs for SMTP activity like so:

# journalctl -f | grep postfix

Nothing appears when I run the Slurm script with the SBATCH command directing it to send me mail when the job finishes. (And yes, the log files do show the normal transaction activity when I use the mail command from the command line.)

If this is working for you, do you have the Slurm MailProg key explicitly configured in /etc/slurm/slurm.conf, or maybe there is some other Slurm configuration parameter that I don't have set up properly?


 

Hi Patrick,


Yes, I mentioned both of those in my message. I can do this:
Sorry, missed that. I was going just on general testing...

I'm not sure how SLURM sends mail. Never tested it... Is postfix
listening on all addresses, or only loopback? Meaning, if SLURM is
trying to send the mail out via the 'sms_node' IP, but it's only
listening on 127.0.0.1, then that would fail (unless you have loopback
set as the host name for some reason).

What does: lsof -i:25 -- give you for what it's listening on?

Are there any SLURM logs that have anything about a failure?

And I doubt this, but is there a 'set smtp' entry in either
/etc/nail.rc or ~/.mailrc ?

-J


Ryan Novosielski
 




On May 26, 2017, at 13:05, Jason Stover <jason.stover@...> wrote:

Hi Patrick,


Yes, I mentioned both of those in my message.  I can do this:


Sorry, missed that. I was going just on general testing...

I'm not sure how SLURM sends mail. Never tested it... Is postfix
listening on all addresses, or only loopback? Meaning, if SLURM is
trying to send the mail out via the 'sms_node' IP, but it's only
listening on 127.0.0.1, then that would fail (unless you have loopback
set as the host name for some reason).

What does:  lsof -i:25   -- give you for what it's listening on?

Are there any SLURM logs that have anything about a failure?

And I doubt this, but is there a 'set smtp' entry in either
/etc/nail.rc or ~/.mailrc ?

This is very likely to be the cause. Good catch, Jason. Postfix does not default to listening on interfaces on CentOS. 

--
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj@...
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'


Meij, Henk
 

postfix can be instructed to behave as a relay host in general, so in sms:/etc/postfix/main.cf


relayhost=[mail-internal-for-example.utexas.edu]


will send all mail on all interfaces along to your institutional mail server for delivery.


-Henk




From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Ryan Novosielski <novosirj@...>
Sent: Friday, May 26, 2017 1:12:10 PM
To: OpenHPC-users@groups.io
Subject: Re: [openhpc-users] Slurm: #SBATCH --mail not working
 



On May 26, 2017, at 13:05, Jason Stover <jason.stover@...> wrote:

Hi Patrick,


Yes, I mentioned both of those in my message.  I can do this:


Sorry, missed that. I was going just on general testing...

I'm not sure how SLURM sends mail. Never tested it... Is postfix
listening on all addresses, or only loopback? Meaning, if SLURM is
trying to send the mail out via the 'sms_node' IP, but it's only
listening on 127.0.0.1, then that would fail (unless you have loopback
set as the host name for some reason).

What does:  lsof -i:25   -- give you for what it's listening on?

Are there any SLURM logs that have anything about a failure?

And I doubt this, but is there a 'set smtp' entry in either
/etc/nail.rc or ~/.mailrc ?

This is very likely to be the cause. Good catch, Jason. Postfix does not default to listening on interfaces on CentOS. 

--
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj@...
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'


Patrick Goetz
 

On 05/26/2017 12:05 PM, Jason Stover wrote:
What does: lsof -i:25 -- give you for what it's listening on?
It's listening on localhost, which AFAIK is all that's required:

[root@lakers ~]# lsof -i:25
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
master 4309 root 13u IPv4 54881699 0t0 TCP localhost:smtp (LISTEN)
master 4309 root 14u IPv6 54881700 0t0 TCP localhost:smtp (LISTEN)

Again, it would have to be listening on localhost because the mail command works; i.e. I can send mail from the host.

Also, the MailProg Slurm configuration parameter seems to indicate that Slurm is just using /bin/mail. I thought I figured out the issue, as when I run mail, it defaults to /usr/bin/mail instead of /bin/mail, but I just used /bin/mail explicitly and was still able to send mail to myself, so it's not that.


Are there any SLURM logs that have anything about a failure?
I hadn't thought to check the Slurm logs, but tail -f'ing /var/log/slurmctld reveals nothing out of the ordinary when I run the script.


And I doubt this, but is there a 'set smtp' entry in either
/etc/nail.rc or ~/.mailrc ?

I was surprised that there actually is an /etc/mail.rc file, but there's nothing in there and the account I'm using to send mail doesn't have a .mailrc file at all.

Looks like I'm still at square one with this and am not even sure how to debug further....


Prentice Bisbal <pbisbal@...>
 


On 05/26/2017 01:55 PM, Meij, Henk wrote:

postfix can be instructed to behave as a relay host in general, so in sms:/etc/postfix/main.cf


relayhost=[mail-internal-for-example.utexas.edu]


will send all mail on all interfaces along to your institutional mail server for delivery.




This can problematic for a number of reasons.Typically, the mail server will do various checks to protect against spam. Having your cluster nodes behind a NAT-ing router can lead to problems (IP of system contacting the mail server doesn't match hostname in mail headers, etc.) I forget all the details, but this has happened to me before.

The fix was to come up with the proper rewriting rules for the message and/or envelope. In my case, I think  I had all the mail from the clients sent to the 'head' node, and then that postfix server  did the necessary rewriting and then relayed the mail to the main mail server. I wish I could offer you more details from my experience, but that was about 8-9 years ago at another employer. It took me a little while to get it working properly, but then it worked perfectly for the rest of that cluster's life.

Prentice


 

Hi Patrick,

It's listening on localhost, which AFAIK is all that's required:
IIRC, the mail binary spawns off a call to the sendmail binary
(postfix should have its own sendmail command) to send the mail, not
necessarily contacting the smtp server unless there's a configuration
setting to do so. So, the actual submission line would be something
like:

sendmail to@... < /tmp/tmp.message

So, it really is going to depend on how exactly mail gets called, and
if SLURM is setting the smtp option itself... or if it's doing
something else.

I agree, that it *should* just be working from the looks of it... but
something in how SLURM is sending the mail is different. I'll try
messing on a SLURM cluster and see if I can find anything.

-J

On Fri, May 26, 2017 at 1:41 PM, Patrick Goetz <pgoetz@...> wrote:
On 05/26/2017 12:05 PM, Jason Stover wrote:


What does: lsof -i:25 -- give you for what it's listening on?
It's listening on localhost, which AFAIK is all that's required:

[root@lakers ~]# lsof -i:25
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
master 4309 root 13u IPv4 54881699 0t0 TCP localhost:smtp (LISTEN)
master 4309 root 14u IPv6 54881700 0t0 TCP localhost:smtp (LISTEN)

Again, it would have to be listening on localhost because the mail command
works; i.e. I can send mail from the host.

Also, the MailProg Slurm configuration parameter seems to indicate that
Slurm is just using /bin/mail. I thought I figured out the issue, as when I
run mail, it defaults to /usr/bin/mail instead of /bin/mail, but I just used
/bin/mail explicitly and was still able to send mail to myself, so it's not
that.


Are there any SLURM logs that have anything about a failure?
I hadn't thought to check the Slurm logs, but tail -f'ing /var/log/slurmctld
reveals nothing out of the ordinary when I run the script.


And I doubt this, but is there a 'set smtp' entry in either
/etc/nail.rc or ~/.mailrc ?

I was surprised that there actually is an /etc/mail.rc file, but there's
nothing in there and the account I'm using to send mail doesn't have a
.mailrc file at all.

Looks like I'm still at square one with this and am not even sure how to
debug further....






Meij, Henk
 

Sms is receiving these mails in question  from a private internal network in this scenario, so no spam, no issues. Sms i assume is behind vpn firewall.  Any unresolved addresses on sms will be delivered locally.

-Henk

Sent from my HTC

----- Reply message -----
From: "Prentice Bisbal" <pbisbal@...>
To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Subject: [openhpc-users] Slurm: #SBATCH --mail not working
Date: Fri, May 26, 2017 6:02 PM


On 05/26/2017 01:55 PM, Meij, Henk wrote:

postfix can be instructed to behave as a relay host in general, so in sms:/etc/postfix/main.cf


relayhost=[mail-internal-for-example.utexas.edu]


will send all mail on all interfaces along to your institutional mail server for delivery.




This can problematic for a number of reasons.Typically, the mail server will do various checks to protect against spam. Having your cluster nodes behind a NAT-ing router can lead to problems (IP of system contacting the mail server doesn't match hostname in mail headers, etc.) I forget all the details, but this has happened to me before.

The fix was to come up with the proper rewriting rules for the message and/or envelope. In my case, I think  I had all the mail from the clients sent to the 'head' node, and then that postfix server  did the necessary rewriting and then relayed the mail to the main mail server. I wish I could offer you more details from my experience, but that was about 8-9 years ago at another employer. It took me a little while to get it working properly, but then it worked perfectly for the rest of that cluster's life.

Prentice


Patrick Goetz
 

On 05/26/2017 05:56 PM, Jason Stover wrote:
I agree, that it *should* just be working from the looks of it... but
something in how SLURM is sending the mail is different. I'll try
messing on a SLURM cluster and see if I can find anything.
Thanks!


 

Hi Patrick,

I was just able to test. On our setup I didn't have an issue with
slurm sending mail. The mail was sent from the scheduler to an
external address. It's running postfix, and only listening on the
loopback address as well (though, I think in this specific case that
should be irrelevant).

We don't have MailProg set in slurm.conf, so it's defaulting to
/bin/mail ... which eventually links to /bin/mailx. And, when sent
/var/log/maillog had the mails being sent out.

My file included the same commands yours does... but one thing I
didn't think of.... Can you remove the empty '#' lines so the #SBATCH
are all contained not separated out. I've seen some issues where a
blank like that stops the processing of the #SBATCH directives...
Also, try putting --mail-type before --mail-user

-J

On Tue, May 30, 2017 at 8:12 AM, Patrick Goetz <pgoetz@...> wrote:
On 05/26/2017 05:56 PM, Jason Stover wrote:

I agree, that it *should* just be working from the looks of it... but
something in how SLURM is sending the mail is different. I'll try
messing on a SLURM cluster and see if I can find anything.
Thanks!





Patrick Goetz
 

Hi -

So, I have Slurm emailing, and once firewall and basic postfix configuration issues are straightened out, it does just work as advertised. It turns out that neither the empty # lines nor having the --mail-type after the email address makes a difference. This issue was originally prompted by a user reporting that email wasn't working, after that it was a rather embarrassing PEBKAC (problem exists between keyboard and chair) I was so focused on looking at the mail pipeline that I was just absent-mindedly running the submission shell script by hand rather than running it through the sbatch command -- super dumb mistake made by an admin who is not a regular user. However, now that I'm qualified to be the POTUS, I will retire from the list on move on to start my campaign covfefe.

Thanks for everyone's help and sorry to waste valuable list bandwidth on my really dumb mistakes!

On 05/30/2017 02:27 PM, Jason Stover wrote:
Hi Patrick,
I was just able to test. On our setup I didn't have an issue with
slurm sending mail. The mail was sent from the scheduler to an
external address. It's running postfix, and only listening on the
loopback address as well (though, I think in this specific case that
should be irrelevant).
We don't have MailProg set in slurm.conf, so it's defaulting to
/bin/mail ... which eventually links to /bin/mailx. And, when sent
/var/log/maillog had the mails being sent out.
My file included the same commands yours does... but one thing I
didn't think of.... Can you remove the empty '#' lines so the #SBATCH
are all contained not separated out. I've seen some issues where a
blank like that stops the processing of the #SBATCH directives...
Also, try putting --mail-type before --mail-user
-J
On Tue, May 30, 2017 at 8:12 AM, Patrick Goetz <pgoetz@...> wrote:
On 05/26/2017 05:56 PM, Jason Stover wrote:

I agree, that it *should* just be working from the looks of it... but
something in how SLURM is sending the mail is different. I'll try
messing on a SLURM cluster and see if I can find anything.
Thanks!