Problems PXE booting #warewulf #pxe #ipxe #dhcp


SimCenter Admin
 

Guys,

I'm desperately trying to get a cluster to boot and have had no luck. This is old hardware (Dell R620s for storage, and Dell C6220s for compute) which needs to boot statelessly. I have a problem that feels a lot like the one outlined in this OpenHPC issue, however, I am not able to fix my problem. The group at UAB was using VirtualBox environment and I'm using actual hardware, (Dell, Ethernet for provisioning, IB for fabric if I can get it working/configured).

I get the same errors that Mike and John-Paul were getting in their issue, but I have no idea how to fix it. 

  • DHCP is working, there's records in the /var/log/messages file of this; see [1]. Note, this image contains hints I think like "Error code 0: TFTP Aborted".
  • I've added -vvv to the server_args config in the /etc/xinet.d/tftp file, but this doesn't really seem to give any extra information.
  • Also, see [2] and [3] for what the client shows when it tries to PXE.
  • Also note that it appears as if the DHCP config is 'falling though' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe just like John-Paul observed here, but I cannot figure out why. Also, the file(s) referenced in the first line of this file (filename "http://192.168.222.1/WW/ipxe/cfg/${mac}does exist.
  • I have updated the firmware on these machines to the newest possible (BIOS, network, etc. firmware) and the behavior does not change.
  • I have attached salient output from wwsh in the warewulf.script file and I'd appreciate it if somebody could take a look. I've tried this several dozen times and ways, so at this point I'm not sure I remember everything I've tried.
  • I try to remember to restart DHCP (even though warewulf seems to do that itself most of the time) as well as update PXEf (wwsh pxe update) every time I make changes.
  • The provisioning interface is named 'bond1' and it is, as its name suggests, a bond (2x1GbE links), I have done what the install recipe suggests to support this: wwsh provision set --postnetdown=1 --kargs=net.ifnames=1,biosdevname=1 beegfs*

Please ask me for any relevant updates/information that I may have left out. 
I don't pretend to understand what PXE/iPXE is doing nor how it works, but maybe some experts out there can help me figure this out? What side of the equation is this likely on? Could it be firmware issues? Software config on the SMS side (warewulf, DHCP, etc.)? What other logs could I look at or how could I generate more verbose logs?

I've Googled endlessly, to no avail. It feels like many people have similar problems, but I cannot find a solution for mine in any posts I've come across. If I cannot make headway on this within the next couple days, its likely going to be too late as I'll run out of time to finish this. It's frustrating to get stuck to early on, I spent most of this weekend just trying anything I could think of simply to get these hosts to PXE boot to no avail.

Thank you all very much for any help you can provide. Have a great evening.


[1] 

[2]


[3]


 

Your problem may be due to your choice of beegeefs as your volume format. We have tried experimentally various file system choices and have only been able to get past this problem by starting out with a boot volume formatted in ext4, then adding other disks (xfs in our case) once the system has been booted.

We would also be interested in finding ways around this restriction and understanding the reason for it, but for now, we set up the partitions with an ext4 file system boot volume and add other differently formatted volumes once it has been booted.

Alan

On Sep 8, 2019, at 6:30 PM, SimCenter Admin via Groups.Io <simcenter.admin@...> wrote:

Guys,

I'm desperately trying to get a cluster to boot and have had no luck. This is old hardware (Dell R620s for storage, and Dell C6220s for compute) which needs to boot statelessly. I have a problem that feels a lot like the one outlined in this<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.io%2Fg%2FOpenHPC-users%2Ftopic%2F22676845%232095&data=02%7C01%7CAlan.Sill%40ttu.edu%7C9139d44b40d94a95c5bc08d734b481ef%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637035822233903009&sdata=OYCfu7HLxrtRVpnwduiHSevxbb%2FQwTeAbskSlLIKCQo%3D&reserved=0> OpenHPC issue, however, I am not able to fix my problem. The group at UAB was using VirtualBox environment and I'm using actual hardware, (Dell, Ethernet for provisioning, IB for fabric if I can get it working/configured).

I get the same errors that Mike and John-Paul were getting in their issue, but I have no idea how to fix it.


* DHCP is working, there's records in the /var/log/messages file of this; see [1]. Note, this image contains hints I think like "Error code 0: TFTP Aborted".
* I've added -vvv to the server_args config in the /etc/xinet.d/tftp file, but this doesn't really seem to give any extra information.
* Also, see [2] and [3] for what the client shows when it tries to PXE.
* Also note that it appears as if the DHCP config is 'falling though' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe just like John-Paul observed here<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.io%2Fg%2FOpenHPC-users%2Fmessage%2F2095&data=02%7C01%7CAlan.Sill%40ttu.edu%7C9139d44b40d94a95c5bc08d734b481ef%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637035822233903009&sdata=GpDvQ73JNg1fpLsRKRib1CVG8xkMc4fYNwt9ecIfHDI%3D&reserved=0>, but I cannot figure out why. Also, the file(s) referenced in the first line of this file (filename "http://192.168.222.1/WW/ipxe/cfg/${mac}) does exist.
* I have updated the firmware on these machines to the newest possible (BIOS, network, etc. firmware) and the behavior does not change.
* I have attached salient output from wwsh in the warewulf.script file and I'd appreciate it if somebody could take a look. I've tried this several dozen times and ways, so at this point I'm not sure I remember everything I've tried.
* I try to remember to restart DHCP (even though warewulf seems to do that itself most of the time) as well as update PXEf (wwsh pxe update) every time I make changes.
* The provisioning interface is named 'bond1' and it is, as its name suggests, a bond (2x1GbE links), I have done what the install recipe suggests to support this: wwsh provision set --postnetdown=1 --kargs=net.ifnames=1,biosdevname=1 beegfs*

Please ask me for any relevant updates/information that I may have left out. I don't pretend to understand what PXE/iPXE is doing nor how it works, but maybe some experts out there can help me figure this out? What side of the equation is this likely on? Could it be firmware issues? Software config on the SMS side (warewulf, DHCP, etc.)? What other logs could I look at or how could I generate more verbose logs?

I've Googled endlessly, to no avail. It feels like many people have similar problems, but I cannot find a solution for mine in any posts I've come across. If I cannot make headway on this within the next couple days, its likely going to be too late as I'll run out of time to finish this. It's frustrating to get stuck to early on, I spent most of this weekend just trying anything I could think of simply to get these hosts to PXE boot to no avail.

Thank you all very much for any help you can provide. Have a great evening.


[1] [cid:attach_0_15C29A407D362DE8_17221@groups.io]

[2]
[cid:attach_1_15C29A407D4CE062_17221@groups.io]

[3]

[cid:attach_2_15C29A407FD60A8E_17221@groups.io]

<dhcp_messages.png>
<IMG_4415.JPG>
<IMG_4416.JPG>
<warewulf.script>


 

Looking at the IPXE error ( http://ipxe.org/err/280860 ) it appears that it can't open the network device. From the ipxe error, the error is being thrown from this code block:

------
/* Avoid calling transmit() on unopened network devices */
if ( ! netdev_is_open ( netdev ) ) {
        rc = -ENETUNREACH;
        goto err;
}
------

You can try replacing undionly.kpxe with: http://boot.ipxe.org/undionly.kpxe

That _may_ help.

-J




Aaron Blakeman
 

On Sun, Sep 8, 2019 at 05:03 PM, Alan Sill wrote:
Your problem may be due to your choice of beegeefs as your volume format.
I agree that this is worth checking out.  I've hit similar issues with non ext4 filesystems.

One other thing to try is changing the boot mode.  If I recall, there was someone else facing a similar issue and the root of the issue was that their machine was trying to boot in legacy mode but it needed to be using UEFI.

Thanks,
Aaron


jose_d
 

is the relevant switch-port in PortFast / Acces mode?

1) I have experience that some Supermicro servers were not able to iPXE-boot as STP negotiations caused delay with setting the port up. So I'd disable all fancy features at the switchport.
2) pxe vs uefi confusion was already metioned here

> how could I generate more verbose logs?

at the end, one can always tcpdump at the management interface with host filter to understand/guess what's wrong.

cheers

Josef Dvoracek
Institute of Physics @ Czech Academy of Sciences
cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669
On 09. 09. 19 1:27, SimCenter Admin wrote:

Guys,

I'm desperately trying to get a cluster to boot and have had no luck. This is old hardware (Dell R620s for storage, and Dell C6220s for compute) which needs to boot statelessly. I have a problem that feels a lot like the one outlined in this OpenHPC issue, however, I am not able to fix my problem. The group at UAB was using VirtualBox environment and I'm using actual hardware, (Dell, Ethernet for provisioning, IB for fabric if I can get it working/configured).

I get the same errors that Mike and John-Paul were getting in their issue, but I have no idea how to fix it. 

  • DHCP is working, there's records in the /var/log/messages file of this; see [1]. Note, this image contains hints I think like "Error code 0: TFTP Aborted".
  • I've added -vvv to the server_args config in the /etc/xinet.d/tftp file, but this doesn't really seem to give any extra information.
  • Also, see [2] and [3] for what the client shows when it tries to PXE.
  • Also note that it appears as if the DHCP config is 'falling though' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe just like John-Paul observed here, but I cannot figure out why. Also, the file(s) referenced in the first line of this file (filename "http://192.168.222.1/WW/ipxe/cfg/${mac}does exist.
  • I have updated the firmware on these machines to the newest possible (BIOS, network, etc. firmware) and the behavior does not change.
  • I have attached salient output from wwsh in the warewulf.script file and I'd appreciate it if somebody could take a look. I've tried this several dozen times and ways, so at this point I'm not sure I remember everything I've tried.
  • I try to remember to restart DHCP (even though warewulf seems to do that itself most of the time) as well as update PXEf (wwsh pxe update) every time I make changes.
  • The provisioning interface is named 'bond1' and it is, as its name suggests, a bond (2x1GbE links), I have done what the install recipe suggests to support this: wwsh provision set --postnetdown=1 --kargs=net.ifnames=1,biosdevname=1 beegfs*

Please ask me for any relevant updates/information that I may have left out. 
I don't pretend to understand what PXE/iPXE is doing nor how it works, but maybe some experts out there can help me figure this out? What side of the equation is this likely on? Could it be firmware issues? Software config on the SMS side (warewulf, DHCP, etc.)? What other logs could I look at or how could I generate more verbose logs?

I've Googled endlessly, to no avail. It feels like many people have similar problems, but I cannot find a solution for mine in any posts I've come across. If I cannot make headway on this within the next couple days, its likely going to be too late as I'll run out of time to finish this. It's frustrating to get stuck to early on, I spent most of this weekend just trying anything I could think of simply to get these hosts to PXE boot to no avail.

Thank you all very much for any help you can provide. Have a great evening.


[1] 

[2]


[3]


UTC, SimCenterAdmin <SimCenter.Admin@...>
 

I'll be looking into this more soon; thank you for your answers. I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using and which I'm supposed to be using? I've checked in the BIOS settings but am not able to find anything obvious about this.

Josef, thank you for your note too, I am not a network expert and this network is a little complicated. The servers are on a VLAN that isolates them behind the login node, also, their management network is on an ancient Dell PowerConnect 3348 (which isn't even Gigabit) which has a single uplink to the main network infrastructure on which the VLAN is configured. I don't even know how to access or manage the switch. Is there any way to infer this from outside the switch? I'll have to loop in my networking team for help, and maybe see if we cannot find another switch and simplify this infrastructure.

I'll be following up just as soon as I'm able; I've a lot of work to do!


From: OpenHPC-users@groups.io on behalf of jose_d via Groups.Io
Sent: Monday, September 9, 2019 6:54 AM
To: OpenHPC-users@groups.io
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe

is the relevant switch-port in PortFast / Acces mode?

1) I have experience that some Supermicro servers were not able to iPXE-boot as STP negotiations caused delay with setting the port up. So I'd disable all fancy features at the switchport.
2) pxe vs uefi confusion was already metioned here

> how could I generate more verbose logs?

at the end, one can always tcpdump at the management interface with host filter to understand/guess what's wrong.

cheers

Josef Dvoracek
Institute of Physics @ Czech Academy of Sciences
cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669
On 09. 09. 19 1:27, SimCenter Admin wrote:

Guys,

I'm desperately trying to get a cluster to boot and have had no luck. This is old hardware (Dell R620s for storage, and Dell C6220s for compute) which needs to boot statelessly. I have a problem that feels a lot like the one outlined in this OpenHPC issue, however, I am not able to fix my problem. The group at UAB was using VirtualBox environment and I'm using actual hardware, (Dell, Ethernet for provisioning, IB for fabric if I can get it working/configured).

I get the same errors that Mike and John-Paul were getting in their issue, but I have no idea how to fix it. 

  • DHCP is working, there's records in the /var/log/messages file of this; see [1]. Note, this image contains hints I think like "Error code 0: TFTP Aborted".
  • I've added -vvv to the server_args config in the /etc/xinet.d/tftp file, but this doesn't really seem to give any extra information.
  • Also, see [2] and [3] for what the client shows when it tries to PXE.
  • Also note that it appears as if the DHCP config is 'falling though' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe just like John-Paul observed here, but I cannot figure out why. Also, the file(s) referenced in the first line of this file (filename "http://192.168.222.1/WW/ipxe/cfg/${mac}does exist.
  • I have updated the firmware on these machines to the newest possible (BIOS, network, etc. firmware) and the behavior does not change.
  • I have attached salient output from wwsh in the warewulf.script file and I'd appreciate it if somebody could take a look. I've tried this several dozen times and ways, so at this point I'm not sure I remember everything I've tried.
  • I try to remember to restart DHCP (even though warewulf seems to do that itself most of the time) as well as update PXEf (wwsh pxe update) every time I make changes.
  • The provisioning interface is named 'bond1' and it is, as its name suggests, a bond (2x1GbE links), I have done what the install recipe suggests to support this: wwsh provision set --postnetdown=1 --kargs=net.ifnames=1,biosdevname=1 beegfs*

Please ask me for any relevant updates/information that I may have left out. 
I don't pretend to understand what PXE/iPXE is doing nor how it works, but maybe some experts out there can help me figure this out? What side of the equation is this likely on? Could it be firmware issues? Software config on the SMS side (warewulf, DHCP, etc.)? What other logs could I look at or how could I generate more verbose logs?

I've Googled endlessly, to no avail. It feels like many people have similar problems, but I cannot find a solution for mine in any posts I've come across. If I cannot make headway on this within the next couple days, its likely going to be too late as I'll run out of time to finish this. It's frustrating to get stuck to early on, I spent most of this weekend just trying anything I could think of simply to get these hosts to PXE boot to no avail.

Thank you all very much for any help you can provide. Have a great evening.


[1] 

[2]


[3]


 

>  The servers are on a VLAN that isolates them behind the login node

I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do:  telnet 192.168.222.1 80    (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic?

> I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using

TFTP (in.tftpd) is trying to send the unionly.kpxe file.

-J

On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:
I'll be looking into this more soon; thank you for your answers. I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using and which I'm supposed to be using? I've checked in the BIOS settings but am not able to find anything obvious about this.

Josef, thank you for your note too, I am not a network expert and this network is a little complicated. The servers are on a VLAN that isolates them behind the login node, also, their management network is on an ancient Dell PowerConnect 3348 (which isn't even Gigabit) which has a single uplink to the main network infrastructure on which the VLAN is configured. I don't even know how to access or manage the switch. Is there any way to infer this from outside the switch? I'll have to loop in my networking team for help, and maybe see if we cannot find another switch and simplify this infrastructure.

I'll be following up just as soon as I'm able; I've a lot of work to do!


UTC, SimCenterAdmin <SimCenter.Admin@...>
 

OK, I'm not sure what the actual problem was, but eliminating the old crappy switch from the picture has (mostly) fixed my problem. I am now able to boot a server using PXE as expected!

A couple caveats I think, the PXE boot appears to fail initially, I see the same DHCP logs that I had before, but it now seems to fail back on something that works. See [1]; furthermore, the console on the compute node seems to also indicate that something with PXE is wrong, then, like I said, it actually boots. For what I'm talking about, see [2]. Is this normal or should I do some additional tweaking?

Thank you all for the help; it didn't even occur to me that the switch could be the problem!



[1]



[2]


From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Jason Stover via Groups.Io <jason.stover@...>
Sent: Monday, September 9, 2019 11:37 AM
To: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
>  The servers are on a VLAN that isolates them behind the login node

I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do:  telnet 192.168.222.1 80    (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic?

> I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using

TFTP (in.tftpd) is trying to send the unionly.kpxe file.

-J

On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:

I'll be looking into this more soon; thank you for your answers. I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using and which I'm supposed to be using? I've checked in the BIOS settings but am not able to find anything obvious about this.

Josef, thank you for your note too, I am not a network expert and this network is a little complicated. The servers are on a VLAN that isolates them behind the login node, also, their management network is on an ancient Dell PowerConnect 3348 (which isn't even Gigabit) which has a single uplink to the main network infrastructure on which the VLAN is configured. I don't even know how to access or manage the switch. Is there any way to infer this from outside the switch? I'll have to loop in my networking team for help, and maybe see if we cannot find another switch and simplify this infrastructure.

I'll be following up just as soon as I'm able; I've a lot of work to do!


UTC, SimCenterAdmin <SimCenter.Admin@...>
 

I already have a follow up question! I currently have several nodes booting which is step one! I need to provide an xfs filesystem on at least one of these hosts, however, the kernel I seem to be getting doesn't support xfs. I get: "mount: unknown filesystem type 'xfs'"

I think this may have to do with the fact that I'm again 'falling through' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file? But, I'm not sure about this. There are no xfs module files located in the /lib/modules/ directory.

Can anybody help me figure out why this is? What other information do you need from me?

From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of via Groups.Io <SimCenter.Admin@...>
Sent: Monday, September 9, 2019 6:20 PM
To: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
OK, I'm not sure what the actual problem was, but eliminating the old crappy switch from the picture has (mostly) fixed my problem. I am now able to boot a server using PXE as expected!

A couple caveats I think, the PXE boot appears to fail initially, I see the same DHCP logs that I had before, but it now seems to fail back on something that works. See [1]; furthermore, the console on the compute node seems to also indicate that something with PXE is wrong, then, like I said, it actually boots. For what I'm talking about, see [2]. Is this normal or should I do some additional tweaking?

Thank you all for the help; it didn't even occur to me that the switch could be the problem!



[1]



[2]


From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Jason Stover via Groups.Io <jason.stover@...>
Sent: Monday, September 9, 2019 11:37 AM
To: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
>  The servers are on a VLAN that isolates them behind the login node

I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do:  telnet 192.168.222.1 80    (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic?

> I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using

TFTP (in.tftpd) is trying to send the unionly.kpxe file.

-J

On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:

I'll be looking into this more soon; thank you for your answers. I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using and which I'm supposed to be using? I've checked in the BIOS settings but am not able to find anything obvious about this.

Josef, thank you for your note too, I am not a network expert and this network is a little complicated. The servers are on a VLAN that isolates them behind the login node, also, their management network is on an ancient Dell PowerConnect 3348 (which isn't even Gigabit) which has a single uplink to the main network infrastructure on which the VLAN is configured. I don't even know how to access or manage the switch. Is there any way to infer this from outside the switch? I'll have to loop in my networking team for help, and maybe see if we cannot find another switch and simplify this infrastructure.

I'll be following up just as soon as I'm able; I've a lot of work to do!


UTC, SimCenterAdmin <SimCenter.Admin@...>
 

Guys, I'm sorry for the noise, I have very little real experience with this technology and I'm trying to figure it out as I go! I figured my xfs issue out. For those out there who might come across this, I simply had to add xfs to the /etc/warewulf/bootstrap.conf file (I added it to the end of the line that also contains ext{2,3,4} etc) and rebuild the bootstrap image.

I'm still not convinced that my servers are PXE booting correctly, but they are booting which a step in the right direction!

From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of via Groups.Io <SimCenter.Admin@...>
Sent: Monday, September 9, 2019 7:44 PM
To: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
I already have a follow up question! I currently have several nodes booting which is step one! I need to provide an xfs filesystem on at least one of these hosts, however, the kernel I seem to be getting doesn't support xfs. I get: "mount: unknown filesystem type 'xfs'"

I think this may have to do with the fact that I'm again 'falling through' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file? But, I'm not sure about this. There are no xfs module files located in the /lib/modules/ directory.

Can anybody help me figure out why this is? What other information do you need from me?

From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of via Groups.Io <SimCenter.Admin@...>
Sent: Monday, September 9, 2019 6:20 PM
To: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
OK, I'm not sure what the actual problem was, but eliminating the old crappy switch from the picture has (mostly) fixed my problem. I am now able to boot a server using PXE as expected!

A couple caveats I think, the PXE boot appears to fail initially, I see the same DHCP logs that I had before, but it now seems to fail back on something that works. See [1]; furthermore, the console on the compute node seems to also indicate that something with PXE is wrong, then, like I said, it actually boots. For what I'm talking about, see [2]. Is this normal or should I do some additional tweaking?

Thank you all for the help; it didn't even occur to me that the switch could be the problem!



[1]



[2]


From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Jason Stover via Groups.Io <jason.stover@...>
Sent: Monday, September 9, 2019 11:37 AM
To: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
>  The servers are on a VLAN that isolates them behind the login node

I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do:  telnet 192.168.222.1 80    (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic?

> I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using

TFTP (in.tftpd) is trying to send the unionly.kpxe file.

-J

On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:

I'll be looking into this more soon; thank you for your answers. I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using and which I'm supposed to be using? I've checked in the BIOS settings but am not able to find anything obvious about this.

Josef, thank you for your note too, I am not a network expert and this network is a little complicated. The servers are on a VLAN that isolates them behind the login node, also, their management network is on an ancient Dell PowerConnect 3348 (which isn't even Gigabit) which has a single uplink to the main network infrastructure on which the VLAN is configured. I don't even know how to access or manage the switch. Is there any way to infer this from outside the switch? I'll have to loop in my networking team for help, and maybe see if we cannot find another switch and simplify this infrastructure.

I'll be following up just as soon as I'm able; I've a lot of work to do!


 

This is exactly what we ran into. Adding xfs to the conf file is not enough. Tried many experiments but no good so far. 

We’re thus also interested in the answer to this question if one is available. 

Alan 

On Sep 10, 2019, at 5:54 PM, via Groups.Io <SimCenter.Admin@...> wrote:

For those out there who might come across this, I simply had to add xfs to the /etc/warewulf/bootstrap.conf file (I added it to the end of the line that also contains ext{2,3,4} etc) and rebuild the bootstrap image.

I'm still not convinced that my servers are PXE booting correctly, but they are booting which a step in the right direction!
I already have a follow up question! I currently have several nodes booting which is step one! I need to provide an xfs filesystem on at least one of these hosts, however, the kernel I seem to be getting doesn't support xfs. I get: "mount: unknown filesystem type 'xfs'"

I think this may have to do with the fact that I'm again 'falling through' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file? But, I'm not sure about this. There are no xfs module files located in the /lib/modules/ directory.


UTC, SimCenterAdmin <SimCenter.Admin@...>
 

So, to be clear, what I am saying is that what I've done does seem to work. Let me elaborate:

  1. I installed xfs into the chroot: yum install --installroot=/opt/ohpc/admin/images/centos7.6-beegfs-storage/ xfsprogs (this is required to provide mkfs.xfs in the image which I needed, however it was not sufficient to provide the kernel module for xfs; I was not able mount the xfs filesystem after I created it). You must rebuild the VNFS image aftter making this change.
  2. Then I added xfs to the /etc/warewulf/bootstrap.conf file (I added it to the end of the line containing the ext* drivers/modules for what it's worth) and rebuilt the bootstrap image: wwbootstrap $(uname -r).
  3. Just to be safe, I rebuilt the VNFS images, restarted dhcpd and updated the PXE stuff: wwsh pxe update.
Again, I'm just following my nose, but, unless it was a happy accident, the steps above were enough to allow me to mount an xfs filesystem in a stateless image successfully.

I'm curious if I've missed anything important, or if I'm making this too complicated.

From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Alan Sill via Groups.Io <Alan.Sill@...>
Sent: Tuesday, September 10, 2019 6:05 PM
To: via Groups.Io <SimCenter.Admin@...>
Cc: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
This is exactly what we ran into. Adding xfs to the conf file is not enough. Tried many experiments but no good so far. 

We’re thus also interested in the answer to this question if one is available. 

Alan 

On Sep 10, 2019, at 5:54 PM, via Groups.Io <SimCenter.Admin@...> wrote:

For those out there who might come across this, I simply had to add xfs to the /etc/warewulf/bootstrap.conf file (I added it to the end of the line that also contains ext{2,3,4} etc) and rebuild the bootstrap image.

I'm still not convinced that my servers are PXE booting correctly, but they are booting which a step in the right direction!
I already have a follow up question! I currently have several nodes booting which is step one! I need to provide an xfs filesystem on at least one of these hosts, however, the kernel I seem to be getting doesn't support xfs. I get: "mount: unknown filesystem type 'xfs'"

I think this may have to do with the fact that I'm again 'falling through' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file? But, I'm not sure about this. There are no xfs module files located in the /lib/modules/ directory.


Meij, Henk
 

Very interesting. Here is what I do.  Export my local XFS filesystem on the home directory file server 10.11.103.42 then mount it via NFS on the nodes.

/etc/fstab
/dev/mapper/VolGroup00-lvhome             /home                          xfs        defaults,usrquota,grpquota              0 1
/etc/exports
/home                                   10.11.0.0/16(rw,no_root_squash,async)

On the compute node install (latter not needed unless v4.1 is used)
yum -y install nfs-utils nfs-utils-lib nfs4-acl-tools

On the nodes mount /home, vers=3 so we observe human readable users/groups
/etc/fstab
10.11.103.42:/home     /home   nfs  defaults,vers=3 0 0

-Henk

From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of UTC, SimCenterAdmin <SimCenter.Admin@...>
Sent: Tuesday, September 10, 2019 6:17 PM
To: via Groups.Io <SimCenter.Admin@...>; OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
So, to be clear, what I am saying is that what I've done does seem to work. Let me elaborate:

  1. I installed xfs into the chroot: yum install --installroot=/opt/ohpc/admin/images/centos7.6-beegfs-storage/ xfsprogs (this is required to provide mkfs.xfs in the image which I needed, however it was not sufficient to provide the kernel module for xfs; I was not able mount the xfs filesystem after I created it). You must rebuild the VNFS image aftter making this change.
  2. Then I added xfs to the /etc/warewulf/bootstrap.conf file (I added it to the end of the line containing the ext* drivers/modules for what it's worth) and rebuilt the bootstrap image: wwbootstrap $(uname -r).
  3. Just to be safe, I rebuilt the VNFS images, restarted dhcpd and updated the PXE stuff: wwsh pxe update.
Again, I'm just following my nose, but, unless it was a happy accident, the steps above were enough to allow me to mount an xfs filesystem in a stateless image successfully.

I'm curious if I've missed anything important, or if I'm making this too complicated.

From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Alan Sill via Groups.Io <Alan.Sill@...>
Sent: Tuesday, September 10, 2019 6:05 PM
To: via Groups.Io <SimCenter.Admin@...>
Cc: OpenHPC-users@groups.io <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
 
This is exactly what we ran into. Adding xfs to the conf file is not enough. Tried many experiments but no good so far. 

We’re thus also interested in the answer to this question if one is available. 

Alan 

On Sep 10, 2019, at 5:54 PM, via Groups.Io <SimCenter.Admin@...> wrote:

For those out there who might come across this, I simply had to add xfs to the /etc/warewulf/bootstrap.conf file (I added it to the end of the line that also contains ext{2,3,4} etc) and rebuild the bootstrap image.

I'm still not convinced that my servers are PXE booting correctly, but they are booting which a step in the right direction!
I already have a follow up question! I currently have several nodes booting which is step one! I need to provide an xfs filesystem on at least one of these hosts, however, the kernel I seem to be getting doesn't support xfs. I get: "mount: unknown filesystem type 'xfs'"

I think this may have to do with the fact that I'm again 'falling through' to the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file? But, I'm not sure about this. There are no xfs module files located in the /lib/modules/ directory.