Problems PXE booting #warewulf #pxe #ipxe #dhcp
SimCenter Admin
Guys,
I'm desperately trying to get a cluster to boot and have had no luck. This is old hardware (Dell R620s for storage, and Dell C6220s for compute) which needs to boot statelessly. I have a problem that feels a lot like the one outlined in this OpenHPC issue, however, I am not able to fix my problem. The group at UAB was using VirtualBox environment and I'm using actual hardware, (Dell, Ethernet for provisioning, IB for fabric if I can get it working/configured). I get the same errors that Mike and John-Paul were getting in their issue, but I have no idea how to fix it.
Please ask me for any relevant updates/information that I may have left out. I don't pretend to understand what PXE/iPXE is doing nor how it works, but maybe some experts out there can help me figure this out? What side of the equation is this likely on? Could it be firmware issues? Software config on the SMS side (warewulf, DHCP, etc.)? What other logs could I look at or how could I generate more verbose logs? I've Googled endlessly, to no avail. It feels like many people have similar problems, but I cannot find a solution for mine in any posts I've come across. If I cannot make headway on this within the next couple days, its likely going to be too late as I'll run out of time to finish this. It's frustrating to get stuck to early on, I spent most of this weekend just trying anything I could think of simply to get these hosts to PXE boot to no avail. Thank you all very much for any help you can provide. Have a great evening. [1] [2] [3]
|
|
Your problem may be due to your choice of beegeefs as your volume format. We have tried experimentally various file system choices and have only been able to get past this problem by starting out with a boot volume formatted in ext4, then adding other disks (xfs in our case) once the system has been booted.
toggle quoted messageShow quoted text
We would also be interested in finding ways around this restriction and understanding the reason for it, but for now, we set up the partitions with an ext4 file system boot volume and add other differently formatted volumes once it has been booted. Alan
On Sep 8, 2019, at 6:30 PM, SimCenter Admin via Groups.Io <simcenter.admin@...> wrote:
|
|
Looking at the IPXE error (
http://ipxe.org/err/280860 ) it appears that it can't open the network device. From the ipxe error, the error is being thrown from this code block: ------ } ------ You can try replacing undionly.kpxe with: http://boot.ipxe.org/undionly.kpxe That _may_ help. -J
|
|
Aaron Blakeman
On Sun, Sep 8, 2019 at 05:03 PM, Alan Sill wrote:
Your problem may be due to your choice of beegeefs as your volume format.I agree that this is worth checking out. I've hit similar issues with non ext4 filesystems. One other thing to try is changing the boot mode. If I recall, there was someone else facing a similar issue and the root of the issue was that their machine was trying to boot in legacy mode but it needed to be using UEFI. Thanks, Aaron
|
|
jose_d
is the relevant switch-port in PortFast / Acces mode? 1) I have experience that some Supermicro servers were not able
to iPXE-boot as STP negotiations caused delay with setting the
port up. So I'd disable all fancy features at the switchport. > how could I generate more verbose logs? at the end, one can always tcpdump at the management interface
with host filter to understand/guess what's wrong. cheers Josef Dvoracek Institute of Physics @ Czech Academy of Sciences cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669 On 09. 09. 19 1:27, SimCenter Admin
wrote:
Guys,
|
|
UTC, SimCenterAdmin <SimCenter.Admin@...>
I'll be looking into this more soon; thank you for your answers. I wonder if the comment
here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using and which I'm supposed to be using? I've checked in the BIOS settings but am not able to find anything obvious about this.
Josef, thank you for your note too, I am not a network expert and this network is a little complicated. The servers are on a VLAN that isolates them behind the login node, also, their management network is on an ancient Dell PowerConnect 3348 (which isn't even
Gigabit) which has a single uplink to the main network infrastructure on which the VLAN is configured. I don't even know how to access or manage the switch. Is there any way to infer this from outside the switch? I'll have to loop in my networking team for
help, and maybe see if we cannot find another switch and simplify this infrastructure.
I'll be following up just as soon as I'm able; I've a lot of work to do!
From: OpenHPC-users@groups.io on behalf of jose_d via Groups.Io Sent: Monday, September 9, 2019 6:54 AM To: OpenHPC-users@groups.io Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe is the relevant switch-port in PortFast / Acces mode? 1) I have experience that some Supermicro servers were not able to iPXE-boot as STP negotiations caused delay with setting the port up. So I'd disable all fancy features at the switchport. > how could I generate more verbose logs? at the end, one can always tcpdump at the management interface with host filter to understand/guess what's wrong. cheers Josef Dvoracek Institute of Physics @ Czech Academy of Sciences cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669 On 09. 09. 19 1:27, SimCenter Admin wrote:
Guys,
|
|
> The servers are on a VLAN that isolates them behind the login node I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do: telnet 192.168.222.1 80 (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic? > I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using TFTP (in.tftpd) is trying to send the unionly.kpxe file. -J
On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:
|
|
UTC, SimCenterAdmin <SimCenter.Admin@...>
OK, I'm not sure what the actual problem was, but eliminating the old crappy switch from the picture has (mostly) fixed my problem. I am now able to boot a server using PXE as expected!
A couple caveats I think, the PXE boot appears to fail initially, I see the same DHCP logs that I had before, but it now seems to fail back on something that works. See [1]; furthermore, the console on the compute node seems to also indicate that something
with PXE is wrong, then, like I said, it actually boots. For what I'm talking about, see [2]. Is this normal or should I do some additional tweaking?
Thank you all for the help; it didn't even occur to me that the switch could be the problem!
[1]
[2]
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Jason Stover via Groups.Io <jason.stover@...>
Sent: Monday, September 9, 2019 11:37 AM To: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe > The servers are on a VLAN that isolates them behind the login node I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do: telnet 192.168.222.1 80 (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic? > I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using TFTP (in.tftpd) is trying to send the unionly.kpxe file. -J
On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:
|
|
UTC, SimCenterAdmin <SimCenter.Admin@...>
I already have a follow up question! I currently have several nodes booting which is step one! I need to provide an xfs filesystem on at least one of these hosts, however, the kernel I seem to be getting doesn't support xfs. I get: "mount: unknown filesystem
type 'xfs'"
I think this may have to do with the fact that I'm again 'falling through' to the
/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file? But, I'm not sure about this. There are no xfs module files located in the
/lib/modules/ directory.
Can anybody help me figure out why this is? What other information do you need from me?
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of via Groups.Io <SimCenter.Admin@...>
Sent: Monday, September 9, 2019 6:20 PM To: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
OK, I'm not sure what the actual problem was, but eliminating the old crappy switch from the picture has (mostly) fixed my problem. I am now able to boot a server using PXE as expected!
A couple caveats I think, the PXE boot appears to fail initially, I see the same DHCP logs that I had before, but it now seems to fail back on something that works. See [1]; furthermore, the console on the compute node seems to also indicate that something
with PXE is wrong, then, like I said, it actually boots. For what I'm talking about, see [2]. Is this normal or should I do some additional tweaking?
Thank you all for the help; it didn't even occur to me that the switch could be the problem!
[1]
[2]
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Jason Stover via Groups.Io <jason.stover@...>
Sent: Monday, September 9, 2019 11:37 AM To: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe > The servers are on a VLAN that isolates them behind the login node I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do: telnet 192.168.222.1 80 (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic? > I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using TFTP (in.tftpd) is trying to send the unionly.kpxe file. -J
On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:
|
|
UTC, SimCenterAdmin <SimCenter.Admin@...>
Guys, I'm sorry for the noise, I have very little real experience with this technology and I'm trying to figure it out as I go! I figured my xfs issue out. For those out there who might come across this, I simply had to add xfs to the
/etc/warewulf/bootstrap.conf file (I added it to the end of the line that also contains ext{2,3,4} etc) and rebuild the bootstrap image.
I'm still not convinced that my servers are PXE booting correctly, but they are booting which a step in the right direction!
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of via Groups.Io <SimCenter.Admin@...>
Sent: Monday, September 9, 2019 7:44 PM To: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
I already have a follow up question! I currently have several nodes booting which is step one! I need to provide an xfs filesystem on at least one of these hosts, however, the kernel I seem to be getting doesn't support xfs. I get: "mount: unknown filesystem
type 'xfs'"
I think this may have to do with the fact that I'm again 'falling through' to the
/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file? But, I'm not sure about this. There are no xfs module files located in the
/lib/modules/ directory.
Can anybody help me figure out why this is? What other information do you need from me?
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of via Groups.Io <SimCenter.Admin@...>
Sent: Monday, September 9, 2019 6:20 PM To: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
OK, I'm not sure what the actual problem was, but eliminating the old crappy switch from the picture has (mostly) fixed my problem. I am now able to boot a server using PXE as expected!
A couple caveats I think, the PXE boot appears to fail initially, I see the same DHCP logs that I had before, but it now seems to fail back on something that works. See [1]; furthermore, the console on the compute node seems to also indicate that something
with PXE is wrong, then, like I said, it actually boots. For what I'm talking about, see [2]. Is this normal or should I do some additional tweaking?
Thank you all for the help; it didn't even occur to me that the switch could be the problem!
[1]
[2]
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Jason Stover via Groups.Io <jason.stover@...>
Sent: Monday, September 9, 2019 11:37 AM To: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe > The servers are on a VLAN that isolates them behind the login node I was also going to be asking about the network setup. Meaning, from the network the nodes are on you can actually do: telnet 192.168.222.1 80 (I think that's the right IP...) and get a connection. Are DHCP requests routed over the network, but there isn't actually a route back to the provisioner from the segment the compute nodes are on for other traffic? > I wonder if the comment here about using http in lieu of tftp is relevant here and how I can check/confirm which I'm using TFTP (in.tftpd) is trying to send the unionly.kpxe file. -J
On Mon, Sep 9, 2019 at 10:00 AM UTC, SimCenterAdmin <SimCenter.Admin@...> wrote:
|
|
This is exactly what we ran into. Adding xfs to the conf file is not enough.
Tried many experiments but no good so far.
We’re thus also interested in the answer to this question if one is available.
Alan
|
|
UTC, SimCenterAdmin <SimCenter.Admin@...>
So, to be clear, what I am saying is that what I've done does seem to work. Let me elaborate:
Again, I'm just following my nose, but, unless it was a happy accident, the steps above were enough to allow me to mount an xfs filesystem in a stateless image successfully.
I'm curious if I've missed anything important, or if I'm making this too complicated.
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Alan Sill via Groups.Io <Alan.Sill@...>
Sent: Tuesday, September 10, 2019 6:05 PM To: via Groups.Io <SimCenter.Admin@...> Cc: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe This is exactly what we ran into. Adding xfs to the conf file is not enough.
Tried many experiments but no good so far.
We’re thus also interested in the answer to this question if one is available.
Alan
|
|
Meij, Henk
Very interesting. Here is what I do. Export my local XFS filesystem on the home directory file server 10.11.103.42 then mount it via NFS on the nodes.
/etc/fstab
/dev/mapper/VolGroup00-lvhome /home xfs defaults,usrquota,grpquota 0 1
/etc/exports
/home 10.11.0.0/16(rw,no_root_squash,async)
On the compute node install (latter not needed unless v4.1 is used)
yum -y install nfs-utils nfs-utils-lib nfs4-acl-tools
On the nodes mount /home, vers=3 so we observe human readable users/groups
/etc/fstab
10.11.103.42:/home /home nfs defaults,vers=3 0 0
-Henk
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of UTC, SimCenterAdmin <SimCenter.Admin@...>
Sent: Tuesday, September 10, 2019 6:17 PM To: via Groups.Io <SimCenter.Admin@...>; OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe
So, to be clear, what I am saying is that what I've done does seem to work. Let me elaborate:
Again, I'm just following my nose, but, unless it was a happy accident, the steps above were enough to allow me to mount an xfs filesystem in a stateless image successfully.
I'm curious if I've missed anything important, or if I'm making this too complicated.
From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Alan Sill via Groups.Io <Alan.Sill@...>
Sent: Tuesday, September 10, 2019 6:05 PM To: via Groups.Io <SimCenter.Admin@...> Cc: OpenHPC-users@groups.io <OpenHPC-users@groups.io> Subject: Re: [openhpc-users] Problems PXE booting #dhcp #warewulf #pxe #ipxe This is exactly what we ran into. Adding xfs to the conf file is not enough.
Tried many experiments but no good so far.
We’re thus also interested in the answer to this question if one is available.
Alan
|
|