xcat efi stateful on instdisk=nvme0n1 stuck at "Generate the repository for the installation" #pxe #openhpc


Jesse Stacey
 

Hi all. I have already installed four stateful nodes using a partitionfile that uses instdisk="/dev/nvme0n1", and now I'm having trouble adding more. Not sure what I have changed, but when I ssh to a provisioning node, the xcat log shows it's stuck here:
xcatprobe osdeploy -n n10 [-V]
[n10] 17:21:18 INFO Found /dev/sdf, generate partition file...
[n10] 17:21:18 INFO Found /dev/sdf, generate partition file...
[n10] 17:21:18 INFO Generate the repository for the installation
[n10] 17:21:18 INFO Generate the repository for the installation

I see both /tmp/partitionfile & /tmp/xcat.install_disk and the xcat.install_disk always has a random incorrect drive on each node. It seems to be ignoring my partitionfile and installing to another disk? but it never makes it past generating the repository. I don't know what has changed since I provisioned the first four nodes successfully. I feel as though nodeset is not making the necessary changes to /install/autoinst. Where should I look for clues? Thanks!
lsdef -t site clustersite
Object name: clustersite
    SNsyncfiledir=/var/xcat/syncfiles
    auditnosyslog=0
    auditskipcmds=ALL
    blademaxp=64
    cleanupxcatpost=no
    consoleondemand=yes
    databaseloc=/var/lib
    db2installloc=/mntdb2
    dhcpinterfaces=eno1
    dhcplease=43200
    dnshandler=ddns
    domain=local
    enableASMI=no
    forwarders=132.206.44.21,8.8.8.8
    fsptimeout=0
    installdir=/install
    ipmimaxp=64
    ipmiretries=3
    ipmitimeout=2
    master=10.1.1.1
    maxssh=8
    nameservers=10.1.1.1
    nodesyncfiledir=/var/xcat/node/syncfiles
    powerinterval=0
    ppcmaxp=64
    ppcretry=3
    ppctimeout=0
    sharedtftp=1
    sshbetweennodes=1
    syspowerinterval=0
    tftpdir=/tftpboot
    timezone=America/Montreal
    useNmapfromMN=no
    vsftp=n
    xcatconfdir=/etc/xcat
    xcatdebugmode=2
    xcatdport=3001
    xcatiport=3002



Jesse Stacey
 

It had a look inside /install/autoinst and noticed that the nodeset command isn't updating the files in there. I'm thinking nodeset is broken or perhaps my partitionfile is breaking nodeset and can't complete writing the xnba and boot files:

root@cerebra:/install/autoinst # nodeset n6-10 osimage=centos7.6-x86_64-install-compute
root@cerebra:/install/autoinst # ls /install/autoinst/
bak  bak2  n2  n3  n4  n5
root@cerebra:/install/autoinst #


Jesse Stacey
 

I just discovered that the variable $NEXTSERVER was incorrect. The file pre.rh.rhel7 (actually all the pre.rh files in that folder) need to be modified to correct how it fetches the NEXTSERVER value, and then it should continue to generate the repository and the kickstart install.

vi /opt/xcat/share/xcat/install/scripts/pre.rh.rhel7
#NEXTSERVER=`cat /proc/cmdline | grep http | head -n 1`
NEXTSERVER=`cat /proc/cmdline | grep http | head -n 1 | cut -d / -f 3 | cut -d : -f 1`




I found the tip here, which fixes it for RHEL6. Why is it still an issue for RHEL7 ?!


Jesse Stacey
 

hmm not much interest in this thread, but for completion and anyone that prefers to go the OpenHPC / xCat stateful route, I will show what happened with this finally.

First of all, the NEXTSERVER line might be the cause of some issues since it is grepping through output that is subject to change, but the real issue was my partitionfile.
Debian/Ubuntu uses a partitionfile.sh script while RHEL / Centos prefers to use a kickstart formatted partitionfile. I just went to a functional node that already kickstarted properly, and grabbed the partitioning from /root/anaconda-ks.cfg. It looks like this:
part swap --fstype="swap" --ondisk=nvme0n1 --size=4096
part /boot --fstype="xfs" --ondisk=nvme0n1 --size=512
part / --fstype="xfs" --ondisk=nvme0n1 --size=911056
part /boot/efi --fstype="efi" --ondisk=nvme0n1 --size=50 --fsoptions="defaults,uid=0,gid=0,umask=0077,shortname=winnt"
I saved this to /install/custom/my-partitions and attached it to the install image with: 
chdef -t osimage centos7.6-x86_64-install-compute -p partitionfile=/install/custom/my-partitions
Then regenerate the xcat file for the node(s): 
nodeset compute osimage=centos7.6-x86_64-install-compute
This regenerates the files inside /install/autoinst . You can empty this folder before issuing nodeset to make sure they are regenerated.

Then tell xcat to reboot the node(s) for reinstall :
rsetboot compute net
rpower compute boot
If you do tail on the xcat log (or ssh to the node being installed), just ignore the Found /dev/sdx , it should use the drive you specified in the partitionfile for your osimage regardless:



For a more verbose output of whats happening during the xcat kickstart, use xcatprobe before flagging the node(s) for reinstall. I usually do this from a screen tty session, and make sure you set xcatdebugmode=2 :





Thats it. I would have loved to continue using warewulf, it just gives so many headaches for anyone wanting to use use nvme boot drives in stateful, I had no choice but to go with xcat.