Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infiniband boot problem #100

Closed
macdems opened this issue Jan 25, 2018 · 53 comments
Closed

Infiniband boot problem #100

macdems opened this issue Jan 25, 2018 · 53 comments

Comments

@macdems
Copy link
Contributor

macdems commented Jan 25, 2018

Warewulf 3.8 does not boot nodes over Infiniband. It does not even bootstrap!

3.7 works fine.

After some investigation I found out that the problem is with DHCP and the new
IPXE boot. This is what happens:

  1. Mellanox FlexBoot is requesting DHCP, with a fixed dhcp-client-identifier.
    This is properly handled if the correct HWPREFIX and HWADDR is set for the
    node.

  2. After assigning the IP address, Warewulf 3.7 simply loads bootstrap and
    continues. However, in 3.8 there is an intermediate IPXE (it's IPXE, right?)
    which once again asks for the DHCP address -- this time without providing
    the dhcp-client-identifier. And this is the point where the boot fails!

The only solution I see, is to somehow force the second stage IPXE to provide
the same dhcp-client-identifier as the Mellanox FlexBoot does.

@bensallen
Copy link
Member

bensallen commented Jan 25, 2018

Two things:

@macdems
Copy link
Contributor Author

macdems commented Jan 25, 2018

I remember I noticed that undionly.kpxe was loaded.

Right now I switched back to 3.7 and it works.

@macdems
Copy link
Contributor Author

macdems commented Jan 25, 2018

One more thing:

I remember I did the following:

I added range parameter to dhcpd-template.conf, witch some unused values from the local network and set the node HWADDR in Warewulf to the shortened MAC. It was booting (the original boot was getting the address from the range and the second one was got the configured IP). However, the bootstrap had problems with configuring network interfaces as https://github.com/warewulf/warewulf3/blob/master/provision/initramfs/init#L73 could not find the interface with the shortened HWADDR matching /sys/class/net/$i/address.

@bensallen
Copy link
Member

I remember I noticed that undionly.kpxe was loaded.

Sounds like it was indeed chainloading then, which is problematic. I would need a tcpdump -s0 from the master during the boot of the node to figure out what's going on. Also need to know which FlexBoot version is in use, and what dhcpd-template.conf looks like.

could not find the interface with the shortened HWADDR matching /sys/class/net/$i/address.

Yes init specifically expects the 64-bit HWADDR when it finds an IB interface.

As aside, looking at Connect-X 3 Flexboot releases, I'm hopeful that 3.4.752 http://www.mellanox.com/related-docs/prod_software/FlexBoot-3_4_752_for_ConnectX3_release_notes.pdf, includes support for the new ${hwaddr} variable. Curiously it appears this FlexBoot version and ConnectX-3 firmware is now only available via mymellanox instead of the normal website.

@macdems
Copy link
Contributor Author

macdems commented Jan 25, 2018

I will try my best to investigate this. However, I will not be able to look at this within two weeks and I am not sure if a our cluster will be free for such tests after this time (people who paid for the hardware are getting more and more impatient to actually see it working). So, if I find some time window, I will do the tests, however maybe Irek Porebski [email protected], who apparently seems to have similar issue, will be able to help sooner.

@Irekporeb
Copy link

Irekporeb commented Jan 26, 2018

Hi,
There is my dhcpd-template.conf

[root@headnode warewulf]# cat dhcpd-template.conf
allow booting;
allow bootp;
ddns-update-style interim;
authoritative;

option space ipxe;

# Tell iPXE to not wait for ProxyDHCP requests to speed up boot.
option ipxe.no-pxedhcp code 176 = unsigned integer 8;
option ipxe.no-pxedhcp 1;

option architecture-type   code 93  = unsigned integer 16;

if exists user-class and option user-class = "iPXE" {
    filename "http://%{IPADDR}/WW/ipxe/cfg/${mac}";
} else {
    if option architecture-type = 00:0B {
        filename "/warewulf/ipxe/bin-arm64-efi/snp.efi";
    } elsif option architecture-type = 00:0A {
        filename "/warewulf/ipxe/bin-arm32-efi/placeholder.efi";
    } elsif option architecture-type = 00:09 {
        filename "/warewulf/ipxe/bin-x86_64-efi/ipxe.efi";
    } elsif option architecture-type = 00:07 {
        filename "/warewulf/ipxe/bin-x86_64-efi/ipxe.efi";
    } elsif option architecture-type = 00:06 {
        filename "/warewulf/ipxe/bin-i386-efi/ipxe.efi";
    } elsif option architecture-type = 00:00 {
        filename "/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe";
    }
}

subnet %{NETWORK} netmask %{NETMASK} {
   not authoritative;
   # option interface-mtu 9000;
   option subnet-mask %{NETMASK};
}

# Node entries will follow below

The iPXE version is 1.0.0+. I am using Conect-X5 .
During the booting I am only able to get to the debug mode shell not boostrap shell.
The last two entries in the "dmesg" are

mlx5_ib: Mellanox Connect-IB Infiniband Driver v2.2-1 (Feb 2014)
random: crng init done

There is no variable $hwaddr but there is $wwhwaddr which is MAC
I can do the tcpdump for you just let me know how to upload it here?

Thanks,
Irek

@bensallen
Copy link
Member

Are you using RHEL's included OFED or Mellanox's OFED?

@Irekporeb
Copy link

Hi
I didn't noticed that I can just drop the file here.

There is the tcpdump from the session
warewulf.pcap.001.zip
warewulf.pcap.003.zip
warewulf.pcap.004.zip
warewulf.pcap.002.zip

@Irekporeb
Copy link

I am using Centos so it is RedHat

@bensallen
Copy link
Member

So you've not for example installed: http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers ?

@Irekporeb
Copy link

I was trying to use Mellanox OFED but the result was the same.

@Irekporeb
Copy link

no, I didn't installed this OFED

@bensallen
Copy link
Member

Rebuild the bootstrap with: wwbootstrap --chroot=/var/chroots/rhel 3.10.0-693.el7.x86_64, replacing chroot path and kernel version as appropriate.

Find the ID of your bootstrap: wwsh object dump -t bootstrap 3.10.0-693.el7.x86_64

Look at the contents of your bootstrap grep for mlx modules: zcat /var/warewulf/bootstrap/x86_64/<ID>/initfs.gz | cpio -tdv | grep mlx

@Irekporeb
Copy link

there is result of the zcat:

[root@headnode 15]# zcat initfs.gz | cpio -tdv | grep mlx
drwxr-xr-x 2 root root 0 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw
-rw-r--r-- 1 root root 31988 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_core.ko.xz
-rw-r--r-- 1 root root 19512 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_pci.ko.xz
-rw-r--r-- 1 root root 4040 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_i2c.ko.xz
-rw-r--r-- 1 root root 8764 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.ko.xz
-rw-r--r-- 1 root root 20876 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchx2.ko.xz
-rw-r--r-- 1 root root 95548 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko.xz
-rw-r--r-- 1 root root 1668 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_minimal.ko.xz
drwxr-xr-x 3 root root 0 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/net/ethernet/mellanox/mlx5
drwxr-xr-x 2 root root 0 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/net/ethernet/mellanox/mlx5/core
-rw-r--r-- 1 root root 17901552 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
drwxr-xr-x 2 root root 0 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/net/ethernet/mellanox/mlx4
-rw-r--r-- 1 root root 5040520 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko
-rw-r--r-- 1 root root 9019832 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
drwxr-xr-x 2 root root 0 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/infiniband/hw/mlx5
-rw-r--r-- 1 root root 6363616 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/infiniband/hw/mlx5/mlx5_ib.ko
drwxr-xr-x 2 root root 0 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/infiniband/hw/mlx4
-rw-r--r-- 1 root root 6214784 Jan 26 13:24 lib/modules/3.10.0-693.11.6.el7.x86_64/updates/drivers/infiniband/hw/mlx4/mlx4_ib.ko
299026 blocks

@bensallen
Copy link
Member

bensallen commented Jan 26, 2018

Please try using the updated bootstrap.conf in PR #103. Uncomment the modprobe line for ib_ipoib. Rebuild the bootstrap via wwbootstrap .... Attempt a boot of the node. If it still cannot bring up ib0, please from a debug shell show uslsmod and ip link show ib0 output.

Then try running modprobe mlx5_ib ib_ipoib. If successful see if ib0 exists. If it returns an error please report it.

@Irekporeb
Copy link

This helps the node starts booting just stop on VNFS

Jan 26 15:24:30 headnode dhcpd: DHCPOFFER on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0
Jan 26 15:24:30 headnode dhcpd: DHCPREQUEST for 192.168.0.100 (192.168.0.1) from ec:0d:9a:9c:81:e2 via ib0
Jan 26 15:24:30 headnode dhcpd: DHCPACK on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0
Jan 26 05:26:11 gpunode-0-0.localdomain wwlogger: Starting the provision handler
Jan 26 05:26:11 gpunode-0-0.localdomain wwlogger: Running provision script: adhoc-pre
Jan 26 05:26:12 gpunode-0-0.localdomain wwlogger: Running provision script: filesystems
Jan 26 05:26:12 gpunode-0-0.localdomain wwlogger: Running provision script: getvnfs
Jan 26 05:26:12 gpunode-0-0.localdomain wwlogger: vnfs , , undef
Jan 26 05:26:14 gpunode-0-0.localdomain wwlogger: vnfs , , undef
Jan 26 05:26:18 gpunode-0-0.localdomain wwlogger: vnfs , , undef
Jan 26 05:26:24 gpunode-0-0.localdomain wwlogger: vnfs , , undef

[root@headnode log]# wwsh provision print gpunode-0-0

gpunode-0-0

gpunode-0-0: BOOTSTRAP        = 3.10.0-693.11.6.el7.x86_64
gpunode-0-0: VNFS             = centos7.4
gpunode-0-0: FILES            = dynamic_hosts,group,ifcfg-ib0.ww,munge.key,network,passwd,shadow,slurm.conf
gpunode-0-0: PRESHELL         = FALSE
gpunode-0-0: POSTSHELL        = FALSE
gpunode-0-0: CONSOLE          = UNDEF
gpunode-0-0: PXELINUX         = UNDEF
gpunode-0-0: SELINUX          = DISABLED
gpunode-0-0: KARGS            = "net.ifnames=0 biosdevname=0 "
gpunode-0-0: BOOTLOCAL        = FALSE

I think I have som node name problem now

@Irekporeb
Copy link

But this is only working if I have disable onboard NIC. So only IB is available.
I have also noticed that the booting information changed. Before when was failing it was:

Checking for network device: eth0 (ib0)

Now is:
Checking for network device: ib0 (ib0)

So there is something which causing the Warewulf to use wrong network interface

@Irekporeb
Copy link

@Irekporeb
Copy link

Hi Ben I have still problem. Though it all right now but not.
I can boot bootstrap bun on the end I have message "No network device available" and this get me to debug shell. I can see modules loaded and in "ip a" the ib0 interface is up but no IP assigned.
Could you help me to fix that?

Thanks,
Irek

@bensallen
Copy link
Member

I can boot bootstrap bun on the end I have message "No network device available" and this get me to debug shell.

@Irekporeb From the debug shell can you show us the contents of cat /proc/cmdline and ip addr show ib0.

Then on the Warewulf server, wwsh object dump gpunode-0-0. Please be cognizant that your IPMI password (if stored) will be in plaintext in this output. Also please find the ipxe cfg under /var/warewulf/ipxe/cfg/ for the IB interface, and the ethernet onboard interface if you have it configured.

Also from your screenshot above it looks like the bootstrap got past IP addressing, but failed to fetch the VNFS is this the current state? Or is this some intermediate workaround?

@Irekporeb
Copy link

Irekporeb commented Jan 29, 2018

I am have done "wwsh pxe update" and this looks like changed something
Now I am I can't pass the IP addressing.

[root@headnode dhcp]# wwsh object dump gpunode-0-0
Object #0:  OBJECT REF Warewulf::Node=HASH(0x24a3d20) {
    "ARCH" (4) => "x86_64" (6)
    "BOOTSTRAPID" (11) => 15
    "FILEIDS" (7) => ARRAY REF ARRAY(0x24a4128) {
        0:  9
        1:  1
        2:  2
        3:  3
        4:  4
        5:  5
        6:  6
        7:  21
    }
    "NAME" (4) => ARRAY REF ARRAY(0x24a3e58) {
        0:  "gpunode-0-0" (11)
    }
    "NETDEVS" (7) => OBJECT REF Warewulf::ObjectSet=HASH(0x24a3fc0) {
        "ARRAY" (5) => ARRAY REF ARRAY(0x24a3ff0) {
            0:  OBJECT REF Warewulf::Object=HASH(0x24a4038) {
                "HWADDR" (6) => "ec:0d:9a:9c:81:e2" (17)
                "HWPREFIX" (8) => "ff:00:00:00:00:00:02:00:00:02:c9:00" (35)
                "IPADDR" (6) => "192.168.0.100" (13)
                "NAME" (4) => "ib0" (3)
            }
        }
    }
    "NODENAME" (8) => "gpunode-0-0" (11)
    "VNFSID" (6) => 12
    "_HWADDR" (7) => ARRAY REF ARRAY(0x24a3ee8) {
        0:  "ec:0d:9a:9c:81:e2" (17)
    }
    "_HWPREFIX" (9) => ARRAY REF ARRAY(0x24a3e10) {
        0:  "ff:00:00:00:00:00:02:00:00:02:c9:00" (35)
    }
    "_ID" (3) => 22
    "_IPADDR" (7) => ARRAY REF ARRAY(0x24a3f48) {
        0:  "192.168.0.100" (13)
    }
    "_TIMESTAMP" (10) => 1517205815
    "_TYPE" (5) => "node" (4)
}

I had to modify DHCP config to this:

   host gpunode-0-0-ib0 {
      option host-name gpunode-0-0;
      hardware ethernet ec:0d:9a:9c:81:e2;
      option dhcp-client-identifier = ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9c:81:e2;
      #option dhcp-client-identifier = ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:9c:81:e2;
      fixed-address 192.168.0.100;
      next-server 192.168.0.1;
   }

During the booting the node is using both MAC and client-identifier.

Jan 29 11:21:32 headnode dhcpd: DHCPDISCOVER from ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9c:81:e2 via ib0
Jan 29 11:21:32 headnode dhcpd: DHCPOFFER on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0
Jan 29 11:21:48 headnode dhcpd: DHCPREQUEST for 192.168.0.100 (192.168.0.1) from ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9c:81:e2 via ib0
Jan 29 11:21:48 headnode dhcpd: DHCPACK on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0
Jan 29 11:21:49 headnode in.tftpd[234117]: Client 192.168.0.100 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe
Jan 29 11:21:59 headnode dhcpd: DHCPDISCOVER from ec:0d:9a:9c:81:e2 via ib0
Jan 29 11:21:59 headnode dhcpd: DHCPOFFER on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0
Jan 29 11:21:59 headnode dhcpd: DHCPREQUEST for 192.168.0.100 (192.168.0.1) from ec:0d:9a:9c:81:e2 via ib0
Jan 29 11:21:59 headnode dhcpd: DHCPACK on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0

However I am not able to register the node with full client-identifier as it will use GUN instead of MAC and the links to kernel will be incorrect and node will not boot at all.
I am telling you this I it may be relevant.

I have configured only IB for this node.

[root@headnode cfg]# cat  ec:0d:9a:9c:81:e2
#!ipxe
# Configuration for Warewulf node: gpunode-0-0
# Warewulf data store ID: 22
echo Now booting gpunode-0-0 with Warewulf bootstrap (3.10.0-693.11.6.el7.x86_64)
set base http://192.168.0.1/WW/bootstrap
initrd ${base}/x86_64/15/initfs.gz
kernel ${base}/x86_64/15/kernel ro initrd=initfs.gz wwhostname=gpunode-0-0 net.ifnames=0 biosdevname=0  wwmaster=192.168.0.1 wwipaddr=192.168.0.100 wwnetmask=255.255.255.0 wwnetdev=ib0 wwhwaddr=ec:0d:9a:9c:81:e2
boot

My network file looks like this so it should have static IP after bootstrap

[root@headnode cfg]# wwsh file show  ifcfg-ib0.ww
DEVICE=ib0
BOOTPROTO=static
IPADDR=%{NETDEVS::IB0::IPADDR}
NETMASK=%{NETDEVS::IB0::NETMASK}
ONBOOT=yes
NM_CONTROLLED=no
DEVTIMEOUT=5

Sorry for late response but I am in different timezone

boot-problem3

@bensallen
Copy link
Member

bensallen commented Jan 29, 2018

Please set the hwaddr of the ib0 interface to ec:0d:9a:03:00:9c:81:e2, via wwsh node set --netdev=ib0 --hwaddr ec:0d:9a:03:00:9c:81:e2 gpunode-0-0

Warewulf works with the full 64-bit GUID of IB interfaces, not the chopped 48-bit version (eg. two middle bytes 03:00 dropped). This change will fix your option dhcp-client-identifier problem. It will then generate /var/warewulf/ipxe/cfg/ec:0d:9a:03:00:9c:81:e2. This cfg path will however not quite work with our current dhcpd-template.conf. Specifically https://github.com/warewulf/warewulf3/blob/master/provision/etc/dhcpd-template.conf#L15. ${mac} is the iPXE shell variable for the 48-bit address, in your case ec:0d:9a:9c:81:e2. For this testing please symlink ln -s /var/warewulf/ipxe/cfg/ec:0d:9a:03:00:9c:81:e2 /var/warewulf/ipxe/cfg/ec:0d:9a:9c:81:e2.

The reason for the symlink is we're still waiting on a firmware fix from Mellanox in Flexboot that will let us change that line in dhcpd-template.conf to:

filename "http://%{IPADDR}/WW/ipxe/cfg/${hwaddr}";

Where ${hwaddr} is a iPXE shell variable that is representative of the 64-bit GUID on IB and 48-bit MAC on Ethernet.

There shouldn't be any other workarounds needed right now.

@Irekporeb
Copy link

Unfortunately this is not working. This create dhpd.conf like this:

host gpunode-0-0-ib0 {
option host-name gpunode-0-0;
option dhcp-client-identifier = ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9c:81:e2;
fixed-address 192.168.0.100;
next-server 192.168.0.1;
}

Jan 30 10:27:43 headnode dhcpd: DHCPREQUEST for 192.168.0.100 (192.168.0.1) from ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9c:81:e2 via ib0
Jan 30 10:27:43 headnode dhcpd: DHCPACK on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0
Jan 30 10:27:44 headnode in.tftpd[87431]: Client 192.168.0.100 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe
Jan 30 10:27:54 headnode dhcpd: DHCPDISCOVER from ec:0d:9a:9c:81:e2 via ib0: network 192.168.0.0/24: no free leases

If I add
"hardware ethernet ec:0d:9a:9c:81:e2;"

I can boot but the network is not recognised. I will try to create new bootstrap like you suggest before and see if this helps.

boot-problem4

@Irekporeb
Copy link

The new bootstrap didn't change anything. It change the wwhwaddr.
I have script from Mellanox which can disable the prefix and use only MAC in the firmware of the card. But this will be dificult to reverse for me. So I am trying to work with you before I mess with card firmware.

boot-problem5

@bensallen
Copy link
Member

bensallen commented Jan 30, 2018

@Irekporeb

I'm trying to figure out why it's grabbing /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe, which is the version of iPXE for a legacy BIOS (before UEFI or a UEFI in legacy mode). The ConnectX-5 card should be running FlexBoot, it's own version of iPXE. Upstream iPXE (undionly.kpxe in this case) doesn't send the dhcp-client-identifier with the prefix and GUID. Hence why the chainloaded undionly.kpxe cannot get a DHCP lease without the added hardware ethernet ... line. However FlexBoot should never need to chainload undionly.kpxe.

This looks like FlexBoot:

Jan 30 10:27:43 headnode dhcpd: DHCPREQUEST for 192.168.0.100 (192.168.0.1) from ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9c:81:e2 via ib0
Jan 30 10:27:43 headnode dhcpd: DHCPACK on 192.168.0.100 to ec:0d:9a:9c:81:e2 via ib0

This is the iPXE compiled for a legacy BIOS:

Jan 30 10:27:44 headnode in.tftpd[87431]: Client 192.168.0.100 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe
Jan 30 10:27:54 headnode dhcpd: DHCPDISCOVER from ec:0d:9a:9c:81:e2 via ib0: network 192.168.0.0/24: no free leases

Questions:

  1. Is this node in legacy BIOS mode or UEFI mode?
  2. What firmware version / flexboot version is on the ConnectX-5?
  3. With the line hardware ethernet ... removed from dhcpd.conf. Can you grab screenshots of the boot process? Most interested in any screens of once the ConnectX-5 attempts to start DHCP. Trying to figure out what is actually loading undionly.kpxe and if FlexBoot is actually being loaded or something else.
  4. Can you grab a tcpdump -s0 of a boot without hardware ethernet ... as well?

It's quite possible that FlexBoot no longer identifies itself as "iPXE" via its DHCP request. so this line https://github.com/warewulf/warewulf3/blob/master/provision/etc/dhcpd-template.conf#L14, is no longer evaluating true for FlexBoot.

@bensallen
Copy link
Member

bensallen commented Jan 30, 2018

Actually that gives me another idea to test my very last point. With the previous symlink workaround in place, eg ln -s /var/warewulf/ipxe/cfg/ec:0d:9a:03:00:9c:81:e2 /var/warewulf/ipxe/cfg/ec:0d:9a:9c:81:e2

Try a dhcpd-template.conf like:

allow booting;
allow bootp;
ddns-update-style interim;
authoritative;

option space ipxe;

# Tell iPXE to not wait for ProxyDHCP requests to speed up boot.
option ipxe.no-pxedhcp code 176 = unsigned integer 8;
option ipxe.no-pxedhcp 1;

option architecture-type   code 93  = unsigned integer 16;

filename "http://%{IPADDR}/WW/ipxe/cfg/${mac}";

subnet %{NETWORK} netmask %{NETMASK} {
   not authoritative;
   # option interface-mtu 9000;
   option subnet-mask %{NETMASK};
}

# Node entries will follow below

@Irekporeb
Copy link

  1. This node is running in BIOS mode. There is bug with DELL UEFI mode and ConnectX-5 card which doesn't empty the buffer. Mellanox inform me that they support only BIOS PXE. But we both open the case with DELL to fix the EUFI buffer problem. In meantime Mellanox says it should work in BIOS mode, as this is they recomended way to do it. The FlexBoot is part of the firmware but is different from EUFI FlexBoot. Below you have both version of the firmware.

  2. I am running the latest firmware (16.21.2010) but with UEFI disable.
    [root@gpunode00 ~]# flint -d /dev/mst/mt4119_pciconf0 q
    Image type: FS4
    FW Version: 16.21.2010
    FW Release Date: 27.11.2017
    Product Version: rel-16_21_2010
    Rom Info: type=PXE version=3.5.305 devid=4119 cpu=AMD64
    Description: UID GuidsNumber
    Base GUID: ec0d9a03009c81e2 4
    Base MAC: 0000ec0d9a9c81e2 4
    Image VSD: N/A
    Device VSD: N/A
    PSID: MT_0000000010
    Security Attributes: N/A

Headnode with FlexBoot EUFI
[root@headnode warewulf]# flint -d /dev/mst/mt4119_pciconf0 q
Image type: FS4
FW Version: 16.21.2010
FW Release Date: 27.11.2017
Product Version: rel-16_21_2010
Rom Info: type=UEFI version=14.14.22 cpu=AMD64
type=PXE version=3.5.305 devid=4119 cpu=AMD64
Description: UID GuidsNumber
Base GUID: ec0d9a03009bc342 4
Base MAC: 0000ec0d9a9bc342 4
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000010
Security Attributes: N/A

  1. I am already using this link:
    -rw-r--r-- 1 root root 465 Jan 30 10:01 ec:0d:9a:03:00:9b:c1:a2
    -rw-r--r-- 1 root root 465 Jan 30 10:01 ec:0d:9a:03:00:9c:81:e2
    lrwxrwxrwx 1 root root 23 Jan 23 13:31 ec:0d:9a:9b:c1:a2 -> ec:0d:9a:03:00:9b:c1:a2
    lrwxrwxrwx 1 root root 23 Jan 30 10:03 ec:0d:9a:9c:81:e2 -> ec:0d:9a:03:00:9c:81:e2

  2. do you still want me to run the test with new dhcp template?

@Irekporeb
Copy link

Irekporeb commented Jan 30, 2018

if you want to take this conversaton offline I am happy to do it.

@Irekporeb
Copy link

I just got the confirmation that Dell will not support EUFI PXE for ConnectX-5 cards. So I have only BIOS option now.

@bensallen
Copy link
Member

bensallen commented Jan 30, 2018

  1. do you still want me to run the test with new dhcp template?

Yes, I think this is still a useful test. Please take a tcpdump -s0 from the master during the boot of this node.

@bensallen
Copy link
Member

if you want to take this conversaton offline I am happy to do it.

The Github issue is fine with me for this discussion. It's nice and searchable.

@Irekporeb
Copy link

there are screen shots and tcpdump

warewulf-2.zip
21
22
23

This was done with not modification to dhcp-template.conf as we already establish that I am using BIOS PXE

@bensallen
Copy link
Member

bensallen commented Jan 30, 2018

Screen shots are helpful.

This was done with not modification to dhcp-template.conf as we already establish that I am using BIOS PXE

It's not so much that we care if its in BIOS vs UEFI mode, we care why its chainloading undionly.kpxe instead of directly going to loading the ipxe/cfg/... file over http at this point.

Please try a boot with the modified dhcpd-template.conf above, or alternatively adding filename "http://%{IPADDR}/WW/ipxe/cfg/${mac}"; into the node's host stanza.

The idea is ignore/replace the logic around loading the iPXE config only if user-class = iPXE, thus providing FlexBoot directly the ipxe/cfg/${mac} filename option.

@Irekporeb
Copy link

Irekporeb commented Jan 30, 2018

I have modify the dhcp-template. This time it didn't need the MAC to boot.

[root@headnode warewulf]# vi dhcpd-template.conf
[root@headnode warewulf]# wwsh dhcp update
Rebuilding the DHCP configuration
Done.
[root@headnode warewulf]# cat /etc/dhcp/dhcpd.conf
# DHCPD Configuration written by Warewulf. Do not edit this file, rather
# edit the template: /etc/warewulf/dhcpd-template.conf

allow booting;
allow bootp;
ddns-update-style interim;
authoritative;

option space ipxe;

# Tell iPXE to not wait for ProxyDHCP requests to speed up boot.
option ipxe.no-pxedhcp code 176 = unsigned integer 8;
option ipxe.no-pxedhcp 1;

option architecture-type   code 93  = unsigned integer 16;

filename "http://192.168.0.1/WW/ipxe/cfg/${mac}";

subnet 192.168.0.0 netmask 255.255.255.0 {
   not authoritative;
   # option interface-mtu 9000;
   option subnet-mask 255.255.255.0;
}

# Node entries will follow below

group {
   # Evaluating Warewulf node: gpunode-0-0 (DB ID:22)
   # Adding host entry for gpunode-0-0-ib0
   host gpunode-0-0-ib0 {
      option host-name gpunode-0-0;
      option dhcp-client-identifier = ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9c:81:e2;
      fixed-address 192.168.0.100;
      next-server 192.168.0.1;
   }
   # Evaluating Warewulf node: gpunode-0-1 (DB ID:19)
   # Adding host entry for gpunode-0-1-ib0
   host gpunode-0-1-ib0 {
      option host-name gpunode-0-1;
      option dhcp-client-identifier = ff:00:00:00:00:00:02:00:00:02:c9:00:ec:0d:9a:03:00:9b:c1:a2;
      fixed-address 192.168.0.101;
      next-server 192.168.0.1;
   }
}

warewulf-3.pcap.001.zip
warewulf-3.pcap.002.zip
warewulf-3.pcap.003.zip
warewulf-3.pcap.004.zip
warewulf-3.pcap.005.zip
warewulf-3.pcap.006.zip

@bensallen
Copy link
Member

Great. That solves the first mystery at least. Did the bootstrap fully provision the node?

@Irekporeb
Copy link

not, this get me to the debug shell with the same message "Network hardware not recognise"

@bensallen
Copy link
Member

bensallen commented Jan 30, 2018

Could you step through this shell loop from the debug shell:

https://github.com/warewulf/warewulf3/blob/master/provision/initramfs/init#L235

Double check some input variables before starting:

  • echo $WWHWADDR - should be 'ec:0d:9a:03:00:9c:81:e2'
  • echo $WWNETDEV - shoube be 'ib0'
for i in `echo /sys/class/net/* | xargs -n1 /usr/bin/basename | grep -v lo`; do
    HWADDR=`cat /sys/class/net/$i/address`
    # Use GUID if IB
    if [ ${#HWADDR} -eq 59 ]; then
        HWADDR=`expr substr $HWADDR 37 23`
    fi
    if [ $HWADDR == $WWHWADDR ]; then
        if ifup $i $WWHWADDR $WWNETDEV; then
            echo "$i" > /tmp/wwdev
        fi
        break
    fi
done

@Irekporeb
Copy link

The variable are ok. There is only ib0 and lo in /sys/class/net/ folder

boot-problem7

@macdems
Copy link
Contributor Author

macdems commented Jan 30, 2018

Just my 3 cents:

The unnecessary loading of the undionly.kpxe seems to be the main problem here. I suggest you carefully examine differences between WW 3.8 and 3.7 — the older version works perfectly for me: the nodes are up, configured, and running.

(I am on a vacation right now and I have no access to server, so I cannot do any tests before I return: until then I can only comment on the tests I have already done).

@bensallen
Copy link
Member

@Irekporeb Yep looks good. ifup is a shell function within /init, not Busybox's ifup, so the last failed line when you run ifup in your screenshot is expected. https://github.com/warewulf/warewulf3/blob/master/provision/initramfs/init#L73.

  1. Is there a message "Checking for network device: ib0 (ib0)" before the message "ERROR: Network hardware was not recognized!"
  2. Or does it still look exactly like as shown in Infiniband boot problem #100 (comment)?

This will tell us if ifup is being called at all.

  1. It would also be useful to set wwsh provision set --kargs='wwdebug=2' gpunode-0-0 and reboot the node. wwdebug=2 causes set -x in /etc/functions, so we might be able to get an idea what's failing. We'll be looking for the loop of code above, if ifup is being called. If so where ifup is failing.

@jmstover
Copy link
Contributor

jmstover commented Jan 30, 2018

There's been some cards I've needed to wait to initialize (add sleep 10 or so, at the end of openibd init script), and it's possible that that card isn't fully initialized by the time it's hitting this point.

We should see this from the wwdebug=2 output.

Edit:
Meaning... from this, it doesn't look like the device is showing up under /sys/class/net/

@bensallen
Copy link
Member

@macdems The unnecessary loading of the undionly.kpxe is certainly a problem, I've opened #104 to track it. We know why its happening at this point (lack of user-class option 76 in the DHCP discover/request from FlexBoot), and its unrelated to the network interface not being configured once in the bootstrap.

@Irekporeb
Copy link

  1. No there is no message for checking network
  2. the screen still looks like in the mentioned comment. "Network Hardware not recognized"
  3. I have set up wwdebug mode for gpunode-0-0. I can't see "ifup" beeing called at all.
    I recorded whole booting session as it is going fast but still can't see it.

boot-problem9

boot-problem8

boot-problem-sys_class_net

@bensallen
Copy link
Member

@Irekporeb Do you have a screenshot of the for i in `echo /sys/class/net/* | xargs -n1 /usr/bin/basename | grep -v lo`; do loop running directly before the ERROR: Network hardware was not recognized or is the above screenshots showing the continuous output between detect running and the error?

I'm interested in if /sys/class/net/ib0 exists at the point when this loop runs. If so you should see HWADDR being populated. If not then we have an annoying race condition on the interface bring-up like Jason mentioned.

@jmstover
Copy link
Contributor

jmstover commented Jan 30, 2018 via email

@bensallen
Copy link
Member

@jmstover The debug shell that you get dumped into with wwdebug=3, is directly before the debug shell due to /tmp/wwdev not existing, so there won't be much difference in timing.

I'm thinking we might need to extend the WWNETRETRY functionality to loop around

for i in `echo /sys/class/net/* | xargs -n1 /usr/bin/basename | grep -v lo`; do
    HWADDR=`cat /sys/class/net/$i/address`
    # Use GUID if IB
    if [ ${#HWADDR} -eq 59 ]; then
        HWADDR=`expr substr $HWADDR 37 23`
    fi
    if [ $HWADDR == $WWHWADDR ]; then
        if ifup $i $WWHWADDR $WWNETDEV; then
            echo "$i" > /tmp/wwdev
        fi
        break
    fi
done

@Irekporeb
Copy link

this is the message which I was able to catch just befor the Error message. I have also attached the booting video below.

boot-problem10

Video.zip

@Irekporeb
Copy link

@jmstover There is the output of your command

boot-problem11

@bensallen
Copy link
Member

bensallen commented Jan 30, 2018

@Irekporeb Thanks for the video, its super helpful. Yep, the echo /sys/class/net/* glob is only finding lo and not ib0. Since you see it from the debug shell it appears we have a race condition.

Bit of a long shot, but in bootstrap.conf add mlx5_core and mlx5_ib back into a modprobe line, eg:

modprobe += mlx5_core, mlx5_ib, ib_ipoib

Rebuild the bootstrap, via wwboostrap .... These modprobe lines are evaluated and acted on slighter earlier than detect and modules specified via wwkmods via kargs.

I doubt this will work, and if it doesn't than we'll need to add some code into provision/initramfs/init to allow for some retries before giving up.

@Irekporeb
Copy link

I have added the other modules to the bootstrap and disable debug mode. But adding more modules caused that "ib0" disapear.
Also I can see that "ib_ipoib" module is not loaded

boot-problem12

@bensallen
Copy link
Member

Double checking, the line you added to bootstrap.conf looks exactly like: modprobe += mlx5_core, mlx5_ib, ib_ipoib ?

@Irekporeb
Copy link

Irekporeb commented Jan 31, 2018

O yes , there were missing "," between the modules. After fixing that I was able to boot to the node OS.

@bensallen
Copy link
Member

Closing. I don't believe we have anything outstanding in this issue. #104 is still open to figure out how to distinguish FlexBoot's iPXE in dhcpd.conf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants