I have blocks of hosts that I'm provisioning using Puppet
in exactly the same way, they have identical hardware (same blade chassis), and are definitely connected in all the same ways where interfaces on some are not working the same as others. These are all Infiniband interfaces, so I'm able to test them with commands like ibping
and ibsysstat
, which shows that they have working UVERBS/RDMA connections. For example:
master# ibsysstat 29
sysstat ping succeeded
where the node with that LID that isn't working quite right has:
node10# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.11.1250
Hardware version: 1
Node GUID: 0x...
System image GUID: 0x...
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 29
LMC: 0
SM lid: 26
Capability mask: 0x02594868
Port GUID: 0x...
Link layer: InfiniBand
but, when I just do a simple ping
to the IPoIB IP address it sits there not connecting. Other commands like ibping
are also definitely passing traffic, and data shows up when adding -d
showing debug output. I can see the pings go out when I watch the interface using tcpdump
, but nothing coming in. Meanwhile, right next to it is a host with the same everything that works just fine. The routing tables all like right to me also, and match hosts that work. On a host that doesn't work:
default via 10.10.0.1 dev em1 proto dhcp metric 100
10.10.0.0/24 dev em1 proto kernel scope link src 10.10.0.110 metric 100
10.11.0.0/24 dev ib0 proto kernel scope link src 10.11.0.110
169.254.0.0/16 dev ib0 scope link metric 1005
and on one that does:
default via 10.10.0.1 dev em1 proto dhcp metric 100
10.10.0.0/24 dev em1 proto kernel scope link src 10.10.0.108 metric 100
10.11.0.0/24 dev ib0 proto kernel scope link src 10.11.0.108
169.254.0.0/16 dev ib0 scope link metric 1004
The only thing different is the metric in the last route, but that shouldn't matter. Also of note, these hosts worked before they were reprovisioned. So I'm almost positive it's not hardware.
I'm at a bit of a loss now and any ideas would be appreciated.
Edit: Update with dmesg error
I found something in the output of dmesg
for the interface in question that only exists on the hosts that don't work. The error
ib0: failed to modify QP to RTR: -22
unfortunately this isn't very helpful, and there's not much that comes up related in searches.
Perhaps also worth noting, the hosts in question can ping the switch IP address, and the switch can ping the hosts on their associated IPs.
This is a known issue in kernel 3.10.0-862.11.1 to 3.10.0-862.11.6 (see here and here).
Essentially, if you update the kernel to 862.11.1-862.11.6, a bug in drivers/infiniband/core/verbs.c where a semi-colon was left out causes all reliable connected (rc) messages to fail while unreliable datagram messages will work. You can either patch this driver, or boot from an earlier kernel to work-around this issue until the updated kernel resolves this issue.
이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.
침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제
몇 마디 만하겠습니다