[Users] Linux kernel: Crash of IB peer in RC mode is not detected

Discussion:

Jack Wang

2014-10-23 13:50:09 UTC

cc to linux-rdma, which is more proper for this kind of questions.

Hello,
we are implementing Linux kernel modules that are transferring data
with RDMA-Write operations via an RC-connection between 2 hosts.
After the RDMA connection between the hosts was established we are causing a
kernel Oops on one of them with "echo c > /proc/sysrq-trigger".
The other peer of the RC connection don't notice the crash.
RDMA-Write operations are still finished successfully with a WC event 10min
after the crash.
- CQ ib_event_handler,
- QP ib_event_handler,
- device ib_event_handler,
- connection manager event handler.
But we don't receive any events that indicate a connection abort.
I expected that RDMA-Write operations will fail if the other crashes.
Also I hoped that an event is generated when a host is crashed. The subnet
manager should notice it and notify every other device in the network.
Are we missing something in our modules?
Is there a way to determine that a RC peer crashed without implementing a
ping-pong mechanism?
- Linux 3.14.13
- Mellanox Technologies MT27500 Family [ConnectX-3],
mlx4_core driver
- both peers are directly connected, no switch in between
- on both hosts OpenSM 3.2.6 is running
thanks in advance
Fabian
_______________________________________________
Users mailing list
http://lists.openfabrics.org/mailman/listinfo/users

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Roland Dreier

2014-10-23 18:43:56 UTC

Permalink

I expected that RDMA-Write operations will fail if the other crashes.
Also I hoped that an event is generated when a host is crashed. The subnet
manager should notice it and notify every other device in the network.
Are we missing something in our modules?
Is there a way to determine that a RC peer crashed without implementing a
ping-pong mechanism?

If the remote system crashes then any memory regions, QPs, etc. are
still valid with the remote HCA, and RDMA read/write operations will
continue to succeed. (Unless the system reboots and reinitializes the
adapter or something like that).

There isn't a way to detect a remote crash unless that remote crash
disconnects your QP or otherwise affects the HCA on the crashed
system.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jack Wang

2014-10-24 10:01:25 UTC

Permalink

Thanks Roland to clarify our confusion.

So looks ping-pong mechanism is the way to go.

Regards,
Jack

Post by Roland Dreier

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html