Discussion:
[PATCH RFC for-next] net/mlx4_core: Fix racy flow in the driver CQ completion handler
Or Gerlitz
2012-08-30 10:38:05 UTC
Permalink
From: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>

The mlx4 CQ completion handler, mlx4_cq_completion doesn't bother to lock
the radix tree which is used to manage the table of CQs, nor to increase
the reference count of the CQ before invoking the user provided callback.

This is racy and can cause use after free, null pointer dereference, etc various
crashes. Fix that by wrapping the radix tree lookup with taking the table lock,
and increase the ref count for the time of the callback.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>

---

Hi Roland,

Doing some code review on the related parts, I didn't see any way
the handler can be protected for the tree lockup and the user callback
invokation other then using these methods, e.g as done by the "event"
callback. Looking on mthca, I see the same behaviour -- mthca_cq_completion
takes no locks and doesn't increase the refcount while mthca_cq_event does
go by the book and do what ever needed, anything we miss here? its there
from day one...

Or.

drivers/net/ethernet/mellanox/mlx4/cq.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 7e64033..05d9529 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -56,9 +56,14 @@
void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
{
struct mlx4_cq *cq;
+ struct mlx4_cq_table *cq_table = &mlx4_priv(dev)->cq_table;

+ spin_lock(&cq_table->lock);
cq = radix_tree_lookup(&mlx4_priv(dev)->cq_table.tree,
cqn & (dev->caps.num_cqs - 1));
+ if (cq)
+ atomic_inc(&cq->refcount);
+ spin_unlock(&cq_table->lock);
if (!cq) {
mlx4_dbg(dev, "Completion event for bogus CQ %08x\n", cqn);
return;
@@ -67,6 +72,8 @@ void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
++cq->arm_sn;

cq->comp(cq);
+ if (atomic_dec_and_test(&cq->refcount))
+ complete(&cq->free);
}

void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type)
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Or Gerlitz
2012-08-30 22:17:54 UTC
Permalink
Can you be explicit about the race you're worried about?
few

1. on the time CQ A is deleted an interrupt that relates to CQ B
takes place and a radix
tree lookup is running while an element is being deleted from the
tree, looking on the radix tree API, I don't see that this is allowed.

2. while a CQ is being freed an interrupt takes place and the driver
attempts to run the comp handler which can turn to use after free,
null pointer deref, etc. This can happen even if the ULP made sure to
consume all the WCs related to flushed/etc, e.g an "empty"
interrupt

BTW, your email missed the list somehow

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Roland Dreier
2012-08-30 22:35:58 UTC
Permalink
Post by Or Gerlitz
Can you be explicit about the race you're worried about?
few
1. on the time CQ A is deleted an interrupt that relates to CQ B
takes place and a radix
tree lookup is running while an element is being deleted from the
tree, looking on the radix tree API, I don't see that this is allowed=
=2E

I don't think this is a real problem; the radix tree code is
explicitly designed for RCU use, and the data structure is pretty
clearly safe for looking up one slot while another slot is being
cleared. In fact it's hard to see how this could screw up.
Post by Or Gerlitz
2. while a CQ is being freed an interrupt takes place and the driver
attempts to run the comp handler which can turn to use after free,
null pointer deref, etc. This can happen even if the ULP made sure to
consume all the WCs related to flushed/etc, e.g an "empty"
interrupt
So in mlx4_cq_free() we do

mlx4_HW2SW_CQ(dev, NULL, cq->cqn);
//...
synchronize_irq(priv->eq_table.eq[cq->vector].irq);

before we touch the cq table. I don't think we should get a CQ
completion event for the CQ we're freeing after we've done HW2SW_CQ on
it and then waited for any outstanding completion interrupts to
finish.

Also we know that there are no QPs attached to this CQ so there
shouldn't be any completion events anyway...

- R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Or Gerlitz
2012-08-31 06:53:50 UTC
Permalink
Post by Roland Dreier
Post by Or Gerlitz
1. on the time CQ A is deleted an interrupt that relates to CQ B
takes place and a radix
tree lookup is running while an element is being deleted from the
tree, looking on the radix tree API, I don't see that this is allowed.
I don't think this is a real problem; the radix tree code is
explicitly designed for RCU use, and the data structure is pretty
clearly safe for looking up one slot while another slot is being
cleared. In fact it's hard to see how this could screw up.
OK, I'll give it a 2nd look, thanks for elaborating on the design, one
thing which probably confused us was the driver cq event callback
which does take the lock before searching the radix tree, and increase
the refcount before invoking the user event handler, any reason for
the driver event callback to use different practice vs. the comp one?
Post by Roland Dreier
Post by Or Gerlitz
2. while a CQ is being freed an interrupt takes place and the driver
attempts to run the comp handler which can turn to use after free,
null pointer deref, etc. This can happen even if the ULP made sure to
consume all the WCs related to flushed/etc, e.g an "empty" interrupt
So in mlx4_cq_free() we do
mlx4_HW2SW_CQ(dev, NULL, cq->cqn);
//...
synchronize_irq(priv->eq_table.eq[cq->vector].irq);
before we touch the cq table. I don't think we should get a CQ
completion event for the CQ we're freeing after we've done HW2SW_CQ on
it and then waited for any outstanding completion interrupts to finish.
yes, makes sense, this wouldn't handle use cases where the user does
context switch, e.g from hard_irq to softirq/tasklet and let their
handler touch the CQ, but the refcount in the driver wouldn't help
either in that case.
Post by Roland Dreier
Also we know that there are no QPs attached to this CQ so there
shouldn't be any completion events anyway...
All to all, your reasoning makes much sense, we probably need to look
deeper into the crash report that triggered this RFC, see next email.

Yishai - were you thinking on other possible races that the patch
could address? also, can you double check the point Roland made on
HW2SW_CQ.


Or.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Or Gerlitz
2012-08-31 07:12:53 UTC
Permalink
I don't think this is a real problem; the radix tree code is [...]
So maybe this patch wouldn't land in 3.7, we'll see, however, so far
no other patch sits in the for-next branch of your tree for the next
window... any plan to start reviewing / picking patches anytime soon?

Also, there are three more fixes for 3.6 I sent you earlier this week,
which need to get in as they address real bugs (deadlock in IPoIB,
crashes in mlx4 when attempting to register > 512GB, more).

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Or Gerlitz
2012-08-31 07:02:34 UTC
Permalink
Can you be explicit about the race you're worried about?
Roland,

This patch was made after we got the below report from the field.

Dotan, can we get Roland access to the full vmcore image?

Here's the report I got:

"[...] box panic'ed when trying to find a completion queue. There is
no corruption but there is a possible race which could result in
mlx4_cq_completion getting wrong height of the radix tree and
following a bit too deep into the chains. In the other code which uses
this radix tree the access is protected by the lock but
mlx4_cq_completion is running win the interrupt context and cannot
take locks, so instead it run without any protection whatsoever."

The below is the stack trace, the kernel is an RHEL5
2.6.18-274.something, so in that respect, maybe your comment on the
radix tree RCU doesn't take effect there? you can see the problem was
somewhere into the radix tree code.

crash> bt
PID: 8178 TASK: ffff810b91a52400 CPU: 11 COMMAND: "universal_consu"
#0 [ffff81182fe3bc70] crash_kexec at ffffffff800ba60c
#1 [ffff81182fe3bd30] __die at ffffffff80069047
#2 [ffff81182fe3bd70] die at ffffffff8007075b
#3 [ffff81182fe3bda0] do_general_protection at ffffffff80069375
#4 [ffff81182fe3bde0] error_exit at ffffffff80060e9d

[exception RIP: radix_tree_lookup+38]
RIP: ffffffff8016405f RSP: ffff81182fe3be90 RFLAGS: 00210002
RAX: 6b6b6b6b6b6b6b8b RBX: ffff810c2fb90000 RCX: 0000000000000006
RDX: 0000000000000001 RSI: 00000000000000c0 RDI: 6b6b6b6b6b6b6b6b
RBP: 00000000000000c0 R8: 0000000000010000 R9: 000000000920ea94
R10: 00000000000000b3 R11: ffffffff800c7ea5 R12: 00000000000000b3
R13: ffff810c2b15a280 R14: ffff810b7a98ff58 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000

#5 [ffff81182fe3be90] mlx4_cq_completion at ffffffff886f2367
#6 [ffff81182fe3beb0] mlx4_eq_int at ffffffff886f2bd3
#7 [ffff81182fe3bf10] mlx4_msi_x_interrupt at ffffffff886f3078
#8 [ffff81182fe3bf20] handle_IRQ_event at ffffffff8001167c
#9 [ffff81182fe3bf50] __do_IRQ at ffffffff800c7f40
#10 [ffff81182fe3bf90] do_IRQ at ffffffff80071659

--- <IRQ stack> ---

#11 [ffff810b7a98ff58] ret_from_intr at ffffffff80060652
RIP: 00000000f5cd8859 RSP: 00000000f3eafa28 RFLAGS: 00200202
RAX: 0000000000000000 RBX: 00000000f3eafaf8 RCX: 000000000b6d4f00
RDX: 000000000000001c RSI: 000000000000001c RDI: 0000000000000000
RBP: ffff810bf1a0a120 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: ffff810b9e7d5bf0 R15: 0000000000000000
ORIG_RAX: ffffffffffffff4c CS: 0023 SS: 002b
crash>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...