Discussion:
rdma_create_qp() and max_send_wr
Yann Droneaud
2011-04-21 16:44:24 UTC
Permalink
Hi,

I have a problem with rdma_create_qp() when I set
qp_init_attr.cap.max_send_wr to something higher than 16351:
it returns -1 and errno is set to ENOMEM "Cannot allocate memory".

strace doesn't show anything related to memory, but the last write()
syscall returns EINVAL "Invalid Argument".

I'm using a Mellanox ConnectX MT26428 (v2.6.100 / a0) HCA, under Debian
6.0 (kernel 2.6.32-5, librdmacm 1.0.10-1, libibverbs 1.1.3-2, libmlx4
1.0-1).

According to ibv_query_device() informations,
ibv_device_attr.max_qp_wr is 16384 (and ibv_device_attr.max_qp_wr is
16383). So one might think that 16383 outstanding WR should be OK.

I've also tried to increase the length of the associated CQ, but it
doesn't change anything.

So what's the limit of WR in the QP's SQ ?
Is it ibv_device_attr.max_qp_wr - 32 ?


Regards.
--
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Roland Dreier
2011-04-21 18:35:57 UTC
Permalink
Post by Yann Droneaud
I have a problem with rdma_create_qp() when I set
it returns -1 and errno is set to ENOMEM "Cannot allocate memory".
strace doesn't show anything related to memory, but the last write()
syscall returns EINVAL "Invalid Argument".
I'm using a Mellanox ConnectX MT26428 (v2.6.100 / a0) HCA, under Debian
6.0 (kernel 2.6.32-5, librdmacm 1.0.10-1, libibverbs 1.1.3-2, libmlx4
1.0-1).
According to ibv_query_device() informations,
ibv_device_attr.max_qp_wr is 16384 (and ibv_device_attr.max_qp_wr is
16383). So one might think that 16383 outstanding WR should be OK.
Getting exactly the right value for max_qp_wr is kind of tricky because
of complicated allocation rules. I guess this is just a mlx4 bug in
reporting not quite the right value from ibv_query_device().
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Or Gerlitz
2011-05-22 06:46:21 UTC
Permalink
Post by Roland Dreier
Getting exactly the right value for max_qp_wr is kind of tricky because
of complicated allocation rules. I guess this is just a mlx4 bug in
reporting not quite the right value from ibv_query_device().
Maybe the correct way to go for mlx4 is to go min/max that is, report
the --minimal-- of max(recv, send) value that would work for an app
setting of either of their send or the recv WR numbers.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eli Cohen
2011-05-22 08:21:07 UTC
Permalink
Post by Or Gerlitz
Maybe the correct way to go for mlx4 is to go min/max that is,
report the --minimal-- of max(recv, send) value that would work for
an app setting of either of their send or the recv WR numbers.
I see that OFED already contains a fix for this that Jack pushed. So I
guess Jack will send it for upstream submission.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yann Droneaud
2011-04-22 10:20:45 UTC
Permalink
Hi,
An ENOMEM return does not mean that the subsystem *just* failed to
allocate system memory.
The memory that could not be allocated could be device memory.
=20
I'm also having some difficulties with system memory allocation.

In my test, a user is allowed to lock 4MBytes of memory, but not all
this memory is available to ibv_reg_mr() since ibv_create_cq() and
ibv_create_qp()/rdma_create_qp() lock memory respectively for CQ and QP=
=2E
The question is how much memory is needed for the CQ and QP queues ?

In my case, the maximum message of size is 4MBytes - 20KBytes, for a CQ
and QP (half duplex) queues length of 1.

Using message size of 128 bytes and less hit the QP WR limit of 16351
length.

When using messages of size 256 bytes, I'm only able to register 260915=
2
bytes, then CQ and QP (half duplex) queues are 10192 entries length. So
they seems to requires about 1585152 bytes. Taking in account a fixed
amount of reserved memory of 20KBytes, this give about 154 bytes per (C=
Q
+ QP (half duplex)) entry.

When doing the same math with size 512 and 1024, the size of (CQ + QP
(half duplex)) is going down.

msg 512 memory 3395584 length 6632
msg 1024 memory 3788800 length 3700
msg 2048 memory 3985408 length 1946

Note that the memory used for the message is allocated as an aligned bi=
g
chunk and registered as whole, and then sliced to be posted in WR.=20

But the memory required for the CQ and QP elements (and other) is also
subject to alignment to a page size.

At least, I known that CQ / QP "overhead" is not going to hurt users, i=
f
they are allocated "modern" memory limits, let's say 1GBytes ;)

Regards.

--=20
Yann Droneaud
OPTEYA



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yann Droneaud
2011-04-22 10:37:49 UTC
Permalink
And I forgot to mention:

Le vendredi 22 avril 2011 =C3=A0 12:20 +0200, Yann Droneaud a =C3=A9cri=
Post by Yann Droneaud
I'm also having some difficulties with system memory allocation.
In this case of failure, strace shows the last write() syscall returnin=
g
ENOMEM.

Regards.

--=20
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eli Cohen
2011-05-19 06:07:04 UTC
Permalink
Hi Yan,
it appears that you're using quite an old firmware. Could you upgrade
the firmware to the latest version and check again the failure to
create a QP with the max depth. FW and burning tools can be downloaded
from www.mellanox.com.

Another possible reason for the failures you were seeing is the 4 MB
limit on locking memory from usersapce. Could you repeat the
experiment as root?

P.S: I checked on my setup and was able to create a QP with the max
size.
Post by Yann Droneaud
Hi,
=20
An ENOMEM return does not mean that the subsystem *just* failed to
allocate system memory.
=20
The memory that could not be allocated could be device memory.
=20
=20
I'm also having some difficulties with system memory allocation.
=20
In my test, a user is allowed to lock 4MBytes of memory, but not all
this memory is available to ibv_reg_mr() since ibv_create_cq() and
ibv_create_qp()/rdma_create_qp() lock memory respectively for CQ and =
QP.
Post by Yann Droneaud
The question is how much memory is needed for the CQ and QP queues ?
=20
In my case, the maximum message of size is 4MBytes - 20KBytes, for a =
CQ
Post by Yann Droneaud
and QP (half duplex) queues length of 1.
=20
Using message size of 128 bytes and less hit the QP WR limit of 16351
length.
=20
When using messages of size 256 bytes, I'm only able to register 2609=
152
Post by Yann Droneaud
bytes, then CQ and QP (half duplex) queues are 10192 entries length. =
So
Post by Yann Droneaud
they seems to requires about 1585152 bytes. Taking in account a fixed
amount of reserved memory of 20KBytes, this give about 154 bytes per =
(CQ
Post by Yann Droneaud
+ QP (half duplex)) entry.
=20
When doing the same math with size 512 and 1024, the size of (CQ + QP
(half duplex)) is going down.
=20
msg 512 memory 3395584 length 6632
msg 1024 memory 3788800 length 3700
msg 2048 memory 3985408 length 1946
=20
Note that the memory used for the message is allocated as an aligned =
big
Post by Yann Droneaud
chunk and registered as whole, and then sliced to be posted in WR.=20
=20
But the memory required for the CQ and QP elements (and other) is als=
o
Post by Yann Droneaud
subject to alignment to a page size.
=20
At least, I known that CQ / QP "overhead" is not going to hurt users,=
if
Post by Yann Droneaud
they are allocated "modern" memory limits, let's say 1GBytes ;)
=20
Regards.
=20
--=20
Yann Droneaud
OPTEYA
=20
=20
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma"=
in
Post by Yann Droneaud
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yann Droneaud
2011-05-19 09:17:16 UTC
Permalink
Hi,
Post by Eli Cohen
Hi Yan,
it appears that you're using quite an old firmware. Could you upgrade
the firmware to the latest version and check again the failure to
create a QP with the max depth. FW and burning tools can be downloade=
d
Post by Eli Cohen
from www.mellanox.com.
=20
Upgraded to firmware 2.8.0600, I have the same problem.
Post by Eli Cohen
Another possible reason for the failures you were seeing is the 4 MB
limit on locking memory from usersapce. Could you repeat the
experiment as root?
=20
The '16351' limit also hit root on my testbench.
I have also raised the memory limit to 1GBytes, and it doesn't work too=
=2E
Post by Eli Cohen
P.S: I checked on my setup and was able to create a QP with the max
size.
=20
Have you some test code for me to test ?

Regards.

--=20
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eli Cohen
2011-05-19 09:34:02 UTC
Permalink
Post by Yann Droneaud
Have you some test code for me to test ?
I used ibv_rc_pingpong which is part of libiberbs. the '-r' option
allows you to define the queue depth. Please try it and let me know.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yann Droneaud
2011-05-19 10:50:13 UTC
Permalink
Post by Eli Cohen
Post by Yann Droneaud
=20
Have you some test code for me to test ?
=20
=20
I used ibv_rc_pingpong which is part of libiberbs. the '-r' option
allows you to define the queue depth. Please try it and let me know.
ibv_rc_pingpong works well with the max value eg 16384.
but ib_rdma_bw doesn't:

ib_rdma_bw -t 16384 doesn't work, but ib_rdma_bw -t 16351 does.

Creating QP through ibv_create_qp() seems to allow use=20
of the maximum QP WR, but rmda_create_qp() limits to 16351.

Regards.

--=20
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hefty, Sean
2011-05-19 14:25:12 UTC
Permalink
Post by Yann Droneaud
Creating QP through ibv_create_qp() seems to allow use
of the maximum QP WR, but rmda_create_qp() limits to 16351.
rdma_create_qp() passes the QP creation call straight through to ibv_create_qp(). It will also allocate CQs if those don't already exist using the same limits as specified for the QP, which could be the reason for the failure.

- Sean
��칻�&�~�&���+-��ݶ��w��˛���m�b��kvf���^n�r���z���h�����&����������v��fp)��br ���+
Yann Droneaud
2011-05-19 14:45:20 UTC
Permalink
Hi,
Post by Yann Droneaud
Creating QP through ibv_create_qp() seems to allow use
of the maximum QP WR, but rmda_create_qp() limits to 16351.
=20
rdma_create_qp() passes the QP creation call straight through to ibv_=
create_qp(). =20

That's what I see when tracing the code in GDB.
It will also allocate CQs if those don't already exist using the same=
=20
limits as specified for the QP, which could be the reason for the=20
failure.
=20
The CQs are already allocated, I've also checked that with the debugger=
=2E

So I'm a bit puzzled : why does it work in ibv_rc_pingpong but not in
rdma_bw ?

--=20
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yann Droneaud
2011-05-19 15:03:45 UTC
Permalink
Hi,
Post by Yann Droneaud
So I'm a bit puzzled : why does it work in ibv_rc_pingpong but not in
rdma_bw ?
=20
Because ibv_rc_pingpong -r modify the max_recv_wr attributes=20
and rdma_bw -t modify the max_send_rw instead.

After modifiying ibv_rc_pingpong to set max_recv_wr attributes, it fail=
s
to create the QP when trying to use more than 16351 WR in the SQ.

To sum up:

- ibv_qp_init_attr.max_recv_wr can be set to ibv_device_attr.max_qp_wr,
16384 in my case,
=20
- ibv_qp_init_attr.max_send_wr *cannot* be set to
ibv_device_attr.max_qp_wr, but it can be set to 16351.

Regards.

--=20
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eli Cohen
2011-05-19 15:46:50 UTC
Permalink
Post by Yann Droneaud
- ibv_qp_init_attr.max_recv_wr can be set to ibv_device_attr.max_qp_wr,
16384 in my case,
- ibv_qp_init_attr.max_send_wr *cannot* be set to
ibv_device_attr.max_qp_wr, but it can be set to 16351.
Thanks for investigating this. We'll check and send a fix. Meanwhile
you can workaround this in your code.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yann Droneaud
2011-05-19 16:59:23 UTC
Permalink
Post by Eli Cohen
=20
=20
- ibv_qp_init_attr.max_recv_wr can be set to ibv_device_attr.max_qp=
_wr,
Post by Eli Cohen
16384 in my case,
=20
- ibv_qp_init_attr.max_send_wr *cannot* be set to
ibv_device_attr.max_qp_wr, but it can be set to 16351.
=20
Thanks for investigating this. We'll check and send a fix. Meanwhile
you can workaround this in your code.
I understand the problem (which is probably not really a problem):

In libmlx4/src/verbs.c:mlx4_create_qp()

mlx4_calc_sq_wqe_size(&attr->cap, attr->qp_type, qp);

/*
* We need to leave 2 KB + 1 WQE of headroom in the SQ to
* allow HW to prefetch.
*/
qp->sq_spare_wqes =3D (2048 >> qp->sq.wqe_shift) + 1;
qp->sq.wqe_cnt =3D align_queue_size(attr->cap.max_send_wr +
qp->sq_spare_wqes);
qp->rq.wqe_cnt =3D align_queue_size(attr->cap.max_recv_wr);

=46or a requested SQ length of 1:

qp->sq.wge_shift : 6
qp->sq_spare_wges : 33
attr->cap.max_send_wr : 1
qp->sq_wge_cnt : 64

=46or a requested SQ length of 16351:

qp->sq.wge_shift : 6
qp->sq_spare_wges : 33
attr->cap.max_send_wr : 16351
qp->sq_wge_cnt : 16384

=46or a requested SQ length of 16384:

qp->sq.wge_shift : 6
qp->sq_spare_wges : 33
attr->cap.max_send_wr : 16384
qp->sq_wge_cnt : 32768

32768 is clearly above the ibv_device_attr.max_qp_wr limit, so the
creation of the QP failed later:

ret =3D ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd,
sizeof cmd,
&resp, sizeof resp);

ibv_cmd_create_qp() fails and returns 22.


If spare WQEs are taken in account here, it should be taken in account
in the data reported by ibv_query_device().

Regards.

--=20
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yann Droneaud
2011-05-19 17:13:07 UTC
Permalink
Hi,
Post by Yann Droneaud
=20
=20
- ibv_qp_init_attr.max_recv_wr can be set to ibv_device_attr.max_=
qp_wr,
Post by Yann Droneaud
16384 in my case,
=20
- ibv_qp_init_attr.max_send_wr *cannot* be set to
ibv_device_attr.max_qp_wr, but it can be set to 16351.
=20
Thanks for investigating this. We'll check and send a fix. Meanwhil=
e
Post by Yann Droneaud
you can workaround this in your code.
=20
=20
In libmlx4/src/verbs.c:mlx4_create_qp()
=20
mlx4_calc_sq_wqe_size(&attr->cap, attr->qp_type, qp);
=20
/*
* We need to leave 2 KB + 1 WQE of headroom in the SQ to
* allow HW to prefetch.
*/
This was introduced by commit 561da8d1

Handle new FW requirement for send request prefetching

http://git.kernel.org/?p=3Dlibs/infiniband/libmlx4.git;a=3Dcommit;h=3D5=
61da8d1

Regards.

--=20
Yann Droneaud
OPTEYA



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" i=
n
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hefty, Sean
2011-05-19 18:06:03 UTC
Permalink
Post by Yann Droneaud
If spare WQEs are taken in account here, it should be taken in account
in the data reported by ibv_query_device().
max_qp_wr does not distinguish between max send or receive or indicate if those values should be the same. IMO, Setting

max_qp_wr = max(send wr, recv wr)

makes more sense than

max_qp_wr = min(send wr, recv wr)

The maximum values reported are not meant to be guaranteed minimums.

- Sean
N�����r��y����b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h�����/oSc��ڳ9�u�����&jw��
Eli Cohen
2011-05-19 19:37:48 UTC
Permalink
Post by Hefty, Sean
max_qp_wr does not distinguish between max send or receive or indicate if those values should be the same. IMO, Setting
max_qp_wr = max(send wr, recv wr)
makes more sense than
max_qp_wr = min(send wr, recv wr)
The maximum values reported are not meant to be guaranteed minimums.
Here's what the spec says:
The maximum number of outstanding work requests on any
Work Queue supported by this HCA.

I think the text should be interpreted as you said:
max_qp_wr = max(send wr, recv wr)

But I also don't find much use for the value returned by query HCA.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...