Discussion:
[PATCH V2 for-next 0/9] Peer-Direct support
Yishai Hadas
2014-10-23 12:02:48 UTC
Permalink
The following set of patches implements Peer-Direct support over RDMA
stack.

Peer-Direct technology allows RDMA operations to directly target
memory in external hardware devices, such as GPU cards, SSD based
storage, dedicated ASIC accelerators, etc.

This technology allows RDMA-based (over InfiniBand/RoCE) application
to avoid unneeded data copying when sharing data between peer hardware
devices.

To implement this technology, we defined an API to securely expose the
memory of a hardware device (peer memory) to an RDMA hardware device.

The API defined for Peer-Direct is described in this cover letter.
The required implementation for a hardware device to expose memory
buffers over Peer-Direct is also detailed in this letter.

Finally, the cover letter includes a description of the flow and the
API that IB core and low level IB hardware drivers implement to
support the technology

Flow:
-----------------
Each peer memory client should register itself into the IB core (ib_core)
module, and provide a set of callbacks to manage its memory basic
functionality.

The required functionality includes getting pages descriptors based
upon user space virtual address, dma mapping these pages, getting the
memory page size, removing the DMA mapping of the pages and releasing
page descriptors.
Those callbacks are quite similar to the kernel API used to pin normal
host memory and exposed it to the hardware.
Description of the API is included later in this cover
letter.

The peer direct controller, implemented as part of the IB core
services, provides registry and brokering services between peer memory
providers and low level IB hardware drivers.
This makes the usage of peer-direct almost completely transparent to
the individual hardware drivers. The only changes required in the low
level IB hardware drivers is supporting an interface for immediate
invalidation of registered memory regions.

The IB hardware driver should use ib_umem_get with an extra signaling
that the requested memory may reside on a peer memory. When a given
user space virtual memory address found to belong to a peer memory
client, an ib_umem is built using the callbacks provided by the peer
memory client. In case the IB hardware driver supports invalidation
on that ib_umem it must be signaled as part of ib_umem_get, otherwise
if the peer memory requires invalidation support the registration will
be rejected.

After getting the ib_umem, if it is residing on a peer memory that requires
invalidation support, the low level IB hardware driver must register the
invalidation callback for this ib_umem.
If this callback is called, the driver should ensure that no access to
the memory mapped by the umem will happen once the callback returns.

Information and statistics regarding the registered peer memory
clients are exported to the user space at:
/sys/kernel/infiniband/memory_peers/<peer_name>/.
===============================================================================
Peer memory API
===============================================================================
Peer client structure:
-------------------------------------------------------------------------------
struct peer_memory_client {
char name[IB_PEER_MEMORY_NAME_MAX];
char version[IB_PEER_MEMORY_VER_MAX];
int (*acquire) (unsigned long addr, size_t size, void *peer_mem_private_data,
char *peer_mem_name, void **client_context);
int (*get_pages) (unsigned long addr,
size_t size, int write, int force,
struct sg_table *sg_head,
void *client_context, void *core_context);
int (*dma_map) (struct sg_table *sg_head, void *client_context,
struct device *dma_device, int dmasync, int *nmap);
int (*dma_unmap) (struct sg_table *sg_head, void *client_context,
struct device *dma_device);
void (*put_pages) (struct sg_table *sg_head, void *client_context);
unsigned long (*get_page_size) (void *client_context);
void (*release) (void *client_context);

};

A detailed description of above callbacks is defined as part of the first patch
in peer_mem.h header file.
-----------------------------------------------------------------------------------
void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
invalidate_peer_memory *invalidate_callback);

Description:
Each peer memory should use this function to register as an available
peer memory client during its initialization. The callbacks provided
as part of the peer_client may be used later on by the IB core when
registering and unregistering its memory.
----------------------------------------------------------------------------------

void ib_unregister_peer_memory_client(void *reg_handle);

Description:
On unload, the peer memory client must unregister itself, to prevent
any additional callbacks to the unloaded module.

----------------------------------------------------------------------------------
typedef int (*invalidate_peer_memory)(void *reg_handle,
void *core_context);

Description:
A callback function to be called by the peer driver when an allocation
should be invalidated. When the invalidation callback returns, the user
of the allocation is guaranteed not to access it.

-------------------------------------------------------------------------------

The structure of the patchset

First, the patches apply against the for-next branch in the
roland/infiniband.git tree, based upon commit ID
3bdad2d13fa62bcb59ca2506e74ce467ea436586 having subject: "Merge
branches 'core', 'ipoib', 'iser', 'mlx4', 'ocrdma' and 'qib' into
for-next"

Patches 1-3:
This set of patches introduces the API, adds the required support to
the IB core layer, allowing Peers to be registered and be part of the
flow. The first patch introduces the API, the next two patches add the
infrastructure to manage peer client and use its registration
callbacks.

Patch 4-5:
Those patches allow peers to notify IB core that a specific
registration should be invalidated.

Patch 6:
This patch exposes some information and statistics for a given peer
memory by using the sysfs mechanism.

Patches 7-8:
Those patches add the required functionality needed by mlx4 & mlx5 to
work with peer clients that require invalidation support. Currently
that support was added for only MRs.

Patch 9:
This patch is an example peer memory client which uses the HOST
memory, it can serve as very good reference for peer client writers.

Changes from V0:
- fixed coding style issues.
- changed core ticket from (void *) to u64. Removed all wraparound handling.
- documented the sysfs interface and added missing counters.

Changes from V1:
- reformat the documentation to look nicely for nanodoc.
- changed the sysfs interface to be under infiniband subsystem instead of mm one.

Yishai Hadas (9):
IB/core: Introduce peer client interface
IB/core: Get/put peer memory client
IB/core: Umem tunneling peer memory APIs
IB/core: Infrastructure to manage peer core context
IB/core: Invalidation support for peer memory
IB/core: Sysfs support for peer memory
IB/mlx4: Invalidation support for MR over peer memory
IB/mlx5: Invalidation support for MR over peer memory
Samples: Peer memory client example

Documentation/infiniband/peer_memory.txt | 64 ++++
drivers/infiniband/core/Makefile | 3 +-
drivers/infiniband/core/core_priv.h | 2 +
drivers/infiniband/core/peer_mem.c | 526 ++++++++++++++++++++++++++
drivers/infiniband/core/sysfs.c | 6 +
drivers/infiniband/core/umem.c | 119 ++++++-
drivers/infiniband/core/uverbs_cmd.c | 2 +
drivers/infiniband/hw/amso1100/c2_provider.c | 2 +-
drivers/infiniband/hw/cxgb3/iwch_provider.c | 2 +-
drivers/infiniband/hw/cxgb4/mem.c | 2 +-
drivers/infiniband/hw/ehca/ehca_mrmw.c | 2 +-
drivers/infiniband/hw/ipath/ipath_mr.c | 2 +-
drivers/infiniband/hw/mlx4/cq.c | 2 +-
drivers/infiniband/hw/mlx4/doorbell.c | 2 +-
drivers/infiniband/hw/mlx4/main.c | 3 +-
drivers/infiniband/hw/mlx4/mlx4_ib.h | 5 +
drivers/infiniband/hw/mlx4/mr.c | 90 ++++-
drivers/infiniband/hw/mlx4/qp.c | 2 +-
drivers/infiniband/hw/mlx4/srq.c | 2 +-
drivers/infiniband/hw/mlx5/cq.c | 5 +-
drivers/infiniband/hw/mlx5/doorbell.c | 2 +-
drivers/infiniband/hw/mlx5/main.c | 3 +-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 10 +
drivers/infiniband/hw/mlx5/mr.c | 84 ++++-
drivers/infiniband/hw/mlx5/qp.c | 2 +-
drivers/infiniband/hw/mlx5/srq.c | 2 +-
drivers/infiniband/hw/mthca/mthca_provider.c | 2 +-
drivers/infiniband/hw/nes/nes_verbs.c | 2 +-
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
drivers/infiniband/hw/qib/qib_mr.c | 2 +-
include/rdma/ib_peer_mem.h | 59 +++
include/rdma/ib_umem.h | 36 ++-
include/rdma/ib_verbs.h | 5 +-
include/rdma/peer_mem.h | 247 ++++++++++++
samples/Kconfig | 10 +
samples/Makefile | 3 +-
samples/peer_memory/Makefile | 1 +
samples/peer_memory/example_peer_mem.c | 260 +++++++++++++
38 files changed, 1535 insertions(+), 40 deletions(-)
create mode 100644 Documentation/infiniband/peer_memory.txt
create mode 100644 drivers/infiniband/core/peer_mem.c
create mode 100644 include/rdma/ib_peer_mem.h
create mode 100644 include/rdma/peer_mem.h
create mode 100644 samples/peer_memory/Makefile
create mode 100644 samples/peer_memory/example_peer_mem.c

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:51 UTC
Permalink
Builds umem over peer memory client functionality.
It tries getting a peer client for a given address range, in case found
further memory calls are tunneled to that peer client.
ib_umem_get was extended to have an indication whether this umem can
be part of a peer client. As a result, usage of
ib_umem_get was updated accordingly.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
drivers/infiniband/core/umem.c | 77 +++++++++++++++++++++++++-
drivers/infiniband/hw/amso1100/c2_provider.c | 2 +-
drivers/infiniband/hw/cxgb3/iwch_provider.c | 2 +-
drivers/infiniband/hw/cxgb4/mem.c | 2 +-
drivers/infiniband/hw/ehca/ehca_mrmw.c | 2 +-
drivers/infiniband/hw/ipath/ipath_mr.c | 2 +-
drivers/infiniband/hw/mlx4/cq.c | 2 +-
drivers/infiniband/hw/mlx4/doorbell.c | 2 +-
drivers/infiniband/hw/mlx4/mr.c | 11 +++-
drivers/infiniband/hw/mlx4/qp.c | 2 +-
drivers/infiniband/hw/mlx4/srq.c | 2 +-
drivers/infiniband/hw/mlx5/cq.c | 5 +-
drivers/infiniband/hw/mlx5/doorbell.c | 2 +-
drivers/infiniband/hw/mlx5/mr.c | 2 +-
drivers/infiniband/hw/mlx5/qp.c | 2 +-
drivers/infiniband/hw/mlx5/srq.c | 2 +-
drivers/infiniband/hw/mthca/mthca_provider.c | 2 +-
drivers/infiniband/hw/nes/nes_verbs.c | 2 +-
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
drivers/infiniband/hw/qib/qib_mr.c | 2 +-
include/rdma/ib_peer_mem.h | 4 +
include/rdma/ib_umem.h | 13 +++-
22 files changed, 119 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index df0c4f6..f3e445c 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -42,6 +42,66 @@

#include "uverbs.h"

+static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
+ struct ib_umem *umem, unsigned long addr,
+ int dmasync)
+{
+ int ret;
+ const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+
+ umem->ib_peer_mem = ib_peer_mem;
+ /*
+ * We always request write permissions to the pages, to force breaking of any CoW
+ * during the registration of the MR. For read-only MRs we use the "force" flag to
+ * indicate that CoW breaking is required but the registration should not fail if
+ * referencing read-only areas.
+ */
+ ret = peer_mem->get_pages(addr, umem->length,
+ 1, !umem->writable,
+ &umem->sg_head,
+ umem->peer_mem_client_context,
+ 0);
+ if (ret)
+ goto out;
+
+ umem->page_size = peer_mem->get_page_size
+ (umem->peer_mem_client_context);
+ if (umem->page_size <= 0)
+ goto put_pages;
+
+ umem->offset = addr & ((unsigned long)umem->page_size - 1);
+ ret = peer_mem->dma_map(&umem->sg_head,
+ umem->peer_mem_client_context,
+ umem->context->device->dma_device,
+ dmasync,
+ &umem->nmap);
+ if (ret)
+ goto put_pages;
+
+ return umem;
+
+put_pages:
+ peer_mem->put_pages(umem->peer_mem_client_context,
+ &umem->sg_head);
+out:
+ ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
+ kfree(umem);
+ return ERR_PTR(ret);
+}
+
+static void peer_umem_release(struct ib_umem *umem)
+{
+ const struct peer_memory_client *peer_mem =
+ umem->ib_peer_mem->peer_mem;
+
+ peer_mem->dma_unmap(&umem->sg_head,
+ umem->peer_mem_client_context,
+ umem->context->device->dma_device);
+ peer_mem->put_pages(&umem->sg_head,
+ umem->peer_mem_client_context);
+ ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+ kfree(umem);
+}

static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
{
@@ -74,9 +134,11 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
* @size: length of region to pin
* @access: IB_ACCESS_xxx flags for memory being pinned
* @dmasync: flush in-flight DMA when the memory region is written
+ * @peer_mem_flags: IB_PEER_MEM_xxx flags for memory being used
*/
struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access, int dmasync)
+ size_t size, int access, int dmasync,
+ unsigned long peer_mem_flags)
{
struct ib_umem *umem;
struct page **page_list;
@@ -114,6 +176,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
* "MW bind" can change permissions by binding a window.
*/
umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ);
+ if (peer_mem_flags & IB_PEER_MEM_ALLOW) {
+ struct ib_peer_memory_client *peer_mem_client;
+
+ peer_mem_client = ib_get_peer_client(context, addr, size,
+ &umem->peer_mem_client_context);
+ if (peer_mem_client)
+ return peer_umem_get(peer_mem_client, umem, addr,
+ dmasync);
+ }

/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb = 1;
@@ -234,6 +305,10 @@ void ib_umem_release(struct ib_umem *umem)
struct mm_struct *mm;
struct task_struct *task;
unsigned long diff;
+ if (umem->ib_peer_mem) {
+ peer_umem_release(umem);
+ return;
+ }

__ib_umem_release(umem->context->device, umem, 1);

diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 2d5cbf4..e88d222 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -444,7 +444,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
return ERR_PTR(-ENOMEM);
c2mr->pd = c2pd;

- c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+ c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
if (IS_ERR(c2mr->umem)) {
err = PTR_ERR(c2mr->umem);
kfree(c2mr);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 811b24a..aa9c142 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -635,7 +635,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,

mhp->rhp = rhp;

- mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+ mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
if (IS_ERR(mhp->umem)) {
err = PTR_ERR(mhp->umem);
kfree(mhp);
diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index ec7a298..506ddd2 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -705,7 +705,7 @@ struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,

mhp->rhp = rhp;

- mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+ mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
if (IS_ERR(mhp->umem)) {
err = PTR_ERR(mhp->umem);
kfree(mhp);
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 3488e8c..d5bbbc0 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -359,7 +359,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
}

e_mr->umem = ib_umem_get(pd->uobject->context, start, length,
- mr_access_flags, 0);
+ mr_access_flags, 0, 0);
if (IS_ERR(e_mr->umem)) {
ib_mr = (void *)e_mr->umem;
goto reg_user_mr_exit1;
diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c
index 5e61e9b..d6641be 100644
--- a/drivers/infiniband/hw/ipath/ipath_mr.c
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c
@@ -198,7 +198,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
}

umem = ib_umem_get(pd->uobject->context, start, length,
- mr_access_flags, 0);
+ mr_access_flags, 0, 0);
if (IS_ERR(umem))
return (void *) umem;

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 1066eec..23aaf77 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -142,7 +142,7 @@ static int mlx4_ib_get_cq_umem(struct mlx4_ib_dev *dev, struct ib_ucontext *cont
int cqe_size = dev->dev->caps.cqe_size;

*umem = ib_umem_get(context, buf_addr, cqe * cqe_size,
- IB_ACCESS_LOCAL_WRITE, 1);
+ IB_ACCESS_LOCAL_WRITE, 1, IB_PEER_MEM_ALLOW);
if (IS_ERR(*umem))
return PTR_ERR(*umem);

diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c
index c517409..71e7b66 100644
--- a/drivers/infiniband/hw/mlx4/doorbell.c
+++ b/drivers/infiniband/hw/mlx4/doorbell.c
@@ -62,7 +62,7 @@ int mlx4_ib_db_map_user(struct mlx4_ib_ucontext *context, unsigned long virt,
page->user_virt = (virt & PAGE_MASK);
page->refcnt = 0;
page->umem = ib_umem_get(&context->ibucontext, virt & PAGE_MASK,
- PAGE_SIZE, 0, 0);
+ PAGE_SIZE, 0, 0, IB_PEER_MEM_ALLOW);
if (IS_ERR(page->umem)) {
err = PTR_ERR(page->umem);
kfree(page);
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 8f9325c..ad4cdfd 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -147,7 +147,8 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
/* Force registering the memory as writable. */
/* Used for memory re-registeration. HCA protects the access */
mr->umem = ib_umem_get(pd->uobject->context, start, length,
- access_flags | IB_ACCESS_LOCAL_WRITE, 0);
+ access_flags | IB_ACCESS_LOCAL_WRITE, 0,
+ IB_PEER_MEM_ALLOW);
if (IS_ERR(mr->umem)) {
err = PTR_ERR(mr->umem);
goto err_free;
@@ -226,12 +227,18 @@ int mlx4_ib_rereg_user_mr(struct ib_mr *mr, int flags,
int err;
int n;

+ /* Peer memory isn't supported */
+ if (mmr->umem->ib_peer_mem) {
+ err = -ENOTSUPP;
+ goto release_mpt_entry;
+ }
+
mlx4_mr_rereg_mem_cleanup(dev->dev, &mmr->mmr);
ib_umem_release(mmr->umem);
mmr->umem = ib_umem_get(mr->uobject->context, start, length,
mr_access_flags |
IB_ACCESS_LOCAL_WRITE,
- 0);
+ 0, 0);
if (IS_ERR(mmr->umem)) {
err = PTR_ERR(mmr->umem);
/* Prevent mlx4_ib_dereg_mr from free'ing invalid pointer */
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 577b477..15d6430 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -721,7 +721,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
goto err;

qp->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr,
- qp->buf_size, 0, 0);
+ qp->buf_size, 0, 0, IB_PEER_MEM_ALLOW);
if (IS_ERR(qp->umem)) {
err = PTR_ERR(qp->umem);
goto err;
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 62d9285..e05c772 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -114,7 +114,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
}

srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr,
- buf_size, 0, 0);
+ buf_size, 0, 0, IB_PEER_MEM_ALLOW);
if (IS_ERR(srq->umem)) {
err = PTR_ERR(srq->umem);
goto err_srq;
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index e405627..a968a54 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -628,7 +628,8 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,

cq->buf.umem = ib_umem_get(context, ucmd.buf_addr,
entries * ucmd.cqe_size,
- IB_ACCESS_LOCAL_WRITE, 1);
+ IB_ACCESS_LOCAL_WRITE, 1,
+ IB_PEER_MEM_ALLOW);
if (IS_ERR(cq->buf.umem)) {
err = PTR_ERR(cq->buf.umem);
return err;
@@ -958,7 +959,7 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
return -EINVAL;

umem = ib_umem_get(context, ucmd.buf_addr, entries * ucmd.cqe_size,
- IB_ACCESS_LOCAL_WRITE, 1);
+ IB_ACCESS_LOCAL_WRITE, 1, IB_PEER_MEM_ALLOW);
if (IS_ERR(umem)) {
err = PTR_ERR(umem);
return err;
diff --git a/drivers/infiniband/hw/mlx5/doorbell.c b/drivers/infiniband/hw/mlx5/doorbell.c
index ece028f..5d7f427 100644
--- a/drivers/infiniband/hw/mlx5/doorbell.c
+++ b/drivers/infiniband/hw/mlx5/doorbell.c
@@ -64,7 +64,7 @@ int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
page->user_virt = (virt & PAGE_MASK);
page->refcnt = 0;
page->umem = ib_umem_get(&context->ibucontext, virt & PAGE_MASK,
- PAGE_SIZE, 0, 0);
+ PAGE_SIZE, 0, 0, IB_PEER_MEM_ALLOW);
if (IS_ERR(page->umem)) {
err = PTR_ERR(page->umem);
kfree(page);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 80b3c63..55c6649 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -884,7 +884,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
mlx5_ib_dbg(dev, "start 0x%llx, virt_addr 0x%llx, length 0x%llx\n",
start, virt_addr, length);
umem = ib_umem_get(pd->uobject->context, start, length, access_flags,
- 0);
+ 0, IB_PEER_MEM_ALLOW);
if (IS_ERR(umem)) {
mlx5_ib_dbg(dev, "umem get failed\n");
return (void *)umem;
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 8c574b6..d6856c6 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -584,7 +584,7 @@ static int create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,

if (ucmd.buf_addr && qp->buf_size) {
qp->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr,
- qp->buf_size, 0, 0);
+ qp->buf_size, 0, 0, IB_PEER_MEM_ALLOW);
if (IS_ERR(qp->umem)) {
mlx5_ib_dbg(dev, "umem_get failed\n");
err = PTR_ERR(qp->umem);
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 70bd131..4bca523 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -103,7 +103,7 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);

srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, buf_size,
- 0, 0);
+ 0, 0, IB_PEER_MEM_ALLOW);
if (IS_ERR(srq->umem)) {
mlx5_ib_dbg(dev, "failed umem get, size %d\n", buf_size);
err = PTR_ERR(srq->umem);
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 415f8e1..599ee1f 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1002,7 +1002,7 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
return ERR_PTR(-ENOMEM);

mr->umem = ib_umem_get(pd->uobject->context, start, length, acc,
- ucmd.mr_attrs & MTHCA_MR_DMASYNC);
+ ucmd.mr_attrs & MTHCA_MR_DMASYNC, 0);

if (IS_ERR(mr->umem)) {
err = PTR_ERR(mr->umem);
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index fef067c..5b70588 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -2333,7 +2333,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u8 stag_key;
int first_page = 1;

- region = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+ region = ib_umem_get(pd->uobject->context, start, length, acc, 0, 0);
if (IS_ERR(region)) {
return (struct ib_mr *)region;
}
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 8f5f257..a90c88b 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -794,7 +794,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(status);
- mr->umem = ib_umem_get(ibpd->uobject->context, start, len, acc, 0);
+ mr->umem = ib_umem_get(ibpd->uobject->context, start, len, acc, 0, 0);
if (IS_ERR(mr->umem)) {
status = -EFAULT;
goto umem_err;
diff --git a/drivers/infiniband/hw/qib/qib_mr.c b/drivers/infiniband/hw/qib/qib_mr.c
index 9bbb553..aadce11 100644
--- a/drivers/infiniband/hw/qib/qib_mr.c
+++ b/drivers/infiniband/hw/qib/qib_mr.c
@@ -242,7 +242,7 @@ struct ib_mr *qib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
}

umem = ib_umem_get(pd->uobject->context, start, length,
- mr_access_flags, 0);
+ mr_access_flags, 0, 0);
if (IS_ERR(umem))
return (void *) umem;

diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 3353ae7..98056c5 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -13,6 +13,10 @@ struct ib_peer_memory_client {
struct completion unload_comp;
};

+enum ib_peer_mem_flags {
+ IB_PEER_MEM_ALLOW = 1,
+};
+
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
size_t size, void **peer_client_context);

diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index a2bf41e..a22dde0 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -36,6 +36,7 @@
#include <linux/list.h>
#include <linux/scatterlist.h>
#include <linux/workqueue.h>
+#include <rdma/ib_peer_mem.h>

struct ib_ucontext;

@@ -53,12 +54,17 @@ struct ib_umem {
struct sg_table sg_head;
int nmap;
int npages;
+ /* peer memory that manages this umem */
+ struct ib_peer_memory_client *ib_peer_mem;
+ /* peer memory private context */
+ void *peer_mem_client_context;
};

#ifdef CONFIG_INFINIBAND_USER_MEM

struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access, int dmasync);
+ size_t size, int access, int dmasync,
+ unsigned long peer_mem_flags);
void ib_umem_release(struct ib_umem *umem);
int ib_umem_page_count(struct ib_umem *umem);

@@ -67,8 +73,9 @@ int ib_umem_page_count(struct ib_umem *umem);
#include <linux/err.h>

static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
- unsigned long addr, size_t size,
- int access, int dmasync) {
+ unsigned long addr, size_t size,
+ int access, int dmasync,
+ unsigned long peer_mem_flags) {
return ERR_PTR(-EINVAL);
}
static inline void ib_umem_release(struct ib_umem *umem) { }
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:50 UTC
Permalink
Supplies an API to get/put a peer client functionality.
It encapsulates the details of how to acquire/release a peer client from
its callers and let them get the required peer client in case it exists.

The 'get' call iterates over registered peer clients looking for an
owner of a given address range by calling peer's 'acquire' call.
In case an owner is found the loop is stopped.

The 'put' call does the opposite, lets peer release its resources for
that given address range.

A reference counting/completion mechanism is used to prevent a peer
memory client from going down once there are active users for its memory.

In addition:
- ib_ucontext was extended to enable peers setting their private
context, get it via the 'acquire' call then be able to recognize their memory. A given
ucontext can be served only by one peer which it belongs to.
- an extra device capability named IB_DEVICE_PEER_MEMORY was introduced,
to be used by low level drivers to mark that they support this functionality.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 49 ++++++++++++++++++++++++++++++++++
drivers/infiniband/core/uverbs_cmd.c | 2 +
include/rdma/ib_peer_mem.h | 10 +++++++
include/rdma/ib_verbs.h | 5 +++-
4 files changed, 65 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index c00af39..cc6e9e1 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -70,6 +70,14 @@ static int ib_memory_peer_check_mandatory(const struct peer_memory_client
return 0;
}

+static void complete_peer(struct kref *kref)
+{
+ struct ib_peer_memory_client *ib_peer_client =
+ container_of(kref, struct ib_peer_memory_client, ref);
+
+ complete(&ib_peer_client->unload_comp);
+}
+
void *ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
invalidate_peer_memory *invalidate_callback)
{
@@ -82,6 +90,8 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
if (!ib_peer_client)
return NULL;

+ init_completion(&ib_peer_client->unload_comp);
+ kref_init(&ib_peer_client->ref);
ib_peer_client->peer_mem = peer_client;
/* Once peer supplied a non NULL callback it's an indication that invalidation support is
* required for any memory owning.
@@ -107,6 +117,45 @@ void ib_unregister_peer_memory_client(void *reg_handle)
list_del(&ib_peer_client->core_peer_list);
mutex_unlock(&peer_memory_mutex);

+ kref_put(&ib_peer_client->ref, complete_peer);
+ wait_for_completion(&ib_peer_client->unload_comp);
kfree(ib_peer_client);
}
EXPORT_SYMBOL(ib_unregister_peer_memory_client);
+
+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
+ size_t size, void **peer_client_context)
+{
+ struct ib_peer_memory_client *ib_peer_client;
+ int ret;
+
+ mutex_lock(&peer_memory_mutex);
+ list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+ ret = ib_peer_client->peer_mem->acquire(addr, size,
+ context->peer_mem_private_data,
+ context->peer_mem_name,
+ peer_client_context);
+ if (ret > 0)
+ goto found;
+ }
+
+ ib_peer_client = NULL;
+
+found:
+ if (ib_peer_client)
+ kref_get(&ib_peer_client->ref);
+
+ mutex_unlock(&peer_memory_mutex);
+ return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_get_peer_client);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+ void *peer_client_context)
+{
+ if (ib_peer_client->peer_mem->release)
+ ib_peer_client->peer_mem->release(peer_client_context);
+
+ kref_put(&ib_peer_client->ref, complete_peer);
+}
+EXPORT_SYMBOL(ib_put_peer_client);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 0600c50..3f5d754 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -326,6 +326,8 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
INIT_LIST_HEAD(&ucontext->xrcd_list);
INIT_LIST_HEAD(&ucontext->rule_list);
ucontext->closing = 0;
+ ucontext->peer_mem_private_data = NULL;
+ ucontext->peer_mem_name = NULL;

resp.num_comp_vectors = file->device->num_comp_vectors;

diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index fac37b7..3353ae7 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -3,10 +3,20 @@

#include <rdma/peer_mem.h>

+struct ib_ucontext;
+
struct ib_peer_memory_client {
const struct peer_memory_client *peer_mem;
struct list_head core_peer_list;
int invalidation_required;
+ struct kref ref;
+ struct completion unload_comp;
};

+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
+ size_t size, void **peer_client_context);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+ void *peer_client_context);
+
#endif
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index ed44cc0..685e0b9 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -123,7 +123,8 @@ enum ib_device_cap_flags {
IB_DEVICE_MEM_WINDOW_TYPE_2A = (1<<23),
IB_DEVICE_MEM_WINDOW_TYPE_2B = (1<<24),
IB_DEVICE_MANAGED_FLOW_STEERING = (1<<29),
- IB_DEVICE_SIGNATURE_HANDOVER = (1<<30)
+ IB_DEVICE_SIGNATURE_HANDOVER = (1<<30),
+ IB_DEVICE_PEER_MEMORY = (1<<31)
};

enum ib_signature_prot_cap {
@@ -1131,6 +1132,8 @@ struct ib_ucontext {
struct list_head xrcd_list;
struct list_head rule_list;
int closing;
+ void *peer_mem_private_data;
+ char *peer_mem_name;
};

struct ib_uobject {
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:56 UTC
Permalink
Adds the required functionality to work with peer memory
clients which require invalidation support.

It includes:

- umem invalidation callback - once called should free any HW
resources assigned to that umem, then free peer resources
corresponding to that umem.
- The MR object relates to that umem is stay alive till dereg_mr is
called.
- synchronizing support between dereg_mr to invalidate callback.
- advertises the P2P device capability.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
drivers/infiniband/hw/mlx5/main.c | 3 +-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 10 ++++
drivers/infiniband/hw/mlx5/mr.c | 84 ++++++++++++++++++++++++++++++++--
3 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index d8907b2..4185531 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -182,7 +182,8 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
props->device_cap_flags = IB_DEVICE_CHANGE_PHY_PORT |
IB_DEVICE_PORT_ACTIVE_EVENT |
IB_DEVICE_SYS_IMAGE_GUID |
- IB_DEVICE_RC_RNR_NAK_GEN;
+ IB_DEVICE_RC_RNR_NAK_GEN |
+ IB_DEVICE_PEER_MEMORY;
flags = dev->mdev->caps.flags;
if (flags & MLX5_DEV_CAP_FLAG_BAD_PKEY_CNTR)
props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 386780f..bae7338 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -85,6 +85,8 @@ enum mlx5_ib_mad_ifc_flags {
MLX5_MAD_IFC_NET_VIEW = 4,
};

+struct mlx5_ib_peer_id;
+
struct mlx5_ib_ucontext {
struct ib_ucontext ibucontext;
struct list_head db_page_list;
@@ -267,6 +269,14 @@ struct mlx5_ib_mr {
struct mlx5_ib_dev *dev;
struct mlx5_create_mkey_mbox_out out;
struct mlx5_core_sig_ctx *sig;
+ struct mlx5_ib_peer_id *peer_id;
+ atomic_t invalidated;
+ struct completion invalidation_comp;
+};
+
+struct mlx5_ib_peer_id {
+ struct completion comp;
+ struct mlx5_ib_mr *mr;
};

struct mlx5_ib_fast_reg_page_list {
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 55c6649..390b149 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -38,6 +38,9 @@
#include <linux/delay.h>
#include <rdma/ib_umem.h>
#include "mlx5_ib.h"
+static void mlx5_invalidate_umem(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size);

enum {
MAX_PENDING_REG_MR = 8,
@@ -880,16 +883,32 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
int ncont;
int order;
int err;
+ struct ib_peer_memory_client *ib_peer_mem;
+ struct mlx5_ib_peer_id *mlx5_ib_peer_id = NULL;

mlx5_ib_dbg(dev, "start 0x%llx, virt_addr 0x%llx, length 0x%llx\n",
start, virt_addr, length);
umem = ib_umem_get(pd->uobject->context, start, length, access_flags,
- 0, IB_PEER_MEM_ALLOW);
+ 0, IB_PEER_MEM_ALLOW | IB_PEER_MEM_INVAL_SUPP);
if (IS_ERR(umem)) {
mlx5_ib_dbg(dev, "umem get failed\n");
return (void *)umem;
}

+ ib_peer_mem = umem->ib_peer_mem;
+ if (ib_peer_mem) {
+ mlx5_ib_peer_id = kzalloc(sizeof(*mlx5_ib_peer_id), GFP_KERNEL);
+ if (!mlx5_ib_peer_id) {
+ err = -ENOMEM;
+ goto error;
+ }
+ init_completion(&mlx5_ib_peer_id->comp);
+ err = ib_umem_activate_invalidation_notifier(umem, mlx5_invalidate_umem,
+ mlx5_ib_peer_id);
+ if (err)
+ goto error;
+ }
+
mlx5_ib_cont_pages(umem, start, &npages, &page_shift, &ncont, &order);
if (!npages) {
mlx5_ib_warn(dev, "avoid zero region\n");
@@ -927,11 +946,21 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
spin_unlock(&dev->mr_lock);
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
+ atomic_set(&mr->invalidated, 0);
+ if (ib_peer_mem) {
+ init_completion(&mr->invalidation_comp);
+ mlx5_ib_peer_id->mr = mr;
+ mr->peer_id = mlx5_ib_peer_id;
+ complete(&mlx5_ib_peer_id->comp);
+ }

return &mr->ibmr;

error:
+ if (mlx5_ib_peer_id)
+ complete(&mlx5_ib_peer_id->comp);
ib_umem_release(umem);
+ kfree(mlx5_ib_peer_id);
return ERR_PTR(err);
}

@@ -968,7 +997,7 @@ error:
return err;
}

-int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
+static int mlx5_ib_invalidate_mr(struct ib_mr *ibmr)
{
struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
struct mlx5_ib_mr *mr = to_mmr(ibmr);
@@ -990,7 +1019,6 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
mlx5_ib_warn(dev, "failed unregister\n");
return err;
}
- free_cached_mr(dev, mr);
}

if (umem) {
@@ -1000,9 +1028,32 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
spin_unlock(&dev->mr_lock);
}

- if (!umred)
- kfree(mr);
+ return 0;
+}
+
+int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
+{
+ struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
+ struct mlx5_ib_mr *mr = to_mmr(ibmr);
+ int ret = 0;
+ int umred = mr->umred;

+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ /* In case there is inflight invalidation call pending for its termination */
+ wait_for_completion(&mr->invalidation_comp);
+ } else {
+ ret = mlx5_ib_invalidate_mr(ibmr);
+ if (ret)
+ return ret;
+ }
+ kfree(mr->peer_id);
+ mr->peer_id = NULL;
+ if (umred) {
+ atomic_set(&mr->invalidated, 0);
+ free_cached_mr(dev, mr);
+ } else {
+ kfree(mr);
+ }
return 0;
}

@@ -1122,6 +1173,29 @@ int mlx5_ib_destroy_mr(struct ib_mr *ibmr)
return err;
}

+static void mlx5_invalidate_umem(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size)
+{
+ struct mlx5_ib_mr *mr;
+ struct mlx5_ib_peer_id *peer_id = (struct mlx5_ib_peer_id *)invalidation_cookie;
+
+ wait_for_completion(&peer_id->comp);
+ if (peer_id->mr == NULL)
+ return;
+
+ mr = peer_id->mr;
+ /* This function is called under client peer lock so its resources are race protected */
+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ umem->invalidation_ctx->inflight_invalidation = 1;
+ return;
+ }
+
+ umem->invalidation_ctx->peer_callback = 1;
+ mlx5_ib_invalidate_mr(&mr->ibmr);
+ complete(&mr->invalidation_comp);
+}
+
struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
int max_page_list_len)
{
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:55 UTC
Permalink
Adds the required functionality to work with peer memory
clients which require invalidation support.

It includes:

- umem invalidation callback - once called should free any HW
resources assigned to that umem, then free peer resources
corresponding to that umem.
- The MR object relates to that umem is stay alive till dereg_mr is
called.
- synchronizing support between dereg_mr to invalidate callback.
- advertises the P2P device capability.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
drivers/infiniband/hw/mlx4/main.c | 3 +-
drivers/infiniband/hw/mlx4/mlx4_ib.h | 5 ++
drivers/infiniband/hw/mlx4/mr.c | 81 +++++++++++++++++++++++++++++++---
3 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index c7586a1..2f349a2 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -162,7 +162,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
IB_DEVICE_PORT_ACTIVE_EVENT |
IB_DEVICE_SYS_IMAGE_GUID |
IB_DEVICE_RC_RNR_NAK_GEN |
- IB_DEVICE_BLOCK_MULTICAST_LOOPBACK;
+ IB_DEVICE_BLOCK_MULTICAST_LOOPBACK |
+ IB_DEVICE_PEER_MEMORY;
if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR)
props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR;
if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR)
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 6eb743f..4b3dc70 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -116,6 +116,11 @@ struct mlx4_ib_mr {
struct ib_mr ibmr;
struct mlx4_mr mmr;
struct ib_umem *umem;
+ atomic_t invalidated;
+ struct completion invalidation_comp;
+ /* lock protects the live indication */
+ struct mutex lock;
+ int live;
};

struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index ad4cdfd..ddc9530 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -59,7 +59,7 @@ struct ib_mr *mlx4_ib_get_dma_mr(struct ib_pd *pd, int acc)
struct mlx4_ib_mr *mr;
int err;

- mr = kmalloc(sizeof *mr, GFP_KERNEL);
+ mr = kzalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);

@@ -130,6 +130,31 @@ out:
return err;
}

+static void mlx4_invalidate_umem(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size)
+{
+ struct mlx4_ib_mr *mr = (struct mlx4_ib_mr *)invalidation_cookie;
+
+ mutex_lock(&mr->lock);
+ /* This function is called under client peer lock so its resources are race protected */
+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ umem->invalidation_ctx->inflight_invalidation = 1;
+ mutex_unlock(&mr->lock);
+ return;
+ }
+ if (!mr->live) {
+ mutex_unlock(&mr->lock);
+ return;
+ }
+
+ mutex_unlock(&mr->lock);
+ umem->invalidation_ctx->peer_callback = 1;
+ mlx4_mr_free(to_mdev(mr->ibmr.device)->dev, &mr->mmr);
+ ib_umem_release(umem);
+ complete(&mr->invalidation_comp);
+}
+
struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 virt_addr, int access_flags,
struct ib_udata *udata)
@@ -139,28 +164,54 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
int shift;
int err;
int n;
+ struct ib_peer_memory_client *ib_peer_mem;

- mr = kmalloc(sizeof *mr, GFP_KERNEL);
+ mr = kzalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);

+ mutex_init(&mr->lock);
/* Force registering the memory as writable. */
/* Used for memory re-registeration. HCA protects the access */
mr->umem = ib_umem_get(pd->uobject->context, start, length,
access_flags | IB_ACCESS_LOCAL_WRITE, 0,
- IB_PEER_MEM_ALLOW);
+ IB_PEER_MEM_ALLOW | IB_PEER_MEM_INVAL_SUPP);
if (IS_ERR(mr->umem)) {
err = PTR_ERR(mr->umem);
goto err_free;
}

+ ib_peer_mem = mr->umem->ib_peer_mem;
+ if (ib_peer_mem) {
+ err = ib_umem_activate_invalidation_notifier(mr->umem, mlx4_invalidate_umem, mr);
+ if (err)
+ goto err_umem;
+ }
+
+ mutex_lock(&mr->lock);
+ if (atomic_read(&mr->invalidated))
+ goto err_locked_umem;
+
+ if (ib_peer_mem) {
+ if (access_flags & IB_ACCESS_MW_BIND) {
+ /* Prevent binding MW on peer clients, mlx4_invalidate_umem is a void
+ * function and must succeed, however, mlx4_mr_free might fail when MW
+ * are used.
+ */
+ err = -ENOSYS;
+ pr_err("MW is not supported with peer memory client");
+ goto err_locked_umem;
+ }
+ init_completion(&mr->invalidation_comp);
+ }
+
n = ib_umem_page_count(mr->umem);
shift = ilog2(mr->umem->page_size);

err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length,
convert_access(access_flags), n, shift, &mr->mmr);
if (err)
- goto err_umem;
+ goto err_locked_umem;

err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem);
if (err)
@@ -171,12 +222,16 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
goto err_mr;

mr->ibmr.rkey = mr->ibmr.lkey = mr->mmr.key;
-
+ mr->live = 1;
+ mutex_unlock(&mr->lock);
return &mr->ibmr;

err_mr:
(void) mlx4_mr_free(to_mdev(pd->device)->dev, &mr->mmr);

+err_locked_umem:
+ mutex_unlock(&mr->lock);
+
err_umem:
ib_umem_release(mr->umem);

@@ -284,11 +339,23 @@ int mlx4_ib_dereg_mr(struct ib_mr *ibmr)
struct mlx4_ib_mr *mr = to_mmr(ibmr);
int ret;

+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ wait_for_completion(&mr->invalidation_comp);
+ goto end;
+ }
+
ret = mlx4_mr_free(to_mdev(ibmr->device)->dev, &mr->mmr);
- if (ret)
+ if (ret) {
+ /* Error is not expected here, except when memory windows
+ * are bound to MR which is not supported with
+ * peer memory clients.
+ */
+ atomic_set(&mr->invalidated, 0);
return ret;
+ }
if (mr->umem)
ib_umem_release(mr->umem);
+end:
kfree(mr);

return 0;
@@ -365,7 +432,7 @@ struct ib_mr *mlx4_ib_alloc_fast_reg_mr(struct ib_pd *pd,
struct mlx4_ib_mr *mr;
int err;

- mr = kmalloc(sizeof *mr, GFP_KERNEL);
+ mr = kzalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:57 UTC
Permalink
Adds an example of a peer memory client which implements the peer memory
API as defined under include/rdma/peer_mem.h.
It uses the HOST memory functionality to implement the APIs and
can be a good reference for peer memory client writers.

Usage:
- It's built as a kernel module.
- The sample peer memory client takes ownership of a virtual memory area
defined using module parameters.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
samples/Kconfig | 10 ++
samples/Makefile | 3 +-
samples/peer_memory/Makefile | 1 +
samples/peer_memory/example_peer_mem.c | 260 ++++++++++++++++++++++++++++++++
4 files changed, 273 insertions(+), 1 deletions(-)
create mode 100644 samples/peer_memory/Makefile
create mode 100644 samples/peer_memory/example_peer_mem.c

diff --git a/samples/Kconfig b/samples/Kconfig
index 6181c2c..b75b771 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -21,6 +21,16 @@ config SAMPLE_KOBJECT

If in doubt, say "N" here.

+config SAMPLE_PEER_MEMORY_CLIENT
+ tristate "Build peer memory sample client -- loadable modules only"
+ depends on INFINIBAND_USER_MEM && m
+ help
+ This config option will allow you to build a peer memory
+ example module that can be a very good reference for
+ peer memory client plugin writers.
+
+ If in doubt, say "N" here.
+
config SAMPLE_KPROBES
tristate "Build kprobes examples -- loadable modules only"
depends on KPROBES && m
diff --git a/samples/Makefile b/samples/Makefile
index 1a60c62..b42117a 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,5 @@
# Makefile for Linux samples code

obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
+ peer_memory/
diff --git a/samples/peer_memory/Makefile b/samples/peer_memory/Makefile
new file mode 100644
index 0000000..f498125
--- /dev/null
+++ b/samples/peer_memory/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SAMPLE_PEER_MEMORY_CLIENT) += example_peer_mem.o
diff --git a/samples/peer_memory/example_peer_mem.c b/samples/peer_memory/example_peer_mem.c
new file mode 100644
index 0000000..b76013c
--- /dev/null
+++ b/samples/peer_memory/example_peer_mem.c
@@ -0,0 +1,260 @@
+/*
+ * Copyright (c) 2014, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/export.h>
+#include <linux/sched.h>
+#include <rdma/peer_mem.h>
+
+#define DRV_NAME "example_peer_mem"
+#define DRV_VERSION "1.0"
+#define DRV_RELDATE __DATE__
+
+MODULE_AUTHOR("Yishai Hadas");
+MODULE_DESCRIPTION("Example peer memory");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+static unsigned long example_mem_start_range;
+static unsigned long example_mem_end_range;
+
+module_param(example_mem_start_range, ulong, 0444);
+MODULE_PARM_DESC(example_mem_start_range, "peer example start memory range");
+module_param(example_mem_end_range, ulong, 0444);
+MODULE_PARM_DESC(example_mem_end_range, "peer example end memory range");
+
+static void *reg_handle;
+
+struct example_mem_context {
+ u64 core_context;
+ u64 page_virt_start;
+ u64 page_virt_end;
+ size_t mapped_size;
+ unsigned long npages;
+ int nmap;
+ unsigned long page_size;
+ int writable;
+ int dirty;
+};
+
+static void example_mem_put_pages(struct sg_table *sg_head, void *context);
+
+/* acquire return code: 1 mine, 0 - not mine */
+static int example_mem_acquire(unsigned long addr, size_t size, void *peer_mem_private_data,
+ char *peer_mem_name, void **client_context)
+{
+ struct example_mem_context *example_mem_context;
+
+ if (!(addr >= example_mem_start_range) ||
+ !(addr + size < example_mem_end_range))
+ /* peer is not the owner */
+ return 0;
+
+ example_mem_context = kzalloc(sizeof(*example_mem_context), GFP_KERNEL);
+ if (!example_mem_context)
+ /* Error case handled as not mine */
+ return 0;
+
+ example_mem_context->page_virt_start = addr & PAGE_MASK;
+ example_mem_context->page_virt_end = (addr + size + PAGE_SIZE - 1) & PAGE_MASK;
+ example_mem_context->mapped_size = example_mem_context->page_virt_end - example_mem_context->page_virt_start;
+
+ /* 1 means mine */
+ *client_context = example_mem_context;
+ __module_get(THIS_MODULE);
+ return 1;
+}
+
+static int example_mem_get_pages(unsigned long addr, size_t size, int write, int force,
+ struct sg_table *sg_head, void *client_context, u64 core_context)
+{
+ int ret;
+ unsigned long npages;
+ unsigned long cur_base;
+ struct page **page_list;
+ struct scatterlist *sg, *sg_list_start;
+ int i;
+ struct example_mem_context *example_mem_context;
+
+ example_mem_context = (struct example_mem_context *)client_context;
+ example_mem_context->core_context = core_context;
+ example_mem_context->page_size = PAGE_SIZE;
+ example_mem_context->writable = write;
+ npages = example_mem_context->mapped_size >> PAGE_SHIFT;
+
+ if (npages == 0)
+ return -EINVAL;
+
+ ret = sg_alloc_table(sg_head, npages, GFP_KERNEL);
+ if (ret)
+ return ret;
+
+ page_list = (struct page **)__get_free_page(GFP_KERNEL);
+ if (!page_list) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ sg_list_start = sg_head->sgl;
+ cur_base = addr & PAGE_MASK;
+
+ while (npages) {
+ ret = get_user_pages(current, current->mm, cur_base,
+ min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)),
+ write, force, page_list, NULL);
+
+ if (ret < 0)
+ goto out;
+
+ example_mem_context->npages += ret;
+ cur_base += ret * PAGE_SIZE;
+ npages -= ret;
+
+ for_each_sg(sg_list_start, sg, ret, i)
+ sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+
+ /* preparing for next loop */
+ sg_list_start = sg;
+ }
+
+out:
+ if (page_list)
+ free_page((unsigned long)page_list);
+
+ if (ret < 0) {
+ example_mem_put_pages(sg_head, client_context);
+ return ret;
+ }
+ /* mark that pages were exposed from the peer memory */
+ example_mem_context->dirty = 1;
+ return 0;
+}
+
+static int example_mem_dma_map(struct sg_table *sg_head, void *context,
+ struct device *dma_device, int dmasync,
+ int *nmap)
+{
+ DEFINE_DMA_ATTRS(attrs);
+ struct example_mem_context *example_mem_context =
+ (struct example_mem_context *)context;
+
+ if (dmasync)
+ dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
+ example_mem_context->nmap = dma_map_sg_attrs(dma_device, sg_head->sgl,
+ example_mem_context->npages,
+ DMA_BIDIRECTIONAL, &attrs);
+ if (example_mem_context->nmap <= 0)
+ return -ENOMEM;
+
+ *nmap = example_mem_context->nmap;
+ return 0;
+}
+
+static int example_mem_dma_unmap(struct sg_table *sg_head, void *context,
+ struct device *dma_device)
+{
+ struct example_mem_context *example_mem_context =
+ (struct example_mem_context *)context;
+
+ dma_unmap_sg(dma_device, sg_head->sgl,
+ example_mem_context->nmap,
+ DMA_BIDIRECTIONAL);
+ return 0;
+}
+
+static void example_mem_put_pages(struct sg_table *sg_head, void *context)
+{
+ struct scatterlist *sg;
+ struct page *page;
+ int i;
+
+ struct example_mem_context *example_mem_context =
+ (struct example_mem_context *)context;
+
+ for_each_sg(sg_head->sgl, sg, example_mem_context->npages, i) {
+ page = sg_page(sg);
+ if (example_mem_context->writable && example_mem_context->dirty)
+ set_page_dirty_lock(page);
+ put_page(page);
+ }
+
+ sg_free_table(sg_head);
+}
+
+static void example_mem_release(void *context)
+{
+ struct example_mem_context *example_mem_context =
+ (struct example_mem_context *)context;
+
+ kfree(example_mem_context);
+ module_put(THIS_MODULE);
+}
+
+static unsigned long example_mem_get_page_size(void *context)
+{
+ struct example_mem_context *example_mem_context =
+ (struct example_mem_context *)context;
+
+ return example_mem_context->page_size;
+}
+
+static const struct peer_memory_client example_mem_client = {
+ .name = DRV_NAME,
+ .version = DRV_VERSION,
+ .acquire = example_mem_acquire,
+ .get_pages = example_mem_get_pages,
+ .dma_map = example_mem_dma_map,
+ .dma_unmap = example_mem_dma_unmap,
+ .put_pages = example_mem_put_pages,
+ .get_page_size = example_mem_get_page_size,
+ .release = example_mem_release,
+};
+
+static int __init example_mem_client_init(void)
+{
+ reg_handle = ib_register_peer_memory_client(&example_mem_client, NULL);
+ if (!reg_handle)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void __exit example_mem_client_cleanup(void)
+{
+ ib_unregister_peer_memory_client(reg_handle);
+}
+
+module_init(example_mem_client_init);
+module_exit(example_mem_client_cleanup);
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:52 UTC
Permalink
Adds an infrastructure to manage core context for a given umem,
it's needed for the invalidation flow.

Core context is supplied to peer clients as some opaque data for a given
memory pages represented by a umem.

If the peer client needs to invalidate memory it provided through the peer memory callbacks,
it should call the invalidation callback, supplying the relevant core context.
IB core will use this context to invalidate the relevant memory.

To prevent cases when there are inflight invalidation calls in parallel
to releasing this memory (e.g. by dereg_mr) we must ensure that context
is valid before accessing it, that's why couldn't use the core context
pointer directly. For that reason we added a lookup table to map between
a ticket id to a core context. Peer client will get/supply the ticket
id, core will check whether exists before accessing its corresponding
context.

The ticket id is provided to the peer memory client, as part of the
get_pages API. The only "remote" party using it is the peer memory
client. It is used for invalidation flow, to specify what memory
registration should be invalidated. This flow might be called
asynchronously, in parallel to an ongoing dereg_mr operation. As such,
the invalidation flow might be called after the memory registration
has been completely released. Relying on a pointer-based, or IDR-based
ticket value can result in spurious invalidation of unrelated memory
regions. Internally, we carefully lock the data structures and
synchronize as needed when extracting the context from the
ticket. This ensures a proper, synchronized release of the memory
mapping. The ticket mechanism allows us to safely ignore inflight
invalidation calls that were arrived too late.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 84 ++++++++++++++++++++++++++++++++++++
include/rdma/ib_peer_mem.h | 18 ++++++++
include/rdma/ib_umem.h | 6 +++
3 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index cc6e9e1..2f34552 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -42,6 +42,87 @@ static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
return -ENOSYS;
}

+static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
+ void *context,
+ u64 *context_ticket)
+{
+ struct core_ticket *core_ticket = kzalloc(sizeof(*core_ticket), GFP_KERNEL);
+
+ if (!core_ticket)
+ return -ENOMEM;
+
+ mutex_lock(&ib_peer_client->lock);
+ core_ticket->key = ib_peer_client->last_ticket++;
+ core_ticket->context = context;
+ list_add_tail(&core_ticket->ticket_list,
+ &ib_peer_client->core_ticket_list);
+ *context_ticket = core_ticket->key;
+ mutex_unlock(&ib_peer_client->lock);
+
+ return 0;
+}
+
+/* Caller should be holding the peer client lock, specifically, the caller should hold ib_peer_client->lock */
+static int ib_peer_remove_context(struct ib_peer_memory_client *ib_peer_client,
+ u64 key)
+{
+ struct core_ticket *core_ticket;
+
+ list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
+ ticket_list) {
+ if (core_ticket->key == key) {
+ list_del(&core_ticket->ticket_list);
+ kfree(core_ticket);
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
+/**
+** ib_peer_create_invalidation_ctx - creates invalidation context for a given umem
+** @ib_peer_mem: peer client to be used
+** @umem: umem struct belongs to that context
+** @invalidation_ctx: output context
+**/
+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, struct ib_umem *umem,
+ struct invalidation_ctx **invalidation_ctx)
+{
+ int ret;
+ struct invalidation_ctx *ctx;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ret = ib_peer_insert_context(ib_peer_mem, ctx,
+ &ctx->context_ticket);
+ if (ret) {
+ kfree(ctx);
+ return ret;
+ }
+
+ ctx->umem = umem;
+ umem->invalidation_ctx = ctx;
+ *invalidation_ctx = ctx;
+ return 0;
+}
+
+/**
+ * ** ib_peer_destroy_invalidation_ctx - destroy a given invalidation context
+ * ** @ib_peer_mem: peer client to be used
+ * ** @invalidation_ctx: context to be invalidated
+ * **/
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct invalidation_ctx *invalidation_ctx)
+{
+ mutex_lock(&ib_peer_mem->lock);
+ ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
+ mutex_unlock(&ib_peer_mem->lock);
+
+ kfree(invalidation_ctx);
+}
static int ib_memory_peer_check_mandatory(const struct peer_memory_client
*peer_client)
{
@@ -90,9 +171,12 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
if (!ib_peer_client)
return NULL;

+ INIT_LIST_HEAD(&ib_peer_client->core_ticket_list);
+ mutex_init(&ib_peer_client->lock);
init_completion(&ib_peer_client->unload_comp);
kref_init(&ib_peer_client->ref);
ib_peer_client->peer_mem = peer_client;
+ ib_peer_client->last_ticket = 1;
/* Once peer supplied a non NULL callback it's an indication that invalidation support is
* required for any memory owning.
*/
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 98056c5..8b28bfe 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -4,6 +4,8 @@
#include <rdma/peer_mem.h>

struct ib_ucontext;
+struct ib_umem;
+struct invalidation_ctx;

struct ib_peer_memory_client {
const struct peer_memory_client *peer_mem;
@@ -11,16 +13,32 @@ struct ib_peer_memory_client {
int invalidation_required;
struct kref ref;
struct completion unload_comp;
+ /* lock is used via the invalidation flow */
+ struct mutex lock;
+ struct list_head core_ticket_list;
+ u64 last_ticket;
};

enum ib_peer_mem_flags {
IB_PEER_MEM_ALLOW = 1,
};

+struct core_ticket {
+ unsigned long key;
+ void *context;
+ struct list_head ticket_list;
+};
+
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
size_t size, void **peer_client_context);

void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
void *peer_client_context);

+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, struct ib_umem *umem,
+ struct invalidation_ctx **invalidation_ctx);
+
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct invalidation_ctx *invalidation_ctx);
+
#endif
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index a22dde0..3352b14 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -40,6 +40,11 @@

struct ib_ucontext;

+struct invalidation_ctx {
+ struct ib_umem *umem;
+ u64 context_ticket;
+};
+
struct ib_umem {
struct ib_ucontext *context;
size_t length;
@@ -56,6 +61,7 @@ struct ib_umem {
int npages;
/* peer memory that manages this umem */
struct ib_peer_memory_client *ib_peer_mem;
+ struct invalidation_ctx *invalidation_ctx;
/* peer memory private context */
void *peer_mem_client_context;
};
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:53 UTC
Permalink
Adds the required functionality to invalidate a given peer
memory represented by some core context.

Each umem that was built over peer memory and supports invalidation has
some invalidation context assigned to it with the required data to
manage, once peer will call the invalidation callback below actions are
taken:

1) Taking lock on peer client to sync with inflight dereg_mr on that
memory.
2) Once lock is taken have a lookup for ticket id to find the matching
core context.
3) In case found will call umem invalidation function, otherwise call is
returned.

Some notes:
1) As peer invalidate callback defined to be blocking it must return
just after that pages are not going to be accessed any more. For that
reason ib_invalidate_peer_memory is waiting for a completion event in
case there is other inflight call coming as part of dereg_mr.

2) The peer memory API assumes that a lock might be taken by a peer
client to protect its memory operations. Specifically, its invalidate
callback might be called under that lock which may lead to an AB/BA
dead-lock in case IB core will call get/put pages APIs with the IB core peer's lock taken,
for that reason as part of ib_umem_activate_invalidation_notifier lock is taken
then checking for some inflight invalidation state before activating it.

3) Once a peer client admits as part of its registration that it may
require invalidation support, it can't be an owner of a memory range
which doesn't support it.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 83 +++++++++++++++++++++++++++++++++---
drivers/infiniband/core/umem.c | 50 ++++++++++++++++++---
include/rdma/ib_peer_mem.h | 4 +-
include/rdma/ib_umem.h | 17 +++++++
4 files changed, 140 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 2f34552..d4cf31c 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -37,9 +37,55 @@
static DEFINE_MUTEX(peer_memory_mutex);
static LIST_HEAD(peer_memory_list);

+/* Caller should be holding the peer client lock, ib_peer_client->lock */
+static struct core_ticket *ib_peer_search_context(struct ib_peer_memory_client *ib_peer_client,
+ u64 key)
+{
+ struct core_ticket *core_ticket;
+
+ list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
+ ticket_list) {
+ if (core_ticket->key == key)
+ return core_ticket;
+ }
+
+ return NULL;
+}
+
static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
{
- return -ENOSYS;
+ struct ib_peer_memory_client *ib_peer_client = reg_handle;
+ struct invalidation_ctx *invalidation_ctx;
+ struct core_ticket *core_ticket;
+ int need_unlock = 1;
+
+ mutex_lock(&ib_peer_client->lock);
+ core_ticket = ib_peer_search_context(ib_peer_client, core_context);
+ if (!core_ticket)
+ goto out;
+
+ invalidation_ctx = (struct invalidation_ctx *)core_ticket->context;
+ /* If context is not ready yet, mark it to be invalidated */
+ if (!invalidation_ctx->func) {
+ invalidation_ctx->peer_invalidated = 1;
+ goto out;
+ }
+ invalidation_ctx->func(invalidation_ctx->cookie,
+ invalidation_ctx->umem, 0, 0);
+ if (invalidation_ctx->inflight_invalidation) {
+ /* init the completion to wait on before letting other thread to run */
+ init_completion(&invalidation_ctx->comp);
+ mutex_unlock(&ib_peer_client->lock);
+ need_unlock = 0;
+ wait_for_completion(&invalidation_ctx->comp);
+ }
+
+ kfree(invalidation_ctx);
+out:
+ if (need_unlock)
+ mutex_unlock(&ib_peer_client->lock);
+
+ return 0;
}

static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
@@ -117,11 +163,30 @@ int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem, s
void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
struct invalidation_ctx *invalidation_ctx)
{
- mutex_lock(&ib_peer_mem->lock);
- ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
- mutex_unlock(&ib_peer_mem->lock);
+ int peer_callback;
+ int inflight_invalidation;

- kfree(invalidation_ctx);
+ /* If we are under peer callback lock was already taken.*/
+ if (!invalidation_ctx->peer_callback)
+ mutex_lock(&ib_peer_mem->lock);
+ ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
+ /* make sure to check inflight flag after took the lock and remove from tree.
+ * in addition, from that point using local variables for peer_callback and
+ * inflight_invalidation as after the complete invalidation_ctx can't be accessed
+ * any more as it may be freed by the callback.
+ */
+ peer_callback = invalidation_ctx->peer_callback;
+ inflight_invalidation = invalidation_ctx->inflight_invalidation;
+ if (inflight_invalidation)
+ complete(&invalidation_ctx->comp);
+
+ /* On peer callback lock is handled externally */
+ if (!peer_callback)
+ mutex_unlock(&ib_peer_mem->lock);
+
+ /* in case under callback context or callback is pending let it free the invalidation context */
+ if (!peer_callback && !inflight_invalidation)
+ kfree(invalidation_ctx);
}
static int ib_memory_peer_check_mandatory(const struct peer_memory_client
*peer_client)
@@ -208,13 +273,19 @@ void ib_unregister_peer_memory_client(void *reg_handle)
EXPORT_SYMBOL(ib_unregister_peer_memory_client);

struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
- size_t size, void **peer_client_context)
+ size_t size, unsigned long peer_mem_flags,
+ void **peer_client_context)
{
struct ib_peer_memory_client *ib_peer_client;
int ret;

mutex_lock(&peer_memory_mutex);
list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+ /* In case peer requires invalidation it can't own memory which doesn't support it */
+ if (ib_peer_client->invalidation_required &&
+ (!(peer_mem_flags & IB_PEER_MEM_INVAL_SUPP)))
+ continue;
+
ret = ib_peer_client->peer_mem->acquire(addr, size,
context->peer_mem_private_data,
context->peer_mem_name,
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index f3e445c..6655d12 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -44,12 +44,19 @@

static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
struct ib_umem *umem, unsigned long addr,
- int dmasync)
+ int dmasync, unsigned long peer_mem_flags)
{
int ret;
const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+ struct invalidation_ctx *invalidation_ctx = NULL;

umem->ib_peer_mem = ib_peer_mem;
+ if (peer_mem_flags & IB_PEER_MEM_INVAL_SUPP) {
+ ret = ib_peer_create_invalidation_ctx(ib_peer_mem, umem, &invalidation_ctx);
+ if (ret)
+ goto end;
+ }
+
/*
* We always request write permissions to the pages, to force breaking of any CoW
* during the registration of the MR. For read-only MRs we use the "force" flag to
@@ -60,7 +67,8 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
1, !umem->writable,
&umem->sg_head,
umem->peer_mem_client_context,
- 0);
+ invalidation_ctx ?
+ invalidation_ctx->context_ticket : 0);
if (ret)
goto out;

@@ -84,6 +92,9 @@ put_pages:
peer_mem->put_pages(umem->peer_mem_client_context,
&umem->sg_head);
out:
+ if (invalidation_ctx)
+ ib_peer_destroy_invalidation_ctx(ib_peer_mem, invalidation_ctx);
+end:
ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
kfree(umem);
return ERR_PTR(ret);
@@ -91,15 +102,19 @@ out:

static void peer_umem_release(struct ib_umem *umem)
{
- const struct peer_memory_client *peer_mem =
- umem->ib_peer_mem->peer_mem;
+ struct ib_peer_memory_client *ib_peer_mem = umem->ib_peer_mem;
+ const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+ struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
+
+ if (invalidation_ctx)
+ ib_peer_destroy_invalidation_ctx(ib_peer_mem, invalidation_ctx);

peer_mem->dma_unmap(&umem->sg_head,
umem->peer_mem_client_context,
umem->context->device->dma_device);
peer_mem->put_pages(&umem->sg_head,
umem->peer_mem_client_context);
- ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+ ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
kfree(umem);
}

@@ -127,6 +142,27 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d

}

+int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+ umem_invalidate_func_t func,
+ void *cookie)
+{
+ struct invalidation_ctx *invalidation_ctx = umem->invalidation_ctx;
+ int ret = 0;
+
+ mutex_lock(&umem->ib_peer_mem->lock);
+ if (invalidation_ctx->peer_invalidated) {
+ pr_err("ib_umem_activate_invalidation_notifier: pages were invalidated by peer\n");
+ ret = -EINVAL;
+ goto end;
+ }
+ invalidation_ctx->func = func;
+ invalidation_ctx->cookie = cookie;
+ /* from that point any pending invalidations can be called */
+end:
+ mutex_unlock(&umem->ib_peer_mem->lock);
+ return ret;
+}
+EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
/**
* ib_umem_get - Pin and DMA map userspace memory.
* @context: userspace context to pin memory for
@@ -179,11 +215,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
if (peer_mem_flags & IB_PEER_MEM_ALLOW) {
struct ib_peer_memory_client *peer_mem_client;

- peer_mem_client = ib_get_peer_client(context, addr, size,
+ peer_mem_client = ib_get_peer_client(context, addr, size, peer_mem_flags,
&umem->peer_mem_client_context);
if (peer_mem_client)
return peer_umem_get(peer_mem_client, umem, addr,
- dmasync);
+ dmasync, peer_mem_flags);
}

/* We assume the memory is from hugetlb until proved otherwise */
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 8b28bfe..58e0f99 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -21,6 +21,7 @@ struct ib_peer_memory_client {

enum ib_peer_mem_flags {
IB_PEER_MEM_ALLOW = 1,
+ IB_PEER_MEM_INVAL_SUPP = (1<<1),
};

struct core_ticket {
@@ -30,7 +31,8 @@ struct core_ticket {
};

struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context, unsigned long addr,
- size_t size, void **peer_client_context);
+ size_t size, unsigned long peer_mem_flags,
+ void **peer_client_context);

void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
void *peer_client_context);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 3352b14..6cf433b 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -39,10 +39,21 @@
#include <rdma/ib_peer_mem.h>

struct ib_ucontext;
+struct ib_umem;
+
+typedef void (*umem_invalidate_func_t)(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size);

struct invalidation_ctx {
struct ib_umem *umem;
u64 context_ticket;
+ umem_invalidate_func_t func;
+ void *cookie;
+ int peer_callback;
+ int inflight_invalidation;
+ int peer_invalidated;
+ struct completion comp;
};

struct ib_umem {
@@ -73,6 +84,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
unsigned long peer_mem_flags);
void ib_umem_release(struct ib_umem *umem);
int ib_umem_page_count(struct ib_umem *umem);
+int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+ umem_invalidate_func_t func,
+ void *cookie);

#else /* CONFIG_INFINIBAND_USER_MEM */

@@ -87,6 +101,9 @@ static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
static inline void ib_umem_release(struct ib_umem *umem) { }
static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; }

+static inline int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+ umem_invalidate_func_t func,
+ void *cookie) {return 0; }
#endif /* CONFIG_INFINIBAND_USER_MEM */

#endif /* IB_UMEM_H */
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:49 UTC
Permalink
Introduces an API between IB core to peer memory clients,(e.g. GPU cards)
to provide access for the HCA to read/write GPU memory.

As a result it allows RDMA-based application to use GPU computing power,
and RDMA interconnect at the same time w/o copying the data between the P2P devices.

Each peer memory client should register with IB core. In the registration request,
it should supply callbacks to its memory basic functionality such as get/put pages,
get_page_size, dma map/unmap.

The client can optionally require the ability to invalidate memory it provided,
by requesting an invalidation callback details.

Upon successful registration, IB core will provide the client with a unique
registration handle and an invalidate callback function in case required by
the peer.

The handle should be used when unregistering the client, the callback function
can be used by the client in later patches, for a request from the client to
immediately release pinned pages.

Each peer must be able to recognize whether it's the owner of
a specific virtual address range. In case the answer is YES, further calls for memory
functionality will be tunneled to that peer.

The recognition is done via the 'acquire' call. The call arguments provide the
address and size of the memory requested. In case peer-direct context information
is available from the user verbs context, it is provided as well.
Upon recognition, the acquire call returns a peer direct client specific context.
The context will be provided by the peer direct controller to the peer direct client
callbacks when referring the specific address range.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
drivers/infiniband/core/Makefile | 3 +-
drivers/infiniband/core/peer_mem.c | 112 ++++++++++++++++
include/rdma/ib_peer_mem.h | 12 ++
include/rdma/peer_mem.h | 247 ++++++++++++++++++++++++++++++++++++
4 files changed, 373 insertions(+), 1 deletions(-)
create mode 100644 drivers/infiniband/core/peer_mem.c
create mode 100644 include/rdma/ib_peer_mem.h
create mode 100644 include/rdma/peer_mem.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index ffd0af6..e541ff0 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \
$(user_access-y)

ib_core-y := packer.o ud_header.o verbs.o sysfs.o \
- device.o fmr_pool.o cache.o netlink.o
+ device.o fmr_pool.o cache.o netlink.o \
+ peer_mem.o
ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o

ib_mad-y := mad.o smi.o agent.o mad_rmpp.o
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
new file mode 100644
index 0000000..c00af39
--- /dev/null
+++ b/drivers/infiniband/core/peer_mem.c
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2014, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/ib_peer_mem.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_umem.h>
+
+static DEFINE_MUTEX(peer_memory_mutex);
+static LIST_HEAD(peer_memory_list);
+
+static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
+{
+ return -ENOSYS;
+}
+
+static int ib_memory_peer_check_mandatory(const struct peer_memory_client
+ *peer_client)
+{
+#define PEER_MEM_MANDATORY_FUNC(x) { offsetof(struct peer_memory_client, x), #x }
+ static const struct {
+ size_t offset;
+ char *name;
+ } mandatory_table[] = {
+ PEER_MEM_MANDATORY_FUNC(acquire),
+ PEER_MEM_MANDATORY_FUNC(get_pages),
+ PEER_MEM_MANDATORY_FUNC(put_pages),
+ PEER_MEM_MANDATORY_FUNC(get_page_size),
+ PEER_MEM_MANDATORY_FUNC(dma_map),
+ PEER_MEM_MANDATORY_FUNC(dma_unmap)
+ };
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(mandatory_table); ++i) {
+ if (!*(void **)((void *)peer_client + mandatory_table[i].offset)) {
+ pr_err("Peer memory %s is missing mandatory function %s\n",
+ peer_client->name, mandatory_table[i].name);
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+void *ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
+ invalidate_peer_memory *invalidate_callback)
+{
+ struct ib_peer_memory_client *ib_peer_client;
+
+ if (ib_memory_peer_check_mandatory(peer_client))
+ return NULL;
+
+ ib_peer_client = kzalloc(sizeof(*ib_peer_client), GFP_KERNEL);
+ if (!ib_peer_client)
+ return NULL;
+
+ ib_peer_client->peer_mem = peer_client;
+ /* Once peer supplied a non NULL callback it's an indication that invalidation support is
+ * required for any memory owning.
+ */
+ if (invalidate_callback) {
+ *invalidate_callback = ib_invalidate_peer_memory;
+ ib_peer_client->invalidation_required = 1;
+ }
+
+ mutex_lock(&peer_memory_mutex);
+ list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
+ mutex_unlock(&peer_memory_mutex);
+
+ return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_register_peer_memory_client);
+
+void ib_unregister_peer_memory_client(void *reg_handle)
+{
+ struct ib_peer_memory_client *ib_peer_client = reg_handle;
+
+ mutex_lock(&peer_memory_mutex);
+ list_del(&ib_peer_client->core_peer_list);
+ mutex_unlock(&peer_memory_mutex);
+
+ kfree(ib_peer_client);
+}
+EXPORT_SYMBOL(ib_unregister_peer_memory_client);
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
new file mode 100644
index 0000000..fac37b7
--- /dev/null
+++ b/include/rdma/ib_peer_mem.h
@@ -0,0 +1,12 @@
+#if !defined(IB_PEER_MEM_H)
+#define IB_PEER_MEM_H
+
+#include <rdma/peer_mem.h>
+
+struct ib_peer_memory_client {
+ const struct peer_memory_client *peer_mem;
+ struct list_head core_peer_list;
+ int invalidation_required;
+};
+
+#endif
diff --git a/include/rdma/peer_mem.h b/include/rdma/peer_mem.h
new file mode 100644
index 0000000..8368f7f
--- /dev/null
+++ b/include/rdma/peer_mem.h
@@ -0,0 +1,247 @@
+/*
+ * Copyright (c) 2014, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#if !defined(PEER_MEM_H)
+#define PEER_MEM_H
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/export.h>
+#include <linux/scatterlist.h>
+
+#define IB_PEER_MEMORY_NAME_MAX 64
+#define IB_PEER_MEMORY_VER_MAX 16
+
+/**
+ * struct peer_memory_client - registration information for peer client.
+ * @name: peer client name
+ * @version: peer client version
+ * @acquire: callback function to be used by IB core to detect whether a
+ * virtual address in under the responsibility of a specific peer client.
+ * @get_pages: callback function to be used by IB core asking the peer client to pin
+ * the physical pages of the given address range and returns that information.
+ * It equivalents to the kernel API of get_user_pages(), but targets peer memory.
+ * @dma_map: callback function to be used by IB core asking the peer client to fill
+ * the dma address mapping for a given address range.
+ * @dma_unmap: callback function to be used by IB core asking the peer client to take
+ * relevant actions to unmap the memory.
+ * @put_pages: callback function to be used by IB core asking the peer client to remove the
+ * pinning from the given memory.
+ * It's the peer-direct equivalent of the kernel API put_page.
+ * @get_page_size: callback function to be used by IB core to query the peer client for
+ * the page size for the given allocation.
+ * @release: callback function to be used by IB core asking peer client to release all
+ * resources associated with previous acquire call. The call will be performed
+ * only for contexts that have been successfully acquired (i.e. acquire returned a non-zero value).
+ * Additionally, IB core guarentees that there will be no pages pinned through this context when the callback is called.
+ *
+ * The subsections in this description contain detailed description
+ * of the callback arguments and expected return values for the
+ * callbacks defined in this struct.
+ *
+ * acquire:
+ *
+ * Callback function to be used by IB core to detect
+ * whether a virtual address in under the responsibility
+ * of a specific peer client.
+ *
+ * addr [IN] - virtual address to be checked whether belongs to peer.
+ *
+ * size [IN] - size of memory area starting at addr.
+ *
+ * peer_mem_private_data [IN] - The contents of ib_ucontext-> peer_mem_private_data.
+ * This parameter allows usage of the peer-direct
+ * API in implementations where it is impossible
+ * to detect if the memory belongs to the device
+ * based upon the virtual address alone. In such
+ * cases, the peer device can create a special
+ * ib_ucontext, which will be associated with the
+ * relevant peer memory.
+ *
+ * peer_mem_name [IN] - The contents of ib_ucontext-> peer_mem_name.
+ * Used to identify the peer memory client that
+ * initialized the ib_ucontext.
+ * This parameter is normally used along with
+ * peer_mem_private_data.
+ * client_context [OUT] - peer opaque data which holds a peer context for
+ * the acquired address range, will be provided
+ * back to the peer memory in subsequent
+ * calls for that given memory.
+ *
+ * If peer takes responsibility on the given address range further calls for memory management
+ * will be directed to the callbacks of this peer client.
+ *
+ * Return - 1 in case peer client takes responsibility on that range otherwise 0.
+ * Any peer internal error should resulted in a zero answer, in case address range
+ * really belongs to the peer, no owner will be found and application will get an error
+ * from IB Core as expected.
+ *
+ * get_pages:
+ *
+ * Callback function to be used by IB core asking the
+ * peer client to pin the physical pages of the given
+ * address range and returns that information. It
+ * equivalents to the kernel API of get_user_pages(), but
+ * targets peer memory.
+ *
+ * addr [IN] - start virtual address of that given allocation.
+ *
+ * size [IN] - size of memory area starting at addr.
+ *
+ * write [IN] - indicates whether the pages will be written to by the caller.
+ * Same meaning as of kernel API get_user_pages, can be
+ * ignored if not relevant.
+ *
+ * force [IN] - indicates whether to force write access even if user
+ * mapping is read only. Same meaning as of kernel API
+ * get_user_pages, can be ignored if not relevant.
+ *
+ * sg_head [IN/OUT] - pointer to head of struct sg_table.
+ * The peer client should allocate a table big
+ * enough to store all of the required entries. This
+ * function should fill the table with physical addresses
+ * and sizes of the memory segments composing this
+ * memory mapping.
+ * The table allocation can be done using sg_alloc_table.
+ * Filling in the physical memory addresses and size can
+ * be done using sg_set_page.
+ *
+ * client_context [IN] - peer context for the given allocation, as received from
+ * the acquire call.
+ *
+ * core_context [IN] - IB core context. If the peer client wishes to
+ * invalidate any of the pages pinned through this API,
+ * it must provide this context as an argument to the
+ * invalidate callback.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * dma_map:
+ *
+ * Callback function to be used by IB core asking the peer client to fill
+ * the dma address mapping for a given address range.
+ *
+ * sg_head [IN/OUT] - pointer to head of struct sg_table. The peer memory
+ * should fill the dma_address & dma_length for
+ * each scatter gather entry in the table.
+ *
+ * client_context [IN] - peer context for the allocation mapped.
+ *
+ * dma_device [IN] - the RDMA capable device which requires access to the
+ * peer memory.
+ *
+ * dmasync [IN] - flush in-flight DMA when the memory region is written.
+ * Same meaning as with host memory mapping, can be ignored if not relevant.
+ *
+ * nmap [OUT] - number of mapped/set entries.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * dma_unmap:
+ *
+ * Callback function to be used by IB core asking the peer client to take
+ * relevant actions to unmap the memory.
+ *
+ * sg_head [IN] - pointer to head of struct sg_table. The peer memory
+ * should fill the dma_address & dma_length for
+ * each scatter gather entry in the table.
+ *
+ * client_context [IN] - peer context for the allocation mapped.
+ *
+ * dma_device [IN] - the RDMA capable device which requires access to the
+ * peer memory.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * put_pages:
+ *
+ * Callback function to be used by IB core asking the peer client to remove the
+ * pinning from the given memory.
+ * It's the peer-direct equivalent of the kernel API put_page.
+ *
+ * sg_head [IN] - pointer to head of struct sg_table.
+ *
+ * client_context [IN] - peer context for that given allocation.
+ *
+ * get_page_size:
+ *
+ * Callback function to be used by IB core to query the
+ * peer client for the page size for the given
+ * allocation.
+ *
+ * sg_head [IN] - pointer to head of struct sg_table.
+ *
+ * client_context [IN] - peer context for that given allocation.
+ *
+ * Return - Page size in bytes
+ *
+ * release:
+ *
+ * Callback function to be used by IB core asking peer
+ * client to release all resources associated with
+ * previous acquire call. The call will be performed only
+ * for contexts that have been successfully acquired
+ * (i.e. acquire returned a non-zero value).
+ * Additionally, IB core guarentees that there will be no
+ * pages pinned through this context when the callback is
+ * called.
+ *
+ * client_context [IN] - peer context for the given allocation.
+ *
+ **/
+struct peer_memory_client {
+ char name[IB_PEER_MEMORY_NAME_MAX];
+ char version[IB_PEER_MEMORY_VER_MAX];
+ int (*acquire)(unsigned long addr, size_t size, void *peer_mem_private_data,
+ char *peer_mem_name, void **client_context);
+ int (*get_pages)(unsigned long addr,
+ size_t size, int write, int force,
+ struct sg_table *sg_head,
+ void *client_context, u64 core_context);
+ int (*dma_map)(struct sg_table *sg_head, void *client_context,
+ struct device *dma_device, int dmasync, int *nmap);
+ int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
+ struct device *dma_device);
+ void (*put_pages)(struct sg_table *sg_head, void *client_context);
+ unsigned long (*get_page_size)(void *client_context);
+ void (*release)(void *client_context);
+};
+
+typedef int (*invalidate_peer_memory)(void *reg_handle, u64 core_context);
+
+void *ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
+ invalidate_peer_memory *invalidate_callback);
+void ib_unregister_peer_memory_client(void *reg_handle);
+
+#endif
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yishai Hadas
2014-10-23 12:02:54 UTC
Permalink
Supplies the required functionality to expose information and
statistics over sysfs for a given peer memory client.

This mechanism enables userspace application to check
which peers are available (based on name & version) and based on that
decides whether it can run successfully.

Root sysfs directory is /sys/kernel/infiniband/<peer_name>, under that directory
will reside some files that represent the statistics for that peer.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>
---
Documentation/infiniband/peer_memory.txt | 64 +++++++++
drivers/infiniband/core/core_priv.h | 2 +
drivers/infiniband/core/peer_mem.c | 212 +++++++++++++++++++++++++++++-
drivers/infiniband/core/sysfs.c | 6 +
drivers/infiniband/core/umem.c | 6 +
include/rdma/ib_peer_mem.h | 13 ++
6 files changed, 302 insertions(+), 1 deletions(-)
create mode 100644 Documentation/infiniband/peer_memory.txt

diff --git a/Documentation/infiniband/peer_memory.txt b/Documentation/infiniband/peer_memory.txt
new file mode 100644
index 0000000..be5e416
--- /dev/null
+++ b/Documentation/infiniband/peer_memory.txt
@@ -0,0 +1,64 @@
+Peer-Direct technology allows RDMA operations to directly target
+memory in external hardware devices, such as GPU cards, SSD based
+storage, dedicated ASIC accelerators, etc.
+
+This technology allows RDMA-based (over InfiniBand/RoCE) application
+to avoid unneeded data copying when sharing data between peer hardware
+devices.
+
+This file contains documentation for the sysfs interface provided by
+the feature. For documentation of the kernel level interface that peer
+memory clients should implement, please refer to the API documentation
+in include/rdma/peer_mem.h
+
+From the user application perspective, it is free to perform memory
+registration using pointers and handles provided by peer memory
+clients (i.e. OpenCL, Cuda, FPGA-specific handles, etc.). The kernel
+will transparently select the appropriate peer memory client to
+perform the memory registration, as needed.
+
+
+The peer-memory subsystem allows the user to monitor the current usage
+of the technology through a basic sysfs interface. For each peer
+memory client (i.e. GPU type, FPGA, etc.), the following files are
+created:
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/version - the version string
+ of the peer memory client
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/num_alloc_mrs - the number
+ of memory regions allocated using this peers memory. Note that this
+ counter is not decreased during de-registration of memory regions,
+ it is monotonically increasing. To get the number of memory regions
+ currently allocated on this peer, subtract the value of
+ num_dealloc_mrs from this counter.
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/num_dealloc_mrs - the number
+ of memory regions de-allocated, and were originally using peer
+ memory.
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/num_reg_pages - the amount
+ of peer_name's memory pages that have been mapped through peer
+ direct. Note that this is a monotonically increasing counter. To get
+ the number of pages currently mapped, subtract the value of
+ num_dereg_pages from this counter. Also, pay attention to the fact
+ that this counter is using device pages, which might differ in size
+ from the host memory page size.
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/num_dereg_pages - the amount
+ of peer memory pages that have been unmapped through peer direct for
+ peer_name.
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/num_reg_bytes - the number
+ of bytes that have been mapped through peer direct from
+ peer_name. Note that this is a monotonically increasing counter. To
+ get the number of bytes currently mapped, subtract the value of
+ num_dereg_bytes from this counter.
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/num_dereg_bytes - the number
+ of bytes that have been unmapped through peer direct from peer_name.
+
+* /sys/kernel/infiniband/memory_peers/<peer_name>/num_free_callbacks - the
+ number of times the peer used the "invalidate" callback to free a
+ memory region before the application de-registered the memory
+ region.
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 87d1936..b404699 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -38,6 +38,8 @@

#include <rdma/ib_verbs.h>

+extern struct kobject *infiniband_kobj;
+
int ib_device_register_sysfs(struct ib_device *device,
int (*port_callback)(struct ib_device *,
u8, struct kobject *));
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index d4cf31c..bf987aa 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -33,9 +33,211 @@
#include <rdma/ib_peer_mem.h>
#include <rdma/ib_verbs.h>
#include <rdma/ib_umem.h>
+#include "core_priv.h"

static DEFINE_MUTEX(peer_memory_mutex);
static LIST_HEAD(peer_memory_list);
+static struct kobject *peers_kobj;
+
+static void complete_peer(struct kref *kref);
+static struct ib_peer_memory_client *get_peer_by_kobj(void *kobj);
+static ssize_t version_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%s\n", ib_peer_client->peer_mem->version);
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static ssize_t num_alloc_mrs_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_alloc_mrs));
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static ssize_t num_dealloc_mrs_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_dealloc_mrs));
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static ssize_t num_reg_pages_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_reg_pages));
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static ssize_t num_dereg_pages_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_dereg_pages));
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static ssize_t num_reg_bytes_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_reg_bytes));
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static ssize_t num_dereg_bytes_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%llu\n", (u64)atomic64_read(&ib_peer_client->stats.num_dereg_bytes));
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static ssize_t num_free_callbacks_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct ib_peer_memory_client *ib_peer_client = get_peer_by_kobj(kobj);
+
+ if (ib_peer_client) {
+ sprintf(buf, "%lu\n", ib_peer_client->stats.num_free_callbacks);
+ kref_put(&ib_peer_client->ref, complete_peer);
+ return strlen(buf);
+ }
+ /* not found - nothing is return */
+ return 0;
+}
+
+static struct kobj_attribute version_attr = __ATTR_RO(version);
+static struct kobj_attribute num_alloc_mrs = __ATTR_RO(num_alloc_mrs);
+static struct kobj_attribute num_dealloc_mrs = __ATTR_RO(num_dealloc_mrs);
+static struct kobj_attribute num_reg_pages = __ATTR_RO(num_reg_pages);
+static struct kobj_attribute num_dereg_pages = __ATTR_RO(num_dereg_pages);
+static struct kobj_attribute num_reg_bytes = __ATTR_RO(num_reg_bytes);
+static struct kobj_attribute num_dereg_bytes = __ATTR_RO(num_dereg_bytes);
+static struct kobj_attribute num_free_callbacks = __ATTR_RO(num_free_callbacks);
+
+static struct attribute *peer_mem_attrs[] = {
+ &version_attr.attr,
+ &num_alloc_mrs.attr,
+ &num_dealloc_mrs.attr,
+ &num_reg_pages.attr,
+ &num_dereg_pages.attr,
+ &num_reg_bytes.attr,
+ &num_dereg_bytes.attr,
+ &num_free_callbacks.attr,
+ NULL,
+};
+
+static void destroy_peer_sysfs(struct ib_peer_memory_client *ib_peer_client)
+{
+ kobject_put(ib_peer_client->kobj);
+ if (list_empty(&peer_memory_list))
+ kobject_put(peers_kobj);
+}
+
+static int create_peer_sysfs(struct ib_peer_memory_client *ib_peer_client)
+{
+ int ret;
+
+ if (list_empty(&peer_memory_list)) {
+ /* creating under /sys/kernel/infiniband */
+ peers_kobj = kobject_create_and_add("memory_peers", infiniband_kobj);
+ if (!peers_kobj)
+ return -ENOMEM;
+ }
+
+ ib_peer_client->peer_mem_attr_group.attrs = peer_mem_attrs;
+ /* Dir alreday was created explicitly to get its kernel object for further usage */
+ ib_peer_client->peer_mem_attr_group.name = NULL;
+ ib_peer_client->kobj = kobject_create_and_add(ib_peer_client->peer_mem->name,
+ peers_kobj);
+
+ if (!ib_peer_client->kobj) {
+ ret = -EINVAL;
+ goto free;
+ }
+
+ /* Create the files associated with this kobject */
+ ret = sysfs_create_group(ib_peer_client->kobj,
+ &ib_peer_client->peer_mem_attr_group);
+ if (ret)
+ goto peer_free;
+
+ return 0;
+
+peer_free:
+ kobject_put(ib_peer_client->kobj);
+
+free:
+ if (list_empty(&peer_memory_list))
+ kobject_put(peers_kobj);
+
+ return ret;
+}
+
+static struct ib_peer_memory_client *get_peer_by_kobj(void *kobj)
+{
+ struct ib_peer_memory_client *ib_peer_client;
+
+ mutex_lock(&peer_memory_mutex);
+ list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+ if (ib_peer_client->kobj == kobj) {
+ kref_get(&ib_peer_client->ref);
+ goto found;
+ }
+ }
+
+ ib_peer_client = NULL;
+found:
+ mutex_unlock(&peer_memory_mutex);
+ return ib_peer_client;
+}

/* Caller should be holding the peer client lock, ib_peer_client->lock */
static struct core_ticket *ib_peer_search_context(struct ib_peer_memory_client *ib_peer_client,
@@ -60,6 +262,7 @@ static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
int need_unlock = 1;

mutex_lock(&ib_peer_client->lock);
+ ib_peer_client->stats.num_free_callbacks += 1;
core_ticket = ib_peer_search_context(ib_peer_client, core_context);
if (!core_ticket)
goto out;
@@ -251,9 +454,15 @@ void *ib_register_peer_memory_client(const struct peer_memory_client *peer_clien
}

mutex_lock(&peer_memory_mutex);
+ if (create_peer_sysfs(ib_peer_client)) {
+ kfree(ib_peer_client);
+ ib_peer_client = NULL;
+ goto end;
+ }
list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
- mutex_unlock(&peer_memory_mutex);
+end:

+ mutex_unlock(&peer_memory_mutex);
return ib_peer_client;
}
EXPORT_SYMBOL(ib_register_peer_memory_client);
@@ -264,6 +473,7 @@ void ib_unregister_peer_memory_client(void *reg_handle)

mutex_lock(&peer_memory_mutex);
list_del(&ib_peer_client->core_peer_list);
+ destroy_peer_sysfs(ib_peer_client);
mutex_unlock(&peer_memory_mutex);

kref_put(&ib_peer_client->ref, complete_peer);
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index cbd0383..eae6fb0 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -40,6 +40,8 @@

#include <rdma/ib_mad.h>

+struct kobject *infiniband_kobj;
+
struct ib_port {
struct kobject kobj;
struct ib_device *ibdev;
@@ -913,10 +915,14 @@ void ib_device_unregister_sysfs(struct ib_device *device)

int ib_sysfs_setup(void)
{
+ infiniband_kobj = kobject_create_and_add("infiniband", kernel_kobj);
+ if (!infiniband_kobj)
+ return -ENOMEM;
return class_register(&ib_class);
}

void ib_sysfs_cleanup(void)
{
+ kobject_put(infiniband_kobj);
class_unregister(&ib_class);
}
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 6655d12..1fa5447 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -86,6 +86,9 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
if (ret)
goto put_pages;

+ atomic64_add(umem->nmap, &ib_peer_mem->stats.num_reg_pages);
+ atomic64_add(umem->nmap * umem->page_size, &ib_peer_mem->stats.num_reg_bytes);
+ atomic64_inc(&ib_peer_mem->stats.num_alloc_mrs);
return umem;

put_pages:
@@ -114,6 +117,9 @@ static void peer_umem_release(struct ib_umem *umem)
umem->context->device->dma_device);
peer_mem->put_pages(&umem->sg_head,
umem->peer_mem_client_context);
+ atomic64_add(umem->nmap, &ib_peer_mem->stats.num_dereg_pages);
+ atomic64_add(umem->nmap * umem->page_size, &ib_peer_mem->stats.num_dereg_bytes);
+ atomic64_inc(&ib_peer_mem->stats.num_dealloc_mrs);
ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
kfree(umem);
}
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 58e0f99..1b865c8 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -3,6 +3,16 @@

#include <rdma/peer_mem.h>

+struct ib_peer_memory_statistics {
+ atomic64_t num_alloc_mrs;
+ atomic64_t num_dealloc_mrs;
+ atomic64_t num_reg_pages;
+ atomic64_t num_dereg_pages;
+ atomic64_t num_reg_bytes;
+ atomic64_t num_dereg_bytes;
+ unsigned long num_free_callbacks;
+};
+
struct ib_ucontext;
struct ib_umem;
struct invalidation_ctx;
@@ -17,6 +27,9 @@ struct ib_peer_memory_client {
struct mutex lock;
struct list_head core_ticket_list;
u64 last_ticket;
+ struct kobject *kobj;
+ struct attribute_group peer_mem_attr_group;
+ struct ib_peer_memory_statistics stats;
};

enum ib_peer_mem_flags {
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Roland Dreier
2014-10-23 17:29:02 UTC
Permalink
Post by Yishai Hadas
The API defined for Peer-Direct is described in this cover letter.
The required implementation for a hardware device to expose memory
buffers over Peer-Direct is also detailed in this letter.
I don't see how I can justify merging this (for now at least), given
that there are no actual users of all this (fairly complex) new code,
besides a sample that doesn't actually do anything useful. Is there
any actual consumer that might go upstream someday that we can at
least review now?
Post by Yishai Hadas
This makes the usage of peer-direct almost completely transparent to
the individual hardware drivers. The only changes required in the low
level IB hardware drivers is supporting an interface for immediate
invalidation of registered memory regions.
Why do we need immediate invalidation of memory regions?

- R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...