clu2's notes: Notes on high-performance interconnect

Acronyms in high performance interconnect and MPI world

ACM	Communication Management Assistant. It provides address and route (path) resolution services to the InfiniBand fabrics, just like the Address Resolution Protocol (ARP).
ADI	Abstract Device Interface (MPICH specific)
AH	Address handle (InfiniBand specific) An object which describes the path to the remote side used in UD QP.
AL	Access layer (InfiniBand specific) Low level operating system infrastructure (plumbing) used for accessing the interconnect fabric (VPI?, InfiniBandR, Ethernet, FCoE). It includes all basic transport services needed to support upper level network protocols, middleware, and management agents.
ALPS	Application Level Placement Scheduler (Cray MPI specific)
APM	Automatic Path Migration (InfiniBand specific)
ARMCI	Aggregate Remote Memory Copy Interface
BML	BTL Management Layer. It is a thin layer between PML and BTL in Open MPI and it provides multiplexing between multiple PMLs and multiple BTLs.
BTL	Byte Transfer Layer. This is the fabric-specific layer in Open MPI and it is used by OB1 & DR PML.
CH	Channel (MPICH specific)
CM	Connection/Communication Manager (InfiniBand specific) An entity responsible to establish, maintain, and release communication for RC and UC QP service types. The Service ID Resolution Protocol enables users of UD service to locate QPs supporting their desired service. There is a CM in every IB port of the end nodes. In the context of Open MPI, CM is a PML implementation, and its etymology is Connor MacLeod from The Highlander movie. Unlike OB1, CM is designed to utilize directly the native, MPI-like tagged matching interface exposed by certain fabrics, e.g. Myrinet MX, QLogic PSM, and Cray Portals.
CMA	Communication Manager Abstraction, which is another name of RDMA CM (OpenFabrics specific)
CNO	Consumer Notification Object (DAPL specific)
CQE	Completion queue entry (InfiniBand specific) An entry in the Completion Queue that describes the information about the completed WR (status, size etc.)
CR	Connection Request
CRCP	Checkpoint/Restart Coordination Protocol; it is a PML implementation in Open MPI.
CXGB	Chelsio T3/T4 10 Gigabit Ethernet
DAPL	Direct Access Programming Library
DAT	Direct Access Transport
DBL	Data Bypass Layer, a user-level, kernel-bypass, messaging software package for Myricom 10-Gigabit Ethernet.
DDP	Direct Data Placement
DDR	Double data rate (InfiniBand specific)
DPM	Dynamic Process Management (Open MPI specific)
DR	Data Reliability. It is one of PML implementations in Open MPI; unlike OB1 which relies on error codes propogated from lower level communication libraries, DR can better handle network failures.
DTC	Data Transfer Completion (DAPL specific)
DTO	Data Transfer Operation (DAPL specific)
EE EEC	End-to-end context (InfiniBand specific)
eHCA	IBM eServer GX-based InfiniBand HCA
EP	Endpoint
EVD	Event Dispatcher (DAPL specific)
FCA	Voltaire's (now Mellanox) Fabric Collective Accelerator
FCoE	Fibre Channel over Ethernet
FMS	Fabric Management System (Myrinet specific)
GbE	Gigabit Ethernet
GM	Myrinet Generic Messages protocol
GRU	Global Reference Unit (SGI Altix specific)
GSI	General Services Interface
HCA	Host Channel Adapter (InfiniBand specific)
IA	Interface adapter
IB	InfiniBand
IBAL	InfiniBand Access Layer
IBTA	InfiniBand Trade Association
IPoIB	IP over InfiniBand
iPath	QLogic InfiniPath HCA
iSER	iSCSI Extensions for RDMA
ITAPI	Interconnect Transport API, one of the APIs for InfiniBand.
iWARP	Internet Wide Area RDMA Protocol, a competing protocol of InfiniBand
kDAPL	Kernel level DAPL
LID	Local identifier (InfiniBand specific) A 16 bit address assigned to end nodes by the Subnet Manager. Each LID is unique within its subnet.
LMR	Local Memory Region
MAD	Management Datagram (InfiniBand specific)
MCA	Modular Component Architecture (Open MPI specific)
MCP	Myrinet Control Program
MLX4	Mellanox ConnectX adapter
MPD	Multi-Purpose Daemon (MPI specific)
MPID	A name space in MPICH source code tree. Functions in this name space are hardware-dependent (D for Device)
MPIR	A name space in MPICH source code tree. Functions in this name space are hardware-independent (R for Runtime)
MPT	Message Passing Toolkit
MR	Memory region (InfiniBand specific) A set of memory buffers which have already been registered with access permissions. These buffers need to be registered in order for the network adapter to make use of them. During registration an lkey and rkey are created associated with the created memory region.
MSI	Message signaled interrupt
MTHCA	Mellanox InfiniHost device
MTL	Matching Tranport Layer (used by the "CM" PML implementation in Open MPI)
MX	Myrinet Express protocol
MXM	MellanoX Messaging
NES	Intel-NetEffect Ethernet Cluster Server adapter
OB1	A PML implementation in Open MPI; its focus is high performance and will use RDMA if available. This name is inspired by Star Wars's Obi-Wan. Open MPI also has a BML implementation called R2
ODLS	ORTE Daemon Launch System (Open MPI specific)
OFA	Open Fabrics Alliance (was OpenIB)
OFED	Open Fabrics Enterprise Distribution
OMPI	Open MPI
OPA	Open Portable Atomics (MPICH specific)
OPAL	Open Portability Access Layer (Open MPI specific)
OpenPA	Open Portable Atomics (MPICH specific)
ORTE	Open Run-Time Environment (Open MPI specific)
OpenIB	Open Fabrics Alliance
OSC	One Sided Communication (Open MPI specific)
P4	Portable Programs for Parallel Processors protocol (MPICH specific)
PD	Protection domain (InfiniBand specific) Object whose components can interact with only each other. AHs interact with QPs, and MRs interact with WQs.
PIO	Programmed I/O
PLS	Process Launch System (Open MPI specific)
PMA	Performance Management Agent (InfiniBand specific)
PMI	Process Management Interface (MPICH2 specific)
PML	Point-To-Point Management Layer. As opposed to BTL, PML is the fabric-independent layer in Open MPI.
PP	Per peer. In Open MPI, each QP can be specified to use an SRQ or PP receive buffers posted to QP directly.
PSM	QLogic Performance Scaled Messaging
PSP	Public Service Point (DAPL specific)
PTL	Portals, the lowest-level network transport layer on Cray platform
PZ	Protection Zone (DAPL specific)
QDR	Quad data rate (InfiniBand specific)
QP	Queue pair (InfiniBand specific) The pair (Send Queue and Receive Queue) of independent WQs packed together in one object for the purpose of transferring data between nodes of a network. Posts are used to initiate the sending or receiving of data. There are three types of QP: UD, UC, and RC.
RAS	Resource Allocation Subsystem, which queries the batch job system to determine what resources have been allocated to the MPI job; it is a MCA component in Open MPI
RC	Reliable Connection (InfiniBand specific) A QP Transport service type based on a connection oriented protocol. A QP is associated with another single QP. The messages are sent in a reliable way (in terms of the correctness of order and the information.)
RCache	Registration Cache, which allows memory pools to cache registered memory for later operations.
RD	Reliable Datagram (InfiniBand specific)
RDMA	Remote Direct Memory Access
RDS	Reliable Datagram Socket
RMA	Remote Memory Access
RMAPS	Resource MAPping Subsystem, which maps MPI processes to specific nodes/cores; it is a MCA component in Open MPI
RMC	Reliable Multicast (InfiniBand specific)
RML	Runtime Messaging Layer communication interface, which provices basic point-to-point communication between ORTE processes; it is a MCA component in Open MPI
RMR	Remote Memory Region
RNR	Receiver not ready (InfiniBand specific) The flow in an RC QP where there is a connection between the sides but a Receive Request is not present in the Receive side.
RO	Remote Operation (InfiniBand specific)
RoCEE	RDMA over Converged Enhanced Ethernet
RQ	Request Queue
RSP	Reserved Service Point (DAPL specific)
RTR	Ready to Receive (InfiniBand specific)
RTS	Ready to Send (InfiniBand specific)
SA	Subnet Administrator (InfiniBand specific)
SCI	Scalable Coherent Interface
SCM	uDAPL socket based CM (DAPL/OpenFabrics specific)
SDP	Sockets Direct Protocol, which is a protocol over an RDMA fabric (usually InfiniBand) to support stream sockets (SOCK_STREAM)
SGE	Scatter/Gather Elements (InfiniBand specific) An entry to a pointer to a full or a part of a local registered memory block. The element hold the start address of the block, size, and lkey (with its associated permissions).
SHMEM	Shared Memory
SM	Shared Memory (Intel MPI specific), or Subnet Manager (InfiniBand specific)
SMA	Subnet Manager Agent (InfiniBand specific)
SMI	Subnet Management Interface
SNAPC	Snapshot Coordination; a MCA component in Open MPI
SR	Send Request (InfiniBand specific) A WR which was posted to an SQ which describes how much data is going to be transferred, its direction, and the way (the opcode will specify the transfer)
SRP	SCSI RDMA Protocol
SRQ	Shared Receive Queue (InfiniBand specific) A queue which holds WQEs for incoming messages from any RC/UD QP which is associated with it. More than one QPs can be associated with one SRQ.
TCA	Target Channel Adapter (InfiniBand specific) A Channel Adapter that is not required to support Verbs, usually used in I/O devices
TMI	Tag Matching Interface, e.g. QLogic PSM and Myricom MX
TOE	TCP Offload Engine
UC	Unreliable Connection (InfiniBand specific) A QP transport service type based on a connection oriented protocol, where a QP is associated with another single QP. The QPs do not execute a reliable Protocol and messages can be lost.
UCM	uDAPL unreliable datagram based CM (DAPL/OpenFabrics specific)
UD	Unreliable Datagram (InfiniBand specific) A QP transport service type in which messages can be one packet length and every UD QP can send/receive messages from another UD QP in the subnet. Messages can be lost and the order is not guaranteed. UD QP is the only type which supports multicast messages.
UDA	Mellanox's Unstructured Data Accelerator
uDAPL	User level DAPL
ULP	Upper Layer Protocol (InfiniBand specific)
V	Message logging and replay protocol; it is a PML implementation in Open MPI and is based on MPICH-V (V for Volatility).
VAPI	Mellanox's Verbs API for InfiniBand.
VI	Virtual Interface
VIA	Virtual Interface Architecture
VIPL	VI Provider Library
VMA	Voltaire Messaging Accelerator
VPI	Virtual Protocol Interconnect It allows the user to change the layer 2 protocol of the port.
WQE	Work Queue Element (InfiniBand specific)
WR	Work request (InfiniBand specific) A request which was posted by a user to a work queue
XRC	eXtended Reliable Connection. It is a new transport layer introduced by Mellanox to improve scalability in multi-core environments. It allows one single receive QP to be shared by multiple SRQs across one or more processes.
XPMEM	Cross Partition Memory (SGI Altix specific)

Useful info for InfiniBand: here and here

Environmental variables for Myrinet MX

(One can find their details in the MX source tree at mx-1.*/doc/libmx-options.txt and mx-1.*/README)

MX_STATS	(libmyriexpress must be compiled with `--enable-stats` when running the `configure` script) Set to 1 to enable reporting of statistics when the endpoint is closed. If the MPI library does not close the endpoint, nothing will be printed out.
MX_DEBUG_FILE	Output statistics and all debugging information to this file
MX_VERBOSE	Set to a non-zero value to display debugging information. The bigger the value, the more verbose.
MX_MONOTHREAD	Set to 1 to use single thread. Default is 0
MX_ZOMBIE_SEND	Set to 0 to ensure a message is only reported as completed when it has safely reached its destination.
MX_NO_MYRINET	Set to 1 to allow intra-node only communication
MX_DISABLE_SELF	Set to 1 to disable self-communication channel
MX_DISABLE_SHMEM	Set to 1 to disable shared-memory channel
MX_RCACHE	Set to 0 to disable registration cache. This is related to bandwidth performance tuning.
MX_CPUS	Set to a non-zero value to enable CPU-process affinity
MX_IMM_ACK	Set to 1 to enable immediate ACK. Default is 4
MX_CSUM	Set to 1 to enable checksum

Obtaining Myrinet MX software

Go to ftp://ftp.myri.com/pub/MX and use the login name comrade and password ruckus to login.

To compile libmyriexpress, one needs to run the configure script as:

./configure --with-linux=/usr/src/kernels/2.6.xx --with-fms-run=/var/run/fms --with-fms-server=<hostname>

There are other features one can enable, such as --enable-stats, --enable-thread-safety, etc.

Analysis of `libmyriexpress`

Myrinet MX's API mirrors closely MPI's. It has two send functions: mx_isend and mx_issend and their implementation can be found in libmyriexpress/mx_isend.c and libmyriexpress/mx_issend.c

Both mx_isend and mx_issend use an internal function mx__post_send (in libmyriexpress/mx__post_send.c) to post the send request to the request queue. The messages are classified according to their sizes: MX__REQUEST_TYPE_SEND_TINY, MX__REQUEST_TYPE_SEND_SMALL, MX__REQUEST_TYPE_SEND_MEDIUM, and MX__REQUEST_TYPE_SEND_LARGE.

MX also has a function mx__luigi to enable progress of request queues. It has a corresponding API called mx_progress.

The actual send is done via memory mapped I/O. The message is copied to a memory region using function mx_copy_ureq and the device driver in the kernel will pick it up.

There are also ioctl interface for libmyriexpress to communicate with the device driver. First mx__open will try to open the following device files with O_RDWR access:

/dev/mxp*
/dev/mx*

(mxp* are privileged devices)

Then MX uses mx__ioctl to send commands to the device driver. These low-level commands can be found in libmyriexpress/mx__driver_interface.c and common/mx_io_impl.h. The common commands used by Intel MPI are

MX_APP_WAIT                    MX_GET_MEDIUM_MESSAGE_THRESHOLD
MX_ARM_TIMER                   MX_GET_NIC_ID
MX_GET_BOARD_TYPE              MX_GET_SMALL_MESSAGE_THRESHOLD
MX_GET_COPYBLOCKS              MX_GET_VERSION
MX_GET_INSTANCE_COUNT          MX_NIC_ID_TO_PEER_INDEX
MX_GET_MAX_ENDPOINTS           MX_PEER_INDEX_TO_NIC_ID
MX_GET_MAX_PEERS               MX_SET_ENDPOINT
MX_GET_MAX_RDMA_WINDOWS        MX_WAKE
MX_GET_MAX_SEND_HANDLES

Myrinet MX: Life of a request

In libmyriexpress/mx__lib_types.h there is a description of "life of a request". It is copied here:

Send Tiny

If enough resources, go to 4
Mark MX__REQUEST_STATE_SEND_QUEUED, enqueue in send_reqq
mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
mx__post_send posts the request, marks MX__REQUEST_STATE_BUFFERED and MX__REQUEST_STATE_MCP and enqueues in buffered_sendq
mx__process_events dequeues from buffered_sendq, removes MCP, enqueues in resend_list
mx__process_resend_list adds MX__REQUEST_STATE_SEND_QUEUED, dequeues from resend_list and queues in resend_reqq
mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
mx__post_send posts the request, marks MX__REQUEST_STATE_MCP and enqueues in buffered_sendq
mx__process_events dequeues from buffered_sendq, removes MX__REQUEST_STATE_MCP, enqueues in resend_list
Go to 6 until MX__REQUEST_STATE_ACKED
When acked, mx__mark_request_acked dequeues from resend_list and calls mx__send_acked_and_mcp_complete
mx__send_complete queues in doneq/ctxid
mx_test/wait/ipeek/peek/test_any/wait_any dequeues from doneq/ctxid

Send Small

As Tiny, except:

4,8: mx__post_send calls mx__ptr_stack_pop to get room to write the data.
5,9: mx__process_events calls mx__ptr_stack_push to release room.

Send Medium

As Tiny, except:

Skip 1, do 2,3
4: mx_post_send calls mx__memory_pool_alloc to get room in the send copyblock
11: mx__send_acked_and_mcp_complete calls mx__release_send_medium which calls mx__memory_pool_free

Send Large

Mark MX__REQUEST_STATE_SEND_QUEUED, enqueue in send_reqq
mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
mx__post_send removes MX__REQUEST_STATE_SEND_QUEUED, adds MX__REQUEST_STATE_MCP, allocates rdma window, post the rdnv and enqueues to large_sendq
mx__process_events removes MX__REQUEST_STATE_MCP, dequeues from large_sendq, enqueues to resend_list
mx__process_resend_list adds MX__REQUEST_STATE_SEND_QUEUED, dequeues from resend_list and queues in resend_reqq
mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
mx__post_send posts the request, marks MX__REQUEST_STATE_MCP and enqueues in large_sendq
mx__process_events removes MX__REQUEST_STATE_MCP, dequeues from large_sendq, enqueues to resend_list
Go to 5 until MX__REQUEST_STATE_ACKED
mx__send_acked_and_mcp_complete enqueues notifying_large_sendq
mx__process_receives gets a MX_MCP_UEVT_RECV_NOTIFY, dequeues from notifying_large_sendq, calls mx__process_recv_notify which calls mx__rndv_got_notify
mx__release_send_large releases the rdma window
mx__send_complete queues in doneq/ctxid
mx_test/wait/ipeek/peek/test_any/wait_any dequeues from doneq/ctxid

Incoming message

mx__process_events calls mx__process_recvs with the callback associated to the event type
mx__process_recvs checks whether the message is obsolete and drops it then
mx__process_recvs checks whether the message is early and stores it in the early queue then
mx__process_ordered_evt is called on the message if it is the expected seqnum
mx__endpoint_match_receive tries to match the message in the receive queue
If unexpected (no match found), mx__create_unexp_for_evt is called to generate an unexpected request and store it in the unexp queue
If expected (match found), the corresponding receive requests is filled with the message info, and MX__REQUEST_STATE_RECV_MATCHED is set
The callback is called
For tiny messages, mx__process_recv_tiny copies the data from the event and calls mx__recv_complete if the message is expected
For small messages, mx__process_recv_small copies the data from the copy block and calls mx__recv_complete if the message is expected
For medium messages, mx__process_recv_medium moves the requests in the multifrag_recvq if the message is expected, inserts in the partner partialq if it is not that last fragment, and calls mx__process_recv_copy_frag
1: mx__process_recv_copy_frag copies the fragment from the copy block and calls mx__received_last_frag if it was the last fragment
2: mx__received_last_frag removes the partner partialq if there's more than one fragment, removes from the multifrag_recvq if expected, and calls mx__recv_complete
For large message, mx__process_recv_large calls mx__queue_large_recv if the message is expected (see Recv Large below)
mx__process_early is called to process the next seqnum messages if already arrived

Recv posting

mx_irecv calls mx__endpoint_match_unexpected to try to match the new receive in the unexp queue, if not, the recv is enqueued in the recv_reqq, else go on
The receive request gets info from the unexpected requests, and MX__REQUEST_STATE_RECV_MATCHED is set
If not large, the data is copied from the unexpected
If medium, not entirely received, the unexpected is replaced with the receive request in the partner's partialq, and the receive request is enqueue in the multifrag_recvq
If large, mx__queue_large_recv is called (see Recv Large below)
The unexpected requests is released

Recv large

mx__queue_large_recv changes the type from MX__REQUEST_TYPE_RECV to MX__REQUEST_TYPE_RECV_LARGE, and sets MX__REQUEST_STATE_SEND_QUEUED
Enqueue to send_rreq and if length>0 then mark notifying and goto 5)
mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED, adds MX__REQUEST_STATE_MCP, allocates rdma window, calls mx__post_large_recv, and enqueues to large_getq
mx__process_events removes MX__REQUEST_STATE_MCP, adds MX__REQUEST_STATE_SEND_QUEUED, sets notifying, removes from large_getq and enqueues in send_rreq
mx__process_requests calls mx__post_send and enqueues to large_getq
mx__process_events removes MX__REQUEST_STATE_MCP, removes from large_getq and calls mx__send_acked_and_mcp_complete if MX__REQUEST_STATE_ACKED or go back to 4
mx__release_recv_large releases rdma window
mx__recv_complete enqueues in doneq/ctxid

Troubleshooting Myrinet MX

What is warning:regcache incompatible with malloc ?

Myrinet MX uses registration cache (see the "Acronyms in high performance interconnect world" table above) to achieve higher performance. When registration cache feature is enabled, Myrinet MX will manage all memory allocations by itself, i.e. it has its own implemetation of malloc, free, realloc, mremap, munmap, sbrk, etc (see mx__regcache.c in libmyriexpress package)

The warning message in question pops up when mx__regcache_works returns 0. For Linux, this means when calling a pair of malloc/free, the variable mx__hook_triggered is not triggerred.

Registration cache checks can be disabled by setting the environmental variable MX_RCACHE to 2.

Registration cache can sometimes cause weird errors. It can be disabled by setting the environmental variable MX_RCACHE to 0.

Obtaining InfiniBand/OpenFabrics software

Go to http://www.openfabrics.org/downloads/. For developers, check the Verbs and RDMACM links.

InfiniBand diagnostic utilities

If OFED software stack is installed properly, there should be a bunch of InfiniBand tools at:

/usr/bin/ib*
/usr/sbin/ib*

ibstatus, ibstat, ibv_devinfo should print out the device info (All of them query /sys/class/infiniband/). ibv_devinfo is particularly useful to see the health of the InfiniBand card.

ibnetdiscover should print out the whole InfiniBand fabric.

perfquery should print out the performance data and numbers of errors of the InfiniBand card.

To test point-to-point RDMA bandwidth, use ib_read_bw or ib_send_bw. First, login to node A and run ib_read_bw, which will start a server and wait for connection. Then login to node B and run

$ ib_read_bw A

and it will connect to the server at node A and report the bandwidth.

Similarly, use ib_read_lat or ib_send_lat to measure point-to-point RDMA latency.

The infiniband-diags package contains many other InfiniBand diagnostic utilities.

InfiniBand concepts

InfiniBand supports two communication modes: the traditional send/receive (a la MPI-1) and the get/put RDMA (a la MPI-2). In both modes, the two involving processes will both create channels, which is a designated memory region in the process's address space. The endpoints of the channel are called queue pairs (QP) in InfiniBand parlance. A QP can have different Transport service types:

IBV_QPT_RC     // Reliable Connection
IBV_QPT_UC     // Unreliable Connection
IBV_QPT_UD     // Unreliable Datagram, which supports multicast

InfiniBand actually has four Transport Service types XY where X can be R (Reliable, meaning "acknowledged") or U (Unreliable) and Y can be C (Connection, meaning "connection-oriented") or D (Datagram).

To transmit a message, the sender places a work request (WR) on the send queue of the QP via the ibv_post_send function. Currently the supported WR types (as defined in infiniband/verbs.h) depend on QP's Transport service type, and they can be

IBV_WR_SEND
IBV_WR_SEND_WITH_IMM
IBV_WR_RDMA_READ
IBV_WR_RDMA_WRITE
IBV_WR_RDMA_WRITE_WITH_IMM
IBV_WR_ATOMIC_CMP_AND_SWP,
IBV_WR_ATOMIC_FETCH_AND_ADD

and only IBV_QPT_RC service type supports all of above.

In addition to QP, the involving processes must also create completion queues (CQ) so they can poll the status of WR.

One of the idiosyncracies of InfiniBand is for the IBV_QPT_RC service, the receiver must post a receive request before the sender can place a send request, or incoming message will be rejected and the sender will get a Receiver-Not-Ready error.

InfiniBand connection management

There are two ways for processes to establish a connection, i.e. create a channel.

The first approach is low-level, InfiniBand-specific InfiniBand Communication Manager (IBCM). IBCM is based on the InfiniBand connection state machine as defined by the IB Architecture Spec, and handles both connection establishment, as well as service ID resolution. The receiver first needs to open an InfiniBand device via ib_cm_open_device function, which will open the device file

/dev/infiniband/ucm*

and associates it with a service ID via ib_cm_create_id function. Then the receiver can listen on this service ID with ib_cm_listen function.

One can look at ibcm_component_query function in btl_openib_connect_ibcm.c in Open MPI's source tree for a complete example.

The userspace IBCM source code is libibcm, available here

The second approach is RDMA Communication Manager (RDMACM), also referred to as the CMA (Communication Manager Abstraction), is a higher-level CM that operates based on IP addresses. RDMACM provides an interface that is closer to that defined for TCP/IP sockets, and is transport independent, allowing user access to multiple RDMA devices, such as InfiniBand HCAs and iWARP RNICs. For most developers, the RDMA CM provides a simpler interface sufficient for most applications. As in IBCM's case, the receiver process will access the device file

/dev/infiniband/rdma_cm

during the connection establishment.

Man pages of RDMACM APIs can be found here. The userspace RDMACM source code is librdmacm, available here

One can check rdmacm_component_query function in btl_openib_connect_rdmacm.c in Open MPI's source tree for a complete example, or read this tutorial.

Analysis of `libibverbs`

libibverbs is the userspace InfiniBand verbs library, which implements ibv_* APIs mentioned earlier. In reality, libibverbs is just a wrapper which loads and sets up userspace InfiniBand RDMA device driver lib*-rdmav2.so, such as libmlx4-rdmav2.so for Mellanox ConnectX adapter, libnes-rdmav2.so for NetEffect Ethernet adapter, etc.

After the connection has been established, the user program calls ibv_get_device_list which is implemented internally as __ibv_get_device_list in device.c. __ibv_get_device_list will do the following:

Call ibv_fork_init if either environmental variable RDMAV_FORK_SAFE or IBV_FORK_SAFE is present
Check ABI version by looking at sysfs file
```
/sys/class/infiniband_verbs/abi_version
```
Read userspace InfiniBand RDMA device driver configuration files under
```
/etc/libibverbs.d/
```
Note that the /etc prefix is configurable when one runs the configure script.
Each configuration file usually contains only one line, e.g. driver mlx4. mlx4 in this case, is the deviceName, which will be used later.
Get ABI version info of installed InfiniBand RDMA devices by reading the following sysfs files
```
/sys/class/infiniband_verbs/uverbs*/ibdev
/sys/class/infiniband_verbs/uverbs*/abi_version
```
For each deviceName obtained from Step 3, try to initialize its corresonding driver shared library. The try_driver function will call ibv_driver.init_func if it is not pointing to NULL. If the driver is not loaded yet, then ibv_driver.init_func is NULL.
The try_driver function will then examine the node type by looking at
```
/sys/class/infiniband/driverName/node_type
```
file. In this file, CA is InfiniBand Channel Adapter and RNIC is iWARP adatper.
Load the device driver shared library (load_drivers function.) When a device driver shared library is loaded for the first time, a function (usually named deviceName_register_driver) inside the device driver shared library will be invoked. This function is invoked because it has __attribute__((constructor)) attribute.
deviceName_register_driver of device driver shared library will call ibv_register_driver of libibverbs to register itself. In particular, ibv_driver.init_func is set to point to deviceName_driver_init
Execute Step 5 again. This time at least one device should have valid ibv_driver.init_func. What ibv_driver.init_func does is device driver dependent. For example, MLX4's mlx4_driver_init will read info from
```
/sys/class/infiniband_verbs/uverbs*/device/vendor
/sys/class/infiniband_verbs/uverbs*/device/device
```

After the user program has a list of devices (i.e. after ibv_get_device_list call), it then calls ibv_open_device, which is implemented internally as __ibv_open_device in device.c. __ibv_open_device will do the following:

Open the InfiniBand Verb device at
```
/dev/infiniband/uverb*
```
for read/write access.
Create a corresponding ibv_context struct by calling ibv_driver.ops.alloc_context function. This ibv_driver.ops struct is populated when the device driver shared library is loaded. In MLX4 device driver's source file mlx4.c one can find the following code snippet:
```
static struct ibv_device_ops mlx4_dev_ops = {
    .alloc_context = mlx4_alloc_context,
    .free_context  = mlx4_free_context
};
```
and the following assignment in mlx4_alloc_context function
```
    context->ibv_ctx.ops = mlx4_ctx_ops;
```
Return the ibv_context struct just created.

ibv_context struct contains an important member variable called op (which is of struct ibv_context_ops type). The InfiniBand Verbs APIs are implemented inside the device driver shared library, and struct ibv_context_ops contains function pointers pointing to the implementation.

For example, the ibv_post_send Verbs API is simply defined as

static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,struct ibv_send_wr **bad_wr)
{
    return qp->context->ops.post_send(qp, wr, bad_wr);
}

One can also do a grep of context->ops. in verbs.c of libibverbs and see how they are used.

So how is ibv_context.op populated ? In MLX4's case, ibv_context.op is populated by simply copying mlx4_ctx_ops, which is fixed at compile time. In the source file mlx4.c one can find the following code snippet:

static struct ibv_context_ops mlx4_ctx_ops = {
    .query_device  = mlx4_query_device,
    .query_port    = mlx4_query_port,
              . . . .
    .attach_mcast  = mlx4_attach_mcast,
    .detach_mcast  = mlx4_detach_mcast
};

Documentation of the InfiniBand Verbs API and RDMACM API can be found in RDMA Aware Networks Programming User Manual from Mellanox.

Notes on high-performance interconnect