Notes on high-performance interconnect

Acronyms in high performance interconnect and MPI world

ACMCommunication Management Assistant. It provides address and route (path) resolution services to the InfiniBand fabrics, just like the Address Resolution Protocol (ARP).
ADIAbstract Device Interface (MPICH specific)
AHAddress handle (InfiniBand specific)
An object which describes the path to the remote side used in UD QP.
ALAccess layer (InfiniBand specific)
Low level operating system infrastructure (plumbing) used for accessing the interconnect fabric (VPI?, InfiniBandR, Ethernet, FCoE). It includes all basic transport services needed to support upper level network protocols, middleware, and management agents.
ALPSApplication Level Placement Scheduler (Cray MPI specific)
APMAutomatic Path Migration (InfiniBand specific)
ARMCIAggregate Remote Memory Copy Interface
BMLBTL Management Layer. It is a thin layer between PML and BTL in Open MPI and it provides multiplexing between multiple PMLs and multiple BTLs.
BTLByte Transfer Layer. This is the fabric-specific layer in Open MPI and it is used by OB1 & DR PML.
CHChannel (MPICH specific)
CMConnection/Communication Manager (InfiniBand specific)
An entity responsible to establish, maintain, and release communication for RC and UC QP service types. The Service ID Resolution Protocol enables users of UD service to locate QPs supporting their desired service. There is a CM in every IB port of the end nodes.

In the context of Open MPI, CM is a PML implementation, and its etymology is Connor MacLeod from The Highlander movie. Unlike OB1, CM is designed to utilize directly the native, MPI-like tagged matching interface exposed by certain fabrics, e.g. Myrinet MX, QLogic PSM, and Cray Portals.

CMACommunication Manager Abstraction, which is another name of RDMA CM (OpenFabrics specific)
CNOConsumer Notification Object (DAPL specific)
CQECompletion queue entry (InfiniBand specific)
An entry in the Completion Queue that describes the information about the completed WR (status, size etc.)
CRConnection Request
CRCPCheckpoint/Restart Coordination Protocol; it is a PML implementation in Open MPI.
CXGBChelsio T3/T4 10 Gigabit Ethernet
DAPLDirect Access Programming Library
DATDirect Access Transport
DBLData Bypass Layer, a user-level, kernel-bypass, messaging software package for Myricom 10-Gigabit Ethernet.
DDPDirect Data Placement
DDRDouble data rate (InfiniBand specific)
DPMDynamic Process Management (Open MPI specific)
DRData Reliability. It is one of PML implementations in Open MPI; unlike OB1 which relies on error codes propogated from lower level communication libraries, DR can better handle network failures.
DTCData Transfer Completion (DAPL specific)
DTOData Transfer Operation (DAPL specific)
EE
EEC
End-to-end context (InfiniBand specific)
eHCAIBM eServer GX-based InfiniBand HCA
EPEndpoint
EVDEvent Dispatcher (DAPL specific)
FCAVoltaire's (now Mellanox) Fabric Collective Accelerator
FCoEFibre Channel over Ethernet
FMSFabric Management System (Myrinet specific)
GbEGigabit Ethernet
GMMyrinet Generic Messages protocol
GRUGlobal Reference Unit (SGI Altix specific)
GSIGeneral Services Interface
HCAHost Channel Adapter (InfiniBand specific)
IAInterface adapter
IBInfiniBand
IBALInfiniBand Access Layer
IBTAInfiniBand Trade Association
IPoIBIP over InfiniBand
iPathQLogic InfiniPath HCA
iSERiSCSI Extensions for RDMA
ITAPIInterconnect Transport API, one of the APIs for InfiniBand.
iWARPInternet Wide Area RDMA Protocol, a competing protocol of InfiniBand
kDAPLKernel level DAPL
LIDLocal identifier (InfiniBand specific)
A 16 bit address assigned to end nodes by the Subnet Manager. Each LID is unique within its subnet.
LMRLocal Memory Region
MADManagement Datagram (InfiniBand specific)
MCAModular Component Architecture (Open MPI specific)
MCPMyrinet Control Program
MLX4Mellanox ConnectX adapter
MPDMulti-Purpose Daemon (MPI specific)
MPIDA name space in MPICH source code tree. Functions in this name space are hardware-dependent (D for Device)
MPIRA name space in MPICH source code tree. Functions in this name space are hardware-independent (R for Runtime)
MPTMessage Passing Toolkit
MRMemory region (InfiniBand specific)
A set of memory buffers which have already been registered with access permissions. These buffers need to be registered in order for the network adapter to make use of them. During registration an lkey and rkey are created associated with the created memory region.
MSIMessage signaled interrupt
MTHCAMellanox InfiniHost device
MTLMatching Tranport Layer (used by the "CM" PML implementation in Open MPI)
MXMyrinet Express protocol
MXMMellanoX Messaging
NESIntel-NetEffect Ethernet Cluster Server adapter
OB1A PML implementation in Open MPI; its focus is high performance and will use RDMA if available.
This name is inspired by Star Wars's Obi-Wan. Open MPI also has a BML implementation called R2
ODLSORTE Daemon Launch System (Open MPI specific)
OFAOpen Fabrics Alliance (was OpenIB)
OFEDOpen Fabrics Enterprise Distribution
OMPIOpen MPI
OPAOpen Portable Atomics (MPICH specific)
OPALOpen Portability Access Layer (Open MPI specific)
OpenPAOpen Portable Atomics (MPICH specific)
ORTEOpen Run-Time Environment (Open MPI specific)
OpenIBOpen Fabrics Alliance
OSCOne Sided Communication (Open MPI specific)
P4Portable Programs for Parallel Processors protocol (MPICH specific)
PDProtection domain (InfiniBand specific)
Object whose components can interact with only each other. AHs interact with QPs, and MRs interact with WQs.
PIOProgrammed I/O
PLSProcess Launch System (Open MPI specific)
PMAPerformance Management Agent (InfiniBand specific)
PMIProcess Management Interface (MPICH2 specific)
PMLPoint-To-Point Management Layer. As opposed to BTL, PML is the fabric-independent layer in Open MPI.
PPPer peer.

In Open MPI, each QP can be specified to use an SRQ or PP receive buffers posted to QP directly.

PSMQLogic Performance Scaled Messaging
PSPPublic Service Point (DAPL specific)
PTLPortals, the lowest-level network transport layer on Cray platform
PZProtection Zone (DAPL specific)
QDRQuad data rate (InfiniBand specific)
QPQueue pair (InfiniBand specific)
The pair (Send Queue and Receive Queue) of independent WQs packed together in one object for the purpose of transferring data between nodes of a network. Posts are used to initiate the sending or receiving of data. There are three types of QP: UD, UC, and RC.
RASResource Allocation Subsystem, which queries the batch job system to determine what resources have been allocated to the MPI job; it is a MCA component in Open MPI
RCReliable Connection (InfiniBand specific)
A QP Transport service type based on a connection oriented protocol. A QP is associated with another single QP. The messages are sent in a reliable way (in terms of the correctness of order and the information.)
RCacheRegistration Cache, which allows memory pools to cache registered memory for later operations.
RDReliable Datagram (InfiniBand specific)
RDMARemote Direct Memory Access
RDSReliable Datagram Socket
RMARemote Memory Access
RMAPSResource MAPping Subsystem, which maps MPI processes to specific nodes/cores; it is a MCA component in Open MPI
RMCReliable Multicast (InfiniBand specific)
RMLRuntime Messaging Layer communication interface, which provices basic point-to-point communication between ORTE processes; it is a MCA component in Open MPI
RMRRemote Memory Region
RNRReceiver not ready (InfiniBand specific)
The flow in an RC QP where there is a connection between the sides but a Receive Request is not present in the Receive side.
RORemote Operation (InfiniBand specific)
RoCEERDMA over Converged Enhanced Ethernet
RQRequest Queue
RSPReserved Service Point (DAPL specific)
RTRReady to Receive (InfiniBand specific)
RTSReady to Send (InfiniBand specific)
SASubnet Administrator (InfiniBand specific)
SCIScalable Coherent Interface
SCMuDAPL socket based CM (DAPL/OpenFabrics specific)
SDPSockets Direct Protocol, which is a protocol over an RDMA fabric (usually InfiniBand) to support stream sockets (SOCK_STREAM)
SGEScatter/Gather Elements (InfiniBand specific)
An entry to a pointer to a full or a part of a local registered memory block. The element hold the start address of the block, size, and lkey (with its associated permissions).
SHMEMShared Memory
SMShared Memory (Intel MPI specific), or Subnet Manager (InfiniBand specific)
SMASubnet Manager Agent (InfiniBand specific)
SMISubnet Management Interface
SNAPCSnapshot Coordination; a MCA component in Open MPI
SRSend Request (InfiniBand specific)
A WR which was posted to an SQ which describes how much data is going to be transferred, its direction, and the way (the opcode will specify the transfer)
SRPSCSI RDMA Protocol
SRQShared Receive Queue (InfiniBand specific)
A queue which holds WQEs for incoming messages from any RC/UD QP which is associated with it. More than one QPs can be associated with one SRQ.
TCATarget Channel Adapter (InfiniBand specific)
A Channel Adapter that is not required to support Verbs, usually used in I/O devices
TMITag Matching Interface, e.g. QLogic PSM and Myricom MX
TOETCP Offload Engine
UCUnreliable Connection (InfiniBand specific)
A QP transport service type based on a connection oriented protocol, where a QP is associated with another single QP. The QPs do not execute a reliable Protocol and messages can be lost.
UCMuDAPL unreliable datagram based CM (DAPL/OpenFabrics specific)
UDUnreliable Datagram (InfiniBand specific)
A QP transport service type in which messages can be one packet length and every UD QP can send/receive messages from another UD QP in the subnet. Messages can be lost and the order is not guaranteed. UD QP is the only type which supports multicast messages.
UDAMellanox's Unstructured Data Accelerator
uDAPLUser level DAPL
ULPUpper Layer Protocol (InfiniBand specific)
VMessage logging and replay protocol; it is a PML implementation in Open MPI and is based on MPICH-V (V for Volatility).
VAPIMellanox's Verbs API for InfiniBand.
VIVirtual Interface
VIAVirtual Interface Architecture
VIPLVI Provider Library
VMAVoltaire Messaging Accelerator
VPIVirtual Protocol Interconnect
It allows the user to change the layer 2 protocol of the port.
WQEWork Queue Element (InfiniBand specific)
WRWork request (InfiniBand specific)
A request which was posted by a user to a work queue
XRCeXtended Reliable Connection.

It is a new transport layer introduced by Mellanox to improve scalability in multi-core environments. It allows one single receive QP to be shared by multiple SRQs across one or more processes.

XPMEMCross Partition Memory (SGI Altix specific)

Useful info for InfiniBand: here and here

Environmental variables for Myrinet MX

(One can find their details in the MX source tree at mx-1.*/doc/libmx-options.txt and mx-1.*/README)

MX_STATS(libmyriexpress must be compiled with --enable-stats when running the configure script) Set to 1 to enable reporting of statistics when the endpoint is closed.

If the MPI library does not close the endpoint, nothing will be printed out.

MX_DEBUG_FILEOutput statistics and all debugging information to this file
MX_VERBOSESet to a non-zero value to display debugging information. The bigger the value, the more verbose.
MX_MONOTHREADSet to 1 to use single thread. Default is 0
MX_ZOMBIE_SENDSet to 0 to ensure a message is only reported as completed when it has safely reached its destination.
MX_NO_MYRINETSet to 1 to allow intra-node only communication
MX_DISABLE_SELFSet to 1 to disable self-communication channel
MX_DISABLE_SHMEMSet to 1 to disable shared-memory channel
MX_RCACHESet to 0 to disable registration cache.

This is related to bandwidth performance tuning.

MX_CPUSSet to a non-zero value to enable CPU-process affinity
MX_IMM_ACKSet to 1 to enable immediate ACK. Default is 4
MX_CSUMSet to 1 to enable checksum

Obtaining Myrinet MX software

Go to ftp://ftp.myri.com/pub/MX and use the login name comrade and password ruckus to login.

To compile libmyriexpress, one needs to run the configure script as:

./configure --with-linux=/usr/src/kernels/2.6.xx --with-fms-run=/var/run/fms --with-fms-server=<hostname>
There are other features one can enable, such as --enable-stats, --enable-thread-safety, etc.

Analysis of libmyriexpress

Myrinet MX's API mirrors closely MPI's. It has two send functions: mx_isend and mx_issend and their implementation can be found in libmyriexpress/mx_isend.c and libmyriexpress/mx_issend.c

Both mx_isend and mx_issend use an internal function mx__post_send (in libmyriexpress/mx__post_send.c) to post the send request to the request queue. The messages are classified according to their sizes: MX__REQUEST_TYPE_SEND_TINY, MX__REQUEST_TYPE_SEND_SMALL, MX__REQUEST_TYPE_SEND_MEDIUM, and MX__REQUEST_TYPE_SEND_LARGE.

MX also has a function mx__luigi to enable progress of request queues. It has a corresponding API called mx_progress.

The actual send is done via memory mapped I/O. The message is copied to a memory region using function mx_copy_ureq and the device driver in the kernel will pick it up.

There are also ioctl interface for libmyriexpress to communicate with the device driver. First mx__open will try to open the following device files with O_RDWR access:

/dev/mxp*
/dev/mx*
(mxp* are privileged devices)

Then MX uses mx__ioctl to send commands to the device driver. These low-level commands can be found in libmyriexpress/mx__driver_interface.c and common/mx_io_impl.h. The common commands used by Intel MPI are

MX_APP_WAIT                    MX_GET_MEDIUM_MESSAGE_THRESHOLD
MX_ARM_TIMER                   MX_GET_NIC_ID
MX_GET_BOARD_TYPE              MX_GET_SMALL_MESSAGE_THRESHOLD
MX_GET_COPYBLOCKS              MX_GET_VERSION
MX_GET_INSTANCE_COUNT          MX_NIC_ID_TO_PEER_INDEX
MX_GET_MAX_ENDPOINTS           MX_PEER_INDEX_TO_NIC_ID
MX_GET_MAX_PEERS               MX_SET_ENDPOINT
MX_GET_MAX_RDMA_WINDOWS        MX_WAKE
MX_GET_MAX_SEND_HANDLES

Myrinet MX: Life of a request

In libmyriexpress/mx__lib_types.h there is a description of "life of a request". It is copied here:

Send Tiny

  1. If enough resources, go to 4
  2. Mark MX__REQUEST_STATE_SEND_QUEUED, enqueue in send_reqq
  3. mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
  4. mx__post_send posts the request, marks MX__REQUEST_STATE_BUFFERED and MX__REQUEST_STATE_MCP and enqueues in buffered_sendq
  5. mx__process_events dequeues from buffered_sendq, removes MCP, enqueues in resend_list
  6. mx__process_resend_list adds MX__REQUEST_STATE_SEND_QUEUED, dequeues from resend_list and queues in resend_reqq
  7. mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
  8. mx__post_send posts the request, marks MX__REQUEST_STATE_MCP and enqueues in buffered_sendq
  9. mx__process_events dequeues from buffered_sendq, removes MX__REQUEST_STATE_MCP, enqueues in resend_list
  10. Go to 6 until MX__REQUEST_STATE_ACKED
  11. When acked, mx__mark_request_acked dequeues from resend_list and calls mx__send_acked_and_mcp_complete
  12. mx__send_complete queues in doneq/ctxid
  13. mx_test/wait/ipeek/peek/test_any/wait_any dequeues from doneq/ctxid

Send Small

As Tiny, except:

4,8: mx__post_send calls mx__ptr_stack_pop to get room to write the data.
5,9: mx__process_events calls mx__ptr_stack_push to release room.

Send Medium

As Tiny, except:

Skip 1, do 2,3
4: mx_post_send calls mx__memory_pool_alloc to get room in the send copyblock
11: mx__send_acked_and_mcp_complete calls mx__release_send_medium which calls mx__memory_pool_free

Send Large

  1. Mark MX__REQUEST_STATE_SEND_QUEUED, enqueue in send_reqq
  2. mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
  3. mx__post_send removes MX__REQUEST_STATE_SEND_QUEUED, adds MX__REQUEST_STATE_MCP, allocates rdma window, post the rdnv and enqueues to large_sendq
  4. mx__process_events removes MX__REQUEST_STATE_MCP, dequeues from large_sendq, enqueues to resend_list
  5. mx__process_resend_list adds MX__REQUEST_STATE_SEND_QUEUED, dequeues from resend_list and queues in resend_reqq
  6. mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
  7. mx__post_send posts the request, marks MX__REQUEST_STATE_MCP and enqueues in large_sendq
  8. mx__process_events removes MX__REQUEST_STATE_MCP, dequeues from large_sendq, enqueues to resend_list
  9. Go to 5 until MX__REQUEST_STATE_ACKED
  10. mx__send_acked_and_mcp_complete enqueues notifying_large_sendq
  11. mx__process_receives gets a MX_MCP_UEVT_RECV_NOTIFY, dequeues from notifying_large_sendq, calls mx__process_recv_notify which calls mx__rndv_got_notify
  12. mx__release_send_large releases the rdma window
  13. mx__send_complete queues in doneq/ctxid
  14. mx_test/wait/ipeek/peek/test_any/wait_any dequeues from doneq/ctxid

Incoming message

  1. mx__process_events calls mx__process_recvs with the callback associated to the event type
  2. mx__process_recvs checks whether the message is obsolete and drops it then
  3. mx__process_recvs checks whether the message is early and stores it in the early queue then
  4. mx__process_ordered_evt is called on the message if it is the expected seqnum
  5. mx__endpoint_match_receive tries to match the message in the receive queue

    If unexpected (no match found), mx__create_unexp_for_evt is called to generate an unexpected request and store it in the unexp queue

    If expected (match found), the corresponding receive requests is filled with the message info, and MX__REQUEST_STATE_RECV_MATCHED is set

  6. The callback is called

    For tiny messages, mx__process_recv_tiny copies the data from the event and calls mx__recv_complete if the message is expected

    For small messages, mx__process_recv_small copies the data from the copy block and calls mx__recv_complete if the message is expected

    For medium messages, mx__process_recv_medium moves the requests in the multifrag_recvq if the message is expected, inserts in the partner partialq if it is not that last fragment, and calls mx__process_recv_copy_frag
    1: mx__process_recv_copy_frag copies the fragment from the copy block and calls mx__received_last_frag if it was the last fragment
    2: mx__received_last_frag removes the partner partialq if there's more than one fragment, removes from the multifrag_recvq if expected, and calls mx__recv_complete

    For large message, mx__process_recv_large calls mx__queue_large_recv if the message is expected (see Recv Large below)

  7. mx__process_early is called to process the next seqnum messages if already arrived

Recv posting

  1. mx_irecv calls mx__endpoint_match_unexpected to try to match the new receive in the unexp queue, if not, the recv is enqueued in the recv_reqq, else go on
  2. The receive request gets info from the unexpected requests, and MX__REQUEST_STATE_RECV_MATCHED is set
  3. If not large, the data is copied from the unexpected
  4. If medium, not entirely received, the unexpected is replaced with the receive request in the partner's partialq, and the receive request is enqueue in the multifrag_recvq

    If large, mx__queue_large_recv is called (see Recv Large below)

  5. The unexpected requests is released

Recv large

  1. mx__queue_large_recv changes the type from MX__REQUEST_TYPE_RECV to MX__REQUEST_TYPE_RECV_LARGE, and sets MX__REQUEST_STATE_SEND_QUEUED
  2. Enqueue to send_rreq and if length>0 then mark notifying and goto 5)
  3. mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED, adds MX__REQUEST_STATE_MCP, allocates rdma window, calls mx__post_large_recv, and enqueues to large_getq
  4. mx__process_events removes MX__REQUEST_STATE_MCP, adds MX__REQUEST_STATE_SEND_QUEUED, sets notifying, removes from large_getq and enqueues in send_rreq
  5. mx__process_requests calls mx__post_send and enqueues to large_getq
  6. mx__process_events removes MX__REQUEST_STATE_MCP, removes from large_getq and calls mx__send_acked_and_mcp_complete if MX__REQUEST_STATE_ACKED or go back to 4
  7. mx__release_recv_large releases rdma window
  8. mx__recv_complete enqueues in doneq/ctxid

Troubleshooting Myrinet MX

What is warning:regcache incompatible with malloc ?

Myrinet MX uses registration cache (see the "Acronyms in high performance interconnect world" table above) to achieve higher performance. When registration cache feature is enabled, Myrinet MX will manage all memory allocations by itself, i.e. it has its own implemetation of malloc, free, realloc, mremap, munmap, sbrk, etc (see mx__regcache.c in libmyriexpress package)

The warning message in question pops up when mx__regcache_works returns 0. For Linux, this means when calling a pair of malloc/free, the variable mx__hook_triggered is not triggerred.

Registration cache checks can be disabled by setting the environmental variable MX_RCACHE to 2.

Registration cache can sometimes cause weird errors. It can be disabled by setting the environmental variable MX_RCACHE to 0.

Obtaining InfiniBand/OpenFabrics software

Go to http://www.openfabrics.org/downloads/. For developers, check the Verbs and RDMACM links.

InfiniBand diagnostic utilities

If OFED software stack is installed properly, there should be a bunch of InfiniBand tools at:
/usr/bin/ib*
/usr/sbin/ib*
ibstatus, ibstat, ibv_devinfo should print out the device info (All of them query /sys/class/infiniband/). ibv_devinfo is particularly useful to see the health of the InfiniBand card.

ibnetdiscover should print out the whole InfiniBand fabric.

perfquery should print out the performance data and numbers of errors of the InfiniBand card.

To test point-to-point RDMA bandwidth, use ib_read_bw or ib_send_bw. First, login to node A and run ib_read_bw, which will start a server and wait for connection. Then login to node B and run

$ ib_read_bw A
and it will connect to the server at node A and report the bandwidth.

Similarly, use ib_read_lat or ib_send_lat to measure point-to-point RDMA latency.

The infiniband-diags package contains many other InfiniBand diagnostic utilities.

InfiniBand concepts

InfiniBand supports two communication modes: the traditional send/receive (a la MPI-1) and the get/put RDMA (a la MPI-2). In both modes, the two involving processes will both create channels, which is a designated memory region in the process's address space. The endpoints of the channel are called queue pairs (QP) in InfiniBand parlance. A QP can have different Transport service types:
IBV_QPT_RC     // Reliable Connection
IBV_QPT_UC     // Unreliable Connection
IBV_QPT_UD     // Unreliable Datagram, which supports multicast
InfiniBand actually has four Transport Service types XY where X can be R (Reliable, meaning "acknowledged") or U (Unreliable) and Y can be C (Connection, meaning "connection-oriented") or D (Datagram).

To transmit a message, the sender places a work request (WR) on the send queue of the QP via the ibv_post_send function. Currently the supported WR types (as defined in infiniband/verbs.h) depend on QP's Transport service type, and they can be

IBV_WR_SEND
IBV_WR_SEND_WITH_IMM
IBV_WR_RDMA_READ
IBV_WR_RDMA_WRITE
IBV_WR_RDMA_WRITE_WITH_IMM
IBV_WR_ATOMIC_CMP_AND_SWP,
IBV_WR_ATOMIC_FETCH_AND_ADD
and only IBV_QPT_RC service type supports all of above.

In addition to QP, the involving processes must also create completion queues (CQ) so they can poll the status of WR.

One of the idiosyncracies of InfiniBand is for the IBV_QPT_RC service, the receiver must post a receive request before the sender can place a send request, or incoming message will be rejected and the sender will get a Receiver-Not-Ready error.

InfiniBand connection management

There are two ways for processes to establish a connection, i.e. create a channel.

The first approach is low-level, InfiniBand-specific InfiniBand Communication Manager (IBCM). IBCM is based on the InfiniBand connection state machine as defined by the IB Architecture Spec, and handles both connection establishment, as well as service ID resolution. The receiver first needs to open an InfiniBand device via ib_cm_open_device function, which will open the device file

/dev/infiniband/ucm*
and associates it with a service ID via ib_cm_create_id function. Then the receiver can listen on this service ID with ib_cm_listen function.

One can look at ibcm_component_query function in btl_openib_connect_ibcm.c in Open MPI's source tree for a complete example.

The userspace IBCM source code is libibcm, available here

The second approach is RDMA Communication Manager (RDMACM), also referred to as the CMA (Communication Manager Abstraction), is a higher-level CM that operates based on IP addresses. RDMACM provides an interface that is closer to that defined for TCP/IP sockets, and is transport independent, allowing user access to multiple RDMA devices, such as InfiniBand HCAs and iWARP RNICs. For most developers, the RDMA CM provides a simpler interface sufficient for most applications. As in IBCM's case, the receiver process will access the device file

/dev/infiniband/rdma_cm
during the connection establishment.

Man pages of RDMACM APIs can be found here. The userspace RDMACM source code is librdmacm, available here

One can check rdmacm_component_query function in btl_openib_connect_rdmacm.c in Open MPI's source tree for a complete example, or read this tutorial.

Analysis of libibverbs

libibverbs is the userspace InfiniBand verbs library, which implements ibv_* APIs mentioned earlier. In reality, libibverbs is just a wrapper which loads and sets up userspace InfiniBand RDMA device driver lib*-rdmav2.so, such as libmlx4-rdmav2.so for Mellanox ConnectX adapter, libnes-rdmav2.so for NetEffect Ethernet adapter, etc.

After the connection has been established, the user program calls ibv_get_device_list which is implemented internally as __ibv_get_device_list in device.c. __ibv_get_device_list will do the following:

  1. Call ibv_fork_init if either environmental variable RDMAV_FORK_SAFE or IBV_FORK_SAFE is present
  2. Check ABI version by looking at sysfs file
    /sys/class/infiniband_verbs/abi_version
  3. Read userspace InfiniBand RDMA device driver configuration files under
    /etc/libibverbs.d/
    Note that the /etc prefix is configurable when one runs the configure script.

    Each configuration file usually contains only one line, e.g. driver mlx4. mlx4 in this case, is the deviceName, which will be used later.

  4. Get ABI version info of installed InfiniBand RDMA devices by reading the following sysfs files
    /sys/class/infiniband_verbs/uverbs*/ibdev
    /sys/class/infiniband_verbs/uverbs*/abi_version
    
  5. For each deviceName obtained from Step 3, try to initialize its corresonding driver shared library. The try_driver function will call ibv_driver.init_func if it is not pointing to NULL. If the driver is not loaded yet, then ibv_driver.init_func is NULL.

    The try_driver function will then examine the node type by looking at

    /sys/class/infiniband/driverName/node_type
    file. In this file, CA is InfiniBand Channel Adapter and RNIC is iWARP adatper.

  6. Load the device driver shared library (load_drivers function.) When a device driver shared library is loaded for the first time, a function (usually named deviceName_register_driver) inside the device driver shared library will be invoked. This function is invoked because it has __attribute__((constructor)) attribute.

  7. deviceName_register_driver of device driver shared library will call ibv_register_driver of libibverbs to register itself. In particular, ibv_driver.init_func is set to point to deviceName_driver_init

  8. Execute Step 5 again. This time at least one device should have valid ibv_driver.init_func. What ibv_driver.init_func does is device driver dependent. For example, MLX4's mlx4_driver_init will read info from
    /sys/class/infiniband_verbs/uverbs*/device/vendor
    /sys/class/infiniband_verbs/uverbs*/device/device
    
After the user program has a list of devices (i.e. after ibv_get_device_list call), it then calls ibv_open_device, which is implemented internally as __ibv_open_device in device.c. __ibv_open_device will do the following:
  1. Open the InfiniBand Verb device at
    /dev/infiniband/uverb*
    for read/write access.

  2. Create a corresponding ibv_context struct by calling ibv_driver.ops.alloc_context function. This ibv_driver.ops struct is populated when the device driver shared library is loaded. In MLX4 device driver's source file mlx4.c one can find the following code snippet:
    static struct ibv_device_ops mlx4_dev_ops = {
        .alloc_context = mlx4_alloc_context,
        .free_context  = mlx4_free_context
    };
    
    and the following assignment in mlx4_alloc_context function
        context->ibv_ctx.ops = mlx4_ctx_ops;
    
  3. Return the ibv_context struct just created.

ibv_context struct contains an important member variable called op (which is of struct ibv_context_ops type). The InfiniBand Verbs APIs are implemented inside the device driver shared library, and struct ibv_context_ops contains function pointers pointing to the implementation.

For example, the ibv_post_send Verbs API is simply defined as

static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,struct ibv_send_wr **bad_wr)
{
    return qp->context->ops.post_send(qp, wr, bad_wr);
}
One can also do a grep of context->ops. in verbs.c of libibverbs and see how they are used.

So how is ibv_context.op populated ? In MLX4's case, ibv_context.op is populated by simply copying mlx4_ctx_ops, which is fixed at compile time. In the source file mlx4.c one can find the following code snippet:

static struct ibv_context_ops mlx4_ctx_ops = {
    .query_device  = mlx4_query_device,
    .query_port    = mlx4_query_port,
              . . . .
    .attach_mcast  = mlx4_attach_mcast,
    .detach_mcast  = mlx4_detach_mcast
};
Documentation of the InfiniBand Verbs API and RDMACM API can be found in RDMA Aware Networks Programming User Manual from Mellanox.