Acronyms in high performance interconnect and MPI world
ACM | Communication Management Assistant. It provides address and route (path) resolution services to the InfiniBand fabrics, just like the Address Resolution Protocol (ARP). |
ADI | Abstract Device Interface (MPICH specific) |
AH | Address handle (InfiniBand specific) An object which describes the path to the remote side used in UD QP. |
AL | Access layer (InfiniBand specific) Low level operating system infrastructure (plumbing) used for accessing the interconnect fabric (VPI?, InfiniBandR, Ethernet, FCoE). It includes all basic transport services needed to support upper level network protocols, middleware, and management agents. |
ALPS | Application Level Placement Scheduler (Cray MPI specific) |
APM | Automatic Path Migration (InfiniBand specific) |
ARMCI | Aggregate Remote Memory Copy Interface |
BML | BTL Management Layer. It is a thin layer between PML and BTL in Open MPI and it provides multiplexing between multiple PMLs and multiple BTLs. |
BTL | Byte Transfer Layer. This is the fabric-specific layer in Open MPI and it is used by OB1 & DR PML. |
CH | Channel (MPICH specific) |
CM | Connection/Communication Manager (InfiniBand specific) An entity responsible to establish, maintain, and release communication for RC and UC QP service types. The Service ID Resolution Protocol enables users of UD service to locate QPs supporting their desired service. There is a CM in every IB port of the end nodes. In the context of Open MPI, CM is a PML implementation, and its etymology is Connor MacLeod from The Highlander movie. Unlike OB1, CM is designed to utilize directly the native, MPI-like tagged matching interface exposed by certain fabrics, e.g. Myrinet MX, QLogic PSM, and Cray Portals. |
CMA | Communication Manager Abstraction, which is another name of RDMA CM (OpenFabrics specific) |
CNO | Consumer Notification Object (DAPL specific) |
CQE | Completion queue entry (InfiniBand specific) An entry in the Completion Queue that describes the information about the completed WR (status, size etc.) |
CR | Connection Request |
CRCP | Checkpoint/Restart Coordination Protocol; it is a PML implementation in Open MPI. |
CXGB | Chelsio T3/T4 10 Gigabit Ethernet |
DAPL | Direct Access Programming Library |
DAT | Direct Access Transport |
DBL | Data Bypass Layer, a user-level, kernel-bypass, messaging software package for Myricom 10-Gigabit Ethernet. |
DDP | Direct Data Placement |
DDR | Double data rate (InfiniBand specific) |
DPM | Dynamic Process Management (Open MPI specific) |
DR | Data Reliability. It is one of PML implementations in Open MPI; unlike OB1 which relies on error codes propogated from lower level communication libraries, DR can better handle network failures. |
DTC | Data Transfer Completion (DAPL specific) |
DTO | Data Transfer Operation (DAPL specific) |
EE EEC | End-to-end context (InfiniBand specific) |
eHCA | IBM eServer GX-based InfiniBand HCA |
EP | Endpoint |
EVD | Event Dispatcher (DAPL specific) |
FCA | Voltaire's (now Mellanox) Fabric Collective Accelerator |
FCoE | Fibre Channel over Ethernet |
FMS | Fabric Management System (Myrinet specific) |
GbE | Gigabit Ethernet |
GM | Myrinet Generic Messages protocol |
GRU | Global Reference Unit (SGI Altix specific) |
GSI | General Services Interface |
HCA | Host Channel Adapter (InfiniBand specific) |
IA | Interface adapter |
IB | InfiniBand |
IBAL | InfiniBand Access Layer |
IBTA | InfiniBand Trade Association |
IPoIB | IP over InfiniBand |
iPath | QLogic InfiniPath HCA |
iSER | iSCSI Extensions for RDMA |
ITAPI | Interconnect Transport API, one of the APIs for InfiniBand. |
iWARP | Internet Wide Area RDMA Protocol, a competing protocol of InfiniBand |
kDAPL | Kernel level DAPL |
LID | Local identifier (InfiniBand specific) A 16 bit address assigned to end nodes by the Subnet Manager. Each LID is unique within its subnet. |
LMR | Local Memory Region |
MAD | Management Datagram (InfiniBand specific) |
MCA | Modular Component Architecture (Open MPI specific) |
MCP | Myrinet Control Program |
MLX4 | Mellanox ConnectX adapter |
MPD | Multi-Purpose Daemon (MPI specific) |
MPID | A name space in MPICH source code tree. Functions in this name space are hardware-dependent (D for Device) |
MPIR | A name space in MPICH source code tree. Functions in this name space are hardware-independent (R for Runtime) |
MPT | Message Passing Toolkit |
MR | Memory region (InfiniBand specific) A set of memory buffers which have already been registered with access permissions. These buffers need to be registered in order for the network adapter to make use of them. During registration an lkey and rkey are created associated with the created memory region. |
MSI | Message signaled interrupt |
MTHCA | Mellanox InfiniHost device |
MTL | Matching Tranport Layer (used by the "CM" PML implementation in Open MPI) |
MX | Myrinet Express protocol |
MXM | MellanoX Messaging |
NES | Intel-NetEffect Ethernet Cluster Server adapter |
OB1 | A PML implementation in Open MPI; its focus is high performance and will use RDMA if available. This name is inspired by Star Wars's Obi-Wan. Open MPI also has a BML implementation called R2 |
ODLS | ORTE Daemon Launch System (Open MPI specific) |
OFA | Open Fabrics Alliance (was OpenIB) |
OFED | Open Fabrics Enterprise Distribution |
OMPI | Open MPI |
OPA | Open Portable Atomics (MPICH specific) |
OPAL | Open Portability Access Layer (Open MPI specific) |
OpenPA | Open Portable Atomics (MPICH specific) |
ORTE | Open Run-Time Environment (Open MPI specific) |
OpenIB | Open Fabrics Alliance |
OSC | One Sided Communication (Open MPI specific) |
P4 | Portable Programs for Parallel Processors protocol (MPICH specific) |
PD | Protection domain (InfiniBand specific) Object whose components can interact with only each other. AHs interact with QPs, and MRs interact with WQs. |
PIO | Programmed I/O |
PLS | Process Launch System (Open MPI specific) |
PMA | Performance Management Agent (InfiniBand specific) |
PMI | Process Management Interface (MPICH2 specific) |
PML | Point-To-Point Management Layer. As opposed to BTL, PML is the fabric-independent layer in Open MPI. |
PP | Per peer. In Open MPI, each QP can be specified to use an SRQ or PP receive buffers posted to QP directly. |
PSM | QLogic Performance Scaled Messaging |
PSP | Public Service Point (DAPL specific) |
PTL | Portals, the lowest-level network transport layer on Cray platform |
PZ | Protection Zone (DAPL specific) |
QDR | Quad data rate (InfiniBand specific) |
QP | Queue pair (InfiniBand specific) The pair (Send Queue and Receive Queue) of independent WQs packed together in one object for the purpose of transferring data between nodes of a network. Posts are used to initiate the sending or receiving of data. There are three types of QP: UD, UC, and RC. |
RAS | Resource Allocation Subsystem, which queries the batch job system to determine what resources have been allocated to the MPI job; it is a MCA component in Open MPI |
RC | Reliable Connection (InfiniBand specific) A QP Transport service type based on a connection oriented protocol. A QP is associated with another single QP. The messages are sent in a reliable way (in terms of the correctness of order and the information.) |
RCache | Registration Cache, which allows memory pools to cache registered memory for later operations. |
RD | Reliable Datagram (InfiniBand specific) |
RDMA | Remote Direct Memory Access |
RDS | Reliable Datagram Socket |
RMA | Remote Memory Access |
RMAPS | Resource MAPping Subsystem, which maps MPI processes to specific nodes/cores; it is a MCA component in Open MPI |
RMC | Reliable Multicast (InfiniBand specific) |
RML | Runtime Messaging Layer communication interface, which provices basic point-to-point communication between ORTE processes; it is a MCA component in Open MPI |
RMR | Remote Memory Region |
RNR | Receiver not ready (InfiniBand specific) The flow in an RC QP where there is a connection between the sides but a Receive Request is not present in the Receive side. |
RO | Remote Operation (InfiniBand specific) |
RoCEE | RDMA over Converged Enhanced Ethernet |
RQ | Request Queue |
RSP | Reserved Service Point (DAPL specific) |
RTR | Ready to Receive (InfiniBand specific) |
RTS | Ready to Send (InfiniBand specific) |
SA | Subnet Administrator (InfiniBand specific) |
SCI | Scalable Coherent Interface |
SCM | uDAPL socket based CM (DAPL/OpenFabrics specific) |
SDP | Sockets Direct Protocol, which is a protocol over an RDMA fabric (usually InfiniBand) to support stream sockets (SOCK_STREAM) |
SGE | Scatter/Gather Elements (InfiniBand specific) An entry to a pointer to a full or a part of a local registered memory block. The element hold the start address of the block, size, and lkey (with its associated permissions). |
SHMEM | Shared Memory |
SM | Shared Memory (Intel MPI specific), or Subnet Manager (InfiniBand specific) |
SMA | Subnet Manager Agent (InfiniBand specific) |
SMI | Subnet Management Interface |
SNAPC | Snapshot Coordination; a MCA component in Open MPI |
SR | Send Request (InfiniBand specific) A WR which was posted to an SQ which describes how much data is going to be transferred, its direction, and the way (the opcode will specify the transfer) |
SRP | SCSI RDMA Protocol |
SRQ | Shared Receive Queue (InfiniBand specific) A queue which holds WQEs for incoming messages from any RC/UD QP which is associated with it. More than one QPs can be associated with one SRQ. |
TCA | Target Channel Adapter (InfiniBand specific) A Channel Adapter that is not required to support Verbs, usually used in I/O devices |
TMI | Tag Matching Interface, e.g. QLogic PSM and Myricom MX |
TOE | TCP Offload Engine |
UC | Unreliable Connection (InfiniBand specific) A QP transport service type based on a connection oriented protocol, where a QP is associated with another single QP. The QPs do not execute a reliable Protocol and messages can be lost. |
UCM | uDAPL unreliable datagram based CM (DAPL/OpenFabrics specific) |
UD | Unreliable Datagram (InfiniBand specific) A QP transport service type in which messages can be one packet length and every UD QP can send/receive messages from another UD QP in the subnet. Messages can be lost and the order is not guaranteed. UD QP is the only type which supports multicast messages. |
UDA | Mellanox's Unstructured Data Accelerator |
uDAPL | User level DAPL |
ULP | Upper Layer Protocol (InfiniBand specific) |
V | Message logging and replay protocol; it is a PML implementation in Open MPI and is based on MPICH-V (V for Volatility). |
VAPI | Mellanox's Verbs API for InfiniBand. |
VI | Virtual Interface |
VIA | Virtual Interface Architecture |
VIPL | VI Provider Library |
VMA | Voltaire Messaging Accelerator |
VPI | Virtual Protocol Interconnect It allows the user to change the layer 2 protocol of the port. |
WQE | Work Queue Element (InfiniBand specific) |
WR | Work request (InfiniBand specific) A request which was posted by a user to a work queue |
XRC | eXtended Reliable Connection. It is a new transport layer introduced by Mellanox to improve scalability in multi-core environments. It allows one single receive QP to be shared by multiple SRQs across one or more processes. |
XPMEM | Cross Partition Memory (SGI Altix specific) |
Useful info for InfiniBand: here and here
Environmental variables for Myrinet MX
(One can find their details in the MX source tree at mx-1.*/doc/libmx-options.txt and mx-1.*/README)
MX_STATS | (libmyriexpress must be compiled with --enable-stats when running the configure script) Set to 1 to enable reporting of statistics when the endpoint is closed. If the MPI library does not close the endpoint, nothing will be printed out. |
MX_DEBUG_FILE | Output statistics and all debugging information to this file |
MX_VERBOSE | Set to a non-zero value to display debugging information. The bigger the value, the more verbose. |
MX_MONOTHREAD | Set to 1 to use single thread. Default is 0 |
MX_ZOMBIE_SEND | Set to 0 to ensure a message is only reported as completed when it has safely reached its destination. |
MX_NO_MYRINET | Set to 1 to allow intra-node only communication |
MX_DISABLE_SELF | Set to 1 to disable self-communication channel |
MX_DISABLE_SHMEM | Set to 1 to disable shared-memory channel |
MX_RCACHE | Set to 0 to disable registration cache. This is related to bandwidth performance tuning. |
MX_CPUS | Set to a non-zero value to enable CPU-process affinity |
MX_IMM_ACK | Set to 1 to enable immediate ACK. Default is 4 |
MX_CSUM | Set to 1 to enable checksum |
Obtaining Myrinet MX software
Go to ftp://ftp.myri.com/pub/MX and use the login name comrade and password ruckus to login.To compile libmyriexpress, one needs to run the configure script as:
./configure --with-linux=/usr/src/kernels/2.6.xx --with-fms-run=/var/run/fms --with-fms-server=<hostname>There are other features one can enable, such as --enable-stats, --enable-thread-safety, etc.
Analysis of libmyriexpress
Myrinet MX's API mirrors closely MPI's. It has two send functions: mx_isend and mx_issend and their implementation can be found in libmyriexpress/mx_isend.c and libmyriexpress/mx_issend.cBoth mx_isend and mx_issend use an internal function mx__post_send (in libmyriexpress/mx__post_send.c) to post the send request to the request queue. The messages are classified according to their sizes: MX__REQUEST_TYPE_SEND_TINY, MX__REQUEST_TYPE_SEND_SMALL, MX__REQUEST_TYPE_SEND_MEDIUM, and MX__REQUEST_TYPE_SEND_LARGE.
MX also has a function mx__luigi to enable progress of request queues. It has a corresponding API called mx_progress.
The actual send is done via memory mapped I/O. The message is copied to a memory region using function mx_copy_ureq and the device driver in the kernel will pick it up.
There are also ioctl interface for libmyriexpress to communicate with the device driver. First mx__open will try to open the following device files with O_RDWR access:
/dev/mxp* /dev/mx*(mxp* are privileged devices)
Then MX uses mx__ioctl to send commands to the device driver. These low-level commands can be found in libmyriexpress/mx__driver_interface.c and common/mx_io_impl.h. The common commands used by Intel MPI are
MX_APP_WAIT MX_GET_MEDIUM_MESSAGE_THRESHOLD MX_ARM_TIMER MX_GET_NIC_ID MX_GET_BOARD_TYPE MX_GET_SMALL_MESSAGE_THRESHOLD MX_GET_COPYBLOCKS MX_GET_VERSION MX_GET_INSTANCE_COUNT MX_NIC_ID_TO_PEER_INDEX MX_GET_MAX_ENDPOINTS MX_PEER_INDEX_TO_NIC_ID MX_GET_MAX_PEERS MX_SET_ENDPOINT MX_GET_MAX_RDMA_WINDOWS MX_WAKE MX_GET_MAX_SEND_HANDLES
Myrinet MX: Life of a request
In libmyriexpress/mx__lib_types.h there is a description of "life of a request". It is copied here:
Send Tiny
- If enough resources, go to 4
- Mark MX__REQUEST_STATE_SEND_QUEUED, enqueue in send_reqq
- mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
- mx__post_send posts the request, marks MX__REQUEST_STATE_BUFFERED and MX__REQUEST_STATE_MCP and enqueues in buffered_sendq
- mx__process_events dequeues from buffered_sendq, removes MCP, enqueues in resend_list
- mx__process_resend_list adds MX__REQUEST_STATE_SEND_QUEUED, dequeues from resend_list and queues in resend_reqq
- mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
- mx__post_send posts the request, marks MX__REQUEST_STATE_MCP and enqueues in buffered_sendq
- mx__process_events dequeues from buffered_sendq, removes MX__REQUEST_STATE_MCP, enqueues in resend_list
- Go to 6 until MX__REQUEST_STATE_ACKED
- When acked, mx__mark_request_acked dequeues from resend_list and calls mx__send_acked_and_mcp_complete
- mx__send_complete queues in doneq/ctxid
- mx_test/wait/ipeek/peek/test_any/wait_any dequeues from doneq/ctxid
Send Small
As Tiny, except:4,8: mx__post_send calls mx__ptr_stack_pop to get room to write the data.
5,9: mx__process_events calls mx__ptr_stack_push to release room.
Send Medium
As Tiny, except:Skip 1, do 2,3
4: mx_post_send calls mx__memory_pool_alloc to get room in the send copyblock
11: mx__send_acked_and_mcp_complete calls mx__release_send_medium which calls mx__memory_pool_free
Send Large
- Mark MX__REQUEST_STATE_SEND_QUEUED, enqueue in send_reqq
- mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
- mx__post_send removes MX__REQUEST_STATE_SEND_QUEUED, adds MX__REQUEST_STATE_MCP, allocates rdma window, post the rdnv and enqueues to large_sendq
- mx__process_events removes MX__REQUEST_STATE_MCP, dequeues from large_sendq, enqueues to resend_list
- mx__process_resend_list adds MX__REQUEST_STATE_SEND_QUEUED, dequeues from resend_list and queues in resend_reqq
- mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED and dequeues
- mx__post_send posts the request, marks MX__REQUEST_STATE_MCP and enqueues in large_sendq
- mx__process_events removes MX__REQUEST_STATE_MCP, dequeues from large_sendq, enqueues to resend_list
- Go to 5 until MX__REQUEST_STATE_ACKED
- mx__send_acked_and_mcp_complete enqueues notifying_large_sendq
- mx__process_receives gets a MX_MCP_UEVT_RECV_NOTIFY, dequeues from notifying_large_sendq, calls mx__process_recv_notify which calls mx__rndv_got_notify
- mx__release_send_large releases the rdma window
- mx__send_complete queues in doneq/ctxid
- mx_test/wait/ipeek/peek/test_any/wait_any dequeues from doneq/ctxid
Incoming message
- mx__process_events calls mx__process_recvs with the callback associated to the event type
- mx__process_recvs checks whether the message is obsolete and drops it then
- mx__process_recvs checks whether the message is early and stores it in the early queue then
- mx__process_ordered_evt is called on the message if it is the expected seqnum
- mx__endpoint_match_receive tries to match the message in the receive queue
If unexpected (no match found), mx__create_unexp_for_evt is called to generate an unexpected request and store it in the unexp queue
If expected (match found), the corresponding receive requests is filled with the message info, and MX__REQUEST_STATE_RECV_MATCHED is set
- The callback is called
For tiny messages, mx__process_recv_tiny copies the data from the event and calls mx__recv_complete if the message is expected
For small messages, mx__process_recv_small copies the data from the copy block and calls mx__recv_complete if the message is expected
For medium messages, mx__process_recv_medium moves the requests in the multifrag_recvq if the message is expected, inserts in the partner partialq if it is not that last fragment, and calls mx__process_recv_copy_frag
1: mx__process_recv_copy_frag copies the fragment from the copy block and calls mx__received_last_frag if it was the last fragment
2: mx__received_last_frag removes the partner partialq if there's more than one fragment, removes from the multifrag_recvq if expected, and calls mx__recv_completeFor large message, mx__process_recv_large calls mx__queue_large_recv if the message is expected (see Recv Large below)
- mx__process_early is called to process the next seqnum messages if already arrived
Recv posting
- mx_irecv calls mx__endpoint_match_unexpected to try to match the new receive in the unexp queue, if not, the recv is enqueued in the recv_reqq, else go on
- The receive request gets info from the unexpected requests, and MX__REQUEST_STATE_RECV_MATCHED is set
- If not large, the data is copied from the unexpected
- If medium, not entirely received, the unexpected is replaced with the receive request in the partner's partialq, and the receive request is enqueue in the multifrag_recvq
If large, mx__queue_large_recv is called (see Recv Large below)
- The unexpected requests is released
Recv large
- mx__queue_large_recv changes the type from MX__REQUEST_TYPE_RECV to MX__REQUEST_TYPE_RECV_LARGE, and sets MX__REQUEST_STATE_SEND_QUEUED
- Enqueue to send_rreq and if length>0 then mark notifying and goto 5)
- mx__process_requests removes MX__REQUEST_STATE_SEND_QUEUED, adds MX__REQUEST_STATE_MCP, allocates rdma window, calls mx__post_large_recv, and enqueues to large_getq
- mx__process_events removes MX__REQUEST_STATE_MCP, adds MX__REQUEST_STATE_SEND_QUEUED, sets notifying, removes from large_getq and enqueues in send_rreq
- mx__process_requests calls mx__post_send and enqueues to large_getq
- mx__process_events removes MX__REQUEST_STATE_MCP, removes from large_getq and calls mx__send_acked_and_mcp_complete if MX__REQUEST_STATE_ACKED or go back to 4
- mx__release_recv_large releases rdma window
- mx__recv_complete enqueues in doneq/ctxid
Troubleshooting Myrinet MX
What is warning:regcache incompatible with malloc ?
Myrinet MX uses registration cache (see the "Acronyms in high performance interconnect world" table above) to achieve higher performance. When registration cache feature is enabled, Myrinet MX will manage all memory allocations by itself, i.e. it has its own implemetation of malloc, free, realloc, mremap, munmap, sbrk, etc (see mx__regcache.c in libmyriexpress package)The warning message in question pops up when mx__regcache_works returns 0. For Linux, this means when calling a pair of malloc/free, the variable mx__hook_triggered is not triggerred.
Registration cache checks can be disabled by setting the environmental variable MX_RCACHE to 2.
Registration cache can sometimes cause weird errors. It can be disabled by setting the environmental variable MX_RCACHE to 0.
Obtaining InfiniBand/OpenFabrics software
Go to http://www.openfabrics.org/downloads/. For developers, check the Verbs and RDMACM links.InfiniBand diagnostic utilities
If OFED software stack is installed properly, there should be a bunch of InfiniBand tools at:/usr/bin/ib* /usr/sbin/ib*ibstatus, ibstat, ibv_devinfo should print out the device info (All of them query /sys/class/infiniband/). ibv_devinfo is particularly useful to see the health of the InfiniBand card.
ibnetdiscover should print out the whole InfiniBand fabric.
perfquery should print out the performance data and numbers of errors of the InfiniBand card.
To test point-to-point RDMA bandwidth, use ib_read_bw or ib_send_bw. First, login to node A and run ib_read_bw, which will start a server and wait for connection. Then login to node B and run
$ ib_read_bw Aand it will connect to the server at node A and report the bandwidth.
Similarly, use ib_read_lat or ib_send_lat to measure point-to-point RDMA latency.
The infiniband-diags package contains many other InfiniBand diagnostic utilities.
InfiniBand concepts
InfiniBand supports two communication modes: the traditional send/receive (a la MPI-1) and the get/put RDMA (a la MPI-2). In both modes, the two involving processes will both create channels, which is a designated memory region in the process's address space. The endpoints of the channel are called queue pairs (QP) in InfiniBand parlance. A QP can have different Transport service types:IBV_QPT_RC // Reliable Connection IBV_QPT_UC // Unreliable Connection IBV_QPT_UD // Unreliable Datagram, which supports multicastInfiniBand actually has four Transport Service types XY where X can be R (Reliable, meaning "acknowledged") or U (Unreliable) and Y can be C (Connection, meaning "connection-oriented") or D (Datagram).
To transmit a message, the sender places a work request (WR) on the send queue of the QP via the ibv_post_send function. Currently the supported WR types (as defined in infiniband/verbs.h) depend on QP's Transport service type, and they can be
IBV_WR_SEND IBV_WR_SEND_WITH_IMM IBV_WR_RDMA_READ IBV_WR_RDMA_WRITE IBV_WR_RDMA_WRITE_WITH_IMM IBV_WR_ATOMIC_CMP_AND_SWP, IBV_WR_ATOMIC_FETCH_AND_ADDand only IBV_QPT_RC service type supports all of above.
In addition to QP, the involving processes must also create completion queues (CQ) so they can poll the status of WR.
One of the idiosyncracies of InfiniBand is for the IBV_QPT_RC service, the receiver must post a receive request before the sender can place a send request, or incoming message will be rejected and the sender will get a Receiver-Not-Ready error.
InfiniBand connection management
There are two ways for processes to establish a connection, i.e. create a channel.The first approach is low-level, InfiniBand-specific InfiniBand Communication Manager (IBCM). IBCM is based on the InfiniBand connection state machine as defined by the IB Architecture Spec, and handles both connection establishment, as well as service ID resolution. The receiver first needs to open an InfiniBand device via ib_cm_open_device function, which will open the device file
/dev/infiniband/ucm*and associates it with a service ID via ib_cm_create_id function. Then the receiver can listen on this service ID with ib_cm_listen function.
One can look at ibcm_component_query function in btl_openib_connect_ibcm.c in Open MPI's source tree for a complete example.
The userspace IBCM source code is libibcm, available here
The second approach is RDMA Communication Manager (RDMACM), also referred to as the CMA (Communication Manager Abstraction), is a higher-level CM that operates based on IP addresses. RDMACM provides an interface that is closer to that defined for TCP/IP sockets, and is transport independent, allowing user access to multiple RDMA devices, such as InfiniBand HCAs and iWARP RNICs. For most developers, the RDMA CM provides a simpler interface sufficient for most applications. As in IBCM's case, the receiver process will access the device file
/dev/infiniband/rdma_cmduring the connection establishment.
Man pages of RDMACM APIs can be found here. The userspace RDMACM source code is librdmacm, available here
One can check rdmacm_component_query function in btl_openib_connect_rdmacm.c in Open MPI's source tree for a complete example, or read this tutorial.
Analysis of libibverbs
libibverbs is the userspace InfiniBand verbs library, which implements ibv_* APIs mentioned earlier. In reality, libibverbs is just a wrapper which loads and sets up userspace InfiniBand RDMA device driver lib*-rdmav2.so, such as libmlx4-rdmav2.so for Mellanox ConnectX adapter, libnes-rdmav2.so for NetEffect Ethernet adapter, etc.After the connection has been established, the user program calls ibv_get_device_list which is implemented internally as __ibv_get_device_list in device.c. __ibv_get_device_list will do the following:
- Call ibv_fork_init if either environmental variable RDMAV_FORK_SAFE or IBV_FORK_SAFE is present
- Check ABI version by looking at sysfs file
/sys/class/infiniband_verbs/abi_version
- Read userspace InfiniBand RDMA device driver configuration files under
/etc/libibverbs.d/
Note that the /etc prefix is configurable when one runs the configure script.Each configuration file usually contains only one line, e.g. driver mlx4. mlx4 in this case, is the deviceName, which will be used later.
- Get ABI version info of installed InfiniBand RDMA devices by reading the following sysfs files
/sys/class/infiniband_verbs/uverbs*/ibdev /sys/class/infiniband_verbs/uverbs*/abi_version
- For each deviceName obtained from Step 3, try to initialize its corresonding driver
shared library. The try_driver function will call
ibv_driver.init_func if it is not pointing to NULL.
If the driver is not loaded yet, then
ibv_driver.init_func is NULL.
The try_driver function will then examine the node type by looking at
/sys/class/infiniband/driverName/node_type
file. In this file, CA is InfiniBand Channel Adapter and RNIC is iWARP adatper. - Load the device driver shared library (load_drivers function.)
When a device driver shared library is loaded for the first time,
a function (usually named deviceName_register_driver)
inside the device driver shared library will be invoked.
This function is invoked because it has __attribute__((constructor))
attribute.
- deviceName_register_driver of device driver shared library
will call ibv_register_driver of libibverbs to register itself.
In particular, ibv_driver.init_func is set to point to
deviceName_driver_init
- Execute Step 5 again. This time at least one device should have valid
ibv_driver.init_func. What ibv_driver.init_func does is device driver dependent. For example,
MLX4's mlx4_driver_init will read info from
/sys/class/infiniband_verbs/uverbs*/device/vendor /sys/class/infiniband_verbs/uverbs*/device/device
- Open the InfiniBand Verb device at
/dev/infiniband/uverb*
for read/write access. - Create a corresponding ibv_context struct by calling ibv_driver.ops.alloc_context function.
This ibv_driver.ops struct is populated when the device driver shared library
is loaded. In MLX4 device driver's source file mlx4.c
one can find the following code snippet:
static struct ibv_device_ops mlx4_dev_ops = { .alloc_context = mlx4_alloc_context, .free_context = mlx4_free_context };
and the following assignment in mlx4_alloc_context functioncontext->ibv_ctx.ops = mlx4_ctx_ops;
- Return the ibv_context struct just created.
For example, the ibv_post_send Verbs API is simply defined as
static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,struct ibv_send_wr **bad_wr) { return qp->context->ops.post_send(qp, wr, bad_wr); }One can also do a grep of context->ops. in verbs.c of libibverbs and see how they are used.
So how is ibv_context.op populated ? In MLX4's case, ibv_context.op is populated by simply copying mlx4_ctx_ops, which is fixed at compile time. In the source file mlx4.c one can find the following code snippet:
static struct ibv_context_ops mlx4_ctx_ops = { .query_device = mlx4_query_device, .query_port = mlx4_query_port, . . . . .attach_mcast = mlx4_attach_mcast, .detach_mcast = mlx4_detach_mcast };Documentation of the InfiniBand Verbs API and RDMACM API can be found in RDMA Aware Networks Programming User Manual from Mellanox.