clu2's notes: Notes on MPI (Message Passing Interface)

MPI compiler wrappers

mpicc mpigcc mpiicc	Script to compile and link MPI programs in C. mpigcc and mpiicc are Intel MPI specific (allows to choose the compiler from GNU or Intel).
mpicxx mpiCC mpigxx mpiicpc	Script to compile and link MPI programs in C++. mpigxx and mpiicpc are Intel MPI specific (allows to choose the compiler from GNU or Intel).
mpif77 mpif90 mpiifort	Script to compile and link MPI programs in Fortran. mpiifort is Intel MPI specific.
mpichversion	[MPICH] Display MPICH version
mpiname	[MVAPICH] Display MPI & compiler information. Use -a option to print all information.
mpitune	[IntelMPI] Automated tuning utility. It's used during Intel MPI library installation or after the cluster configuration change.
mpivars.sh mpivars.csh	[IntelMPI] Set up Intel MPI related environmental variables (e.g. `PATH`, `LD_LIBRARY_PATH`.)
logconverter	[MPICH] Convert between different MPE log format (if the program is compiled with -mpilog option, see below). Some of the logs are for Jumpshot visualization. See here for more information.
parkill	[MVAPICH] Kill all processes of a program (parallel kill).

MPI compiler wrapper specific switches

-mpilog -mpitrace -mpianim	[MPICH] Compile the program which will generate logs/traces. Only one of the switches can be used at the same time. -mpilog will generate MPE log. It uses -llmpe -lmpe linking options. -mpitrace will generate MPE trace. It uses -ltmpe -lmpe linking options. -mpianim will generate MPE real-time animation. It uses -lampe -lmpe linking options.
-t -t=log -dynamic_log	[IntelMPI] Compile the program which will generate logs/traces. The resulting executable is linked against the Intel Trace Collector library. -t=log option will also trace the internals of Intel MPI library.
-mpe=option	[MVAPICH] Compile the program which will generate logs/traces. option can be mpilog, mpitrace, mpianim, mpicheck.. Use -mpe=help to see all available options and what libraries it will link to.
-lpmpich -lpmpich++	Link with the MPI profiling interface.
-cc=pgm -CC=pgm -fc=pgm -f90=pgm	Use pgm to compile and link MPI programs. This is the same as setting the environmental variable `CC`, `CLINKER`, `CCC`, `CCLINKER`, `F77`, `F77LINKER`, `F90`, `F90LINKER`.
-compile-info -link-info	Show the compiler and command-line options which will be used in compiling/linking MPI programs.
-echo	[MPICH, IntelMPI] Show exactly what the script is doing. Basically it runs the mpicc script with -x option.
-fast	[IntelMPI] Maximize speed across the entire program.
-mt_mpi	[IntelMPI] Use thread safe version of the Intel MPI Library.
-shlib -noshlib	[MPICH] Whether to use shared libraries. This is the same as setting the environmental variable `MPICH_USE_SHLIB` to `yes`.
-static_mpi -static	[IntelMPI] Use static linking with Intel MPI library.
-show	[MPICH, IntelMPI, OpenMPI] Show commands that would be used without running them.
-showme:command -showme:compile -showme:link	[OpenMPI] Show commands and compiler flags that would be used for compiling and linking.

Running MPI programs

MPICH 1, Open MPI

The old-school mpirun:

mpirun -v -machinefile <machine file name> -np <#processes> ./a.out

Open MPI's mpirun (and mpiexec) is a symbolic link to it's own job launcher orterun, so it accepts other options, such as -npernode (specify the number of processes per node), -x ENV1 (pass the environmental variable ENV1 to the MPI job), -bind-to-core (CPU-process pinning/affinity). In addition, Open MPI can obtain host names of nodes from many batch job schedulers (PBS, LSF, etc) so -machinefile and -np flags are generally not needed.

MPICH 2, MVAPICH 2

The mpd (multi-purpose daemon) approach:

mpdallexit
mpdboot -v -n <#mpd's> -f <machine file name> -r <ssh or rsh>
mpiexec -np <#processes> ./a.out
mpdallexit

where <#mpd's> is usually the number of unique host names in <machine file name>, e.g. `sort machinefile|uniq|wc -l`

In many cases one might add -genvall flag to mpiexec ... command line to pass all environmental variables in the current environment to the MPI job, or use -genvlist=ENV1,ENV2,.. to pass only environmental variables ENV1, ENV2, ... to the MPI job. These options are especially useful when the MPI executable needs environmental variables such as LD_LIBRARY_PATH to function properly.

Having the ability to pass environmental variables to the MPI job is one of major differences between the old-school mpirun and the new mpiexec.

Intel MPI

Intel MPI is based on MPICH 2 & mpd:

source $INTEL_MPI/bin/mpivars.sh
mpdallexit
mpdboot -v -n <#mpd's> -f <machine file name> -r <ssh or rsh>
mpiexec -ppn <#processes per machine> -np <#processes> ./a.out
mpdallexit

Intel MPI can automatically choose the appropriate fabric to use. By default, it uses DAPL and will fall back to TCP if DAPL is not available. However, in some cases one might want to override this setting. To force Intel MPI to use InfiniBand only, add -IB in front of -n option; to use Myrinet MX only, use -MX.

If -MX is used, then Intel MPI will use TMI (Tag Matching Interface) as opposed to DAPL. Intel MPI's libtmi.so will then seek configuration info from /etc/tmi.conf or from whatever the environmental variable TMI_CONFIG (see below) points to.

Troubleshooting mpdboot

Make sure ~/.mpd.conf exists and the first line of ~/.mpd.conf is MPD_SECRETWORD=<secret word of your choice>.
Use -d and -v command-line options to enable debugging information and verbosity mode.
Make sure /tmp is not full on every machine in the <machine file name>. mpdboot will create temporary files or UNIX domain sockets such as /tmp/mpd2.*<username> under /tmp.
Alternatively, use --tmpdir to specify a different location when running mpdboot.
For Intel MPI, use the environmental variable I_MPI_MPD_TMPDIR
Similarly, make sure /var/run/ is not full. mpdboot will create a temporary file under this directory containing the PID of mpd
Alternatively, use --pid to specify a different location when running mpdboot
Python must have the correct version as in mpdboot script on every machine in the <machine file name>
Choose the correct remote shell: ssh or rsh. For ssh, make sure ~/.ssh/authorized_keys is correct and has correct read/write permissions. For rsh, make sure ~/.rhosts is correct and has correct read/write permissions.

mpirun.ch_mx command-line options

mpirun.ch_mx accepts Myrinet-specific command-line options:

--mx-no-shmem	Disable the shared memory based intra-node communication.
--mx-recv mode	Set mode to blocking to enable process waiting for receiving messages without polling.

MPICH2 mpiexec environmental variables

Some of mpiexec's command-line options can be also specified in environmental vairables:

`MPICH_DBG`	Display debugging information. To use this feature, MPICH2 must be configured and compiled with `--enable-g` option.
`MPICH_DBG_LEVEL`	Debugging verbosity level; can be `TERSE`, `TYPICAL`, `VERBOSE`
`MPICH_NO_LOCAL`	Set to 1 to not use shared memory for intra-node communication
`MPICHD_DBG_PG`	Set to YES to enable MPD to display debugging information
`MPIEXEC_BNR`	Set to 1 to use the MPICH1 legacy BNR interface to spawn and manage processes. Trivia: BNR stands for Bill, Brian, Nick, Ralph, and Rusty, who were the initial MPICH2 developers of the interface.
`MPIEXEC_GDB`	Set to 1 to run the MPI job in GDB
`MPIEXEC_MACHINEFILE`	Set to the file name containing the list of hosts
`MPIEXEC_MERGE_OUTPUT`	Set to 1 to merge output when running the MPI job in GDB
`MPIEXEC_RSH`	rsh command to use; default is `ssh -x`
`MPIEXEC_SHOW_LINE_LABELS`	Set to 1 to show MPI rank in each line of the output
`MPIEXEC_TIMEOUT`	Job time out (in seconds)
`MPIEXEC_TOTALVIEW`	Set to 1 to run the MPI job in TotalView

MVAPICH2 mpiexec environmental variables (version 1.5)

Some of mpiexec's command-line options can be also specified in environmental vairables:

`MV2_CPU_MAPPING`	Provide more fine-grained control of CPU-process pinning/affinity. Note that none of `MV2_ENABLE_AFFINITY` or `MV2_USE_SHARED_MEM` environmental variables should be set to 0 if `MV2_CPU_MAPPING` is used.
`MV2_CPU_BINDING_POLICY`	Provide fine-grained control of CPU-process pinning/affinity. It's value is either `bunch` (default) or `scatter`.
`MV2_DAPL_PROVIDER`	If your MVAPICH2 is built with uDAPL-CH3, then this variable specifies the uDAPL-CH3 library to use. It can takes the following values: `ib0` `ccil` `gmg2` `ibd0`
`MV2_ENABLE_AFFINITY`	Set to 0 to disable CPU-process pinning/affinity. Default is 1.
`MV2_USE_XRC`	Set to 1 to enable eXtended Reliable Connection, a new transport layer in Mellanox HCAs to improve scalability in multi-core environments. It allows one single receive QP to be shared by multiple SRQs across one or more processes. (Your MVAPICH2 must be built with XRC to use this feature)

Intel MPI environmental variables

(See here for special features and known issues)

`I_MPI_ADJUST_`coll	Select the algorithm for collective operation coll and also the message size ranges for that algorithm to take effect.
`I_MPI_PLATFORM`	Select the set of algorithms and their parameters for collective operations. This is an undocumented environmental variable added since version 4.0 and it could change again in the future. (In 4.1 it is documented.) To see its effect, set `I_MPI_DEBUG` environmental variable to 6 and run the MPI program (if you recompile your MPI program with `-g`, you can see even more verbose output), then one can see something like this: [0] MPI startup(): Recognition level=0. Platform code=1. Device=5 and the algorithms for collective operations (check Intel MPI reference manual for their meanings): .... [0] MPI startup(): Allreduce: 3: 2097152-2147483647 & 512-2147483647 [0] MPI startup(): Allreduce: 4: 524288-2147483647 & 32-2147483647 [0] MPI startup(): Allreduce: 5: 0-2147483647 & 0-2147483647 .... For version 4.0.0, the value of this variable can be `auto`: Recognition level=2. `trust`: Recognition level=1, Platform code=1 `generic`: Recognition level=0, Platform code=1 `zero`: Recognition level=0, Platform code=-1 (What this really means is regardless of message sizes or MPI ranks, the same baseline algorithm is used.) `htn`: (Harpertown) The same as `generic` `nhm`: (Nehalem) Recognition level=0, Platform code=3 For version 4.0.1, the value of this variable can be `auto`: Recognition level=2. Use this value when running on heterogeneous nodes, e.g. Westmere and Harpertown nodes, as suggested here. `generic`: Recognition level=0, Platform code=1 `zero`: Recognition level=0, Platform code=-1 `harpertown`: The same as `generic` `nehalem`: Recognition level=0, Platform code=4 `westmere`: Recognition level=0, Platform code=2 For version 4.0.3, the value of this variable can be `generic` `harpertown` `hetero` `homo` `htn` (Harpertown) `nehalem` `nhm` (Nehalem) `snb` (Sandy Bridge) `westmere` `wsm` (Westmere) For version 4.1.0, the value of this variable can be `auto`: (Default) `auto:max`: For the newest supported Intel architecture processors across all nodes. `auto:most`: For the most numerous Intel architecture processors across all nodes. `generic`: Select the algorithm optimized for Harpertown. `harpertown` `hetero` `homo` `hsw` (Haswell) `htn` (Harpertown) `ivb` (Ivy Bridge) `knc` (Knight's Corner) `nehalem` `nhm` (Nehalem) `none`: Like `zero` above. `sandybridge` `snb` (Sandy Bridge) `uniform`: Select algorithm optimized for local processors. The behavior is unpredictable if nodes are heterogeneous. `westmere` `wsm` (Westmere)
`I_MPI_DAPL_PROVIDER`	Select the particular DAPL provider. See here for details. By default, Intel MPI will try to load DAPL info from /etc/dat.conf /etc/ofed/compat-dapl/dat.conf
`I_MPI_DEBUG`	Set to 6 to display most debugging information; set to 9 to display the values of many undocumented environmental variables (those with `I_MPI_INFO_` prefix) Compile the code with `-g` compiler option will enable even more detailed debugging information (because the user program will be linked to `libmpi_dbg.so` instead of `libmpi.so`, by the Intel MPI compiler wrapper)
`I_MPI_PRINT_ENV`	(Version 4.0.1) Set to 1 to display the values of Intel MPI environmental variables.
`I_MPI_DEVICE`	Select the particular network to be used, e.g. shm, rdma, rdssm, sock, ssm
`I_MPI_DYNAMIC_CONNECTION`	Set to 1 so all connections are established at the time of the first communication between each pair of processes. Default is 1 if the number of processes is greater than 64, and 0 otherwise.
`I_MPI_EAGER_THRESHOLD` `I_MPI_INTRANODE_EAGER_THRESHOLD`	Set to the number of bytes which is the threshold between the eager and the rendezvous protocol. Default is 262144 (256 KB)
`I_MPI_EXTRA_FILESYSTEM` `I_MPI_EXTRA_FILESYSTEM_LIST`	Select the parallel file system to use in MPI-IO. Set `I_MPI_EXTRA_FILESYSTEM` to `on` and `I_MPI_EXTRA_FILESYSTEM_LIST` to one of the following values: `panfs` (for Panasas) `pvfs2` (for PVFS2) `lustre` (for Lustre)
`I_MPI_FALLBACK`	Set to 0 to not fallback to the first available network, i.e. if an attempt to initialize a specified network fails, the job terminates immediately. Default is 1.
`I_MPI_FABRICS`	Select the particular network fabrics to be used, e.g. shm, dapl, tcp, tmi (Tag Matching Interface, e.g. QLogic PSM and Myricom MX), ofa
`I_MPI_PIN`	Set to 0 to disable CPU-process pinning/affinity. Default is 1
`I_MPI_PIN_DOMAIN` `I_MPI_PIN_PROCS`	Provide more fine-grained control of CPU-process pinning/affinity and interoperability with OpenMP.
`I_MPI_SCALABLE_OPTIMIZATION`	Set to 0 to disable optimization. Default is 1 if the number of processes is greater than 16, and 0 otherwise.
`I_MPI_SHM_BYPASS`	Set to 1 to disable the shared memory based intra-node communication. Default is 0.
`I_MPI_SPIN_COUNT`	Set to loop spin count when polling the network. Default value is equal to 1 when more than one process runs per processor/core. Otherwise the value equals 250. The loop for polling the network spins this many times before freeing the processes if no incoming messages are received for processing. Smaller values cause the MPI library to release the processor more frequently.
`I_MPI_STATS` `I_MPI_STATS_SCOPE` `I_MPI_STATS_BUCKETS` `I_MPI_STATS_FILE`	Use built-in statistics gathering facility to collect performance data.
`I_MPI_TIMER_KIND`	Select the timer routine for `MPI_Wtime` and `MPI_Wtick` calls, e.g. `gettimeofday` (default) or `rdtsc`
`I_MPI_WAIT_MODE`	Set to 1 to enable process waiting for receiving messages without polling. Default is 0 (i.e. use polling). Caveat: Unless one has a good reason, do not change the default setting, especially on Myrinet MX, or MPI collective operations could get stuck. The fix is to use TMI as oppose to the (default) DAPL. See the `TMI_CONFIG` environmental variable below.
`TMI_CONFIG`	If one selects the Tag Matching Interface (e.g. specifying `-MX` or `-PSM` switch at `mpiexec` command), then by default, Intel MPI seeks TMI configuration info from /etc/tmi.conf and this environmental variable can be set to point to the TMI config file one wishes to use. One can find an example TMI config file at `$I_MPI_ROOT/etc64/tmi.conf`, which can be used without modification in most cases. See here for details.
`TMI_DEBUG`	Set to a positive integer to display TMI debugging information.

Open MPI environmental variables

Open MPI has many MCA (Modular Component Architecture) parameters which can be set in the following different ways, and each approach overrides the later ones:

mpirun command-line:
```
mpirun --mca param1 "value1"
```
Environmental variable:
```
export OMPI_MCA_param1="value1"
```
MCA parameter files ($HOME/.openmpi/mca-params.conf or $OMPI_HOME/etc/openmpi-mca-params.conf) and the files contain the following line
```
param1 = value1
```

For example, to use the Checksum component in PML (Point-To-Point Management Layer) framework, use:

mpirun --mca pml csum ...

To see the default MCA parameters:

ompi_info --param all all

Some useful environmental variables:

`OPAL_PREFIX`	Specify the Open MPI installation root directory.
`OMPI_MCA_mca_verbose`	Set to 1 for verbose mode.
`OMPI_MCA_mpi_show_mca_params`	Set to 1 to show all MCA parameter values.
`OMPI_MCA_mpi_yield_when_idle`	Set to any nonzero value for Degraded performance mode, or 0 for Aggressive performacne mode
`OMPI_MCA_mpi_paffinity_alone`	Set to any nonzero value to enable CPU-process pinning/affinity.
`OMPI_MCA_mpi_leave_pinned`	Set to 1 value to enable aggressive pre-registration of user message buffers for RDMA-enabled network fabrics. See here for details.
`OMPI_MCA_mpi_leave_pinned_pipeline`	Set to 1 value to enable pre-registration of user message buffers for RDMA pipeline protocol. See the article High performance RDMA protocols in HPC by T.S. Woodall et al for details.
`OMPI_MCA_btl`	Open MPI can detect the network fabric and choose the best communication protocols for it. If you want to override, you can set `OMPI_MCA_btl` to the following values. You can specify multiple values (separated by commas) e.g. `export OMPI_MCA_btl='gm,sm,self'` `gm`: Myrinet GM `mx`: Myrinet Express (MX) `ofud`: OpenFabrics unreliable datagrams `openib`: OpenFabrics InfiniBand. See the FAQ for details. `portals`: Cray Portals `self`: Loop-back communication. This one should always be specified, or your Open MPI can behave erratically. `sm`: Shared memory. See the FAQ for details. `tcp`: TCP/IP. See the FAQ for details. `udapl`: User level DAPL. To use this, make sure you have `/etc/dat.conf` ready. See the FAQ for details.

SGI MPI environmental variables

MPI_DISPLAY_SETTINGS Set to any value to display the values of SGI MPI environmental variables.

MPI_SLAVE_DEBUG_ATTACH

Set to the MPI rank of the process you want to attach your debugger. When this environmental variable is set, the MPI job will wait for 20 seconds for you to launch the debugger. You will see something like this:

MPI rank 1 sleeping for 20 seconds while you attach the debugger.
You can use this debugger command:
     gdb /proc/30766/exe 30766
or
     idb -pid 30766 /proc/30766/exe

MPI_DAEMON_DEBUG_ATTACH

Similar to MPI_SLAVE_DEBUG_ATTACH environmental variable, except the point of attachment is earlier. If MPI_SLAVE_DEBUG_ATTACH is set, then the attachment occurs at:

#0  0x7f8b794fe060 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x7f8b794fde9c in sleep () from /lib64/libc.so.6
#2  0x7f8b79c663df in slave_init () at adi.c:178
#3  fork_slaves () at adi.c:295
#4  MPI_SGI_create_slaves () at adi.c:419
#5  MPI_SGI_init () at adi.c:644
#6  0x7f8b7a165df8 in call_init () from /lib64/ld-linux-x86-64.so.2

On the other hand, for MPI_DAEMON_DEBUG_ATTACH, the attachment point occurs at an earlier moment:

#0  0x7fb97e8a1060 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x7fb97e8a0e9c in sleep () from /lib64/libc.so.6
#2  0x7fb97f008a16 in MPI_SGI_init () at adi.c:450
#3  0x7fb97f508df8 in call_init () from /lib64/ld-linux-x86-64.so.2

MPI point-to-point communications

Do not confuse eager/rendezvous protocols with MPI_Rsend/MPI_Ssend
Eager/Rendezvous protocols are just a possible MPI implementation for managing internal buffers. Eager means the sender will transfer the data without receiver's acknowledgement. Rendezvous protocol is the opposite; a request-acknowledgment handshake must occur before data can be transferred.
Thus, eager/rendezvous protocols have nothing to do with whether there is any posted matching receive or not. For example, even without a posted matching receive, an MPI implementation can still use Eager protocol for small messages, and the receiver would just buffer them (internally).
Do not confuse blocking with synchronous.
Blocking means the MPI call returns immediately.
Synchronous in MPI protocol has a special meaning. It is a communication mode, meaning a send will not complete until a matching receive is posted.

Thus, there are three levels of perspectives:

User program	Blocking ? Dead lock ? Buffer reuse ?
MPI protocol	Matching receive posted (Synchronous) ? Matching receive must be posted first ? Buffer reuse ?
MPI implementation	Eager or Rendezvous ?

Send modes

Mode	Blocked until ?	Synchronous ?	Buffer reuse ?	Matching receive must be posted first?	Analysis
`MPI_Send`	Buffer is safe to reuse (and when this happens, it does not mean the message is received or a matching receive is posted)	N/A	N/A	No	Strike a balance between `MPI_Ssend` and `MPI_Rsend`
`MPI_Ssend`	A matching receive is posted (S=Synchronous)	Yes	No	No	Greatest overhead, but safe.
`MPI_Bsend`	Data is copied to the buffer attached with `MPI_Buffer_attach`. (B=Buffered).	N/A	Yes. (Must call `MPI_Buffer_attach` first)	No	Decouple sender and reciver.
`MPI_Rsend`	Data transfer completes.	Yes	N/A	Yes (R=Ready)	Least overhead, but could cause error if matching receive is not posted yet.
`MPI_Isend`	Never block (I=Immediate).	N/A	No	No
`MPI_Issend`	Never block.	Yes	No	No
`MPI_Ibsend`	Never block.	N/A	Not until tested by `MPI_Wait/MPI_Test`	No
`MPI_Irsend`	Never block.	Yes	No	Yes

The receiver side is simple: There are only four APIs:

MPI_Recv
MPI_Probe
MPI_Irecv
MPI_Iprobe

The purpose of MPI_Irecv is really to post a matching receive.

MPI_Probe can check if there is any receivable message that matches the specified source, tag, and comm.

Test for communication completion

For all those MPI_I* API's, one will get an MPI_Request handle. Both the sender and receiver can then use the following APIs on this handle to test communication completion:

MPI_Test       MPI_Wait
MPI_Testsome   MPI_Waitsome
MPI_Testall    MPI_Waitall
MPI_Testany    MPI_Waitany

MPI_Test* are non-blocking while MPI_Wait* are blocking.

MPI_*some is between the two extremes MPI_{Test|Wait} and MPI_*all.

Caveat: For sender, send operation completion only means the buffer is ready to reuse; it doesn't mean the data have been received.

Persistent communication mode

One can group many point-to-point communications into a batch and initiate the transfers in one shot. The relevant APIs are:

MPI_Send_init  MPI_Bsend_init MPI_Rsend_init MPI_Ssend_init
MPI_Recv_init
MPI_Start      MPI_Startall

Dead lock scenarios and avoidance

	Process 1	Process 2
Deadlock 1	Recv/Send	Recv/Send
Deadlock 2	Send/Recv	Send/Recv
Solution 1	Send/Recv	Recv/Send
Solution 2	SendRecv	SendRecv
Solution 3	IRecv/Send, Wait	IRecv/Send, Wait
Solution 4	BSend/Recv	BSend/Recv

MPI-2 one-sided point-to-point communications

The above point-to-point communications are two-sided, i.e. they require matching operations on both sender and receiver sides. MPI-2 introduces one-sided point-to-point communications to facilitate remote memory accesses (RMA), which are essential for partitioned global address space languages. Unfortunately, MPI-2's RMA semantics are unintuitive and difficult to use.

Initialization

First, each process needs to designate a memory region (called a window in MPI-2 parlance) for RMA. This is achieved by calling a collective function:

MPI_Win_create

This function associates a memory region with a communicator. The sizes of the memory regions can differ from processes to processes, but it is important for all participating processes to call it, even if a process only offers memory for others to accesses and does not access other processes' memory.

The "destructor" of MPI_Win_create is

MPI_Win_free

Communication

Three RMA operations are supported

MPI_Get
MPI_Put
MPI_Accumulate

MPI_Accumulate is an atomic operation to achieve the remote update (Get-Modify-Put) operations. However, the kind of Modify operations it supports is the predefined MPI_Reduce operations such as MPI_SUM, MPI_PROD, MPI_MAX, MPI_LAND... User-defined reduction operations are not supported

It is important to note that these operations are non-blocking. It makes sense that MPI_Put and MPI_Accumulate are non-blocking, but to understand why MPI_Get is also non-blocking, read on.

Synchronization

Now comes the most criticized and confusing part of MPI-2 RMA. The inclusion of synchronization in MPI-2 RMA seems to facilitate MPI implementers, not MPI users.

As mentioned earlier, all RMA operations are non-blocking, which means, like MPI_I* functions, there must be accompanying synchronization functions like MPI_Wait. And as in MPI_I*'s case, the RMA operations cannot appear anywhere in the program; they must be surrounded by these synchronization functions; a pair of consecutive synchronizations is called an epoch in MPI-2 parlance.

Passive target

This synchronization mode is the easiest to understand and the most intuitive (closest to a shared memory model). Passive means the target process (the one which others will access its memory) does not need to do anything; there is no call to any synchronization function. Only originating processes need to synchronize among themselves, using the following pair of calls:

MPI_Win_lock(lockType, targetRank, hint, assert, win)
MPI_Win_unlock(targetRank, win)

where lockType can be MPI_LOCK_SHARED or MPI_LOCK_EXCLUSIVE.

assert is usually 0, and it can be set to MPI_MODE_NOCHECK, which means no other processes will hold or attempt to acquire a conflicting lock.

It is important to note that the lock granularity is the entire window.

Active target, coarse-grained

Active means the target process will also need to call synchronization functions, even if itself does not access others' window. In coarse-grained mode, all involving processes must have the same number of the following call:

MPI_Win_fence(assert, win)

Recall that a window is associated with a communicator, so all processes in a communicator, no matter they are target or originating processes, must perform the same number of MPI_Win_fence calls. The epochs are the periods between consecutive MPI_Win_fence calls, and RMA operations in the same epoch are then synchronized.

assert is usually 0, but it can be set to

MPI_MODE_NOSTORE: No local stores or self-MPI_Put/MPI_Accumulate to local window between this fence and the prior fence.
MPI_MODE_NOPUT: No MPI_Put/MPI_Accumulate to local window between this fence and next fence.
MPI_MODE_NOPRECEDE: No RMA between this fence and the prior fence.
MPI_MODE_NOSUCCEED: No RMA between this fence and the next fence.

Active target, fine-grained

This is the most complicated synchronization mode. The target process (only one) calls the following pairs of functions to open/close an epoch for the specified originating processes (plural):

MPI_Win_post(group, assert, win)
MPI_Win_wait(win)

where group is an MPI_Group object consisting of ranks of the specified originating processes.
assert can be 0, MPI_MODE_NOCHECK, MPI_MODE_NOSTORE, or MPI_MODE_NOPUT.

The original processes (plural) then use the following pairs to open/close the corresponding epoch:

MPI_Win_start(group, assert, win)
MPI_Win_complete(win)

Within the epoch the original processes can then perform RMA operations.
(assert can be 0 or MPI_MODE_NOSTORE, MPI_MODE_NOPUT, etc)

Update semantics and erroneous behaviors

The following behaviors are considered erroneous (as in MPI-2 specification) and should be avoided

Conflicting accesses to the same memory location in the same window in an epoch. It does not matter whether the accesses are initiated by local or remote processes.
For example, a local load and an MPI_Put (from a remote process) is erroneous (regardless of the order) if they access the same memory location.
The only exception is multiple MPI_Accumulate's (using the same reduction operation) to the same memory location.
Local stores and MPI_Accumulate/MPI_Put (from remote processes) to different memory locations in the same window in an epoch.
The rationale is on some architectures, explicit coherence restoring operations may be needed at synchronization points, and the restoring operations can be different depending on whether the memory is updated locally or remotely. Keeping track of the update types for each memory address is prohibitively expensive.