MPI compiler wrappers
mpicc mpigcc | Script to compile and link MPI programs in C. mpigcc and mpiicc are Intel MPI specific (allows to choose the compiler from GNU or Intel). |
mpicxx mpiCC mpigxx | Script to compile and link MPI programs in C++. mpigxx and mpiicpc are Intel MPI specific (allows to choose the compiler from GNU or Intel). |
mpif77 mpif90 mpiifort | Script to compile and link MPI programs in Fortran. mpiifort is Intel MPI specific. |
mpichversion | [MPICH] Display MPICH version |
mpiname | [MVAPICH] Display MPI & compiler information. Use -a option to print all information. |
mpitune | [IntelMPI] Automated tuning utility. It's used during Intel MPI library installation or after the cluster configuration change. |
mpivars.sh mpivars.csh | [IntelMPI] Set up Intel MPI related environmental variables (e.g. PATH, LD_LIBRARY_PATH.) |
logconverter | [MPICH] Convert between
different MPE log format
(if the program is compiled with -mpilog option, see below). Some of the logs are for
Jumpshot visualization. See here for more information. |
parkill | [MVAPICH] Kill all processes of a program (parallel kill). |
MPI compiler wrapper specific switches
-mpilog -mpitrace -mpianim | [MPICH] Compile the program which will generate logs/traces. Only one of the switches can be used at the same time.
-mpilog will generate MPE log. It uses -llmpe -lmpe linking options.
|
-t -t=log -dynamic_log | [IntelMPI] Compile the program which will generate logs/traces. The
resulting executable is linked against the Intel Trace Collector library. -t=log option will also trace the internals of Intel MPI library. |
-mpe=option | [MVAPICH] Compile the program which will generate logs/traces.
option can be mpilog, mpitrace, mpianim, mpicheck.. Use -mpe=help to see all available options and what libraries it will link to. |
-lpmpich -lpmpich++ | Link with the MPI profiling interface. |
-cc=pgm -CC=pgm -fc=pgm -f90=pgm |
Use pgm to compile and link MPI programs. This is the same as setting the environmental variable CC, CLINKER, CCC, CCLINKER, F77, F77LINKER, F90, F90LINKER. |
-compile-info -link-info | Show the compiler and command-line options which will be used in compiling/linking MPI programs. |
-echo | [MPICH, IntelMPI] Show exactly what the script is doing. Basically it runs the mpicc script with -x option. |
-fast | [IntelMPI] Maximize speed across the entire program. |
-mt_mpi | [IntelMPI] Use thread safe version of the Intel MPI Library. |
-shlib -noshlib | [MPICH] Whether to use shared libraries. This is the same as setting the environmental variable MPICH_USE_SHLIB to yes. |
-static_mpi -static | [IntelMPI] Use static linking with Intel MPI library. |
-show | [MPICH, IntelMPI, OpenMPI] Show commands that would be used without running them. |
-showme:command -showme:compile -showme:link | [OpenMPI] Show commands and compiler flags that would be used for compiling and linking. |
Running MPI programs
MPICH 1, Open MPI
The old-school mpirun:mpirun -v -machinefile <machine file name> -np <#processes> ./a.outOpen MPI's mpirun (and mpiexec) is a symbolic link to it's own job launcher orterun, so it accepts other options, such as -npernode (specify the number of processes per node), -x ENV1 (pass the environmental variable ENV1 to the MPI job), -bind-to-core (CPU-process pinning/affinity). In addition, Open MPI can obtain host names of nodes from many batch job schedulers (PBS, LSF, etc) so -machinefile and -np flags are generally not needed.
MPICH 2, MVAPICH 2
The mpd (multi-purpose daemon) approach:mpdallexit mpdboot -v -n <#mpd's> -f <machine file name> -r <ssh or rsh> mpiexec -np <#processes> ./a.out mpdallexitwhere <#mpd's> is usually the number of unique host names in <machine file name>, e.g. `sort machinefile|uniq|wc -l`
In many cases one might add -genvall flag to mpiexec ... command line to pass all environmental variables in the current environment to the MPI job, or use -genvlist=ENV1,ENV2,.. to pass only environmental variables ENV1, ENV2, ... to the MPI job. These options are especially useful when the MPI executable needs environmental variables such as LD_LIBRARY_PATH to function properly.
Having the ability to pass environmental variables to the MPI job is one of major differences between the old-school mpirun and the new mpiexec.
Intel MPI
Intel MPI is based on MPICH 2 & mpd:source $INTEL_MPI/bin/mpivars.sh mpdallexit mpdboot -v -n <#mpd's> -f <machine file name> -r <ssh or rsh> mpiexec -ppn <#processes per machine> -np <#processes> ./a.out mpdallexit
Intel MPI can automatically choose the appropriate fabric to use. By default, it uses DAPL and will fall back to TCP if DAPL is not available. However, in some cases one might want to override this setting. To force Intel MPI to use InfiniBand only, add -IB in front of -n option; to use Myrinet MX only, use -MX.
If -MX is used, then Intel MPI will use TMI (Tag Matching Interface) as opposed to DAPL. Intel MPI's libtmi.so will then seek configuration info from /etc/tmi.conf or from whatever the environmental variable TMI_CONFIG (see below) points to.
Troubleshooting mpdboot
- Make sure ~/.mpd.conf exists and the first line of ~/.mpd.conf is MPD_SECRETWORD=<secret word of your choice>.
- Use -d and -v command-line options to enable debugging information and verbosity mode.
- Make sure /tmp is not full on every machine in the <machine file name>. mpdboot will create temporary files
or UNIX domain sockets such as /tmp/mpd2.*<username> under /tmp.
Alternatively, use --tmpdir to specify a different location when running mpdboot.
For Intel MPI, use the environmental variable I_MPI_MPD_TMPDIR
- Similarly, make sure /var/run/ is not full. mpdboot will create a temporary file
under this directory containing the PID of mpd
Alternatively, use --pid to specify a different location when running mpdboot
- Python must have the correct version as in mpdboot script on every machine in the <machine file name>
- Choose the correct remote shell: ssh or rsh. For ssh, make sure ~/.ssh/authorized_keys is correct and has correct read/write permissions. For rsh, make sure ~/.rhosts is correct and has correct read/write permissions.
mpirun.ch_mx command-line options
mpirun.ch_mx accepts Myrinet-specific command-line options:
--mx-no-shmem | Disable the shared memory based intra-node communication. |
--mx-recv mode | Set mode to blocking to enable process waiting for receiving messages without polling. |
MPICH2 mpiexec environmental variables
Some of mpiexec's command-line options can be also specified in environmental vairables:
MPICH_DBG | Display debugging information. To use this feature, MPICH2 must be configured and compiled with --enable-g option. |
MPICH_DBG_LEVEL | Debugging verbosity level; can be TERSE, TYPICAL, VERBOSE |
MPICH_NO_LOCAL | Set to 1 to not use shared memory for intra-node communication |
MPICHD_DBG_PG | Set to YES to enable MPD to display debugging information |
MPIEXEC_BNR | Set to 1 to use the MPICH1 legacy BNR interface to spawn and manage processes. Trivia: BNR stands for Bill, Brian, Nick, Ralph, and Rusty, who were the initial MPICH2 developers of the interface. |
MPIEXEC_GDB | Set to 1 to run the MPI job in GDB |
MPIEXEC_MACHINEFILE | Set to the file name containing the list of hosts |
MPIEXEC_MERGE_OUTPUT | Set to 1 to merge output when running the MPI job in GDB |
MPIEXEC_RSH | rsh command to use; default is ssh -x |
MPIEXEC_SHOW_LINE_LABELS | Set to 1 to show MPI rank in each line of the output |
MPIEXEC_TIMEOUT | Job time out (in seconds) |
MPIEXEC_TOTALVIEW | Set to 1 to run the MPI job in TotalView |
MVAPICH2 mpiexec environmental variables (version 1.5)
Some of mpiexec's command-line options can be also specified in environmental vairables:
MV2_CPU_MAPPING | Provide more fine-grained control of CPU-process pinning/affinity.
Note that none of MV2_ENABLE_AFFINITY or MV2_USE_SHARED_MEM environmental variables should be set to 0 if MV2_CPU_MAPPING is used. |
MV2_CPU_BINDING_POLICY | Provide fine-grained control of CPU-process pinning/affinity.
It's value is either bunch (default) or scatter. |
MV2_DAPL_PROVIDER | If your MVAPICH2 is built with uDAPL-CH3, then this
variable specifies the uDAPL-CH3 library to use. It can takes the following values:
|
MV2_ENABLE_AFFINITY | Set to 0 to disable CPU-process pinning/affinity. Default is 1. |
MV2_USE_XRC | Set to 1 to enable eXtended Reliable Connection,
a new transport layer in Mellanox HCAs to improve scalability in multi-core environments.
It allows one single receive QP to be shared by multiple SRQs across one or more processes.
(Your MVAPICH2 must be built with XRC to use this feature) |
Intel MPI environmental variables
(See here for special features and known issues)
I_MPI_ADJUST_coll | Select the algorithm for collective operation coll and also the message size ranges for that algorithm to take effect. |
I_MPI_PLATFORM | Select the set of algorithms and their parameters for collective operations.
This is an undocumented environmental variable added since version 4.0
and it could change again in the future. (In 4.1 it is documented.)
To see its effect, set I_MPI_DEBUG environmental variable
to 6 and run the MPI program (if you recompile your MPI program with -g, you can see even more
verbose output), then one can see something like this:
[0] MPI startup(): Recognition level=0. Platform code=1. Device=5and the algorithms for collective operations (check Intel MPI reference manual for their meanings): .... [0] MPI startup(): Allreduce: 3: 2097152-2147483647 & 512-2147483647 [0] MPI startup(): Allreduce: 4: 524288-2147483647 & 32-2147483647 [0] MPI startup(): Allreduce: 5: 0-2147483647 & 0-2147483647 ....For version 4.0.0, the value of this variable can be
|
I_MPI_DAPL_PROVIDER | Select the particular DAPL provider. See here for details. By default, Intel MPI will try to load DAPL info from /etc/dat.conf /etc/ofed/compat-dapl/dat.conf |
I_MPI_DEBUG | Set to 6 to display most debugging information; set to 9 to display the values of many undocumented
environmental variables (those with I_MPI_INFO_ prefix) Compile the code with -g compiler option will enable even more detailed debugging information (because the user program will be linked to libmpi_dbg.so instead of libmpi.so, by the Intel MPI compiler wrapper) |
I_MPI_PRINT_ENV | (Version 4.0.1) Set to 1 to display the values of Intel MPI environmental variables. |
I_MPI_DEVICE | Select the particular network to be used, e.g. shm, rdma, rdssm, sock, ssm |
I_MPI_DYNAMIC_CONNECTION | Set to 1 so all connections are established at the time of the first communication
between each pair of processes. Default is 1 if the number of processes is greater than 64, and 0 otherwise. |
I_MPI_EAGER_THRESHOLD I_MPI_INTRANODE_EAGER_THRESHOLD | Set to the number
of bytes which is the threshold between the eager and the
rendezvous protocol. Default is 262144 (256 KB) |
I_MPI_EXTRA_FILESYSTEM I_MPI_EXTRA_FILESYSTEM_LIST | Select the parallel file system to use in MPI-IO.
Set I_MPI_EXTRA_FILESYSTEM to on and I_MPI_EXTRA_FILESYSTEM_LIST
to one of the following values:
|
I_MPI_FALLBACK | Set to 0 to not fallback to the first available network, i.e. if an attempt to initialize a specified network fails, the job terminates immediately. Default is 1. |
I_MPI_FABRICS | Select the particular network fabrics to be used, e.g. shm, dapl, tcp, tmi (Tag Matching Interface, e.g. QLogic PSM and Myricom MX), ofa |
I_MPI_PIN | Set to 0 to disable CPU-process pinning/affinity. Default is 1 |
I_MPI_PIN_DOMAIN
I_MPI_PIN_PROCS | Provide more fine-grained control of CPU-process pinning/affinity and interoperability with OpenMP. |
I_MPI_SCALABLE_OPTIMIZATION | Set to 0 to disable optimization. Default is 1 if the number of processes is greater than 16, and 0 otherwise. |
I_MPI_SHM_BYPASS | Set to 1 to disable the shared memory based intra-node communication. Default is 0. |
I_MPI_SPIN_COUNT | Set to loop spin count when polling the network.
Default value is equal to 1 when more than one process runs per processor/core. Otherwise the value equals 250. The loop for polling the network spins this many times before freeing the processes if no incoming messages are received for processing. Smaller values cause the MPI library to release the processor more frequently. |
I_MPI_STATS I_MPI_STATS_SCOPE I_MPI_STATS_BUCKETS I_MPI_STATS_FILE | Use built-in statistics gathering facility to collect performance data. |
I_MPI_TIMER_KIND | Select the timer routine for MPI_Wtime and MPI_Wtick calls, e.g. gettimeofday (default) or rdtsc |
I_MPI_WAIT_MODE | Set to 1 to enable process waiting for receiving
messages without polling. Default is 0 (i.e. use polling). Caveat: Unless one has a good reason, do not change the default setting, especially on Myrinet MX, or MPI collective operations could get stuck. The fix is to use TMI as oppose to the (default) DAPL. See the TMI_CONFIG environmental variable below. |
TMI_CONFIG | If one selects the Tag Matching Interface (e.g. specifying
-MX
or -PSM
switch at mpiexec command), then by default, Intel MPI seeks TMI
configuration info from
/etc/tmi.confand this environmental variable can be set to point to the TMI config file one wishes to use. One can find an example TMI config file at $I_MPI_ROOT/etc64/tmi.conf, which can be used without modification in most cases. See here for details. |
TMI_DEBUG | Set to a positive integer to display TMI debugging information. |
Open MPI environmental variables
Open MPI has many MCA (Modular Component Architecture) parameters which can be set in the following different ways, and each approach overrides the later ones:- mpirun command-line:
mpirun --mca param1 "value1"
- Environmental variable:
export OMPI_MCA_param1="value1"
- MCA parameter files ($HOME/.openmpi/mca-params.conf
or $OMPI_HOME/etc/openmpi-mca-params.conf) and the files
contain the following line
param1 = value1
For example, to use the Checksum component in PML (Point-To-Point Management Layer) framework, use:
mpirun --mca pml csum ...
To see the default MCA parameters:
ompi_info --param all all
Some useful environmental variables:
OPAL_PREFIX | Specify the Open MPI installation root directory. |
OMPI_MCA_mca_verbose | Set to 1 for verbose mode. |
OMPI_MCA_mpi_show_mca_params | Set to 1 to show all MCA parameter values. |
OMPI_MCA_mpi_yield_when_idle | Set to any nonzero value for Degraded performance mode, or 0 for Aggressive performacne mode |
OMPI_MCA_mpi_paffinity_alone | Set to any nonzero value to enable CPU-process pinning/affinity. |
OMPI_MCA_mpi_leave_pinned | Set to
1 value to enable aggressive pre-registration
of user message buffers for RDMA-enabled network fabrics.
See here for details. |
OMPI_MCA_mpi_leave_pinned_pipeline | Set to
1 value to enable pre-registration
of user message buffers for RDMA pipeline protocol.
See the article High performance RDMA protocols in HPC by T.S. Woodall et al for details. |
OMPI_MCA_btl | Open MPI can detect the network fabric and choose the best communication protocols for it.
If you want to override, you can set OMPI_MCA_btl to the following values. You can
specify multiple values (separated by commas) e.g. export OMPI_MCA_btl='gm,sm,self'
|
SGI MPI environmental variables
MPI_DISPLAY_SETTINGS | Set to any value to display the values of SGI MPI environmental variables. |
MPI_SLAVE_DEBUG_ATTACH | Set to the MPI rank of the process you want to attach your debugger. When this
environmental variable is set, the MPI job will wait for 20 seconds for you to launch the debugger. You will see
something like this:
MPI rank 1 sleeping for 20 seconds while you attach the debugger. You can use this debugger command: gdb /proc/30766/exe 30766 or idb -pid 30766 /proc/30766/exe |
MPI_DAEMON_DEBUG_ATTACH | Similar to MPI_SLAVE_DEBUG_ATTACH environmental variable,
except the point of attachment is earlier. If MPI_SLAVE_DEBUG_ATTACH is set, then the attachment
occurs at:
#0 0x7f8b794fe060 in __nanosleep_nocancel () from /lib64/libc.so.6 #1 0x7f8b794fde9c in sleep () from /lib64/libc.so.6 #2 0x7f8b79c663df in slave_init () at adi.c:178 #3 fork_slaves () at adi.c:295 #4 MPI_SGI_create_slaves () at adi.c:419 #5 MPI_SGI_init () at adi.c:644 #6 0x7f8b7a165df8 in call_init () from /lib64/ld-linux-x86-64.so.2On the other hand, for MPI_DAEMON_DEBUG_ATTACH, the attachment point occurs at an earlier moment: #0 0x7fb97e8a1060 in __nanosleep_nocancel () from /lib64/libc.so.6 #1 0x7fb97e8a0e9c in sleep () from /lib64/libc.so.6 #2 0x7fb97f008a16 in MPI_SGI_init () at adi.c:450 #3 0x7fb97f508df8 in call_init () from /lib64/ld-linux-x86-64.so.2 |
MPI point-to-point communications
- Do not confuse eager/rendezvous protocols with MPI_Rsend/MPI_Ssend
Eager/Rendezvous protocols are just a possible MPI implementation for managing internal buffers. Eager means the sender will transfer the data without receiver's acknowledgement. Rendezvous protocol is the opposite; a request-acknowledgment handshake must occur before data can be transferred.
Thus, eager/rendezvous protocols have nothing to do with whether there is any posted matching receive or not. For example, even without a posted matching receive, an MPI implementation can still use Eager protocol for small messages, and the receiver would just buffer them (internally).
- Do not confuse blocking with synchronous.
Blocking means the MPI call returns immediately.
Synchronous in MPI protocol has a special meaning. It is a communication mode, meaning a send will not complete until a matching receive is posted.
User program | Blocking ? Dead lock ? Buffer reuse ? |
---|---|
MPI protocol | Matching receive posted (Synchronous) ? Matching receive must be posted first ? Buffer reuse ? |
MPI implementation | Eager or Rendezvous ? |
Send modes
Mode | Blocked until ? | Synchronous ? | Buffer reuse ? | Matching receive must be posted first? | Analysis |
---|---|---|---|---|---|
MPI_Send | Buffer is safe to reuse (and when this happens, it does not mean the message is received or a matching receive is posted) | N/A | N/A | No | Strike a balance between MPI_Ssend and MPI_Rsend |
MPI_Ssend | A matching receive is posted (S=Synchronous) | Yes | No | No | Greatest overhead, but safe. |
MPI_Bsend | Data is copied to the buffer attached with MPI_Buffer_attach. (B=Buffered). | N/A | Yes. (Must call MPI_Buffer_attach first) | No | Decouple sender and reciver. |
MPI_Rsend | Data transfer completes. | Yes | N/A | Yes (R=Ready) | Least overhead, but could cause error if matching receive is not posted yet. |
MPI_Isend | Never block (I=Immediate). | N/A | No | No | |
MPI_Issend | Never block. | Yes | No | No | |
MPI_Ibsend | Never block. | N/A | Not until tested by MPI_Wait/MPI_Test | No | |
MPI_Irsend | Never block. | Yes | No | Yes |
The receiver side is simple: There are only four APIs:
MPI_Recv MPI_Probe MPI_Irecv MPI_IprobeThe purpose of MPI_Irecv is really to post a matching receive.
MPI_Probe can check if there is any receivable message that matches the specified source, tag, and comm.
Test for communication completion
For all those MPI_I* API's, one will get an MPI_Request handle. Both the sender and receiver can then use the following APIs on this handle to test communication completion:MPI_Test MPI_Wait MPI_Testsome MPI_Waitsome MPI_Testall MPI_Waitall MPI_Testany MPI_WaitanyMPI_Test* are non-blocking while MPI_Wait* are blocking.
MPI_*some is between the two extremes MPI_{Test|Wait} and MPI_*all.
Caveat: For sender, send operation completion only means the buffer is ready to reuse; it doesn't mean the data have been received.
Persistent communication mode
One can group many point-to-point communications into a batch and initiate the transfers in one shot. The relevant APIs are:MPI_Send_init MPI_Bsend_init MPI_Rsend_init MPI_Ssend_init MPI_Recv_init MPI_Start MPI_Startall
Dead lock scenarios and avoidance
Process 1 | Process 2 | |
---|---|---|
Deadlock 1 | Recv/Send | Recv/Send |
Deadlock 2 | Send/Recv | Send/Recv |
Solution 1 | Send/Recv | Recv/Send |
Solution 2 | SendRecv | SendRecv |
Solution 3 | IRecv/Send, Wait | IRecv/Send, Wait |
Solution 4 | BSend/Recv | BSend/Recv |
MPI-2 one-sided point-to-point communications
The above point-to-point communications are two-sided, i.e. they require matching operations on both sender and receiver sides. MPI-2 introduces one-sided point-to-point communications to facilitate remote memory accesses (RMA), which are essential for partitioned global address space languages. Unfortunately, MPI-2's RMA semantics are unintuitive and difficult to use.Initialization
First, each process needs to designate a memory region (called a window in MPI-2 parlance) for RMA. This is achieved by calling a collective function:MPI_Win_createThis function associates a memory region with a communicator. The sizes of the memory regions can differ from processes to processes, but it is important for all participating processes to call it, even if a process only offers memory for others to accesses and does not access other processes' memory.
The "destructor" of MPI_Win_create is
MPI_Win_free
Communication
Three RMA operations are supportedMPI_Get MPI_Put MPI_AccumulateMPI_Accumulate is an atomic operation to achieve the remote update (Get-Modify-Put) operations. However, the kind of Modify operations it supports is the predefined MPI_Reduce operations such as MPI_SUM, MPI_PROD, MPI_MAX, MPI_LAND... User-defined reduction operations are not supported
It is important to note that these operations are non-blocking. It makes sense that MPI_Put and MPI_Accumulate are non-blocking, but to understand why MPI_Get is also non-blocking, read on.
Synchronization
Now comes the most criticized and confusing part of MPI-2 RMA. The inclusion of synchronization in MPI-2 RMA seems to facilitate MPI implementers, not MPI users.As mentioned earlier, all RMA operations are non-blocking, which means, like MPI_I* functions, there must be accompanying synchronization functions like MPI_Wait. And as in MPI_I*'s case, the RMA operations cannot appear anywhere in the program; they must be surrounded by these synchronization functions; a pair of consecutive synchronizations is called an epoch in MPI-2 parlance.
Passive target
MPI_Win_lock(lockType, targetRank, hint, assert, win) MPI_Win_unlock(targetRank, win)where lockType can be MPI_LOCK_SHARED or MPI_LOCK_EXCLUSIVE.
assert is usually 0, and it can be set to MPI_MODE_NOCHECK, which means no other processes will hold or attempt to acquire a conflicting lock.
It is important to note that the lock granularity is the entire window.
Active target, coarse-grained
MPI_Win_fence(assert, win)Recall that a window is associated with a communicator, so all processes in a communicator, no matter they are target or originating processes, must perform the same number of MPI_Win_fence calls. The epochs are the periods between consecutive MPI_Win_fence calls, and RMA operations in the same epoch are then synchronized.
assert is usually 0, but it can be set to
- MPI_MODE_NOSTORE: No local stores or self-MPI_Put/MPI_Accumulate to local window between this fence and the prior fence.
- MPI_MODE_NOPUT: No MPI_Put/MPI_Accumulate to local window between this fence and next fence.
- MPI_MODE_NOPRECEDE: No RMA between this fence and the prior fence.
- MPI_MODE_NOSUCCEED: No RMA between this fence and the next fence.
Active target, fine-grained
MPI_Win_post(group, assert, win) MPI_Win_wait(win)where group is an MPI_Group object consisting of ranks of the specified originating processes.
assert can be 0, MPI_MODE_NOCHECK, MPI_MODE_NOSTORE, or MPI_MODE_NOPUT.
The original processes (plural) then use the following pairs to open/close the corresponding epoch:
MPI_Win_start(group, assert, win) MPI_Win_complete(win)Within the epoch the original processes can then perform RMA operations.
(assert can be 0 or MPI_MODE_NOSTORE, MPI_MODE_NOPUT, etc)
Update semantics and erroneous behaviors
The following behaviors are considered erroneous (as in MPI-2 specification) and should be avoided- Conflicting accesses to the same memory location in the same window in an epoch.
It does not matter whether the accesses are initiated by local or remote processes.
For example, a local load and an MPI_Put (from a remote process) is erroneous (regardless of the order) if they access the same memory location.
The only exception is multiple MPI_Accumulate's (using the same reduction operation) to the same memory location.
- Local stores and MPI_Accumulate/MPI_Put (from remote processes) to
different memory locations in the same window in an epoch.
The rationale is on some architectures, explicit coherence restoring operations may be needed at synchronization points, and the restoring operations can be different depending on whether the memory is updated locally or remotely. Keeping track of the update types for each memory address is prohibitively expensive.