Myrinet logotype
Performance Measurements

GM-1.6.4 API Performance
MPICH-GM Performance
TCP/IP Performance
Sockets-GM Performance

GM 2.x API Performance with PCIXD and PCIXE NICs


GM 1.6.4 API Performance
with PCI64B and PCI64C Myrinet/PCI NICs
April 2003

The three principal performance metrics ("micro-benchmarks") of cluster interconnect are:

The following table summarizes performance measurements for M3F-PCI64B and M3F-PCI64C Myrinet NICs in 2-GHz Pentium-IV Xeon Supermicro P4DL6 hosts, which have 64-bit, 66MHz, PCI slots, and good performance between the PCI bus and system memory:

Performance Metric PCI64B
133MHz
RISC & Memory
PCI64C
200MHz
RISC & Memory
Sustained one-way data rate for large messages 240 MByte/s 246 MByte/s
Sustained two-way data rate for large messages 340 MByte/s 421 MByte/s
Latency for short messages 8.5 µs 6.7 µs
Host-CPU utilization per message (send + receive) 0.54 µs  0.54 µs

These results are representative of message-passing performance in application programs. They are not "marketing benchmarks." The performance measurements are between user processes, with full protection (safe in multi-user, multiprogramming environments), and with end-to-end data-integrity checking. Other host computers may show better or worse performance. However, because the operating system is "bypassed" for the GM API, there is little performance variation between different operating systems.

Sustained one-way data rate

Hosts with 64-bit, 66MHz PCI slots (or PCI-X slots): As shown in the following graph, the sustained, one-way, data rate for GM closely approaches the 250MB/s (2Gb/s) data rate of a Myrinet-2000 link for long messages. The data rate reaches half of this asymptotic value with message lengths of ~1.3KB (PCI64B) or ~900B (PCI64C).

The test was performed by one host sending and another host receiving messages of each plotted length repeatedly. The plot is qualitatively similar to the data-rate performance of other types of interconnect, e.g., Ethernet. For small message sizes, the data-rate performance is limited by the number of DMA transfers and packets the NIC can handle per unit time. Longer messages convey bytes in larger units, hence, more efficiently.

The reason for the jagged patterns in the central part of this log-scale plot is that GM fragments long messages into packets of at most 4KB at the sender, and reassembles the packets into messages at the receiver. This fragmentation and reassembly is performed in order to limit the packet size in the network, so that a long message will not block a channel for an extended period, but will allow other packets to be interleaved on the channel.

Hosts with 64-bit, 33MHz PCI slots: The asymptotic data rate plotted above is roughly equal to the peak DMA transfer rates of 64-bit, 33MHz PCI slots. In hosts with good 64-bit, 33MHz PCI implementations, the GM data-rate performance is limited to 200-240 MB/s (1.6-1.9 Gb/s).

Hosts with 32-bit PCI slots: The asymptotic data rate plotted above exceeds the 133 MB/s peak DMA transfer rates of 32-bit, 33MHz PCI slots. In hosts with good 32-bit, 33MHz PCI implementations, the GM data-rate performance is limited to 100-130 MB/s (0.8-1.0 Gb/s).

Sustained two-way (summed-bidirectional) data rate

In data-rearrangement phases of distributed computations, it is common for a host to be sending and receiving messages simultaneously. Bidirectional performance is thus important to application performance, but it is demanding of local-memory data rate in the NIC and of transfers across the PCI bus. This microbenchmark illustrates the benefits of the higher instruction rate and the higher local-memory data rate of the PCI64C NICs (200MHz, 1600 MBytes/s) versus the PCI64B NICs (133MHz, 1067 MBytes/s). The asymptotic performance of the PCI64C NIC, approximately 421MB/s, is limited by the PCI bus. (The "gm_debug" PCI-DMA rates for this host with PCI64C NICs are 388MB/s read and 491MB/s write. The PCI-bus limit on the summed-bidirectional data rate is 2/(1/388 + 1/491) = 433MB/s.)

This excellent bidirectional performance requires 64-bit, 66MHz, PCI slots with good performance between the PCI bus and system memory. The bidirectional performance with 64-bit, 33MHz or 32-bit, 33MHz PCI slots will be limited by the PCI bus.

Short-message latency

This measurement is performed as a repetitive "ping-pong" exchange of messages between processes in different hosts, with the one-way latency for each message length plotted as half of the average round-trip time (RTT).

The short-message latency, a critical metric for many distributed-computing applications, is 6.7µs for PCI64C NICs, and ~8.5µs for PCI64B NICs under GM 1.6.4. These latencies are the sum of host, PCI, and link-port latencies totaling ~3.1µs; and an instruction-execution component. The higher instruction rate of the PCI64C NICs reduces the instruction-execution component from ~5.4µs (PCI64B) to ~3.6µs (PCI64C).

The latency for transferring a short message between processes in different hosts is relatively insensitive to whether the host has 32-bit, 33MHz; 64-bit, 33MHz; or 64-bit, 66HMz PCI slots. The range for GM's one-way, short-message latency with PCI64B or PCI64C NICs on a wide variety of hosts is from ~6 µs to ~16 µs.

Hardware latencies. These latency measurements were between NICs with Myrinet-Fiber ports connected through short fiber cables to a switch. The latency includes ~0.5µs total hardware latency in the circuitry between the Myrinet-SAN ports of the LANai chips in the two NICs. This hardware latency is introduced by circuitry that converts to and from the Fiber Physical layer, and by the switch itself. Note that the hardware latency is small compared with the software latency.

The "time of flight" latency of a fiber cable is ~0.0065µs per meter, e.g., ~1.3µs for a maximal-length, 200m fiber cable.

Host-CPU utilization

Minimizing the host-CPU overhead was one of the principal design objectives of GM, and GM does indeed exhibit a very low host-CPU utilization. For example, between the hosts used in the performance graphs above, the sum of the measured overheads for sending and receiving a message is 0.54 µs.


Last updated: 02 June 2006