Myrinet logotype
GM Performance Measurements

GM 2.1 API | GM 2.0 API | TCP/IP over GM 2.x
MPICH-GM over GM 2.x | Sockets-GM over GM 2.x

GM 1.6.4 API | TCP/IP over GM 1.x
MPICH-GM over GM 1.x


GM 2.1 API Performance
with M3F2-PCIXE Myrinet/PCI-X NICs
March 2005

The three principal performance metrics ("micro-benchmarks") of cluster interconnect are:

The following table summarizes performance measurements for M3F2-PCIXE Myrinet NICs under GM 2.1.9 in dual-2.4-GHz Pentium-4 Xeon hosts with the Serverworks "Grand Champion" (GC-LE) chipset. These hosts have 64-bit, 133MHz, PCI-X slots with good PCI-DMA performance (780 MB/s bus read, 1044 MB/s bus write, and 899 MB/s bus read write, according to the GM-2 "gm_debug" utility).

Performance Metric M3F2-PCIXE
64-bit, 133MHz PCI-X
333MHz RISC & Memory
Sustained one-way data rate for large messages 495 MByte/s
Sustained two-way data rate for large messages 838 MByte/s
Latency for short messages 5.05 µs
Host-CPU utilization per message (send + receive) less than 0.5 µs 

These results are representative of message-passing performance in application programs. They are not "marketing benchmarks." The performance measurements are between user processes, with full protection (safe in multi-user, multiprogramming environments), and with end-to-end data-integrity checking. Other host computers may show better or worse performance. However, because the operating system is "bypassed" for the GM API, there is little performance variation between different operating systems.

Sustained one-way data rate

As shown in the following graph, the sustained, one-way, data rate for GM 2.1 closely approaches the 500 MB/s unidirectional data rate of dual Myrinet links for long messages. The data rate reaches half of the 495 MB/s asymptotic value with message lengths of ~1486B.

graph

The test was performed by one host sending and another host receiving messages of each plotted length repeatedly. The plot is qualitatively similar to the data-rate performance of other types of interconnect, e.g., Ethernet. For small message sizes, the data-rate performance is limited by the number of DMA transfers and packets the NIC can handle per unit time. Longer messages convey bytes in larger units, hence, more efficiently.

The reason for the jagged patterns in the central part of this log-scale plot is that GM fragments long messages into packets of at most 4KB at the sender, and reassembles the packets into messages at the receiver. This fragmentation and reassembly is performed in order to limit the packet size in the network, so that a long message will not block a channel for an extended period, but will allow other packets to be interleaved on the channel. This same effect is present but is less prominent in the plot of the two-way data rate.

Sustained two-way (summed-bidirectional) data rate

In data-rearrangement phases of distributed computations, it is common for a host to be sending and receiving messages simultaneously. The summed-bidirectional data rate is the key metric for the performance of many applications. Bidirectional performance is demanding of local-memory data rate in the NIC and of transfers across the PCI bus; however, with the 2.7GB/s local memory data rate of the PCIXE NIC and the high throughput of the 133MHz PCI-X bus (limited to 899 MB/s bidirectional bandwidth on this motherboard), the summed-bidirectional performance reaches 838 MB/s, approaching the 500+500 MB/s summed-bidirectional data rate of the dual Myrinet links.

Note that the bus_read_write speed reported in the output of gm_debug -L indicates the maximum bidirectional bandwidth that can be achieved by the PCI bus on the mother board. Unfortunately, however, the computational overhead in the GM-2 firmware limits us to approximately 838 MB/s bidirectional bandwidth.

graph

Short-message latency

This measurement is performed as a repetitive "ping-pong" exchange of messages between processes in different hosts, with the one-way latency for each message length plotted as half of the average round-trip time (RTT).

graph

The short-message latency, a critical metric for many distributed-computing applications, is 5.05 µs.

Host-CPU utilization

Minimizing the host-CPU overhead was one of the principal design objectives of GM, and GM does indeed exhibit a very low host-CPU utilization. For example, between the hosts used in the performance graphs above, the sum of the measured overheads for sending and receiving a message is less than 0.5 µs.


GM 2.0 API Performance
with M3F-PCIXD and M3F-PCIXF Myrinet/PCI-X NICs
March 2005

The three principal performance metrics ("micro-benchmarks") of cluster interconnect are:

The following tables summarize performance measurements for M3F-PCIXD and M3F-PCIXF Myrinet NICs under GM 2.0.19 in dual-2.4-GHz Pentium-4 Xeon hosts with the Serverworks "Grand Champion" (GC-LE) chipset. These hosts have 64-bit, 133MHz, PCI-X slots with good PCI-DMA performance (842 MB/s bus read, 1044 MB/s bus write, according to the GM-2 "gm_debug" utility).

Performance Metric M3F-PCIXD
64-bit, 133MHz PCI-X
225MHz RISC & Memory
Sustained one-way data rate for large messages 248 MByte/s
Sustained two-way data rate for large messages 489 MByte/s
Latency for short messages 6.3 µs
Host-CPU utilization per message (send + receive) less than 0.5 µs 

Performance Metric M3F-PCIXF
64-bit, 133MHz PCI-X
333MHz RISC & Memory
Sustained one-way data rate for large messages 248 MByte/s
Sustained two-way data rate for large messages 489 MByte/s
Latency for short messages 4.5 µs
Host-CPU utilization per message (send + receive) less than 0.5 µs 

These results are representative of message-passing performance in application programs. They are not "marketing benchmarks." The performance measurements are between user processes, with full protection (safe in multi-user, multiprogramming environments), and with end-to-end data-integrity checking. Other host computers may show better or worse performance. However, because the operating system is "bypassed" for the GM API, there is little performance variation between different operating systems.

Sustained one-way data rate

As shown in the following graph, the sustained, one-way, data rate for GM 2.0 closely approaches the 250MB/s (2Gb/s) unidirectional data rate of a Myrinet link for long messages. The data rate reaches half of the 248 MB/s asymptotic value with message lengths of ~650B with PCIXF NICs and ~900B with PCIXD NICs.

graph

The test was performed by one host sending and another host receiving messages of each plotted length repeatedly. The plot is qualitatively similar to the data-rate performance of other types of interconnect, e.g., Ethernet. For small message sizes, the data-rate performance is limited by the number of DMA transfers and packets the NIC can handle per unit time. Longer messages convey bytes in larger units, hence, more efficiently.

Sustained two-way (summed-bidirectional) data rate

In data-rearrangement phases of distributed computations, it is common for a host to be sending and receiving messages simultaneously. The summed-bidirectional data rate is the key metric for the performance of many applications. Bidirectional performance is demanding of local-memory data rate in the NIC and of transfers across the PCI bus; however, with the 1.8GB/s local memory data rate of the PCIXD NIC, the 2.7GB/s local memory data rate of the PCIXF NIC, and the high throughput of the 133MHz PCI-X bus, the summed-bidirectional performance reaches 489 MB/s, closely approaching the 250+250 MB/s summed-bidirectional data rate of the Myrinet link.

graph

The reason for the jagged patterns in the central part of this log-scale plot is that GM fragments long messages into packets of at most 4KB at the sender, and reassembles the packets into messages at the receiver. This fragmentation and reassembly is performed in order to limit the packet size in the network, so that a long message will not block a channel for an extended period, but will allow other packets to be interleaved on the channel. This same effect is present but is less prominent in the plot of the one-way data rate.

Short-message latency

This measurement is performed as a repetitive "ping-pong" exchange of messages between processes in different hosts, with the one-way latency for each message length plotted as half of the average round-trip time (RTT).

graph

The short-message latency, a critical metric for many distributed-computing applications, is 6.3 µs for a PCIXD NIC, and 4.5 µs for a PCIXF NIC.

Host-CPU utilization

Minimizing the host-CPU overhead was one of the principal design objectives of GM, and GM does indeed exhibit a very low host-CPU utilization. For example, between the hosts used in the performance graphs above, the sum of the measured overheads for sending and receiving a message is less than 0.5 µs.

Myricom banner
Last updated: 02 June 2006