Myrinet logotype
Ethernet Emulation
(TCP/IP and UDP/IP)
Performance for GM-1

In addition to its OS-bypass features, GM also presents itself to the host operating system as an ethernet interface. This "ethernet emulation" feature of GM allows Myrinet to carry any packet traffic and protocols that can be carried on ethernet, including TCP/IP and UDP/IP.

It is helpful to understand that when using ethernet emulation over GM, traffic goes from the application through the OS kernel to the GM driver, following the same path as it would for a "real" ethernet interface; traffic does not go directly from the application to the interface, as it does when using GM in its OS-bypass mode. Thus, the TCP/IP and UDP/IP performance over GM depends primarily on the host-CPU performance and the host-OS's IP protocol stack. This performance varies widely for different hosts and operating systems. Also, unlike GM's OS-bypass mode, which exhibits a very small host-CPU overhead, TCP/IP and UDP/IP protocol processing at high data-transfer rates may use a significant fraction of the host-CPU cycles.

The GM developers have streamlined ethernet emulation over GM wherever practical. For example, the ethernet-emulation code uses the PCI64 DMA engines to offload the receive-side IP-checksum computation for TCP/IP and UDP/IP in operating systems that support it (Linux, FreeBSD, MacOS-X, Tru64 5.1). This optimization results in less data being accessed in the host-OS kernel. GM supports 9000-Byte jumbo frames in addition to the standard 1500-Byte ethernet frames; indeed, the MTU (Maximum Transmission Unit) can be set to any value between 64 Bytes and 9000 Bytes. Larger frames result in fewer packets being sent to transfer the same amount of data. An optimization used in GM-2 but not provided in GM-1 is interrupt-coalescing, which reduces host overhead by batching multiple transmitted and received packets together, thereby reducing the number of interrupts the host needs to service.

We report here the ethernet-emulation (TCP/IP and UDP/IP) performance of GM-1.5.2.1 between a pair of dual 2.0Ghz Intel Pentium-4 Xeon hosts that use the Serverworks Grand Champion chipset. The test machines were running Redhat 7.3 and the Redhat 2.4.18-4smp Linux kernel. The GM driver was configured to use a 9K MTU for ethernet emulation.

The standard netperf2.2pl2 benchmark resulted in the following bandwidth performance for TCP and UDP. The TCP test uses 256K socket buffers; the UDP test uses an 8K message size.

NIC Bandwidth CPU Utilization
Sender Receiver
PCI64C TCP 1853 Mb/s 50% 74%
UDP 1962 Mb/s 41% 45%
PCI64B TCP 1771 Mb/s 48% 66%
UDP 1864 Mb/s 40% 35%

The following table shows the (half-round-trip) one-way latency performance for a 1-Byte message. The netperf benchmark presents this data as "number of transmits per second", so we divide 1 second by the number of transmits to get the full round-trip latency, then divide that by 2 to obtain the results below.

NIC One-way Latency CPU Utilization
Sender Receiver
PCI64C TCP 32 µs 24% 23%
UDP 31 µs 26% 26%
PCI64B TCP 34 µs 24% 20%
UDP 32 µs 27% 20%

The "raw" netperf output for these tests is attached below.


Raw netperf output for PCI64C NICs:

% netperf -H10.0.0.9 -l60 -c -C -- -S131072 -s131072
TCP STREAM TEST to 10.0.0.9
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262142 262142 262142    59.99      1852.66   50.41    74.36    4.458   6.576

% netperf -H10.0.0.9 -l60 -c -C -tUDP_STREAM -- -m 8192
UDP UNIDIRECTIONAL SEND TEST to 10.0.0.9
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SS     us/KB

 65535    8192   60.00     1796374      0     1962.2     41.94    3.502
 65535           60.00     1796368            1962.2     45.83    3.826

% netperf -H10.0.0.9 -l60 -c -C -tTCP_RR
TCP REQUEST/RESPONSE TEST to 10.0.0.9
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      59.99   15499.25  24.25  22.55  31.287  29.100
16384  87380

% netperf -H10.0.0.9 -l60 -c -C -tUDP_RR
UDP REQUEST/RESPONSE TEST to 10.0.0.9
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

65535  65535  1       1      60.00   16065.29   27.43  25.66  34.146  31.943
65535  65535

Raw netperf output for PCI64B NICs:

% netperf -H10.0.1.9 -l60 -c -C -- -S131072 -s131072
TCP STREAM TEST to 10.0.1.9
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262142 262142 262142    60.00      1771.09   47.95    65.53    4.436   6.062


% netperf -H10.0.1.9 -l60 -c -C -tUDP_STREAM -- -m 8192
UDP UNIDIRECTIONAL SEND TEST to 10.0.1.9
Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SS     us/KB

 65535    8192   59.99     1706739      0     1864.4     40.30    3.542
 65535           59.99     1706739            1864.4     35.59    3.128

% netperf -H10.0.1.9 -l60 -c -C -tTCP_RR
TCP REQUEST/RESPONSE TEST to 10.0.1.9
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      59.99   14687.71  23.47  20.03  31.952  27.276
16384  87380

% netperf -H10.0.1.9 -l60 -c -C -tUDP_RR
UDP REQUEST/RESPONSE TEST to 10.0.1.9
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

65535  65535  1       1      60.00   15442.63   26.58  20.11  34.422  26.051
65535  65535


02 June 2006