Wednesday, September 10, 2014

Receive side scaling: testing UDP throughput

I think it's about time I shared some more details about the RSS stuff going into FreeBSD and how I'm testing it.

For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.

I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.

The test setup in question:

Testing software:

  • http://github.com/erikarn/freebsd-rss
  • It requires libevent2 - an updated copy; previous versions of libevent2 didn't handle FreeBSD specific errors gracefully and would early error out of the IO loop.

Server:

  • CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz (3292.59-MHz K8-class CPU)
  • There's no SMT/HTT, but I've disabled it in the BIOS just to be sure
  • 4GB RAM
  • FreeBSD-HEAD, amd64
  • NIC:  '82599EB 10-Gigabit SFI/SFP+ Network Connection
  • ix0: 10.11.2.1/24
/etc/sysctl.conf:

# for now redirect processing just makes the lock overhead suck even more.
# disable it.
net.inet.ip.redirect=0
net.inet.icmp.drop_redirect=1

/boot/loader.conf:

hw.ix.num_queues=8

# experiment with deferred dispatch for RSS
net.isr.numthreads=4
net.isr.maxthreads=4
net.isr.bindthreads=1
 

kernel config:

include GENERIC
ident HACKBOX

device netmap
options RSS
options PCBGROUP

# in-system lock profiling
options LOCK_PROFILING

# Flowtable - the rtentry locking is a bit .. slow.
options   FLOWTABLE

# This debugging code has too much overhead to do accurate
# testing with.
nooptions         INVARIANTS
nooptions         INVARIANT_SUPPORT
nooptions         WITNESS
nooptions         WITNESS_SKIPSPIN


The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.

Client

The client box is slightly more powerful to compensate for (currently) not using completely affinity-aware RSS UDP transmit code.

  • CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz (3192.68-MHz K8-class CPU)
  • SMT/HTT: Disabled in BIOS
  • 8GB RAM
  • FreeBSD-HEAD amd64
  • Same kernel config, loader and sysctl config as the server
  • ix0: configured as 10.11.2.2/24, 10.11.2.3/32, 10.11.2.4/32, 10.11.2.32/32, 10.11.2.33/32
The client runs 'udp-clt' programs to source and sink traffic to the server.

Running things

The server-side simply runs the listen server, configured to respond to each frame:

$ rss-udp-srv 1 10.11.2.1

The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):

$ ./udp-clt -l 10.11.2.3 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.4 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.32 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.33 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510

The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.

Results: Round one

When the server is responding to each frame, the following occurs. The numbers are "number of frames generated by the client (netstat)", "number of frames received by the server (netstat)", "number of frames seen by udp-rss-srv", "number of responses transmitted from udp-rss-srv", "number of frames seen by the server (netstat)"
  • 1 udp-clt process: 710,000; 710,000; 296,000; 283,000; 281,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 592,000; 592,000; 575,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 636,000; 636,000; 600,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 255,000; 255,000; 255,000
So, it's not actually linear past two cores. The question here is: why?

There are a couple of parts to this.

Firstly - I had left turbo boost on. What this translated to:

  • One core active: ~ 30% increase in clock speed
  • Two cores active: ~ 30% increase in clock speed
  • Three cores active: ~ 25% increase in clock speed
  • Four cores active: ~ 15% increase in clock speed.
Secondly and more importantly - I had left flow control enabled. This made a world of difference.

The revised results are mostly linear - with more active RSS buckets (and thus CPUs) things seem to get slightly more efficient:
  • 1 udp-clt process: 710,000; 710,000; 266,000; 266,000; 266,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 512,000; 512,000; 512,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 810,000; 810,000; 810,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 1,120,000; 1,120,000; 1,120,000

Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:

$ rss-udp-srv 0 10.11.2.1
  • 1 udp-clt process: 710,000; 710,000; 204,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 378,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 645,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 900,000
The receive-only workload is actually worse off versus the transmit + receive workload!

What's going on here?

Well, a little digging shows that in both instances - even with a single udp-clt thread running which means only one CPU on the server side is actually active! - there's active lock contention.

Here's an example dtrace output for measuring lock contention with only one active process, where one CPU is involved (and the other three are idle):

Receive only, 5 seconds:

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # dtrace -n 'lockstat:::adaptive-block { @[stack()] = sum(arg1); }'
dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C


              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
         46729281

Transmit + receive, 5 seconds:

dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C


              kernel`knote+0x7e
              kernel`sowakeup+0x65
              kernel`udp_append+0x14a
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
             3793

              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          3823793

              kernel`ixgbe_msix_que+0xd3
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          9918140

Somehow it seems there's less lock contention / blocking going on when both transmit and receive is running!

So then I dug into it using the lock profiling suite. This is for 5 seconds with receive-only traffic on a single RSS bucket / CPU (all other CPUs are idle):

# sysctl debug.lock.prof.enable = 1; sleep 5 ; sysctl debug.lock.prof.enable=0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.enable=1 ; sleep 5 ; sysctl debug.lock.prof.enable=0
debug.lock.prof.enable: 1 -> 1
debug.lock.prof.enable: 1 -> 0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.stats | head -2 ; sysctl debug.lock.prof.stats | sort -nk4 | tail -10
debug.lock.prof.stats: 
     max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
    1496         0       10900           0          28    389      0  0      0 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2755 (sx:USB config SX lock)
debug.lock.prof.stats: 
       0         0          31           1          67      0      0  0      4 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2)
       0         0        2715           1       49740      0      0  0      7 /usr/home/adrian/work/freebsd/head/src/sys/dev/random/random_harvestq.c:294 (spin mutex:entropy harvest mutex)
       1         0          51           1         131      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:1179 (spin mutex:sched lock 1)
       0         0          69           2         170      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 2)
       0         0       40389           2      287649      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/kern_intr.c:1359 (spin mutex:sched lock 2)
       0         2           2           4          12      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2762 (sleep mutex:Giant)
      15        20        6556         520        2254      2      0  0    105 /usr/home/adrian/work/freebsd/head/src/sys/dev/acpica/Osd/OsdSynch.c:535 (spin mutex:ACPI lock (0xfffff80002b10f00))
       4         5      195967       65888     3445501      0      0  0  28975 /usr/home/adrian/work/freebsd/head/src/sys/netinet/udp_usrreq.c:369 (sleep mutex:so_rcv)

Notice the lock contention for the so_rcv (socket receive buffer) handling? What's going on here is pretty amusing - it turns out that because there's so much receive traffic going on, the userland process receiving the data is being preempted by the NIC receive thread very often - and when this happens, there's a good chance it's going to be within the small window that the receive socket buffer lock is held. Once this happens, the NIC receive thread processes frames until it gets to one that requires it to grab the same sock buffer lock that is already held by userland - and it fails - so the NIC thread sleeps until the userland thread finishes consuming a packet. Then the CPU flips back to the NIC thread and continues processing a packet.

When the userland code is also transmitting frames it's increasing the amount of time in between socket receives and decreasing the probability of hitting the lock contention condition above.

Note there's no contention between CPUs here - this is entirely contention within a single CPU.

So for now I'm happy that the UDP IPv4 path is scaling well enough with RSS on a single core. The main performance problem here is the socket receive buffer locking (and, yes, copyin() / copyout().)

Next!

No comments:

Post a Comment