SQM on turris flent benchmarks?

Dave_Taht · April 26, 2022, 9:45pm

another good comparison is with BBR.

modprobe tcp_bbr

Add this to the command line

–test-parameter cc_algos=bbr,bbr,bbr,bbr

johkra · April 28, 2022, 7:17pm

I’m on 1Gbps symmetric (Fiber 7) with a Turris Omnia running Turris OS 6.0.

I configured SQM following instructions at How to use the cake queue management system on the Turris Omnia - SW tweaks - Turris forum. I’m not sure I did it right, as my performance was worse than the ~600 Mbps mentioned in the first post. I installed flent on my laptop which was connected via Gigabit Ethernet to the Turris Omnia. A browser based speed tests to a server in the same city, but outside my provider resulted in 1ms latency, Download 721.22 Mbps, Upload 817.95 Mbps.

I failed to run tests with bbr on flent 2.0.1. If you have an example command line, I’m happy to try again.

Command line (only test name was changed):

flent -x --socket-stats --step-size=.05 -H de.starlink.taht.net -H london.starlink.taht.net -H singapore.starlink.taht.net -H fremont.starlink.taht.net -t rtt_fair4be-SQM_cake_upload900000_download900000_1Gfibre_Omnia rtt_fair4be

Results with SQM:

Summary of rtt_fair4be test run from 2022-04-28 18:33:37.548043
  Title: 'rtt_fair4be-SQM_cake_upload900000_download900000_1Gfibre_Omnia'

                                                         avg       median          # data pts
 Ping (ms) ICMP1 de.starlink.taht.net         :        15.71        16.60 ms             1399
 Ping (ms) ICMP2 london.starlink.taht.net     :        26.55        27.70 ms             1399
 Ping (ms) ICMP3 singapore.starlink.taht.net  :       246.76       248.00 ms             1399
 Ping (ms) ICMP4 fremont.starlink.taht.net    :       161.56       163.00 ms             1399
 Ping (ms) avg                                :       112.65          N/A ms             1399
 TCP download BE1 de.starlink.taht.net        :        58.72        57.37 Mbits/s        1399
 TCP download BE2 london.starlink.taht.net    :        52.64        52.19 Mbits/s        1399
 TCP download BE3 singapore.starlink.taht.net :        33.20        34.81 Mbits/s        1399
 TCP download BE4 fremont.starlink.taht.net   :         3.12         2.26 Mbits/s        1399
 TCP download avg                             :        36.92          N/A Mbits/s        1399
 TCP download fairness                        :         0.74          N/A Mbits/s        1399
 TCP download sum                             :       147.68          N/A Mbits/s        1399
 TCP upload BE1 de.starlink.taht.net          :       221.19       219.27 Mbits/s        1399
 TCP upload BE2 london.starlink.taht.net      :       204.03       202.76 Mbits/s        1399
 TCP upload BE3 singapore.starlink.taht.net   :        22.29        22.61 Mbits/s        1399
 TCP upload BE4 fremont.starlink.taht.net     :        19.54        19.40 Mbits/s        1399
 TCP upload avg                               :       116.76          N/A Mbits/s        1399
 TCP upload fairness                          :         0.60          N/A Mbits/s        1399
 TCP upload sum                               :       467.05          N/A Mbits/s        1399

Results without SQM:

Summary of rtt_fair4be test run from 2022-04-28 18:37:28.240879
  Title: 'rtt_fair4be-no_SQM_1Gfibre_Omnia'

                                                         avg       median          # data pts
 Ping (ms) ICMP1 de.starlink.taht.net         :         9.77         9.57 ms             1400
 Ping (ms) ICMP2 london.starlink.taht.net     :        20.57        20.50 ms             1400
 Ping (ms) ICMP3 singapore.starlink.taht.net  :       240.98       241.00 ms             1400
 Ping (ms) ICMP4 fremont.starlink.taht.net    :       155.85       156.00 ms             1400
 Ping (ms) avg                                :       106.79          N/A ms             1400
 TCP download BE1 de.starlink.taht.net        :       170.24       167.42 Mbits/s        1400
 TCP download BE2 london.starlink.taht.net    :       147.30       134.62 Mbits/s        1400
 TCP download BE3 singapore.starlink.taht.net :        22.91        23.19 Mbits/s        1400
 TCP download BE4 fremont.starlink.taht.net   :        38.33        29.72 Mbits/s        1400
 TCP download avg                             :        94.70          N/A Mbits/s        1400
 TCP download fairness                        :         0.68          N/A Mbits/s        1400
 TCP download sum                             :       378.78          N/A Mbits/s        1400
 TCP upload BE1 de.starlink.taht.net          :       509.68       522.90 Mbits/s        1400
 TCP upload BE2 london.starlink.taht.net      :       137.92       142.44 Mbits/s        1400
 TCP upload BE3 singapore.starlink.taht.net   :        50.08        53.03 Mbits/s        1400
 TCP upload BE4 fremont.starlink.taht.net     :        67.82        71.19 Mbits/s        1400
 TCP upload avg                               :       191.38          N/A Mbits/s        1400
 TCP upload fairness                          :         0.51          N/A Mbits/s        1400
 TCP upload sum                               :       765.50          N/A Mbits/s        1400

I’m happy to run further tests and to share the .gz files if this would be useful, but I’d appreciate some guidance (links to read-the-detailed-instructions?) on how to run tests and how to configure things.

P.S. Thanks for the years of work on queuing and the entertaining talks. I was happy to contribute data points after seeing the Turris tests linked in article shared on LWN.net.

Dave_Taht · April 30, 2022, 10:14pm

That isn’t very symmetric, is it? An ideal result in this case would be about 870Mbit up and 870Mbit down simultaneously. Even without SQM the total is closer to just a gbit.

In both cases you are most likely running out of CPU. Very few tests “out there” try to test both directions at the same time, and fewer vendors do - a gigbit download! ship it! You might be able to get better performance by increasing the size of the rx rings on the omnia (using ethtool), and/or trying BBR, which is less sensitive to packet loss. I’m very interested in what goes wrong on this hardware, even without SQM on but that involves various other tracing and optimization tools. irqbalance sometimes helps.

In either case the nearly ruler-flat line for latency indicates you have no bufferbloat at these speeds, with this hardware, because you are losing too many packets.

There are tests in the flent suite that test just up or just down.

moeller0 · May 1, 2022, 6:18am

Some years ago I tested SQM against a local netperf server though my omnia configured as router (NAT, firewall, but no PPPoE) after enabling packet steering for all CPUs (changing the default config to include the interrupt processing CPUs in the set of eligible CPUs) I managed to get at best 550/550 through RRUL. But again, local server with really low RTT, and nothing fancy running on the omnia (WiFi was activated but no station was connected).

johkra · May 1, 2022, 7:32am

I confirm there are CPU limitations on the Omnia during the flent test, even with SQM disabled. Top shows a single core at 99% (and the other mostly idle…).

Ring buffers seem to be already at the maximum:

root@turris:~# ethtool -g eth2
Ring parameters for eth2:
Pre-set maximums:
RX:		512
RX Mini:	0
RX Jumbo:	0
TX:		1024
Current hardware settings:
RX:		512
RX Mini:	0
RX Jumbo:	0
TX:		1024

Interesting that TX buffers are bigger and that TX performance is better.

I installed and started irqbalance on the omnia, but I saw no change in behaviour. All network related IRQs seem to be serviced on CPU0. (eth2 is the external interface, eth1 is connected to the ethernet switch: About - Turris Documentation)

root@turris:~# cat /proc/interrupts
           CPU0       CPU1
 17:          0          0     GIC-0  27 Edge      gt
 18:    3261296    1231854     GIC-0  29 Edge      twd
 19:          0          0      MPIC   5 Level     armada_370_xp_per_cpu_tick
 20:          0          0      MPIC   3 Level     arm-pmu
 21:     767272          0     GIC-0  34 Level     mv64xxx_i2c
 22:         30          0     GIC-0  44 Level     ttyS0
 33:          0          0     GIC-0  41 Level     f1020300.watchdog
 37:          0          0     GIC-0  96 Level     f1020300.watchdog
 38:         52          0      MPIC   8 Level     eth0
 39:   12361593          0      MPIC  10 Level     eth1
 40:    5626500          1      MPIC  12 Level     eth2
 41:          0          0     GIC-0  50 Level     ehci_hcd:usb1
 42:          0          0     GIC-0  51 Level     f1090000.crypto
 43:          0          0     GIC-0  52 Level     f1090000.crypto
 44:          0          0     GIC-0  53 Level     f10a3800.rtc
 45:          0          0     GIC-0  58 Level     ahci-mvebu[f10a8000.sata]
 46:       6526          0     GIC-0  57 Level     mmc0
 47:          0          0     GIC-0  48 Level     xhci-hcd:usb2
 48:          0          0     GIC-0  49 Level     xhci-hcd:usb4
 50:          2          0     GIC-0  54 Level     f1060800.xor
 51:          2          0     GIC-0  97 Level     f1060900.xor
 52:          7          0  f1018140.gpio  13 Level     f1072004.mdio-mii:10
 56:          0          0  mv88e6xxx-g1   3 Edge      mv88e6xxx-g1-atu-prob
 58:          0          0  mv88e6xxx-g1   5 Edge      mv88e6xxx-g1-vtu-prob
 60:          0          7  mv88e6xxx-g1   7 Edge      mv88e6xxx-g2
 62:          0          7  mv88e6xxx-g2   0 Edge      mv88e6xxx-1:00
 63:          0          0  mv88e6xxx-g2   1 Edge      mv88e6xxx-1:01
 64:          0          0  mv88e6xxx-g2   2 Edge      mv88e6xxx-1:02
 65:          0          0  mv88e6xxx-g2   3 Edge      mv88e6xxx-1:03
 66:          0          0  mv88e6xxx-g2   4 Edge      mv88e6xxx-1:04
 77:          0          0  mv88e6xxx-g2  15 Edge      mv88e6xxx-watchdog
 78:          0          0  f1018140.gpio  14 Level     8-0071
 79:          0          0    8-0071   4 Edge      sfp
 80:          0          0    8-0071   3 Edge      sfp
 81:          0          0    8-0071   0 Edge      sfp
 83:     478715          0  MPIC MSI 1048576 Edge      ath10k_pci
 84:    1416655          0     GIC-0  61 Level     ath9k
IPI0:          0          1  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:      81390     366527  Rescheduling interrupts
IPI3:         76       1402  Function call interrupts
IPI4:          0          0  CPU stop interrupts
IPI5:          0          0  IRQ work interrupts
IPI6:          0          0  completion interrupts
Err:          0

Mh, I wonder if (and how) I can configure eth2 to use CPU1 instead? And I need to look at “enabling packet steering for all CPUs (changing the default config to include the interrupt processing CPUs in the set of eligible CPUs)” mentioned by moeller0.

To exclude path limitations: Bidirectional iperf to a speedtest server inside my ISP’s network is 893Mbps up, 912Mbps down, single direction is ~920Mbps.

I’m not sure how to configure usage of BBR with flent. man flent points me at –-test-parameter tcp_cong_control=bbr, but execution fails no matter where in the flent command line I place this. (ERROR: Found no hostnames on lookup of –-test-parameter if placed in the back, flent: error: unrecognized arguments: rtt_fair4be if placed after -x).

I think I forced bbr on the machine running flent using sysctl, but I saw no difference in performance:

# sysctl net.ipv4.tcp_allowed_congestion_control
net.ipv4.tcp_allowed_congestion_control = bbr

Oh, all tests were using IPv6 (no NAT). There are some idle devices connected to Wifi. I have the old 1GB Omnia.

I’ll look at the interrupt servicing / packet steering changes to allow multi-core usage and will report back.

johkra · May 1, 2022, 9:04am

For anyone else interested in more background:

Scaling in the Linux Networking Stack — The Linux Kernel documentation explains more about packet steering.
Bufferbloat & SQM & Omnia - #2 by moeller0 - SW tweaks - Turris forum mentions how to enable packet steering (Edit: That’s actually an option in luci, Network->Interfaces->Global Network Options) and Can't get 1Gbit on Mox Classic - #35 by moeller0 - MOX HW problems - Turris forum the sysfs paths to check used CPUs.
git.openwrt.org Git - openwrt/openwrt.git/commit added the used global option and links to git.openwrt.org Git - openwrt/openwrt.git/commitdiff which indicates issues with hardware scheduling fairness on the platform used by the Omnia.

I enabled packet steering and see it’s enabled after rebooted the device:

root@turris:~# grep "" /sys/class/net/eth2/queues/*/rps_cpus
/sys/class/net/eth2/queues/rx-0/rps_cpus:3
/sys/class/net/eth2/queues/rx-1/rps_cpus:3
/sys/class/net/eth2/queues/rx-2/rps_cpus:3
/sys/class/net/eth2/queues/rx-3/rps_cpus:3
/sys/class/net/eth2/queues/rx-4/rps_cpus:3
/sys/class/net/eth2/queues/rx-5/rps_cpus:3
/sys/class/net/eth2/queues/rx-6/rps_cpus:3
/sys/class/net/eth2/queues/rx-7/rps_cpus:3

With packet steering enabled, I get 80-90% usage on CPU0 and 50-60% on CPU1 on the Omnia when running flent on a machine connected via ethernet.

Performance doesn’t look all that better, though.

Summary of rtt_fair4be test run from 2022-05-01 08:18:13.053230
  Title: 'flent-test'

                                                         avg       median          # data pts
 Ping (ms) ICMP1 de.starlink.taht.net         :         8.86         8.63 ms             1399
 Ping (ms) ICMP2 london.starlink.taht.net     :        19.67        19.50 ms             1399
 Ping (ms) ICMP3 singapore.starlink.taht.net  :       240.22       240.00 ms             1399
 Ping (ms) ICMP4 fremont.starlink.taht.net    :       154.82       155.00 ms             1399
 Ping (ms) avg                                :       105.89          N/A ms             1399
 TCP download BE1 de.starlink.taht.net        :       188.19       170.28 Mbits/s        1399
 TCP download BE2 london.starlink.taht.net    :       128.17       118.57 Mbits/s        1399
 TCP download BE3 singapore.starlink.taht.net :        15.74         8.62 Mbits/s        1399
 TCP download BE4 fremont.starlink.taht.net   :        57.33        37.57 Mbits/s        1399
 TCP download avg                             :        97.36          N/A Mbits/s        1399
 TCP download fairness                        :         0.68          N/A Mbits/s        1399
 TCP download sum                             :       389.43          N/A Mbits/s        1399
 TCP upload BE1 de.starlink.taht.net          :       459.53       460.31 Mbits/s        1399
 TCP upload BE2 london.starlink.taht.net      :       296.28       296.67 Mbits/s        1399
 TCP upload BE3 singapore.starlink.taht.net   :        69.64        79.65 Mbits/s        1399
 TCP upload BE4 fremont.starlink.taht.net     :        81.51        93.69 Mbits/s        1399
 TCP upload avg                               :       226.74          N/A Mbits/s        1399
 TCP upload fairness                          :         0.66          N/A Mbits/s        1399
 TCP upload sum                               :       906.96          N/A Mbits/s        1399

(That’s with default congestion control parameters again.)

IRQ distribution doesn’t look quite equal, but that might be OK:

root@turris:~# cat /proc/softirqs
                    CPU0       CPU1
          HI:          0          0
       TIMER:     195008     127437
      NET_TX:     300123       7943
      NET_RX:    9506939    2409559
       BLOCK:        979       1207
    IRQ_POLL:          0          0
     TASKLET:     258078         67
       SCHED:     187214     152848
     HRTIMER:          0          0
         RCU:      47275      46815

root@turris:~# cat /proc/net/softnet_stat
0095ba30 00000000 000000d3 00000000 00000000 00000000 00000000 00000000 00000000 00000002 00000000
00d42f83 00000238 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0015ef49 00000000

The tenth value shows the number of times a CPU was woken up due to receive packet steering (RPS).

As a next step, I doubled rx-usec and rx-frames to reduce the number of interrupts:

root@turris:~# ethtool -c eth2
Coalesce parameters for eth2:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 200
rx-frames: 64
[...]

This seems to have increased download performance, although top CPU usage was similar:

Summary of rtt_fair4be test run from 2022-05-01 08:50:34.552201
  Title: 'flent-test'

                                                         avg       median          # data pts
 Ping (ms) ICMP1 de.starlink.taht.net         :         8.70         8.66 ms             1399
 Ping (ms) ICMP2 london.starlink.taht.net     :        19.46        19.50 ms             1399
 Ping (ms) ICMP3 singapore.starlink.taht.net  :       239.79       240.00 ms             1399
 Ping (ms) ICMP4 fremont.starlink.taht.net    :       154.71       155.00 ms             1399
 Ping (ms) avg                                :       105.66          N/A ms             1399
 TCP download BE1 de.starlink.taht.net        :       246.63       245.22 Mbits/s        1399
 TCP download BE2 london.starlink.taht.net    :       153.43       150.81 Mbits/s        1399
 TCP download BE3 singapore.starlink.taht.net :        10.39         7.84 Mbits/s        1399
 TCP download BE4 fremont.starlink.taht.net   :        77.37        61.75 Mbits/s        1399
 TCP download avg                             :       121.95          N/A Mbits/s        1399
 TCP download fairness                        :         0.66          N/A Mbits/s        1399
 TCP download sum                             :       487.82          N/A Mbits/s        1399
 TCP upload BE1 de.starlink.taht.net          :       426.68       416.51 Mbits/s        1399
 TCP upload BE2 london.starlink.taht.net      :       289.02       299.59 Mbits/s        1399
 TCP upload BE3 singapore.starlink.taht.net   :        75.79        78.82 Mbits/s        1399
 TCP upload BE4 fremont.starlink.taht.net     :       116.15       121.60 Mbits/s        1399
 TCP upload avg                               :       226.91          N/A Mbits/s        1399
 TCP upload fairness                          :         0.72          N/A Mbits/s        1399
 TCP upload sum                               :       907.64          N/A Mbits/s        1399

I’ll look at the impact of raising these values further. IRQ coalescing trades of latency for throughput and is not something I’d configure persistently.

johkra · May 1, 2022, 9:14am

400 / 128 did not increase download sum, but changed CPU usage: CPU0 showed higher usage (~90% / ~30% distribution between CPU0/CPU1 instead of the previous 80/60) and periods of both cores at 99% instead of a fairly stable distribution. There might be more optimal values, but since I’ll return to defaults after testing, I won’t spend time on finding it.

I’ve exhausted what I found in a limited search on the topic. If anyone has suggestions on what else to try, please don’t hesitate to suggest what to test.

moeller0 · May 1, 2022, 11:41am

Quick note the best achievable TCP/IPv6-throughput for unidirectional traffic over a gigabit ethernet would be (assuming MTU 1500):
1000 * ((1500-40-20)/(1500+38)) = 936.28 Mbps
in bidirectional tests, the reverse ACK traffic obviously eats into that, for TCP-Reno reverse ACK traffic is roughly 1/40 in Volume of the forward data traffic (assuming MTU 1500), so 1000/40 =25 Mbps, more modern TCPs typically emit fewer ACK packets. That means for a link running pegged to line-rate bi directionally with TCP Reno you can expect at best something around 911/911…
On yor link using both other TCPs than Reno and not saturating the link the calculation becomes a bit more challenging, but the point remains that goodput in bidirectional saturating traffic does ignore the capacity uses by ACK packets…