Bufferbloat & SQM & Omnia

I use TOS 5.2.1 on Turris Omnia 2019/2020 (I still don’t understand what year it is). My line is an asymmetric FTTH 900/100 Mb/s. After making a clean flash of the system from a USB key, I checked the level of bufferbloat on the site https://www.waveform.com/tools/bufferbloat; no additional software is installed on the router, WiFi is disabled and the test is via Ethernet cable and PC. The result of the load test is +40 ms download delay and +50 ms upload delay. At this point I have installed the SQM module with its LuCI interface. I set up SQM following this guide https://openwrt.org/docs/guide-user/network/traffic-shaping/sqm-details.
The upload delay immediately reached +0 ms, setting the shaper to 95% of the maximum 100 M/s of the line. The download delay dropped significantly, settling at +9 ms.
I wanted better and I went down more and more on the download sharper, up to remove 45% of the maximum speed of 900 M/s. So I got a download delay of + 3-5 ms. I couldn’t do better.
My question is: does this happen due to Omnia’s CPU limitations or because there is some other kind of problem? Suggestions accepted.

1 Like

It would be helpful, if you could post the share your result link from after the test, here is an example:
https://www.waveform.com/tools/bufferbloat?test-id=8670bf71-99c1-4f52-9eec-fe8a42b5c22d

So SQM’s AQMs all use a 5ms delay target and the observation is that under load you can expect roughly two times the target as delay empirically and theoretocally >= one time target, so with its default configuration (which is suitable for use over the wider internet) will not give you much better than what you saw.

Download shaping is a bit approximate (since it typically happens on the wrong side of the bottleneck, there is always the chance that too many packets rush in at a given time interval, such that they queue up in the under-managed and over-sized upstream buffers, resulting in increased bufferbloat). That said, for bidirectionally saturating traffic the omnia tops out at around 500/500 Mbps (and only after adding option packet_steering '1' to the global options in /etc/config/network).
So you could try the latter, but I guess you will still end with >= ~ 5ms delay, unless you set a lower target/interval (how to do this depends on which AQM you use, cake or HTB+fq_codel).

1 Like

Sorry due to very limited knowledge in that area I can’t provide any answers but I think this is an interesting topic. Are the Turris OS defaults not very good in terms of buffer management and shaping?
A bit of documentation would be great, at least in terms of how to test one’s connection and how to make improvements for some particular use cases…

I can’t imagine Turris OS default being optimal from scratch, You always have to fine tune SQM Cake management in relation to parameters of your connectivity.

1 Like

As @Comodore125 wrote, no, a proper shaper needs 2 to 3 parameters to work reliably and these 3 are hard to measure quickly for a given link:

  1. shaper rate: the gross rate of the link (or to get a bit more slack, something like 95% of the relevant gross bottleneck rate)
  2. per-packet overhead: how much additional data each packet with a given payload size requires
  3. MUP: the minimal packet size, some link layers like ethernet will pad small packets so that the smallest payload size is 46 bytes (IIRC)

So for SQM to actually work, it requires user intervention, and hence even if you install SQM it should be disabled by default, because it requires sensible configuration before activating it.

Have a look at the links @lucenera posted:

https://openwrt.org/docs/guide-user/network/traffic-shaping/sqm

and:

https://openwrt.org/docs/guide-user/network/traffic-shaping/sqm-details

if you still have specific questions afterwards, just post here (or in a new thread).

1 Like

Thank you for replies - much appreciated.

Hi, this might be useful:
Speedtest
https://www.speedtest.net/result/c/98587582-06ba-47ba-974e-a741834785b7
Bufferbloat without SQM cake + piece of cake
https://www.waveform.com/tools/bufferbloat?test-id=265cf84d-1735-4eed-82e5-e2cca04bd38c
Bufferbloat with SQM cake + piece of cake
https://www.waveform.com/tools/bufferbloat?test-id=d129b4e9-8f9f-48ec-8ce7-431b952441a9

packet_steering is only available on HBD or crashlab (OpenWrt 21.02_rcx or master). Unfortunately I can’t use either of them at the moment, because on Omnia they break the WiFi, which is vital for me.

To achieve the best download result I removed 60% of the maximum speed.
This is the SQM configuration: https://termbin.com/kjynr.

1 Like

Ah, okay, you can manually set:

## TURRIS omnia receive side scaling:
for file in /sys/class/net/*
do
echo 3 > $file"/queues/rx-0/rps_cpus"
echo 3 > $file"/queues/tx-0/xps_cpus"
done

this and also read the current state:

for file in /sys/class/net/*
do
cat $file"/queues/rx-0/rps_cpus"
cat $file"/queues/tx-0/xps_cpus"
done

While not hotplug robust, the setting part can be added to /etc/rc.local…

Alternatively/additionally it might be helpful to install and enable irqbalance…

irqbalance already installed and activated.

What exactly does it do? Also I can activate it by running it from the command line, but if I add the commands in rc.local they are not executed at startup.

See’ e.g. https://doc.opensuse.org/documentation/leap/archive/42.1/tuning/html/book.sle.tuning/cha.tuning.network.html#sec.tuning.network.rps, this tells the system which CPUs to use for packet processing. This allows to exclude the CPU handling a network interface’s interrupts from furgher processing, that is what the default value of 2 does, it pushes all packet receive proccessing to CPU1 while CPU0 handles the interrupts. It just turns out that at least with SQM the omnia performs better with alliwing both CPUs for processing, otherwise CPU1 gets fully loaded while CPU0 still has cycles to spare.

Yes, you can copy and paste these into a terminal window on the omnia andvrun from the shell…

The command from the CLI works, but the values revert to the previous ones after a few seconds. So either it is started cyclically every x seconds or it is irrelevant. Even irqbalance on Omnia’s Marvell processor doesn’t seem to have much effect, it does a little bit as it wants. Perhaps the test result is also falsified by the laptop I use (Intel i7 10th gen) or by the operating system (Archlinux) or by the browser (Firefox). There are too many factors that can affect everything. It would take a test to run directly on the router, without going through any other interface or device. The fact is that to get the most, according to my measurements, they have to go down from ~900 Mb/s download to ~300 Mb/s, the only way to get A+.

Ah, yes, you are right, this is because on old OpenWrt versions (like the one TOS5.2 is based on)
/etc/hotplug.d/net/20-smp-tune
is the responsible script that tries to do the right thing, but fails on the omnia, maybe change:

        for q in ${dev}/queues/rx-*; do                                                                                              
                set_hex_val "$q/rps_cpus" "$(($PROC_MASK & ~$irq_cpu_mask))"                                                         
        done               

to

        for q in ${dev}/queues/rx-*; do                                                                                              
                set_hex_val "$q/rps_cpus" "3"                                                                                        
        done               

That might work until the next TOS update…

I think the network interface interrupts are hardcoded on CPU0, so irqbalance will not be able to move those…

But that will not really test wheter your router can actually route at the desired rates … (also measurement code like iperf and friends is relative costly, and will compete with SQM and friends for CPU cycles).

Well, that sounds harsh, what do you get with SQM set to 500?

If I try 500 Mb I get an A with problems in games that require low latency when the network is under load.

Are you running anything CPU intensive like pakon?

The only thing I have installed is the dynamic firewall, nothing else, but that’s just a set of exclusion rules, nothing CPU-heavy. I know the weight of Pakon. It seems so strange to me that I have to remove so much to get a decent result on https://www.waveform.com/tools/bufferbloat. When I was able to use HBD (OpenWrt 21.02), in early builds, where WiFi worked, I got a lot more performance. It will be the kernel or the system in general. Evidently the version jump will do Omnia a lot of good, as well as allowing me to use the SFP module again, which is back to being compatible since Turris 3.x release.

similar for me, have to reduce download from ~900 Mb/s to 400 Mb/s to get rid of bufferbloat. I did set both CPUs for processing as well.

Interesting, I am on a 100/40 link and hence can not easily test this(also I run pakon, but that I could disable for a test).

Are both on you on GPON?

I am on GPON. At the speed at which you @moeller0 are, I too am able to achieve excellent results. Before I had a VDLS2 and a LinkSys with OpenWrt and SQM with cake + piece_of_cake.qos and I achieved excellent results by removing the famous 5% from the maximum speed. But getting closer to Gigabit, everything worsens in proportion: both the bufferbloat part and the WiFi part; the hardware shows some limitations. But at least the processor and routing management part already improves a lot with OpenWrt 21.02. We are confidently waiting for that to join HBS (it will take a while longer, I think).

1 Like

Sure, for one it is hard to get the ~940-950 Mbps TCP/IP goodput over WiFi that ethernet/GPON can deliver, and then the WiFi adds considerably to the bufferbloat. OpenWrt on ath9K and ath10k has some mitigation techniques, but these still add around 10-20 (or even 20-40ms, I do not recall exactly) to latency-under-load, the point is WiFi being simplex, so if WiFi is saturated you will immediately see queueing for both up and downstream, which already doubles the latency under load, plus the latency target for WiFi is larger (to allow for the normal levels of aggregation, otherwise throughput would tank)…

In short if you measure over WiFi 5 ms latency-under-load is only achievable if you shape hard enough that WiFi is not saturated…