Omnia drops a lot of packets over ethernet

yorik · February 23, 2021, 3:07pm

I noticed that every ~2 minutes my packets got dropped for ~3s, sometimes longer up to 15s. During that time one CPU si% (software interrupts) goes to >60% and then stays at 30% for 30s. That affects both WAN<->LAN traffic and LAN<->LAN traffic through router.

Any ideas what could cause that and how to fix? It’s very annoying that sometimes all the clients losing connectivity.

I tried to monitor local processes but fail to see any correlations.
Version: TurrisOS 5.1.9, Turris Omnia, but I think that started even before 5.1.8.
It happens over all connected ports, eth2 is SFP.

Drops:

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 532
    RX: bytes  packets  errors  dropped overrun mcast   
    17199186   61323    0       355     0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    31419324   69177    0       0       0       0       
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 532
    RX: bytes  packets  errors  dropped overrun mcast   
    337254821  360870   0       2409    0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    16888669   72436    0       0       0       0       
5: lan0@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-lan state UP group default qlen 1000
    RX: bytes  packets  errors  dropped overrun mcast   
    1251559    11348    0       5       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    1925363    12598    0       0       0       0       
6: lan1@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-lan state UP group default qlen 1000
    RX: bytes  packets  errors  dropped overrun mcast   
    11695058   36241    0       151     0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    25397109   41635    0       0       0       0

Interrupts:

           CPU0       CPU1       
 17:          0          0     GIC-0  27 Edge      gt
 18:     308556     217714     GIC-0  29 Edge      twd
 19:          0          0      MPIC   5 Level     armada_370_xp_per_cpu_tick
 20:          0          0      MPIC   3 Level     arm-pmu
 21:     169478          0     GIC-0  34 Level     mv64xxx_i2c
 22:        145          0     GIC-0  44 Level     ttyS0
 37:          7          0      MPIC   8 Level     eth0
 38:     198292          0      MPIC  10 Level     eth1
 39:     325130          2      MPIC  12 Level     eth2
 40:          0          0     GIC-0  50 Level     ehci_hcd:usb1
 41:          0          0     GIC-0  51 Level     f1090000.crypto
 42:          0          0     GIC-0  52 Level     f1090000.crypto
 43:          0          0     GIC-0  53 Level     f10a3800.rtc
 44:          0          0     GIC-0  58 Level     ahci-mvebu[f10a8000.sata]
 45:      10462          0     GIC-0  57 Level     mmc0
 46:          0          0     GIC-0  48 Level     xhci-hcd:usb2
 47:          0          0     GIC-0  49 Level     xhci-hcd:usb4
 49:          2          0     GIC-0  54 Level     f1060800.xor
 50:          2          0     GIC-0  97 Level     f1060900.xor
 58:          0          8  mv88e6xxx-g1   7 Edge      mv88e6xxx-g2
 60:          0          4  mv88e6xxx-g2   0 Edge      mv88e6xxx-1:00
 61:          0          4  mv88e6xxx-g2   1 Edge      mv88e6xxx-1:01
 62:          0          4  mv88e6xxx-g2   2 Edge      mv88e6xxx-1:02
 63:          0          0  mv88e6xxx-g2   3 Edge      mv88e6xxx-1:03
 64:          0          0  mv88e6xxx-g2   4 Edge      mv88e6xxx-1:04
 75:          0          0  mv88e6xxx-g2  15 Edge      mv88e6xxx-watchdog
 76:          1          0  f1018140.gpio  14 Level     8-0071
 77:          0          0   pca953x   4 Edge      sfp
 78:          0          0   pca953x   3 Edge      sfp
 79:          1          0   pca953x   0 Edge      sfp
 81:     159058          0  MPIC MSI 1048576 Edge      ath10k_pci
 82:     187909          0     GIC-0  61 Level     ath9k
IPI0:          0          1  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:      39735      40759  Rescheduling interrupts
IPI3:         57     217064  Function call interrupts
IPI4:          0          0  CPU stop interrupts
IPI5:          0          0  IRQ work interrupts
IPI6:          0          0  completion interrupts
Err:          0

yorik · February 24, 2021, 11:54am

Rollback to TurrisOS 5.1.6 didn’t help but looks like to 4.0.5 did.

ChrisDeath · March 17, 2021, 11:26pm

Just a guess: I had a similar problem with cpu load when my Apple TV 4K was in the network. As soon i removed it, all was fine again. Its may a try and i actually was thinking it was about IPv6 of Apple TV … but in the end i just don’t use it anymore (Shield TV now).

yorik · March 18, 2021, 11:39am

I don’t have apple TV in my network, and there wasn’t huge load in the network during the problem.
Also it isn’t an excuse: I think border router should be stable whatever external or internal clients do.

Now I upgraded my Omnia back to TurrisOS 5.1.10 and got the same problem, but when I switched to alternative wifi driver and looks like it went away. But the driver isn’t stable and I’m loosing wifi sometimes.

I also removed all the not strictly necessary services (netdata, all data collection, etc). I’m going to add them back one by one when I have time.

ChrisDeath · March 19, 2021, 2:30pm

Hi, i didn’t wrote “network load” just “cpu load”!!! so the load on cpu was high while the network was quite nothing. CPU went up to 60% sometimes 90% and all other clients got slow while surfing or had issues with streaming.
As you know i am not part of Turris and just wanted to help, so addressing any anger about the router to me is waste of time.
I am using 5.1.9 without any issues now and standard drivers. And if i need to use my Apple TV i use ethernet as it seems to happen more less than with wifi.
If you can reproduce your drop packages stuff. Try it with only one client active…then go on…but as said i am not a professional router developer nor i want to become one

ChrisDeath · March 19, 2021, 4:59pm

Anyway here is my statistic:

3:eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 532
   RX: bytes  packets  errors  dropped overrun mcast
   36826838164 126304830 0       1239412 0       0
   TX: bytes  packets  errors  dropped carrier collsns
   470462366918 211239335 0       0       0       0
 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 532
   RX: bytes  packets  errors  dropped overrun mcast
   488072251983 390737590 0       821349  0       0
   TX: bytes  packets  errors  dropped carrier collsns
   34280288935 121987313 0       0       0       0

So you see it is similar for me…but i do not have any issues with connectivity (playing CS…so would recognize it quite fast). Also have SFP module installed (eth2)

btw. the most packages are droped on RX side from internet…so it is also maybe a routing problem of your provider.
And maybe the firewall is responsible for dropping them

yorik · March 19, 2021, 5:47pm

Chris, I’m sorry if it sounded like an anger attack. I didn’t mean it. Yeah I full of anger to the router and myself for buying it. It almost works, but that “almost” costs me a lot of time.

Your number of dropped packages on eth1 is huge. I’d investigate that. I have only 1257 for 6 days of uptime (5.1.10 with alternative wifi driver).

I’m reading https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/ to get understanding which packets could counted as dropped. Unfortunately driver and/or ethernet don’t support enhance features like big NIC queue sizes (maximum is only 128 for RX) or control of network flows. I hope to get better understanding what’s going on or maybe I’ll give up and start using one of my servers as border router instead.

I don’t think that firewall drops counted as eth dropped packets. Try to use top and check software interrupts (si), it should be quite low (<3%), and only goes to ~50% when you pushing full gigabit through omnia’s CPU.

ChrisDeath · March 20, 2021, 12:55am

0.2% of packages dropped on external interface is okay…and uptime was 16d.
And i think it is the firewall or for good reason. Cause my wireguard device has this:

 wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
  link/none
  RX: bytes  packets  errors  dropped overrun mcast
  0          0        0       0       0       0
 TX: bytes  packets  errors  dropped carrier collsns
  0          0        2       45231   0       0

So no traffic but many drops and 2 errors.
What is also the case here is that dropps are no errors.
Since https://www.cyberciti.biz/faq/linux-show-dropped-packets-per-interface-command/ says that the drop is including firewall. in netdata i only see all 15 min a drop on lan0, what is my Mac in standby…
You may also check ifconfig shows dropped rx packets | Support | SUSE
CPU is low at 0.5…and 40% on full 1GB/s download…so all fine…even with that drop rate. So i have no issues and dropped packages are normal and working as designed. So maybe your connections issues are not related to the drops or drops are actually also only a symptom. In 4.0.5 there are also many newer features missing, so it is may also that one of the clients is not compatible to one of the new features. So again…test with only one client in network and post result

PS: since my first post the drops did not increase on eth2…so it is timely manner…what is a good reason for firewall stopped attack

yorik · March 22, 2021, 11:38am

I’ve checked and firewall isn’t related to interface drops. I was adding a rule with -j DROP into different tables and tried to send packets and that didn’t change any of dropped counters (ip -s l or ethtool -S eth1). Also different network card drivers report drops and errors differently. I think the truth is only in the source code.

I’m on 5.1.10 right now. The drop rate is very small and infrequent, but still noticeable. I hope there will be some time to do proper tests during upcoming holidays.

AreYouLoco · March 23, 2021, 5:31pm

Post your full ethtool -S for the interface that you experience drops and also try to monitor what process is using cpu at the time with htop.

yorik · May 31, 2021, 1:27pm

I’ve fixed the problem by finally migrating to custom build router (ibm server with debian). Finally everything works well and reliably for me.

Good luck you all straggling with turris.

niall · December 2, 2021, 3:44pm

I’ve had trouble with dropped packets on eth2 (WAN) just in the last couple of weeks, on an Omnia running 5.3.1 HBT. Unpredictably, my connection changes from stable to “yo-yo mode” or back again. When my uplink is usable, I cannot count on its continued availability; when it is not usable, I have no idea when it will work again.

RIPE Atlas gives an idea of the trouble.

AreYouLoco · December 2, 2021, 4:25pm

I would consider asking your ISP if they dont have some trouble. Most likely dirty RJ-45 plug.

niall · December 3, 2021, 11:44am

ISP has been as helpful as possible. They have detected and repaired a fault in the copper connection to the FTTC cabinet (some 400m away) and have also given me a new CPE unit (DSL modem/router), leaving the old one with me as well. I use it in bridge mode, as I don’t want an extra router in cascade with my Turris.

I have seen no difference in behaviour after I do any of the following:

swap CPE units,
use a different RJ45 outlet on the downstream side of the CPE unit,
simply re-seat the RJ45 cable at the WAN port of the Turris,
use a fresh RJ45 cable between CPE unit and Turris,
reboot the Turris.

Besides, when I connect a laptop directly downstream of the CPE unit, in parallel with the Turris, and while the Turris has unusable connectivity, the laptop has a beautifully clean connection.

If this all isn’t enough to localize the trouble to the Turris, what am I missing?

AreYouLoco · December 3, 2021, 11:58am

To see if there are any errors on the eth2 interface.

Also you could check with mtr tool if there are connectivity issues and where.

niall · December 3, 2021, 12:16pm

Thanks for these suggestions; I’ve just installed the packages.

As the Turris is operating normally at the moment, I plan to save a reference copy of the output from ethtool -S and wait until the trouble recurs to take another one, and then post both (when possible). Does that seem reasonable?

I’ll explore the use of mtr also.