So, you have broken my Omnia again (migration from 3 to 5.x)

radomirpolach · December 1, 2021, 9:04pm

Why can’t you make a quality update for once and don’t fuck my router?
My WAN connection no longer receives addresses on DHCP. Can’t test with dhclient as it is not present, static IP doesn’t work as well. The link is up. In front of it is a UPC router.

Nothing relevant there.

radomirpolach · December 1, 2021, 10:23pm

No DHCP on eth2 (wan).
When I set static IP, no communication on eth2 (tcpdump).

4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 532
link/ether d8:58:d7:00:43:27 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.2/24 brd 192.168.0.255 scope global eth2
   valid_lft forever preferred_lft forever

# cat /sys/class/net/eth2/operstate 
down

radomirpolach · December 1, 2021, 11:21pm

Are these configs correct? Shouldn’t there be eth2 (wan):

/etc/conifg/system

config led 'led_wan'
        option name 'Auto-configuration for WAN'
        option sysfs 'omnia-led:wan'
        option default '1'
        option trigger 'netdev'
        option dev 'eth1'
        option mode 'link tx rx'

/etc/config/sqm

config queue 'eth2'
        option enabled '0'
        option interface 'eth2'
        option download '85000'
        option upload '10000'
        option qdisc 'fq_codel'
        option script 'simple.qos'
        option qdisc_advanced '0'
        option ingress_ecn 'ECN'
        option egress_ecn 'ECN'
        option qdisc_really_really_advanced '0'
        option itarget 'auto'
        option etarget 'auto'
        option linklayer 'none'

WAN LED is always lit and blinks sometimes (https://docs.turris.cz/hw/omnia/omnia-manual-en.pdf).

Tak do je dobře dojebané.

# grep -nr eth2 * 2>/dev/null | grep -v board.d
board.json:25:			"device": "eth2",
board.json:56:			"ifname": "eth2",
config/network:34:	option ifname 'eth2'
suricata/suricata.yaml:1568: - interface: eth2
suricata/suricata.yaml:1601:   #copy-iface: eth2
suricata-pakon/suricata.yaml:1146: - interface: eth2
suricata-pakon/suricata.yaml:1179:   #copy-iface: eth2

# grep -nr eth1 * 2>/dev/null | grep -v board.d
config/bcp38:3:	option interface 'eth1'
config/system:56:	option dev 'eth1'
config/sqm:2:config queue 'eth1'
config/sqm:4:        option interface 'eth1'
suricata/suricata.yaml:621:    #copy-iface: eth1
suricata/suricata.yaml:1635:  #- interface: eth1
suricata-pakon/suricata.yaml:206:    #copy-iface: eth1
suricata-pakon/suricata.yaml:1213:  #- interface: eth1

einar · December 2, 2021, 6:07am

@cynerd That’s exactly the problem I was having which I sent you data for. No carrier on the interface. Perhaps migration switched WAN to SFP?

@OP: Frustration or not, no need to insult.

cynerd · December 2, 2021, 9:30am

It is the switch to SFP from the @radomirpolach post but I still couldn’t figure out why that would be the case. @einar I originally accounted this to be some sort of the fluke (it is input from the hardware where it could be possible to in very rare cases get an induction to that pin in trigger level if that is stronger than pull up) because I was unable to find the root cause or even reproduce it. But having another users report the same… I went through all of the code again and… oh god. Stupid error. tos3to4: fix pin direction for SFP detection (!853) · Merge requests · Turris / Turris OS / Turris OS packages · GitLab
Thank you for reporting this and I am sorry about causing inconvenience. Unfortunately, this never came up in our internal testing and was reported only once by @einar so I considered it as unlikely issue.

radomirpolach · December 2, 2021, 11:52am

Sadly, my swearing isn’t even equivalent to my frustration. If this was the first, second or third time something like this happened, I would and I was pretty calm. Just from the top of my head:

Some kernel modules weren’t there and I had to fix it through the serial console because it wouldn’t even boot.
They removed old repositories and factory reset basically broke the device completely.
They removed packages that were handling VPN packet traffic through conntrack. I have spent several days debugging why my VPN is not working.
Now this.

And there were several more times like these. And I postpone updates like these as far as I can. But still, even their official migration broke my device. And as you see I am not making it up, there was the reverse direction for SPF GPIO and there was a report from you that this happened. If it was mentioned in:

I would take it as there was a strange issue - We don’t know why - but it is possible SPF will be enabled instead of PHY, do this to recover and I would ok, I wouldn’t say a peep: 3.x migration - Turris Documentation But nada.
They basically covered the issue up with full knowledge that it happened to you.
They knew there is the possibility of the issue. They weren’t able to find the problem. OK. Happens. But not warning the users about a known issue… this is either malice or incompetence.

This issue with VPN was similar, I had to dig through their own commits to see what they broke. It took a week to officially fix VPN issue.

I would be even willing to betatest these changes, but noone ever contacted me after any issue. They just break my device with a scheduled update as usual.

radomirpolach · December 2, 2021, 11:54am

Can anybody confirm that this occurences of eth1 and eth2 in the above post are correct on TurrisOS 5.x?

cynerd · December 2, 2021, 12:23pm

The correct one is eht2 for WAN. I am going to add a note about that to documentation as we can’t reliably switch it in every single place.

radomirpolach · December 2, 2021, 12:26pm

radomirpolach:

/etc/conifg/system

config led 'led_wan'
        option name 'Auto-configuration for WAN'
        option sysfs 'omnia-led:wan'
        option default '1'
        option trigger 'netdev'
        option dev 'eth1'
        option mode 'link tx rx'

/etc/config/sqm

config queue 'eth2'
        option enabled '0'
        option interface 'eth2'
        option download '85000'
        option upload '10000'
        option qdisc 'fq_codel'
        option script 'simple.qos'
        option qdisc_advanced '0'
        option ingress_ecn 'ECN'
        option egress_ecn 'ECN'
        option qdisc_really_really_advanced '0'
        option itarget 'auto'
        option etarget 'auto'
        option linklayer 'none'

So, should these have eth2 and eth1 swapped? I was comparing with my backup on TurrisOS 3.x. Seems that wan_led should be for eth2 and not eth1 and sqm on eth1 and not eth2?

I am not sure if I should go through every single config and check everything? But I don’t really know where sqm should run? Or bcp38?

cynerd · December 2, 2021, 12:31pm

In reality we do not cover leds because in default automatic rainbow functionality should work that out but if user specified something in the configuration it is just a LEDs configuration…

The SQM is potentially harmful. You most likely want to configure bridge br-lan there instead of eth2. In short: eth1 and eth2 got swapped in kernel. You probably want to go trough configuration and verify that everything is as it should be. It is not possible to cover every single package by us. The migration covers only primary router functionality that is possible from Foris (with few exceptions).

radomirpolach · December 2, 2021, 12:34pm

I do not think any of these two were user-configured, that’s the issue. If these were my custom configurations I would know what to do with them. So I don’t really know where they should be pointing.

cynerd · December 2, 2021, 12:38pm

I am pretty sure that SQM is used only for guest network and configured to that bridge so it had to be. There is no way the SQM would be configured for eth2 without user doing that up to my knowledge. The same applies to BCP38 that is not used in default in Turris. The only way for both is that user enables/configures that in LuCI. At the same time it is for sure true that you might did it in LuCI without even knowing that. LuCI is not nice from that point of view if you are unsure what specific configuration does.

einar · December 2, 2021, 12:38pm

The reason nothing came out of the first investigation was because I did not save a snapshot of the broken Omnia, and used a medkit to reflash /, as I had configuration backups. So no one could poke in the actual state of the system that got me to the problem in the first place.

@cynerd The weird thing about this is that my other Omnia did not suffer from this at all (bought early 2017, unlike the one that broke, which was a late Indiegogo unit). When does that script trigger?

cynerd · December 2, 2021, 12:43pm

During the migration in one last step. From the fix it seems to me that it depends on if sfpswitch was running or not. With sfpswitch having that GPIO previously configured to input and script switching it to output preserves value so reading it was not an issue. If script did export of pin because sfpswitch wasn’t enabled then default is zero output which matches with plugged SFP. In other words in most cases it was hidden but if “unnecessary” service sfpswitch is disable then it does invalid detection. That is just a theory but likely one. Either way it was invalid to set that GPIO port to output.

radomirpolach · December 2, 2021, 12:46pm

I didn’t have BCP38 config in my pre-migration backup at all. It appeared after the migration.
I may have some guest network configured, but never actually used it, so I doubt that I installed any special component unless it was in Foris or something and sounded important.

I generally installed only a few terminal tools using opkg.
But I have had Omnia for a long time since the initial campaign.

cynerd · December 2, 2021, 12:52pm

That is probably right. It is now part of LuCI controls package lists as an optional. We enabled all optionals for anyone that had previously this package list enabled because we can’t decide which option should be enabled and which not (this is not part of migration from 3.x but part of upgrade from 4.0 so it does not work with 3.x in mind). I had to look into its default configuration and it simply uses eth1 unconditionally but is disabled. I am sorry for misleading you to thinking that it is something you did. At the same time it is disabled so you do not have to care about it.

The same thing applies to SQM when I looked at it again. It is just a default in the configuration file and it is disabled.

radomirpolach · December 2, 2021, 12:53pm

So that’s probably all safe and I changed the /etc/config/system wan_led to eth2.

radomirpolach · December 2, 2021, 12:54pm

Otherwise everything seems to be working, even LXC containers.

cynerd · December 2, 2021, 12:57pm

You can even drop that if you want as if you have in rainbow WAN LED configured to automatic it already should work without having to configure it.

I am happy to hear that everything else works.

moeller0 · December 2, 2021, 1:48pm

I respectfully disagree, for SQM to work as expected it needs to be in control of the bottleneck link (which most often is the WAN interface so eth2 seems about right). IIRC for the guest network TOS does not use SQM but only HTB, no?

Running SQM on br-lan is rather unexpected because it will:
a) limit download from the internet to 10Mbps
b) limit upload to the internet to 85Mbps
c) limit traffic between WiFin and LAN ports to either 10 or 85 Mbps as well (depending on whether it traverses via the ingress or egress side

br-lan can be the desired and ideal interface for SQM (e.g. on a wired only router), but generally it is not due to the unexpected side-effects.

But in that case would you not use a VLAN for the guest network? In which case SQM should be on ethX.N, no?

But all of this is pretty moot:

This SQM instance is not doing anything one way or the other

EDIT: you already found out before my post