Swich ports offline if plugged during boot

adminX · December 11, 2016, 5:02pm

I mean the one generate by the SoC. It is automatically pulled down if the SoC reset is pulled down but can also be in parts influenced by some settings.

fik · December 13, 2016, 10:45am

Hi guys, could it be the problem is related only to Broadcom Tigon tg3 NIC? I have two wired computers, same software (gentoo), getting reserved IP from Turris via DHCP. When I restart the Turris, there is no network on the computer with tg3, while there is no problem on the other computer with RealTek r8169.

I see this in dmesg:

[Dec13 08:52] r8169 0000:04:00.0 enp4s0: link down
[ +11.479235] r8169 0000:04:00.0 enp4s0: link down
[ +3.351464] r8169 0000:04:00.0 enp4s0: link up

While on the other computer there is just:
[Dec13 08:52] tg3 0000:07:00.0 enp7s0: Link is down

and I get this only when I restart network manually (/etc/init.d/net.enp7s0 restart)
tg3 0000:07:00.0 enp7s0: Link is up at 1000 Mbps, full duplex
tg3 0000:07:00.0 enp7s0: Flow control is on for TX and on for RX

Seems, that Apple Gigabit Ethernet to Thunderbolt (jan357cz) uses Broadcom BCM57762, which is also tg3 related…

There was no problem with the tg3 NIC and other routers in the past.

dandys · December 13, 2016, 11:35am

So far I’ve tried with:
switch TPLINK TL-SG1016 - happens virtualy every time.
integrated NIC in my PC Marvell sky2 - happened about 3 times
comtrend vr-3026e v2, unknown 100Mb/s chipset - never happened

I’d like to test more combinations, but I have limited possibilities because omnia is located in hardly accessible dirty place and my room-mates (and me too) don’t like being without internet for too long. Maybe I will purchase another omnia just for testing and playing hehe.

fik · December 13, 2016, 1:19pm

it could be Broadcom based, I found SG2216 is

davey · December 13, 2016, 1:50pm

I have not had any problem with offline ports or link down.

My setup have been:

But after i discovered duplicates when pinging the server I though that somehting might be up with the unmanaged Netgear gigabit switches and perhaps some olf vlans. I did reset it but no difference.
bytes from 192.168.21.199: icmp_seq=0 ttl=64 time=1.492 ms
64 bytes from 192.168.0.239: icmp_seq=0 ttl=255 time=8.215 ms (DUP!)

I then moved the server from the switch directly to the TO and duplicates disappeared.
I then pinged a client and got duplicates… moved it to TO and no more duplicates.

But suddenly I was not able to connect to the Foris nor the luci. “Server aborted connection” all the time.

As I was on ssh i did a reboot…

And now all of the lan ports on the TO was offline!
Did a physical disconnect of the ethernetcable, waited 4 sec and reconnected it and got a link up.
Had to do the same with all clients and switch. Have never needed to do this before.

Did a reboot and got link directly for the switch - something i did not when I had switch, server and client connected.

But the duplicates are back. Although now the duuplicate is from another IP range then before. This time it’s from TOs own range.

bytes from 192.168.21.199: icmp_seq=0 ttl=64 time=1.799 ms
64 bytes from 192.168.21.79: icmp_seq=0 ttl=255 time=8.121 ms (DUP!)

The server does not run DHCP. Only TO do that.
The server and one of the clients are on static leases set up in TO.

I then connected my MacBook Air via thunderbolt-ethernet to the switch on which the server and clients are connected to and then pinged my macbook air from the server.
I then got duplicates again.

PING 192.168.21.60 (192.168.21.60): 56 data bytes
64 bytes from 192.168.21.60: icmp_seq=0 ttl=64 time=0.553 ms
64 bytes from 192.168.21.79: icmp_seq=0 ttl=255 time=4.794 ms (DUP!)

When I look at the IPs I found a ghost MAC address that I have no idea where it comes from.

2016-12-13T14:01:15+01:00 info dnsmasq-dhcp[25833]: DHCPREQUEST(br-lan) 192.168.21.79 c0:3f:0e:3c:29:f5
2016-12-13T14:01:15+01:00 info dnsmasq-dhcp[25833]: DHCPACK(br-lan) 192.168.21.79 c0:3f:0e:3c:29:f5

My original MAC address looks like this:
2016-12-13T14:09:57+01:00 info dnsmasq-dhcp[25833]: DHCPDISCOVER(br-lan) xx:xx:xx:11:ea:8e
2016-12-13T14:09:57+01:00 info dnsmasq-dhcp[25833]: DHCPOFFER(br-lan) 192.168.21.60 xx:xx:xx:11:ea:8e
2016-12-13T14:09:58+01:00 info dnsmasq-dhcp[25833]: DHCPREQUEST(br-lan) 192.168.21.60 xx:xx:xx:11:ea:8e
2016-12-13T14:09:58+01:00 info dnsmasq-dhcp[25833]: DHCPACK(br-lan) 192.168.21.60 xx:xx:xx:11:ea:8e Air-18-i5

These ghost addresses are listen in the TOs ARP record.

Any ideas?

Have I messed up or are all this related somehow?

fik · December 13, 2016, 2:03pm

what kind of switch do you have? if you completely remove the switch, do you still get the DUP pings?

another thing is the link-down after TO reset, this is what some people in this thread see (and some don’t)

the Apple ethernet-thunderbolt has Broadcom inside (is it mac client1?)

what NIC do you have in the other mac server and mac client 2? they were all offline after TO reboot, right?

davey · December 13, 2016, 3:00pm

The mac server and clients on ethernet uses BCM5701.
The thunderbolt adapter uses a broadcom as well.

With the TOs LAN side connected to an older Netgear ProSafe GS105 V3.
https://www.netgear.com/support/product/GS105v5.aspx#docs (link is to updated v5)

No DUP occurred.
Everything ran perfect.
After I rebooted the TO, the lan0 port was still offline and the port of the ProSafe switch did not lit up.
I disconnected the cat7 ethernet from the ProSafe GS105v3 and waited 6-7 seconds and connected the cable and directly the link LEDs for gigabit lit up again.
No DUP PINGs.

With the TOs LAN side connected to a newer Netgear ProSafe Unmanaged Plus GS105E

Constant DUP begun directly.
After a few moments the TOs Foris and luci died.
Had to SSH into the TO and reboot it.
After the reboot the lan0 port to the ProSafe lit up and connected directly.
Constant DUP PINGs.
Now even the TO get DUP when pinged.

Without any switch I do not get any DUP but on the other hand all LAN ports are stuck in offline after a reboot.

I’ll dig up another switch and test later tonight.

fik · December 13, 2016, 3:10pm

So again Broadcom here, as I thought

The Dup problem is a different problem and it probably is due to the switch

jszakmeister · December 14, 2016, 10:11am

I hate adding another “me too” post, but I’m also seeing this issue. @tohojo’s suggestion did work for me:

So it seems like some sort of configuration-related issue.

I’m not ready to put the Turris into permanent use yet and am just trying to get to know the unit, so I have it plugged directly into my Mac Pro on the second ethernet adapter. That also means there’s a Broadcom chipset involved–a Broadcom 57762 chipset to be specific. I’m a bit wary of saying that it’s a defining characteristic of the problem though. I have a USB Ethernet adapter that I can try later and see if that exhibits the same problem, and I’ll try plugging in a switch to one of the lan ports and see if the problem reproduces with that too.

chrislea · December 17, 2016, 5:42pm

I will also (sadly) put in another “me too” post. I am seeing this same behavior with a Linksys SE2800 dumb switch plugged into my Turris.

I am happy to provide any diagnostic info, just let me know what commands to run. Thanks!

chrislea · December 17, 2016, 6:22pm

Quick update: I replaced the Linksys SE2800 switch with a Netgear GS105 switch. The Netgear switch does NOT have the same problem.

Meaning (to be totally clear) that when the Linksys switch is plugged into the TO, and I reboot the TO, the connection to the Linksys switch does not come back up. When the Netgear switch is plugged into the TO, and I reboot the TO, the connection to the Netgear does come back up.

Again, I am happy to run any commands to help compare the two situations if that would be helpful for anybody troubleshooting this.

white · December 17, 2016, 11:26pm

You can test if adminX’s fix-switch binary helps you. And if it does you can configure it to be run on Omnia’s startup.

chrislea · December 18, 2016, 12:18am

I can confirm that the fix-switch binary does resolve the issue for me. Steps:

Get fix-switch onto the TO with something like scp fix-switch root@192.168.0.1:./
ssh root@192.168.0.1
chmod 755 fix-switch && mv fix-switch /usr/bin
Edit /etc/rc.local and add /usr/bin/fix-switch before the exit 0 line
Reboot

adminX · December 19, 2016, 7:30am

It looks like my omnia wants a holiday trip with me

fik · December 20, 2016, 4:05pm

cool, fix-switch works for me also

dandys · December 20, 2016, 4:35pm

I can confirm fix-switch seems to work reliable - kudos to adminX! I’ve just finished power cycling stress test (about 30 reboots) of my setup because of hunting unrelated bug and fix-switch recovered ethernet connection every time (before that there was about 80% probability ethernet ports will NOT work after reboot).

Maybe warning for someone who is using SFP modules like me. Fix-switch will not work until you stop sfpswitch.py service and temporarily force eth1 to the copper mode (you can use sfpswitch.py code with variable force_mode=‘phy-def’). It’s because MDIO bus is not accessible when eth1 is in SFP mode.

PS: I’ve also tried to fix things more permanently by writing to eeprom attached to the switch, but it turned out it is write protected (R130). Desoldering resistor would need physical access to the router, something that is currently not easily possible, maybe someday in spring…

adminX · December 20, 2016, 6:14pm

The problem with the the MDIO bus is fixable. The switch driver has a reference to the MDIO bus. Moving the functions to the kernel is simple. I will create a patch in the next days. It will also be some kind of a pemanent fix.

You could short PIN 7 to GND. The resistor will limit the current to about 1 milliampere. Physical access would still be required.

marcerlser · December 23, 2016, 7:21am

Hi,

I just found this thread because ma LAN Port 2 (Connected to a Netgear GS108E) does not come up often at reboots. I just added the fix-switch binary to my TO and rebooted it a couple of times. Before I had a chance of 50-60% of Port 2 coming up correctly and now at the end of the boot process fix-switch makes all ports come online even if Port 2 LED was dark at the beginning of the boot process…

It’s good to have a temporary fix at least for this problem whereas 5G card not coming up often is a different story.

Thanks adminX for creating the temporary fix for this and I turris will fix this soon as it seems that TO has problems with various switches.

tomesekjaroslav · December 26, 2016, 6:38pm

Hi,

originally I thought that I had the same problem with not starting LAN Ports after reboot, so I tried to fix-switch and without result.
After a while of testing and approximately twenty reboot (both software and hardware), I came to the fact that the ports activated, but DHCP assigns only IPv6 address.

Příklaz dhclient ethXX (or in newer distros dhclient enpXXXX) as well as disconnection and reconnection of the cable ensures IPv4 address

Here are ifconfig outputs for network states after omnia reboot
rebooting omnia:
enp0s25 Link encap:Ethernet HWadr f0:de:f1:2e:14:b4
AKTIVOVÁNO VŠESMĚROVÉ_VYSÍLÁNÍ MULTICAST MTU:1500 Metrika:1
RX packets:567044 errors:0 dropped:6 overruns:0 frame:0
TX packets:237771 errors:0 dropped:0 overruns:0 carrier:0

omnia back online:
enp0s25 Link encap:Ethernet HWadr f0:de:f1:2e:14:b4
inet6-adr: fd1b:da9b:9280::2ac/128 Rozsah:Globál
inet6-adr: fe80::4a1:7c7e:94ba:ff3c/64 Rozsah:Linka
inet6-adr: fd1b:da9b:9280:0:4729:3d35:76a5:c93a/64 Rozsah:Globál
AKTIVOVÁNO VŠESMĚROVÉ_VYSÍLÁNÍ BĚŽÍ MULTICAST MTU:1500 Metrika:1
RX packets:567109 errors:0 dropped:6 overruns:0 frame:0
TX packets:237869 errors:0 dropped:0 overruns:0 carrier:0

after dhclient command execution:
enp0s25 Link encap:Ethernet HWadr f0:de:f1:2e:14:b4
inet adr:192.168.10.102 Všesměr:192.168.10.255 Maska:255.255.255.0
inet6-adr: fd1b:da9b:9280::2ac/128 Rozsah:Globál
inet6-adr: fe80::4a1:7c7e:94ba:ff3c/64 Rozsah:Linka
inet6-adr: fd1b:da9b:9280:0:4729:3d35:76a5:c93a/64 Rozsah:Globál
AKTIVOVÁNO VŠESMĚROVÉ_VYSÍLÁNÍ BĚŽÍ MULTICAST MTU:1500 Metrika:1
RX packets:567372 errors:0 dropped:6 overruns:0 frame:0
TX packets:238047 errors:0 dropped:0 overruns:0 carrier:0

And now I’m not sure if Omnia assigns only IPv6 addresses and IPv4 after reconnection or dhclient request, or my computers (running Lubuntu 16.04) just accept IPv6 as enough and “not bother” with fact that IPv6 is useless in my network, windows not tested yet.
Problem is that I don’t know how to determine which scenario is correct but when reconnection or issuing dhclient command solves problem I think that Omnia for some reason provides only IPv6 config.

davidhaluska · December 28, 2016, 10:43am

Hi All,

I have finally official answer from CZ.NIC from RMA. I have provided devices I have issue when connected to Omnia one of them was Netgear GS108, so turris team bought this switch and confirmed that issue is really there. They found that this is SW related issue and it will be fixed in system update - development is already working on that according to info from them. I received router from RMA yesterday but I will be able to test it in January if there is new update and if it will fix our issue. I will let you know.

David