Fiber7 (Switzerland) SFP Compatibility

dev-zero · December 1, 2016, 9:21am

Ok, I retried again. When I disable the wireless during the installation wizard, the Turris boots with it.

I also tried the different modi, with and without the change to the GPIO rate-select pin, no luck so far.

sECuRE · December 11, 2016, 10:46am

Any news? Still eager to get this working.

dev-zero · December 12, 2016, 8:28am

Just to summarize my case: I have a SFP module directly from flexoptix with the “Turris Omnia” compatibility, a Turris Omnia with the OS version 3.3 (kernel 4.4.35-34abcd5e548fc8ed5390269f3a31d173-15, using the omnia-stable full medkit). I verified the SFP module itself in in the TP-Link media converter yesterday.
I will try to systematically go through all the settings again (rate-select pin and force-modes) by Wednesday, but last time I tried them, none of them worked.

Anyway, when “watching” the device using ethtool eth1 I have the following cycle of stages:

Settings for eth1:
Supported ports: [ TP MII ]
Supported link modes:   1000baseT/Half 1000baseT/Full 
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes:  1000baseT/Half 1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 10Mb/s
Duplex: Half
Port: MII
PHYAD: 2
Transceiver: external
Auto-negotiation: on
Link detected: no

Settings for eth1:
Supported ports: [ TP MII ]
Supported link modes:   1000baseT/Half 1000baseT/Full 
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes:  1000baseT/Half 1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Link partner advertised link modes:  1000baseT/Full 
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: No
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 2
Transceiver: external
Auto-negotiation: on
Link detected: yes

Settings for eth1:
Supported ports: [ TP MII ]
Supported link modes:   1000baseT/Half 1000baseT/Full 
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes:  1000baseT/Half 1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 10Mb/s
Duplex: Half
Port: MII
PHYAD: 2
Transceiver: external
Auto-negotiation: on
Link detected: yes

dev-zero · December 12, 2016, 10:34pm

hmm, something is not quiet right with the mvneta driver. As soon as I start playing with ethtool and its autoneg option, the only speed I get advertised from the link partner is 10baseT/Full. Also, setting it to speed 1000 duplex full autoneg off does not help.

adminX · December 12, 2016, 11:21pm

For fibre the meanings may be different with 10baseT full duplex meaning 1000BASE-X half-duplex. I stumpled upon this while working with the switch. 10baseT half duplex would be 1000BASE-X full-duplex.

I think i know what is going on. The driver gets set to 1000Base-T full duplex and the other side wants 1000base-X full duplex. So this means the NIC is in the wrong mode.

sECuRE · December 17, 2016, 10:00pm

…so, how can we put the NIC into the correct mode?

adminX · December 20, 2016, 7:05am

I think my assumptions are a wrong so ignore my post above.

I still think the basic direction is right but it will take quite a bit of time to understand what may be going on. Finding bugs in drivers without having the manuals and specs is no fun.

Short question at the end: The MC220L media converters have a switch. Is it set to AUTO or FORCE?

yorik · December 20, 2016, 7:39am

The switch has to be set to auto. It didn’t work in force mode.

brill · December 20, 2016, 1:23pm

Hi!

Sorry for long delay, we have bunch of problems to work on in parallel and it increases latency. Anyway, I contacted people form Init7 and FlexOptix and they were so kind to send me information about their equipment and few pieces of their configuration. They use Cisco 4500 switches, so I tried to test the issue with the nearest Cisco Catalyst derivative that I have here and it was Cisco 7600 (which is almost the same as 6500) in linecard WS-6724-SFP. And I managed to trigger the problem even with the simplest configuration: Freshly-flashed Omnia on one side and this on other:

interface GigabitEthernet1/23
 description Omnia Test
 switchport
 switchport access vlan 911
 switchport mode access

Then I tried to add “speed nonegotiate” to the configuration and the link came up and the link started working. So I think that this issue boils down to the interpretation of IEEE 802.3-2005 clause 37. I will 1) try to ask Marvell for support - I think that they might know the proper settings for the NIC to be compatible with Cisco and 2) I might try to flip some bits in the NIC configuration registers that might be the proper. And 3) I’ll try to obtain access to optical Ethernet analyzer to actually look at the negotiation frames and then I’ll be able to tell which side is doing something wrong. (If anyone of you could help with any of these steps I would be grateful.)

However, I am aware that the Cisco (most likely proprietary) chips that implement L2 in Catalyst-based switches might be quite old and I can’t make any guarantees whether we will be ever able to accommodate to their implementation.

sECuRE · December 20, 2016, 5:01pm

Unfortunately I neither have the equipment nor the knowledge to help, but I’d like to say thank you very much for your work on this issue, @brill, it’s much appreciated!

dev-zero · December 21, 2016, 1:48pm

@brill: thank you very much for the news.

Out of curiosity: where are the kernel sources for the Turris Omnia kernel, respectively which patches are applied for the mvneta driver on top of vanilla linux-4.4.x?

Googling around reveals that it wouldn’t be the first time Cisco implements the autonegotation differently than anybody else.

adminX · December 21, 2016, 5:13pm

[PATCH RFC 00/26] Phylink & SFP support has some hints.

SGMII mode, where the in-band status indicates the speed, duplex and flow control settings of the link partner.

1000base-X mode, where the in-band status indicates only duplex and flow control settings (different, incompatible bit layout from SGMII.)

Linux seems to have some problems in the drivers with this.

Ondrej_Caletka · December 22, 2016, 1:40pm

The patches are in the TurrisOS repository, here:

LukasTribus · December 24, 2016, 12:32pm

Yes, the 6500 is very similar to the 7600 platform, and some linecards interoperate (the WS-6724-SFP, for example).
However, the 4500 is completely different, it cannot at all be compared to the 7600/6500 series. Very different architecture, especially regarding linecards.

So the issue is easily reproducible on 2 very different platforms, both of which have been around for at least a decade and operate fine in a huge number of networks with a whole lot of other players. The “dumb” TP-Link media-converter, AVM’s Fritzbox, Mikrotik boxes, etc all work fine with init7’s Catalyst 4500.

Against which vendors/platforms does the Omnia work fine with 1GBit/s BiDI SFPs and 1000Base-X based negotiation enabled?

I don’t have any evidence that the problem is on the Omnia side as much as I don’t have any evidence that the problem is the Cisco switch.

This is a interoperability problem, pointing fingers without having data to back it up doesn’t help anybody. Research and hard technical facts are what matters.

Please keep an open mind about this.

dev-zero · December 27, 2016, 12:54pm

Would it therefore make sense to pull from the LEDE project again?
It seems that they added the mentioned patchset to their Kernel 4.4: https://github.com/lede-project/source/tree/master/target/linux/mvebu/patches-4.4
Or did someone already try with that patchset?

brill · January 2, 2017, 11:57am

Well, the 1000BASE-BX (BiDi) modules in question works for me in Cisco 2960G and 3560G switches with auto-negotioation enabled.

LukasTribus · January 2, 2017, 2:40pm

Good. What I’m saying is that this is not proof that the root cause is on the 4500 side as much as the fact that “other vendors work fine with the 4500” isn’t proof that the problem is on the Omnia.

Unless the exact root cause is known, and it is confirmed that the Cisco device violates IEEE or SFF specifications, you would be better off not assuming that, as it will only cloud your judgement.

I have handled a lot of interoperability issues between telco vendors and the reaction is always the same. Without conducting any kind of RCA they would blindly assume the interop issue is caused by the other party and then they would twist some stats to convince themselves that their assumption is actually a fact. Please don’t do this is all I’m saying.

dev-zero · January 14, 2017, 4:07pm

Ok, so, is there someone still working on this? If yes, any progress?
Is there anything a user like me could help with?
@LukasTribus: would you have any hints on what to look at/for? Any ideas why the Turris Omnia behaves differently with different backend switches?

LukasTribus · January 16, 2017, 1:50pm

Your guess is as good as mine (I’m not a developer and I don’t know low level details of 1000BaseX negotiation). Could be a timing issue. You posted the LEDE patchset earlier, I’m not sure what patch is relevant here, but I can see for example “sfp: retry phy probe if unsuccessful” [1], which seems interesting.

I don’t have an Omnia in my hands, but I would certainly try current LEDE and a Ubuntu (ARM) installations with different mainline kernels from [2]. Not sure how hard it is to install those on the Omnia.

The reason is that if this can be reproduced with a mainline linux kernel, we can ask for help on netdev. The SFP handling in current linux kernels is pretty young, so we should be able to find the guys that just wrote this code for the vanilla linux kernel.
I’m not saying the Turris Team can’t figure it out on their own, I’m just saying there are developers in the linux community that spend a lot of time on this specific aspect of the linux kernel and its integration with other hardware, and we may benefit from their experience. A part from that, if a fix (or even a workaround for a problem that is specific to bogus implementations on the other side) is committed to the mainline linux kernel, every linux user benefits from it.

Another approach could be to get a marvell based SFP Nic, like the PEX1000SFP2 and try this in a x86 box. But this is at least 150€, and there are no guarantees that the problem can be reproduced this way.

[1] https://github.com/lede-project/source/blob/master/target/linux/mvebu/patches-4.4/300-reprobe_sfp_phy.patch
[2] http://kernel.ubuntu.com/~kernel-ppa/mainline/

brill · January 20, 2017, 2:08pm

Not me… But I definitely want to revisit this problem once somebody confirms that it does the same thing with Cisco 4500 as what I have experienced with Cisco 6500/7600 - I observed that it didn’t work with autonegotionation and it worked for me without.

I wrote this to Init7 and FlexOptix, waiting for further info. Or I can try to test it on some 4500 here if I have time and if I mange to get hands on one…