Link aggregation

Problem statement #

The new switch supports 2,5 Gigabits, the computer connected to it has multiple network ports all of which have only 1Gbit and we want more speed than a single port offers, so we connect multiple ports to the switch.

Nomenclature #

This is sometimes referred to as “Link aggregation” and also sometimes called “Trunk Group Setting”, which is confusing because Cisco calls its 802.11q VLAN tagging something with trunk as well (that is unrelated to link aggregation).

Standards #

To negotiate the details of the links there is the lacp protocol to automatically negotiate how data is distributed across the links. While its tempting to assume that a protocol like that would make such connections more stable I have observed the opposite, because as it turns out: Different vendors implement the protocol in different ways. Some forget implementation details, others add vendor specific extras. In short: While this protocol may be the best option to connect two devices from the same vendor, it may cause headaches when connected to devices with a different implementation.

How do you know? #

After I configured my Switch for active lacp and my FreeBSD machine for lacp (there seems to be no active/passive mode for lacp on FreeBSD) I observed regular network outages (every 1-2 minutes for a few seconds) and in the system log I found:

kernel: lagg0: link state changed to DOWN
kernel: lagg0: link state changed to UP
kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
kernel: igb0: Interface stopped DISTRIBUTING, possible flapping

Setting net.link.lagg.lacp.debug=1 revealed along with these messages more context (which I forgot to copy), however from that I saw that my FreeBSD system was expecting a specific field from the lacp protocol and instead received data it did not understand. For that reason the link was reset and started working again, which explains the regularity of the outages.

How to fix it? #

I have now put my switch into the so called “static mode” for the trunk group (as in: not “active”, not “passive”, but “static”!) and configured my FreeBSD system to use laggproto loadbalance. That seems to be stable.