[Hosting] Post-mortem: Network connectivity issues during edge router upgrade

Hi everyone,

Impact: Intermittent IPv4 and IPv6 connectivity for some hosted services for approximately 3 hours and 20 minutes beyond the planned maintenance window.

Today, OSL performed scheduled maintenance to bring our second edge router (sw-edge1) into active service alongside our existing edge router (sw-edge2). The goal was active-active routing redundancy at our network edge, eliminating long-standing traffic asymmetry, and enabling future edge router maintenance without service interruption.

The maintenance hit two issues:

1. An upstream LACP issue with our ISP (LinkOregon).

Stale configuration on the interface facing our new switch — left over from a pseudo-wire used during our data center migration earlier this year — prevented the new uplink from forming an active LACP bundle. Because we had already activated sw-edge1 as a Layer 3 router, traffic that hashed to sw-edge1 had no clean path out and was disrupted until the bundle came up. LinkOregon’s team identified and removed the legacy configuration once we they noticed it.

2. An ARP and IPv6 neighbor synchronization issue between our two edge switches.

After we resolved the LACP issues, some hosted services experienced intermittent connectivity — some hosts were reachable, others were not, with the pattern shifting over time. The root cause was a subtle platform-specific behavior on our Arista switches: by default, MLAG (the technology bonding our two edge routers into an active-active pair) does not synchronize ARP and IPv6 neighbor state between peer switches unless an additional software agent is active. We had been operating under the assumption that this synchronization happened automatically — a widespread assumption that turned out to be incorrect for our hardware platform.

We had reviewed the migration plan with both LinkOregon and Arista beforehand, and neither of these failure modes was anticipated by anyone involved. We’re grateful that Arista’s engineer was able to join us on short notice — the engineer who helped us had a meeting in ten minutes when we reached out and provided a working fix within about twenty. Their fix involved enabling a VxLAN configuration between our edge switches (used purely to activate the synchronization agent, not to carry traffic) and changing our IPv6 gateway addressing model to give each switch a unique IPv6 address alongside the shared gateway. From the host perspective, gateway addresses are unchanged.

The IPv4 fix was in place by 2:20 PM PDT; IPv6 SLAAC was fully restored by approximately 3:20 PM PDT.

Thanks to Arista’s engineer for the quick response, to LinkOregon’s network team for the fast turnaround, and to our hosted projects for their patience. If you observed connectivity issues you’d like us to verify against our timeline, please reach out via support@osuosl.org.

Thanks-