[SAC] [Hosting] RESOLVED: Core network switch reboot

All,

It appears that one of our core network switches had a kernel panic and rebooted which caused widespread outages throughout our infrastructure. As of right now, everything appears to be back to normal but please let me know if that isn’t the case by sending an email to support@osuosl.org.

Apologies for the outage and we’ll be looking into why this switch had a kernel panic in the first place.

Thanks-

···

Lance Albertson

Director
Oregon State University | Open Source Lab

Unfortunately this just happened again overnight. We may need to schedule another outage to perform some software upgrade on this switch so that this stops happening. We’ll send an announcement out once we have everything in place to do that upgrade.

Thanks-

···

Lance Albertson

Director
Oregon State University | Open Source Lab

This happened again at approximately 10AM PDT. Since we moved our uplink to this switch, everything went down while the switch rebooted.

We’re still planning on doing an upgrade but don’t have a date yet for that. We’ll hopefully get that going soon.

Thanks for your patience.

···

Lance Albertson

Director
Oregon State University | Open Source Lab

Sadly this just happened again about 50 minutes ago. We may need to do some emergency firmware patching tomorrow. As a backup plan, I’m also formulating a plan to add another switch to try and minimize the impact of this troublesome switch.

Once I gather some additional information tomorrow morning, I’ll send an update on what we’re planning to do.

Thanks again for your patience.

···

Lance Albertson

Director
Oregon State University | Open Source Lab

All,

I wanted to pass along more information on where we’re at and our current plans to try and work around this issue.

Without going deep into the history of our core network infrastructure, we have two core “routers” that are both aging and we’re in the process of replacing them with something newer.

Previously, our uplink was connected through our Cisco 6509. This switch has several 1G line cards that half of our servers are directly connected to.

The other core switch is a Cisco Nexus 6001 which has three fabric extenders which provide 1G connectivity to the other half of our servers. When we migrated over to the LinkOregon network, we moved the uplink over to this Nexus 6k as it was much easier to get LR optics for it.

Unfortunately this Nexus 6k has started kernel panicking and rebooting in the past several months multiple times causing these outages. Much of our downlink 10G switches are connected to this Nexus 6k which means there’s a larger impact when it goes down.

A few years ago a high speed trading company donated us a pallet full of Arista switches and I’ve been slowly adding to our infrastructure. Even though they are EOL, they still work very well and we haven’t had any problems with them. And since I have a lot of them, I can easily replace one if one goes bad.

My current plan is to set up one of these Arista switches and move all of the current 10G connections to it. This way, at least we can reduce the impact if/when this Nexus 6k switch reboots again. In theory, it should only affect the servers directly connected to the FEX switches if it reboots again.

I reached out to the OSU IT community and they graciously donated two 10G-LR optical modules so that I can put this plan in place without having to wait to ship modules.

Current plan for today:

  • Setup new Arista switch
  • Move upstream connectivity to LinkOregon to it
  • Move all downstream 10G links to this router

I will send another email when I plan to do the actual outages for the cut over.

Longer term plans

  • Work with vendors to replace our aging core network infrastructure with something that’s still supported and we can afford
  • Look into getting redundancy put into place so that we don’t have this issue anymore
  • Migrate off of the older equipment

If anyone on this list has connections to Arista or any other major edge networking vendor, please let me know. That will certainly help our situation in the long term!

I had already started working on a plan to replace these systems but it seems my time may have run out (at least for the Nexus 6k switch).

Thanks all for your patience and support!

···

Lance Albertson

Director
Oregon State University | Open Source Lab

I have the “new” switch setup and ready to go. I’m currently planning on doing this switch in about 20 minutes (3pm PDT). You will see a set of outages as I plan to do the following:

  1. Move LinkOregon uplink to “new” switch
  2. Move oslsw3 uplink to “new” switch
  3. Move oslsw1 uplink to “new” switch
  4. Move remaining backend 10g switches

If anything goes wrong, I should be able to quickly revert the change.

···

Lance Albertson

Director
Oregon State University | Open Source Lab

This has been completed and everything seems to be working fine.

Now keep in mind, the troublesome switch could reboot again until we figure out why it’s happening. If it does, it’s impact should be smaller than before at least.

Thanks!

···

Lance Albertson

Director
Oregon State University | Open Source Lab

Hi All,

Unfortunately it looks like this switch decided to reboot again last night at around 1AM PDT. Thankfully the impact was smaller than before with all of the adjustments we made in the recent weeks.

I wanted to send another update on how we’re going to permanently fix this moving forward.

I have racked two “new” Arista 1G switches which will replace two of the three Cisco Nexus fabric extenders where the majority of our hosts are. Once I have those plumbed into the new switches, I can start moving hosts over to these switches one by one. I’ll send out another email with a list of hosts this will impact in a few weeks once it’s ready.

Before that happens we need to finish running fiber to our second core switch and finish the MLAG configuration and backend upstream connection. Once this is finished, we’ll have more redundancy in our network. There will be another brief outage when we switch over to the new “core” switches with MLAG.

Thanks again for your patience. Hopefully I can get these all done before this switch decides to reboot again!

On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <lance@osuosl.org> wrote:

Sadly this just happened again about 50 minutes ago. We may need to do some emergency firmware patching tomorrow. As a backup plan, I’m also formulating a plan to add another switch to try and minimize the impact of this troublesome switch.

Once I gather some additional information tomorrow morning, I’ll send an update on what we’re planning to do.

Thanks again for your patience.

On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <lance@osuosl.org> wrote:

This happened again at approximately 10AM PDT. Since we moved our uplink to this switch, everything went down while the switch rebooted.

We’re still planning on doing an upgrade but don’t have a date yet for that. We’ll hopefully get that going soon.

Thanks for your patience.

On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <lance@osuosl.org> wrote:

Unfortunately this just happened again overnight. We may need to schedule another outage to perform some software upgrade on this switch so that this stops happening. We’ll send an announcement out once we have everything in place to do that upgrade.

Thanks-

On Wed, May 25, 2022 at 11:22 PM Lance Albertson <lance@osuosl.org> wrote:

All,

It appears that one of our core network switches had a kernel panic and rebooted which caused widespread outages throughout our infrastructure. As of right now, everything appears to be back to normal but please let me know if that isn’t the case by sending an email to support@osuosl.org.

Apologies for the outage and we’ll be looking into why this switch had a kernel panic in the first place.

Thanks-

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab