[SAC] [Hosting] [OUTAGE]: Backbone switch reboot - Dec 10, 2022 5:01-5:08AM PST (1301-1308 UTC)

All,

Early this morning at around 5:01AM PST (1301 UTC) one of our backbone switches (oslsw1 - Cisco Nexus) had a kernel panic and rebooted itself. This has been an ongoing problem that I’ve described before unfortunately. All services came back online at around 5:08AM PST (1308 UTC) other than a few services that were impacted by this outage that I just fixed a few minutes ago.

As I mentioned before, our long term plan is to completely migrate off of this switch.

Here is where we stand currently with that:

  • Two (2) “new” edge switches that are ready to replace two (2) of the three (3) connected to this troublesome switch
  • We will start migrating internal OSL hosts next week to these switches (which will fix the secondary issues we have to manually fix when this switch reboots)
  • After we’ve completed migrating our internal hosts, we’ll start migrating project hosts to these switches
    Longer term, we still need to either purchase or have some Arista 48 port 1G edge switches donated to complete this process. We will need an additional eight (8) Arista 48 port 1G edge switches to fully replace our aging network backbone. This would also get us off of an ancient Cisco 6509 system that we’re still paying a support contract for!

Related to all of this above, we do need to have another network maintenance window in the coming weeks to fully migrate to the new set of core switches. I’ll send another email out about that when I’m ready to make that happen.

Thanks and Happy Holidays everyone!

···

Lance Albertson

Director
Oregon State University | Open Source Lab