[SAC] [Hosting] Unplanned Power Event

It seems as though we had an unplanned power event that happened in our primary data center early this morning at 3:03AM PDT (1003 UTC) that affected one of the two power feeds. Virtually every system that has a dual power supply should have remained online. The one exception is some systems located in a row that are only being fed by that power feed which include:

  • power8-aix
  • pieta.debian.org
  • gcc2-power8
  • All Buildbot/RTEMS systems
  • gcc113
  • gcc114
  • gcc115
  • gcc116
  • gcc117
  • gcc118

I believe every system that we monitor should be back online but there might be others we aren’t monitoring that are still down. If that’s the case, please send an email to support and we’ll take a look at it as soon as possible.

I’m still waiting to hear back about what happened and why it happened and will pass that information along once I learn more.

Thanks for your patience.

···

Lance Albertson

Director
Oregon State University | Open Source Lab

I got word that this outage was more campus wide which included impacting the OpenCompute hosts. I went through those hosts and ensured they are back online but let me know if I missed anything.

OSU will be sending in a tech in a few days to see why the UPS didn’t fail over properly in our primary datacenter which caused the power event. I’m also going to spot check a few hosts’ power when I go in on Tuesday to ensure power is split properly between the power feeds. If you had any hosts that went down with dual power, please let me know ASAP so I can add it to the list of hosts to check.

Thanks for your patience!

···

Lance Albertson

Director
Oregon State University | Open Source Lab

I received an update on the issues we had in the primary data center. It appears that there was a battery cell problem on one of the UPS’s. Previous to the outage OSU issued a Purchase Order for battery replacements and are waiting for them to arrive to schedule the installation. The projected arrival date for the batteries is September 10th. When they arrive we are scheduling the install as a priority.

In the meantime, this may happen again however I did fix a few systems we had issues with related to how their power was configured.

If you have any questions or concerns please let me know.

Thank you!

···

Lance Albertson

Director
Oregon State University | Open Source Lab

FYI: It looks like we had another power event that impacted our primary data center along with our OpenCompute hosts in another datacenter. I’m taking a look to see what might be down but this time it seems to be not nearly as widespread. I don’t think we had any issues with any of the OSL managed services.

Please let me know if you do have any issues.

Thanks-

···

Lance Albertson

Director
Oregon State University | Open Source Lab