[SAC] [Hosting] Partial datacenter outage this morning

All,

It seems as though we had some kind of a power event at approximately 6:21AM PDT (13:21 UTC). that affected some (but not all) of our hosts. At this point I’m not entirely sure what happened but my guess that one of the power circuits went down and then came back online. This is confusing since the UPS should have prevented that. I’m going to be heading into the datacenter soon to do a visual inspection.

If you have any hosts that are offline and need me to help bring them back, please send an email to support@osuosl.org and I will take a look. Feel free to also reach out on IRC at #osuosl.

Thanks-

···

Lance Albertson

Director
Oregon State University | Open Source Lab

FYI: I got the following regarding the power event on Saturday morning.

​---------- Forwarded message ----------

···

On Sat, Jun 30, 2018 at 9:21 AM, Lance Albertson <lance@osuosl.org> wrote:

All,

It seems as though we had some kind of a power event at approximately 6:21AM PDT (13:21 UTC). that affected some (but not all) of our hosts. At this point I’m not entirely sure what happened but my guess that one of the power circuits went down and then came back online. This is confusing since the UPS should have prevented that. I’m going to be heading into the datacenter soon to do a visual inspection.

If you have any hosts that are offline and need me to help bring them back, please send an email to support@osuosl.org and I will take a look. Feel free to also reach out on IRC at #osuosl.

Thanks-

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab

Looks like we had another power event while they were trying to fix the UPS today. We didn’t have any outages except for one project machine which was a single PSU host. Apologies if this affected anyone’s hosts. Hopefully this is the last of this!

Thanks-

​​​---------- Forwarded message ----------

···

On Mon, Jul 2, 2018 at 10:29 AM, Lance Albertson <lance@osuosl.org> wrote:

FYI: I got the following regarding the power event on Saturday morning.

​---------- Forwarded message ----------
From: Fowler, Stephen Lee <steve.fowler@oregonstate.edu>
Date: Mon, Jul 2, 2018 at 10:26 AM
Subject: [Kerr_b210-announce] Saturday Power issue

All,

I learned after the fact that we had a power event on Saturday that affected power in B210. I did see that the generator came on line, but I did not get any alerts from the other units in that power chain. Further investigation revealed that one of the UPS suffered an inverter fault that is likely the cause of some systems losing power. While we monitor the systems in B210 we did not receive any errors from the UPS themselves, so I was not aware there had been an issue.

What is happening:

I have engaged the UPS maintenance service to investigate and repair the faulty UPS. I will also be talking with them about the logging and notification failure of both units.

On Sat, Jun 30, 2018 at 9:21 AM, Lance Albertson <lance@osuosl.org> wrote:

All,

It seems as though we had some kind of a power event at approximately 6:21AM PDT (13:21 UTC). that affected some (but not all) of our hosts. At this point I’m not entirely sure what happened but my guess that one of the power circuits went down and then came back online. This is confusing since the UPS should have prevented that. I’m going to be heading into the datacenter soon to do a visual inspection.

If you have any hosts that are offline and need me to help bring them back, please send an email to support@osuosl.org and I will take a look. Feel free to also reach out on IRC at #osuosl.

Thanks-

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab