[SAC] [Hosting] Unplanned outage: Hypervisor issue with gprod1 on primary Ganeti cluster


At approximately 2:36AM PDT (0900 UTC), one of the hypervisors (gprod1) in our primary Ganeti cluster started having hardware issues. This took down all of the instances running on that node. I attempted to bring the node back online however the hardware issue prevented it to come back online. At that point I failed all of the VM instances over to their secondary nodes and forced another node to become the Ganeti master (since gprod1 WAS the master). All of the instances were back online by around 7:40AM PDT (1400 UTC).

Everything at this point seems to be back to normal (except for gprod1). I will look into bringing gprod1 back online later today.

Thank you and sorry for the outages this caused.


Lance Albertson

Oregon State University | Open Source Lab