[SAC] Issues with osgeo3

I was doing to do some simple upgrades on osgeo3 this morning and unfortunately I caused it to go offline which effected all vms running on that node. We’re working on fixing it right now.

Apologize for the outage.

···

Lance Albertson

Director
Oregon State University | Open Source Lab

Hi All,

Thanks again for your patience. Everything appears to be online and also confirmed via Jeff McKenna on IRC. I wanted to fill you in on what happened and how I’ll prevent it from happening again.

We’ve been working on upgrading our Ganeti servers to be managed via Chef on CentOS 7. A part of that process is upgrading Ganeti itself to a version that works on CentOS 7 and then eventually scheduling a time to rebuild the nodes with CentOS.

In order to upgrade Ganeti, I needed to upgrade some of the packages on these systems (which currently run Gentoo). During that process I ran into an issue with incompatible versions of ncurses and eventually incompatible versions of glibc. In an attempt to resolve the issue, I tried to manually update glibc which immediately caused all the processes to fail on the machine. To fix the problem, I had to boot the machine up on a LiveCD and manually copy the original versions of the libraries involved back onto the system.

I had tried this same upgrade on another node without any issues related to this but apparently I didn’t do the steps in correct order this time. Moving forward I’m going to make sure I document the process better and make local backup copies of the current system packages installed. Also, I won’t try to manually upgrade glibc on the system as that was silly of me to do :frowning:

I’ll likely be working on this later today again but will be more careful on what I’m doing. It shouldn’t have caused an issue like it did today and I don’t plan on it doing that further.

If you have any further questions please let me know.

Thanks-

···

On Thu, Sep 21, 2017 at 9:41 AM, Lance Albertson <lance@osuosl.org> wrote:

I was doing to do some simple upgrades on osgeo3 this morning and unfortunately I caused it to go offline which effected all vms running on that node. We’re working on fixing it right now.

Apologize for the outage.

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab

Thanks Lance. We appreciate the quick fix and explanation. (sounds pretty tricky to me indeed) -jeff

--
Jeff McKenna
President Emeritus, OSGeo Foundation
http://wiki.osgeo.org/wiki/Jeff_McKenna

On 2017-09-21 3:04 PM, Lance Albertson wrote:

Hi All,

Thanks again for your patience. Everything appears to be online and also confirmed via Jeff McKenna on IRC. I wanted to fill you in on what happened and how I'll prevent it from happening again.

We've been working on upgrading our Ganeti servers to be managed via Chef on CentOS 7. A part of that process is upgrading Ganeti itself to a version that works on CentOS 7 and then eventually scheduling a time to rebuild the nodes with CentOS.

In order to upgrade Ganeti, I needed to upgrade some of the packages on these systems (which currently run Gentoo). During that process I ran into an issue with incompatible versions of ncurses and eventually incompatible versions of glibc. In an attempt to resolve the issue, I tried to manually update glibc which immediately caused all the processes to fail on the machine. To fix the problem, I had to boot the machine up on a LiveCD and manually copy the original versions of the libraries involved back onto the system.

I had tried this same upgrade on another node without any issues related to this but apparently I didn't do the steps in correct order this time. Moving forward I'm going to make sure I document the process better and make local backup copies of the current system packages installed. Also, I won't try to manually upgrade glibc on the system as that was silly of me to do :frowning:

I'll likely be working on this later today again but will be more careful on what I'm doing. It shouldn't have caused an issue like it did today and I don't plan on it doing that further.

If you have any further questions please let me know.

Thanks-

On Thu, Sep 21, 2017 at 9:41 AM, Lance Albertson <lance@osuosl.org <mailto:lance@osuosl.org>> wrote:

    I was doing to do some simple upgrades on osgeo3 this morning and
    unfortunately I caused it to go offline which effected all vms
    running on that node. We're working on fixing it right now.

    Apologize for the outage.

    -- Lance Albertson
    Director
    Oregon State University | Open Source Lab

Lance,

···

Thanks for the description of what happened and bringing things back up. Not too long ago, I managed to utterly hose a BSD machine by installing the “compat” libc instead of the normal one trying to make a package work. Dependency hell always strikes at the most annoying times.

Harrison

On Sep 21, 2017, at 1:04 PM, Lance Albertson <lance@osuosl.org> wrote:

Hi All,

Thanks again for your patience. Everything appears to be online and also confirmed via Jeff McKenna on IRC. I wanted to fill you in on what happened and how I’ll prevent it from happening again.

We’ve been working on upgrading our Ganeti servers to be managed via Chef on CentOS 7. A part of that process is upgrading Ganeti itself to a version that works on CentOS 7 and then eventually scheduling a time to rebuild the nodes with CentOS.

In order to upgrade Ganeti, I needed to upgrade some of the packages on these systems (which currently run Gentoo). During that process I ran into an issue with incompatible versions of ncurses and eventually incompatible versions of glibc. In an attempt to resolve the issue, I tried to manually update glibc which immediately caused all the processes to fail on the machine. To fix the problem, I had to boot the machine up on a LiveCD and manually copy the original versions of the libraries involved back onto the system.

I had tried this same upgrade on another node without any issues related to this but apparently I didn’t do the steps in correct order this time. Moving forward I’m going to make sure I document the process better and make local backup copies of the current system packages installed. Also, I won’t try to manually upgrade glibc on the system as that was silly of me to do :frowning:

I’ll likely be working on this later today again but will be more careful on what I’m doing. It shouldn’t have caused an issue like it did today and I don’t plan on it doing that further.

If you have any further questions please let me know.

Thanks-

On Thu, Sep 21, 2017 at 9:41 AM, Lance Albertson <lance@osuosl.org> wrote:

I was doing to do some simple upgrades on osgeo3 this morning and unfortunately I caused it to go offline which effected all vms running on that node. We’re working on fixing it right now.

Apologize for the outage.

Lance Albertson

Director
Oregon State University | Open Source Lab

Lance Albertson

Director
Oregon State University | Open Source Lab


Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac