[OSGeo] #3170: osgeo7 went down

#3170: osgeo7 went down
----------------------+--------------------------------------
Reporter: robe | Owner: sac-tickets@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2024-I
Component: SysAdmin | Keywords:
----------------------+--------------------------------------
As some may have noticed osgeo7 went down this morning.

It appears there might be some disk failure on the samsung ssd drive.
Taht drive I don't think is used for anything important.

I did a hardware reset this morning and that seemed to have brought it
back up, but then it went down again shortly after.

This time I did a full power down and power up. It's back at the moment,
but I'm in the middle of moving over some critical services like ldap to
other hosts.

I have alerted osuosl of the situation.
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/3170&gt;
OSGeo <Gter - OSGeo;
OSGeo committee and general foundation issue tracker.

#3170: osgeo7 went down
----------------------+---------------------------------------
Reporter: robe | Owner: sac-tickets@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2024-I
Component: SysAdmin | Resolution:
Keywords: |
----------------------+---------------------------------------
Comment (by robe):

My plans are to move secure and tracsvn to osgeo9, so taking another
snapshot of those.

tracsvn however needs an extra ip for the ssh port so I might hold off on
it, till I confirm I can use the extra ip on osgeo9.

If anyone has hardcodings to secure IP let me know. All should be
accessing via ldap.osgeo.org domain name.
--
Ticket URL: <#3170 (osgeo7 went down) – OSGeo;
OSGeo <Gter - OSGeo;
OSGeo committee and general foundation issue tracker.

#3170: osgeo7 went down
----------------------+---------------------------------------
Reporter: robe | Owner: sac-tickets@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2024-I
Component: SysAdmin | Resolution:
Keywords: |
----------------------+---------------------------------------
Comment (by robe):

I was a little concerned about the smart error I received

{{{
The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 6 to 7

Device info:
SAMSUNG MZVKW512HMJP-00000, S/N:S316NX0JB03810, FW:CXA7500Q, 512 GB

For details see host's SYSLOG.
}}}

But running

{{{
  smartctl -a /dev/nvme0

}}}

Shows a PASS

{{{
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
   0 9 0 0x101a 0x4004 - 0 0 -
   1 8 0 0x1011 0x4004 - 0 0 -
   2 7 0 0x1019 0x4004 - 0 0 -
   3 6 0 0x501d 0x4004 - 0 0 -
   4 5 0 0x0004 0x4202 0x028 0 0 -
   5 4 0 0x0004 0x4202 0x028 0 0 -
   6 3 0 0x0004 0x4202 0x028 0 0 -
   7 2 0 0x0004 0x4202 0x028 0 0 -
   8 1 0 0x0004 0x4202 0x028 0 0 -

}}}

But anyrate I still would like to move some critical services off osgeo7
at least temporarily so I can upgrade it without worrying about those.

osgeo7 is the only host that hasn't been upgraded to Ubuntu 22 yet (still
on Ubuntu 20)
--
Ticket URL: <#3170 (osgeo7 went down) – OSGeo;
OSGeo <Gter - OSGeo;
OSGeo committee and general foundation issue tracker.

#3170: osgeo7 went down
----------------------+---------------------------------------
Reporter: robe | Owner: sac-tickets@…
     Type: task | Status: closed
Priority: normal | Milestone: Sysadmin Contract 2024-I
Component: SysAdmin | Resolution: fixed
Keywords: |
----------------------+---------------------------------------
Changes (by robe):

* status: new => closed
* resolution: => fixed

Comment:

Closing this out. OSUOSL added osgeo7 to their syslog monitoring so if it
happens again, they can possibly provide more information.

At the moment, still thinking it's something to do with the SSD drive
which we don't use for hosting any containers, since when it comes back up
it complains about that, and SSD tests that OSUOSL did I don't think
completed.
--
Ticket URL: <#3170 (osgeo7 went down) – OSGeo;
OSGeo <Gter - OSGeo;
OSGeo committee and general foundation issue tracker.