[SAC] [OSGeo] #2771: osgeo7 snapshot failing and secure can't restart

#2771: osgeo7 snapshot failing and secure can't restart
---------------------------+---------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-II
Component: Systems Admin | Keywords:
---------------------------+---------------------------------------
Running into technical difficulties.

Looks like osgeo7 a lot of the containers can't snapshot.
I tried to fix secure and made the mistake of shutting it down.

Still troubleshooting. I have the 5/30/2022 snapshot of it running at
moment. and will shutdown id.osgeo.org to prevent new accounts from being
created while I resolve this.
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2771&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2771: osgeo7 snapshot failing and secure can't restart
---------------------------+----------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-II
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+----------------------------------------
Comment (by robe):

For clarification when I said snapshot, I meant snapshot of secure
(5/30/2022) I have running which is last successful snapshot. The other
containers are still running as they were even though some of them are in
a state where osgeo4 can't create a snapshot of them.

I have disabled https://id.osgeo.org to prevent new registrations while I
sort this issue out.
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2771#comment:1&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2771: osgeo7 snapshot failing and secure can't restart
---------------------------+----------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-II
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+----------------------------------------
Comment (by robe):

the first issue was after I shut down secure, it wouldn't start.
Gave error something to effect:

{{{
Failed to run: zfs set mountpoint
}}}

To fix I did:

{{{
sudo zfs set mountpoint=/var/snap/lxd/common/lxd/storage-
pools/default/containers/secure canmount=noauto osgeo7/containers/secure
zfs umount osgeo7/containers/secure
zfs mount osgeo7/containers/secure
}}}

live was having similar issue so I did the same and stated it up.

secure had an additional issue was one I couldn't find anywhere:

This was a complicated one, I documented my change here -
https://discuss.linuxcontainers.org/t/lxc-snapshot-and-lxc-start-error-
instance-snapshot-record-count-doesnt-match-instance-snapshot-volume-
record-count/14245/3

More detail here:

first I made a backup of the lxd database to inspect, with this:

{{{
  sudo cp /var/snap/lxd/common/lxd/database/global/db.bin lxd-global-220601
}}}

Then I inspected the sql lite backup as follows:

{{{
sudo apt install sqlite3
sqlite3 lxd-global-220601
}}}

# in sqlite console
{{{
.tables
.mode column
.headers on

SELECT count(*) FROM instances AS v INNER JOIN instances_snapshots AS vs
ON v.id = vs.instance_id WHERE v.name = 'secure';
}}}

output: 32

{{{
SELECT count(*) FROM storage_volumes AS v INNER JOIN
storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE
v.name = 'secure';
}}}

output: 37

{{{
SELECT v.id
FROM
(SELECT vs.* FROM storage_volumes AS v INNER JOIN
storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE
v.name = 'secure') AS v

  LEFT JOIN
(SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON
v.id = vs.instance_id WHERE v.name = 'secure') AS i ON i.name = v.name
WHERE i.name IS NULL;
}}}

Which resulted in these numbers for storage_volumes_snapshots

{{{
4701
4714
4737
4761
4779
}}}

Then ran this:

{{{
lxd sql global "DELETE FROM storage_volumes_snapshots WHERE id
IN(4701,4714,4737,4761,4779)"
}}}

Then I was able to do

{{{
lxc snapshot secure
lxc start secure
}}}

I'll close this ticket out once I've fixed the other affected containers.
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2771#comment:2&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2771: osgeo7 snapshot failing and secure can't restart
---------------------------+----------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-II
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+----------------------------------------
Comment (by robe):

Okay I filed a ticket upstream https://github.com/lxc/lxd/issues/10501

For time being I downgraded to 5.1 using

{{{
sudo snap revert lxd
}}}

This unfortunately had the side effect of shutting down all the containers
and restarting them back up. I think all are up now, but double-checking
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2771#comment:3&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2771: osgeo7 snapshot failing and secure can't restart
---------------------------+----------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: closed
Priority: normal | Milestone: Sysadmin Contract 2022-II
Component: Systems Admin | Resolution: fixed
Keywords: |
---------------------------+----------------------------------------
Changes (by robe):

* status: new => closed
* resolution: => fixed

Comment:

This is all fixed. I had to make some extra fixes for tracsvn as it had
an additional issue. so had to delete some orphan volumes. We are now on
LXD 5.3
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2771#comment:4&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.