[SAC] [OSGeo] #2958: woodie-server issue with disk space

#2958: woodie-server issue with disk space
---------------------------+--------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2023-I
Component: Systems Admin | Keywords:
---------------------------+--------------------------------------
Ran into an issue today with woodie-server that 800GB of it space was
taken so much that nothing could be done with the server and I couldn't
even increase space.

I was able to make a new copy of it from the old, and after reboot of
osgeo8 I was able to delete the old container.

But still the new container claims it's using 200GB of space, even after
doing

{{{
docker system prune -a
}}}

and deleting all the backup snapshots. I haven't figured out where all
this space is or just something off with the lxd disk visibility.

Even the backup of it on osgeo4 claims it's 200GB in size.

var/log the other likely culprit is only 254M
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2958&gt;
OSGeo <Gter - OSGeo;
OSGeo committee and general foundation issue tracker.

#2958: woodie-server issue with disk space
---------------------------+---------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2023-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------
Comment (by robe):

Okay there seems to be a lot of stuff in :/var/lib/docker/vfs/dir and
running du on it is taking a long time. So I suspect that might be what's
occupying the remaining space
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2958#comment:1&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2958: woodie-server issue with disk space
---------------------------+---------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: closed
Priority: normal | Milestone: Sysadmin Contract 2023-I
Component: Systems Admin | Resolution: fixed
Keywords: |
---------------------------+---------------------------------------
Changes (by robe):

* status: new => closed
* resolution: => fixed

Comment:

I ended up taking the harsh route of

# this took a couple of ours and I stopped it once disk usage went from
260GB to 2 GB
{{{
cd /var/lib/docker/vfs/dir
rm -r *
}}}

The above, while it preserved the docker volumes of data the woodie server
writes to, I think damaged the agent as it tried ot use a file that was in
that folder. After rebooting the container I did:

{{{
su woodie
cd ~/
docker compose pull #which fixed up the images and pulled woodpecker-
server 1.0 (so yah the screen looks quite different now)
docker compose up -d
}}}

Things seem to be now back to normal except with a hiccup I had logging in
and authenticating, but after clearing out my browser cache and relogging
in, things seemed fine and I see all the run logs from before in the
interface, so db must be fine too.

I did reduce in the docker-compose.yml (in /home/woodie) the number of
procs we run on this server to 2 from 5.

I'm going to setup another docker image, first on osgeo8 that has 4 procs.

I suspect what happened why :/var/lib/docker/vfs/dir was not being cleared
by docker system prune was because somewhere along the line, with all that
shaking we've been doing in postgis lately, it ran out of space before it
could figure out what to delete.

I set in cron to do a docker system prune nightly, so that should keep
things clean. Also having agent on a separate container will allow us to
be more reckless since agents can be thrown away without risking damaging
history.
--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2958#comment:2&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.