[SAC] [OSGeo] #2318: dronie.osgeo.org is down (502 Bad Gateway)

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+-------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Keywords:
---------------------------+-------------------
nginx reports 502 Bad Gateway when going to dronie.osgeo.org

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by strk):

Looking at lxc list I see that dronie-server is being assigned 6 internal
IPs, why is it so ?
The wiki doesn't mention any detail about that. Can the multiple-IP be a
reason for the failure ?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:1&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by strk):

From https://git.osgeo.org/gitea/sac/osgeo7/wiki/Dronie-Server-container
it looks like the startup script for the server does not exist, and
everything is done manually. If confirmed I'd recommend turning it into a
script instead because it is very fragile to only do it manually

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:2&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by robe):

strk the Dronie server is running docker and docker has it's own internal
network. So those ar all drone agents

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:3&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by robe):

drone agents and drone server

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:4&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by strk):

So what happened to the service ? Do you have any idea ?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:5&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by strk):

How does `lxc` know which IP addresses to assign to which container ? Is
there an external script (on the host) mentioning which IPs to assign ?
Should this be discussed in a private ticket ?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:6&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by strk):

nginx reports: {{{ no live upstreams while connecting to upstream }}}

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:7&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by strk):

From {{{ nginx }}} machine: {{{ Host dronie-server.lxd not found:
3(NXDOMAIN) }}} -- that'd explain. Did internal DNS went down ?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:8&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------
Changes (by strk):

* milestone: => Sysadmin Contract 2019-I

Comment:

For easy checking: https://dronie.osgeo.org/ (still down at time of
writing)
NOTE: shouldn't `Sysadmin Contract 2019-I` milestone be closed ?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:9&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

strk is it still down. I can get to the dronie.osgeo.org and could when
you complained about this.

However when I try to log in I get this error:

{{{
Get https://git.osgeo.org/gitea/api/v1/users/robe/tokens: dial tcp: lookup
git.osgeo.org on 10.88.1.1:53: read udp 172.17.0.2:51440->10.88.1.1:53:
i/o timeout
}}}

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:10&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

to answer your question there is no external script in use. when you
setup lxd it internally sets up a dhcp server and dns and assigns ips
accordingly. The ips rarely change. The server is asking for a reboot so
maybe I can do that over the weekend.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:11&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

Looking at it now. I tried shutting down the dronie-server but refuses to
go down.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:12&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

the old container still won't shut off, but was able to create a new
container from the 6-09-2019 snapshot of the dronie-server. At a glance
that new one seems fine and then I upgraded the docker on it.

I've shut it off since I can't rename it to the old name without
renaming/destroying the old one.

I suppose I could just tell nginx to use the new name and then switch back
to the old name once I have removed the old. AS the old is not accessible
anymore (seems to have lost it's ips on the failed shutdown).

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:13&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by strk):

Concerning situation. Can we count on stability of this new
containerization architecture ?

What happened to data ? Are old builds still accessible ?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:14&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

I was able to stop the dronie-server container by killing the process
attached to it.

{{{
ps -faux | grep dronie-server
}}}

There still seems to be something clinging to the name though because when
I tried to do

{{{
lxc mv dronie-server dronie-server-bad #went fine

#get into the container
lxc exec dronie-server-2 bash
#in dronie-server-2 force graceful shutdown
shutdown -P -H now

#now back in osgeo7
lxc mv dronie-server-2 dronie-server # went fine

#but this failed
lxc start dronie-server

#so I had to rename it back to
lxc mv dronie-server dronie-server-2
}}}

As far as data the data asof the 6/9 snapshot is fine. I suspect if I
went with the 6-11 snapshot I would see the data there too.

I feel like the server needs to be rebooted (since it does say system
restart required).

So there could be an underlying funkiness with the network causing this
that rebooting will resolve.

Now is not a good time to bring everything down for this though as other
things are working fine.

Once we get osgeo4 reformatted, we'll be in much better condition as we
can replicate containers between the two and this server really should be
moved to the new osgeo4.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:15&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

Replying to [comment:2 strk]:
> From https://git.osgeo.org/gitea/sac/osgeo7/wiki/Dronie-Server-container
it looks like the startup script for the server does not exist, and
everything is done manually. If confirmed I'd recommend turning it into a
script instead because it is very fragile to only do it manually

strk I'm lost what you mean here -- that is to start up the docker drone
server and it gets started on bootup because it's just the docker
configuration. When would the start-up script ever be run?

It's not like drone.osgeo.org that runs on the server, the dronie server
is running in a docker container.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:16&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by strk):

Even the docker startup command would be good to have
in a script. Because things can go bad, docker might need
a reinstall, you may want to move the service to another machine.
What I'm saying is you don't want to rely on docker daemon
keeping the info on how you started it in place of you.

What we want (and do have even!) is a git repository with
the scripts to start the server. I didn't find a clone of
that repository on the server actually running drone.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:17&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2318: dronie.osgeo.org is down (502 Bad Gateway)
---------------------------+---------------------------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: closed
Priority: blocker | Milestone: Sysadmin Contract 2019-I
Component: Systems Admin | Resolution: fixed
Keywords: |
---------------------------+---------------------------------------
Changes (by robe):

* status: new => closed
* resolution: => fixed

Comment:

I think this is done.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2318#comment:18&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.