[SAC] [OSGeo] #2706: Set up load balancing configuration for download.osgeo.org

#2706: Set up load balancing configuration for download.osgeo.org
---------------------------+--------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-I
Component: Systems Admin | Keywords:
---------------------------+--------------------------------------
One of the things on my list was to setup some sort of cdn setup for
download.osgeo.org

We do have ftp.osuosl.org which we can push traffic to.
In theory we should be able to set this up on nginx.

as detailed here:

https://nginx.org/en/docs/http/load_balancing.html

Though not sure how well that works for balancing download traffic

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2706&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2706: Set up load balancing configuration for download.osgeo.org
---------------------------+---------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

This as a feared doesn't work for our needs. I tested using
bottle.downoad.osgeo.org
using the download backup on osgeo4.

osgeo4 upload is around 10 / 13 MB/s, much faster than osgeo7, perhaps
because it isn't pounded on so much.

But setting up as noted in load_balancing. I still ended up with slow
osgeo7 speed.

My nginx script for bottle.download.osgeo.org on osgeo7-nginx looked
something like this:

{{{
upstream bottle-app {
   #least_conn;
   #server download.lxd;
   server bottle.staging.osgeo.org;
}
server {
     server_name bottle.download.osgeo.org;
     listen 80 proxy_protocol; # managed by Certbot
     set_real_ip_from 140.211.15.0/24;
     real_ip_header proxy_protocol;

     access_log /var/log/nginx/bottle.download.osgeo.org.access_log
pcombined;
     error_log /var/log/nginx/bottle.download.osgeo.org.error_log info;
     location / {
                 # First attempt to serve request as file, then
                 # as directory, then fall back to displaying a 404.
                 #try_files $uri $uri/ =404;
                 client_max_body_size 0;
                 include /etc/nginx/proxy_protocol_params;
                 proxy_pass http://bottle-app/;
                 proxy_redirect off;
         }

     #listen 80 proxy_protocol; # managed by Certbot

     listen 443 ssl proxy_protocol; # managed by Certbot
    :

}

}}}

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2706#comment:1&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2706: Set up load balancing configuration for download.osgeo.org
---------------------------+---------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-I
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+---------------------------------------

Comment (by robe):

I just had an even crazier thought to this.

I think the speed between the servers is very fast. It's the push out
that is bounded.

That said if I set up a round robin in DNS for download, but I simply
have osgeo3, osgeo4, osgeo9 have a redundant nginx config for download
(have download accept all those as proxies for it), pointing back to
osgoe7, then that might work. I'm going to give that a try with
bottle.download.osgeo.org.

This of course still requires doing at the dns level - setting up for
round robin, and will still require that folks use upload.osgeo.org for
uploading (since that will be the only one that has ssh port open).
Depending on which server you hit with download.osgeo.org, the ssh port
might or might not be open.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2706#comment:2&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#2706: Set up load balancing configuration for download.osgeo.org
---------------------------+----------------------------------------
Reporter: robe | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone: Sysadmin Contract 2022-II
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+----------------------------------------
Changes (by robe):

* milestone: Sysadmin Contract 2022-I => Sysadmin Contract 2022-II

Comment:

I've started to work on this -- a lot of the notes are on #2705.

So I have set up a round-robin for download.osgeo.org and notified via
project and discuss to use upload.osgeo.org for sftp. upload.osgeo.org
will remain only connected to osgeo7-download.

I have download-cache.osgeo.org for testing which consists of (osgeo4 and
osgeo9 which pull directly from upload.osgeo.org).

I have download.osgeo.org which consists of (osgeo7 pulling via
download.lxd and osgeo9 pulling via upload.osgeo.org). Note both
ultimately go thru the nginx on osgeo7, so nginx itself is not issue of
slow download on osgeo7.

All osgeo9 does is proxy straight to upload.osgeo.org (nginx) ->
osgeo7-download, but yet when this is active speed can be like anywhere
from 6MB/s to 20MB/s.

How this is possible my guess is the connectivity between the hosts is at
least 100GB/s but the thru put out to the world is much lower and since
osgeo7 is heavily taxed network out, it cripples the outbound network.
osgeo9 only caches the current request pulling at 100-1000GB/s from
download and since it is not taxxed with as many requests can push out
much faster.

Putting this in place immediately ballooned osgeo9 traffic.

Here are stats from osgeo9:

osgeo9 vnstat output as of now - note I turned it on 2 days ago, so that
2022-03: 7.58 tiB is just for the 2 days. The traffic though I think
includes copying from upload.osgeo.org (so really half of that).

Anyway it's huge and I can't believe how huge it is.

On osgeo9 as of now

vnstat output
{{{

                      rx / tx / total / estimated
  enp2s0f0:
        2022-02 5.44 GiB / 425.10 GiB / 430.54 GiB
        2022-03 3.60 TiB / 3.98 TiB / 7.58 TiB / 8.42 TiB
      yesterday 1.27 TiB / 1.24 TiB / 2.51 TiB
          today 2.16 TiB / 2.11 TiB / 4.27 TiB / 4.73 TiB

}}}

vnstat -d 5 #for last 5 days

# note late 3/26 is when I added it to round robin
{{{
  enp2s0f0 / daily

           day rx | tx | total | avg. rate
      ------------------------+-------------+-------------+---------------
      2022-03-24 191.36 MiB | 21.29 GiB | 21.48 GiB | 2.14 Mbit/s
      2022-03-25 471.22 MiB | 21.44 GiB | 21.90 GiB | 2.18 Mbit/s
      2022-03-26 160.08 GiB | 174.21 GiB | 334.29 GiB | 33.24 Mbit/s
      2022-03-27 1.27 TiB | 1.24 TiB | 2.51 TiB | 255.63 Mbit/s
      2022-03-28 2.23 TiB | 2.17 TiB | 4.40 TiB | 483.15 Mbit/s
      ------------------------+-------------+-------------+---------------
      estimated 2.40 TiB | 2.34 TiB | 4.75 TiB |

}}}

----

Now on osgeo7:

vnstat
{{{
                       rx / tx / total / estimated
  eno1:
        2022-02 1.54 TiB / 104.72 TiB / 106.26 TiB
        2022-03 1.76 TiB / 115.49 TiB / 117.25 TiB / 130.25 TiB
      yesterday 27.14 GiB / 2.95 TiB / 2.97 TiB
          today 44.84 GiB / 4.18 TiB / 4.22 TiB / 4.66 TiB

}}}

vnstat -d 5 #for last 5 days

{{{
  eno1 / daily

           day rx | tx | total | avg. rate
      ------------------------+-------------+-------------+---------------
      2022-03-24 75.43 GiB | 4.45 TiB | 4.52 TiB | 460.09 Mbit/s
      2022-03-25 70.75 GiB | 4.40 TiB | 4.47 TiB | 454.90 Mbit/s
      2022-03-26 46.99 GiB | 4.24 TiB | 4.29 TiB | 436.80 Mbit/s
      2022-03-27 27.14 GiB | 2.95 TiB | 2.97 TiB | 302.77 Mbit/s
      2022-03-28 45.75 GiB | 4.26 TiB | 4.31 TiB | 473.15 Mbit/s
      ------------------------+-------------+-------------+---------------
      estimated 49.35 GiB | 4.60 TiB | 4.65 TiB |

}}}

So how do we solve this issue.

1. Finish setting up osgeo8 to also act as a proxy. This one can be a
true cache since it has much more disk space than osgeo9. So it can do a
full rsync of download. Short term solution. One issue I am working out
is that all the traffic coming thru osgeo9 to osgeo7 is being logged as
osgeo9 on download container. Which is both good and bad. Good in that
it's easy to see how much traffic osgeo9 is picking up, but bad in that I
don't have a single authoritative log (then again we wouldn't anyway with
a true round-robin). osgeo9 logs are showing the true identity of traffic
it is handling.

2. Curb traffic - I'm investigating nginx settings to say limit each user
to 1 or 2 requests per second etc or limit bandwith. I've been trying -
https://www.nginx.com/blog/rate-limiting-nginx/ but my settings seem to
be ignored or not working as expected. There is a lot of bot traffic (we
really don't need hogging resources). I still need to break up the stats
to figure out low hanging fruit that should just be killed off.

3. Setup a true CDN for download around world (future plan, this could be
costly something like keycdn comes to mind as someone had suggested a
while back since they offer an open source plan - https://www.keycdn.com
/open-source-cdn. Though given how much traffic this is, I suspect we'll
quickly run out or not be able to use download.osgeo.org for name which
would make it worse than just adding some extra round robin vms on
commercial cloud hosters (hetzner, atlantic, digital ocean come to mind).
Keycdn commercial pricing is $0.01/GB per month for NA/Europe for over 100
TB/month - which would be the bulk of our traffic. Given we are doing
about 105-130 TB if my math is right would be about $1300/mth -- way too
much.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/2706#comment:3&gt;
OSGeo <https://osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.