[SAC] OSGeo7 Sever Config Quote

Here's the latest quote for us to discuss server configuration for OSGeo7.

https://drive.google.com/open?id=1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL

The plan based on discussions is to manage KVM virtual machines, lvm
drives, with libvirt. At such time that we feel we need to go to
something more advanced because we are managing multiple physical
machines we could convert to ganeti or openstack (less sure about how to
convert to openstack).

The idea was up to 4 virtual machines, each would have someone
designated to make sure it was updated, along with use of the unattended
upgrades for security patches.

As quoted I've done RAID 5 SSD, and RAID 5 traditional, 3 drives each.
That will give us fast storage and large storage (think downloads and
foss4g archives).

I did redundant power to maximize uptime.

RAM is only 64 GB which is up to 16 for each of the Virtual Machines.

Please discuss and ask questions so we can possibly vote this week at
the meeting.

Thanks,
Alex

Alex,

This looks good to me +1. Really excited to have a new Box in place.

I'm also thinking that with the new box, we could start off-loading osgeo3,4 and allow Lance to upgrade the ganeti on them.
Since we won't have anything mission critical -- after we migrate mission critical stuff to osgeo7, if hardware on osgeo4 fails during upgrade, I assume it wouldn't be a big deal.
As I recall, was it only osgeo4 that had a hardware issue?

Thanks,
Regina

-----Original Message-----
From: Sac [mailto:sac-bounces@lists.osgeo.org] On Behalf Of Alex M
Sent: Monday, February 12, 2018 3:54 PM
To: sac >> System Administration Committee Discussion/OSGeo <sac@lists.osgeo.org>
Subject: [SAC] OSGeo7 Sever Config Quote

Here's the latest quote for us to discuss server configuration for OSGeo7.

https://drive.google.com/open?id=1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL

The plan based on discussions is to manage KVM virtual machines, lvm drives, with libvirt. At such time that we feel we need to go to something more advanced because we are managing multiple physical machines we could convert to ganeti or openstack (less sure about how to convert to openstack).

The idea was up to 4 virtual machines, each would have someone designated to make sure it was updated, along with use of the unattended upgrades for security patches.

As quoted I've done RAID 5 SSD, and RAID 5 traditional, 3 drives each.
That will give us fast storage and large storage (think downloads and foss4g archives).

I did redundant power to maximize uptime.

RAM is only 64 GB which is up to 16 for each of the Virtual Machines.

Please discuss and ask questions so we can possibly vote this week at the meeting.

Thanks,
Alex
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

Hi Alex,

Overall, this looks like a solid machine, but I do have a few suggestions considering the details of the hardware configuration.

-First, a RAID5 array for the spinning rust pool may leave the pool unduly susceptible to complete failure during recovery from a single drive failure and replacement due to the extreme load on all discs while recreating the data on the replaced disk tending to trigger a subsequent failure. Also, no hot spare is available, leaving the pool running in degraded mode until someone can physically swap the drive. A RAID6 (or ZFS RAIDZ2) configuration having two drives worth of recovery data greatly minimizes such risk.

–Suggest that all 4 hot-swap bays be provisioned with the HGST models listed (512b emulated sector size) in the quote if the 4k native sector size drives are not available – this can be worked around at the FS level (ashift=12) with minimal performance impact.

-Second, RAID5 will seriously reduce the performance of the SSDs, and, especially on writes, increases latency, which somewhat defeats the purpose of utilizing SSDs. A simple mirror array would be much better performing and have the same level of redundancy, while a stripe could be much faster when used as a cache for hot data from the HDDs rather than the primary storage. For heavy write loads, such as databases, MLC SSDs really aren’t suitable because of the wear rate and they usually lack of power-loss-protection. A smaller capacity but higher iops NVMe type SSD on the PCI bus would be much more effective for those workloads.

–Suggest identifying workloads needing high-speed storage and determine read vs. write requirements before final selection of SSDs. Use two SATA SSDs in mirror or stripe configuration for bulk storage or cache. Consider PCIe connected NVMe if large number of writes and transactions.

-Third, the memory really is the biggest bottleneck and resource limit, so I would favor increasing that as much as possible over the size of the SSD pool. Unused memory is used to cache filesystem contents to RAM, which is orders of magnitude faster than a SSD, but is there for your workloads when needed.

–Suggest 128GB RAM, making trade-offs against SSD capacity if budget requires.

Some general comments on filesystems, software stack, and virtualization, in reverse order.

For most of the needs I have seen discussed, full virtualization is far more heavy-handed than necessary – a container solution such as LXC/LXD would be much more appropriate and allow for much better granularity with lower overhead. A few VMs may be useful for particular projects that need to run their own kernel, low-level services, or suspend and move to another host for some reason, but those are the exception, not the rule. Many tools for managing VMs can also manage containers, and provisioning many containers off the same base template is both very easy and consumes very little additional disk space when used on a CoW filesystem (Copy-on-write) that supports cloning; additionally, backups are both instantaneous and only take up as much space as the changed files. My personal preference is to use ZFS for a filesystem because it supports all levels of the storage stack from disk to filesystem to snapshots and remote backup in a single tool and thus can detect and correct data corruption anywhere in the stack before it can be persisted. LVM2 and associated tools provide mostly similar functionality, but I find them much less intuitive and more difficult to administrate - that’s certainly may be just a matter of personal taste and experience.

I hope this helps with the purchasing and provisioning decisions.

Take care,


<details class='elided'>
<summary title='Show trimmed content'>&#183;&#183;&#183;</summary>

On Mon, Feb 12, 2018 at 1:27 PM, Regina Obe <[lr@pcorp.us](mailto:lr@pcorp.us)> wrote:

> Alex,
> 
> This looks good to me +1. Really excited to have a new Box in place.
> 
> I'm also thinking that with the new box, we could start off-loading osgeo3,4 and allow Lance to upgrade the ganeti on them.
> Since we won't have anything mission critical -- after we migrate mission critical stuff to osgeo7, if hardware on osgeo4 fails during upgrade, I assume it wouldn't be a big deal.
> As I recall, was it only osgeo4 that had a hardware issue?
> 
> Thanks,
> Regina
> 
> -----Original Message-----
> From: Sac [mailto:[sac-bounces@lists.osgeo.org](mailto:sac-bounces@lists.osgeo.org)] On Behalf Of Alex M
> Sent: Monday, February 12, 2018 3:54 PM
> To: sac >> System Administration Committee Discussion/OSGeo <[sac@lists.osgeo.org](mailto:sac@lists.osgeo.org)>
> Subject: [SAC] OSGeo7 Sever Config Quote
> 
> Here's the latest quote for us to discuss server configuration for OSGeo7.
> 
> [https://drive.google.com/open?id=1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL](https://drive.google.com/open?id=1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL)
> 
> The plan based on discussions is to manage KVM virtual machines, lvm drives, with libvirt. At such time that we feel we need to go to something more advanced because we are managing multiple physical machines we could convert to ganeti or openstack (less sure about how to convert to openstack).
> 
> The idea was up to 4 virtual machines, each would have someone designated to make sure it was updated, along with use of the unattended upgrades for security patches.
> 
> As quoted I've done RAID 5 SSD, and RAID 5 traditional, 3 drives each.
> That will give us fast storage and large storage (think downloads and foss4g archives).
> 
> I did redundant power to maximize uptime.
> 
> RAM is only 64 GB which is up to 16 for each of the Virtual Machines.
> 
> Please discuss and ask questions so we can possibly vote this week at the meeting.
> 
> Thanks,
> Alex
> _______________________________________________
> Sac mailing list
> [Sac@lists.osgeo.org](mailto:Sac@lists.osgeo.org)
> [https://lists.osgeo.org/mailman/listinfo/sac](https://lists.osgeo.org/mailman/listinfo/sac)
> 
> _______________________________________________
> Sac mailing list
> [Sac@lists.osgeo.org](mailto:Sac@lists.osgeo.org)
> [https://lists.osgeo.org/mailman/listinfo/sac](https://lists.osgeo.org/mailman/listinfo/sac)

</details>

Thanks for the feedback, some comments inline. - Alex

Quick Note, all mention of RAID is software not hardware.

On 02/13/2018 12:13 AM, Chris Giorgi wrote:

Hi Alex,

Overall, this looks like a solid machine, but I do have a few suggestions
considering the details of the hardware configuration.

-First, a RAID5 array for the spinning rust pool may leave the pool unduly
susceptible to complete failure during recovery from a single drive failure
and replacement due to the extreme load on all discs while recreating the
data on the replaced disk tending to trigger a subsequent failure. Also, no
hot spare is available, leaving the pool running in degraded mode until
someone can physically swap the drive. A RAID6 (or ZFS RAIDZ2)
configuration having two drives worth of recovery data greatly minimizes
such risk.
--Suggest that all 4 hot-swap bays be provisioned with the HGST models
listed (512b emulated sector size) in the quote if the 4k native sector
size drives are not available -- this can be worked around at the FS level
(ashift=12) with minimal performance impact.

We actually don't need that much space, 2 TB drives would have sufficed,
I picked the smallest size they offered which was 8 TB. What about just
going with mirror on 2x HGST drives. Note we do have a backup server.
Normally I would also use 4 Drives, but the machines just don't have
that many bays.

-Second, RAID5 will seriously reduce the performance of the SSDs, and,
especially on writes, increases latency, which somewhat defeats the purpose
of utilizing SSDs. A simple mirror array would be much better performing
and have the same level of redundancy, while a stripe could be much faster
when used as a cache for hot data from the HDDs rather than the primary
storage. For heavy write loads, such as databases, MLC SSDs really aren't
suitable because of the wear rate and they usually lack of
power-loss-protection. A smaller capacity but higher iops NVMe type SSD on
the PCI bus would be much more effective for those workloads.
--Suggest identifying workloads needing high-speed storage and determine
read vs. write requirements before final selection of SSDs. Use two SATA
SSDs in mirror or stripe configuration for bulk storage or cache. Consider
PCIe connected NVMe if large number of writes and transactions.

Speed isn't a huge issue, SSDs in any form seem to perform fast enough.
We tend to use SSD storage for everything. Having slow spinning disks at
all is new, and only suggested to start holding larger archives in a
publicly accessible way.

We could go to 4 drives and do a mirror 2x2. We want the ability to keep
going when a drive drops and get a new drive in within a couple of days.

Note OSGeo6 is 6xSSD with I believe 2 RAID5.

Can you verify the type of SSDs, there are other options - also make
sure to note these are not the consumer models.

-Third, the memory really is the biggest bottleneck and resource limit, so
I would favor increasing that as much as possible over the size of the SSD
pool. Unused memory is used to cache filesystem contents to RAM, which is
orders of magnitude faster than a SSD, but is there for your workloads when
needed.
--Suggest 128GB RAM, making trade-offs against SSD capacity if budget
requires.

If you look at OSGeo6 I'm not sure we're really utilizing all the ram we
bought.
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.osgeo.org/index.html

Though really in this case it's about $900 to add the additional ram up
to 128. I'm on the fence about this. Since I'd prefer to buy more often
cheaper machines than load up expensive ones.

Some general comments on filesystems, software stack, and virtualization,
in reverse order.

For most of the needs I have seen discussed, full virtualization is far
more heavy-handed than necessary -- a container solution such as LXC/LXD
would be much more appropriate and allow for much better granularity with
lower overhead. A few VMs may be useful for particular projects that need
to run their own kernel, low-level services, or suspend and move to another
host for some reason, but those are the exception, not the rule. Many tools
for managing VMs can also manage containers, and provisioning many
containers off the same base template is both very easy and consumes very
little additional disk space when used on a CoW filesystem (Copy-on-write)
that supports cloning; additionally, backups are both instantaneous and
only take up as much space as the changed files. My personal preference is
to use ZFS for a filesystem because it supports all levels of the storage
stack from disk to filesystem to snapshots and remote backup in a single
tool and thus can detect and correct data corruption anywhere in the stack
before it can be persisted. LVM2 and associated tools provide mostly
similar functionality, but I find them much less intuitive and more
difficult to administrate - that's certainly may be just a matter of
personal taste and experience.

We've already decided to go VM, so we can migrate existing services. In
our case administering a VM can be delegated easily. We do plan to try
our containers on OSGeo6 (existing). But for now we really just need to
move existing VMs from OSGeo3 so we can retire the hardware. These
include Downloads, Wiki, Trac/SVN, and Webextra (Foss4g)

I hope this helps with the purchasing and provisioning decisions.

Take care,
   ~~~Chris Giorgi~~~

On Mon, Feb 12, 2018 at 1:27 PM, Regina Obe <lr@pcorp.us> wrote:

Alex,

This looks good to me +1. Really excited to have a new Box in place.

I'm also thinking that with the new box, we could start off-loading
osgeo3,4 and allow Lance to upgrade the ganeti on them.
Since we won't have anything mission critical -- after we migrate mission
critical stuff to osgeo7, if hardware on osgeo4 fails during upgrade, I
assume it wouldn't be a big deal.
As I recall, was it only osgeo4 that had a hardware issue?

Thanks,
Regina

-----Original Message-----
From: Sac [mailto:sac-bounces@lists.osgeo.org] On Behalf Of Alex M
Sent: Monday, February 12, 2018 3:54 PM
To: sac >> System Administration Committee Discussion/OSGeo <
sac@lists.osgeo.org>
Subject: [SAC] OSGeo7 Sever Config Quote

Here's the latest quote for us to discuss server configuration for OSGeo7.

https://drive.google.com/open?id=1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL

The plan based on discussions is to manage KVM virtual machines, lvm
drives, with libvirt. At such time that we feel we need to go to something
more advanced because we are managing multiple physical machines we could
convert to ganeti or openstack (less sure about how to convert to
openstack).

The idea was up to 4 virtual machines, each would have someone designated
to make sure it was updated, along with use of the unattended upgrades for
security patches.

As quoted I've done RAID 5 SSD, and RAID 5 traditional, 3 drives each.
That will give us fast storage and large storage (think downloads and
foss4g archives).

I did redundant power to maximize uptime.

RAM is only 64 GB which is up to 16 for each of the Virtual Machines.

Please discuss and ask questions so we can possibly vote this week at the
meeting.

Thanks,
Alex
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

On Mon, Feb 12, 2018 at 04:27:58PM -0500, Regina Obe wrote:

As I recall, was it only osgeo4 that had a hardware issue?

I don't remember having read about hardware failures
on the wiki section "osgeo4" [1]. But it does mention
it has to be retired.

[1] https://wiki.osgeo.org/wiki/SAC_Service_Status#osgeo4

--strk;

On Tue, Feb 13, 2018 at 08:57:14AM -0800, Alex M wrote:

Note OSGeo6 is 6xSSD with I believe 2 RAID5.

Could we add hardware specs to the wiki ?

This page doesn't mention RAIDS or SSDs:
  https://wiki.osgeo.org/wiki/Osgeo6
It does link to an "Hardware Plan 2014" page
but from there you cannot tell which hardware
ended up becoming "osgeo6".

--strk;

On Tue, Feb 13, 2018 at 6:29 PM, Sandro Santilli <strk@kbt.io> wrote:

On Tue, Feb 13, 2018 at 08:57:14AM -0800, Alex M wrote:

Note OSGeo6 is 6xSSD with I believe 2 RAID5.

Could we add hardware specs to the wiki ?

Yes, please.

This page doesn't mention RAIDS or SSDs:
  https://wiki.osgeo.org/wiki/Osgeo6
It does link to an "Hardware Plan 2014" page
but from there you cannot tell which hardware
ended up becoming "osgeo6".

Probably this thread:
https://lists.osgeo.org/pipermail/sac/2015-April/005564.html

Markus

On 02/13/2018 09:29 AM, Sandro Santilli wrote:

On Tue, Feb 13, 2018 at 08:57:14AM -0800, Alex M wrote:

Note OSGeo6 is 6xSSD with I believe 2 RAID5.

Could we add hardware specs to the wiki ?

This page doesn't mention RAIDS or SSDs:
  https://wiki.osgeo.org/wiki/Osgeo6
It does link to an "Hardware Plan 2014" page
but from there you cannot tell which hardware
ended up becoming "osgeo6".

--strk;

Uh, yes the SSDs are listed under hardware.
http://wiki.osgeo.org/wiki/Osgeo6#Hardware
They are all Intel DC S3500 Series MLC

Agreed the RAID configuration does not seem to be on that page. Martin
S. set up the machine and can confirm. Based on the drives I think it's
RAID 1 for the 2x80 GB drives with the OS, and RAID 5 with the 4x480 drives.

Thanks,
Alex

Chris,

Want to take a try at configuring some options? Just make some quotes on
the silicon mechanics website. I can get us the Non-profit discount
(3-5% typically) on any quote through my Rep. See what you can do for
under $6000.

Some bigger questions to discuss:
What services should be on what type of disk?
Which services should be easy to containerize?

Major services:
Downloads - no database, frequent writes only to the maven stuff?
Container Easy

Trac/SVN - Postgres DB, not sure how heavy on writes. Container, tricky

Webextra - mostly static archives. Container easy.

Wiki - MediaWiki php/mysql, lots of writes. Container moderate.

Mailman - frequent writes. Container ?

Projects & Adhoc - a variety of php and other type sites, mostly read,
not a lot of writes. Container - would make life easier if every project
was it's own containers, but we don't want that many db instances.

Why VMs:
Mostly because 1 bad VM could easily be restarted and CPU/RAM allocation
could be somewhat isolated. Yes I know there is a way to quota ram on
containers. ACL is relatively easy, not everyone needs access to the
host. Yes I know ssh can be put inside containers, but that means
network routing is more complicated. That said I agree we should move
some things to containers when easy.

Minor notes:
Our experience with RAID 6 was terrible, but that was a hardware raid
card. Rebuild on even small SAS drives was days.
ZFS - doesn't ZFS require a lot of RAM?

Thanks,
Alex

On 02/13/2018 02:09 PM, Chris Giorgi wrote:

On Tue, Feb 13, 2018 at 8:57 AM, Alex M <tech_dev@wildintellect.com> wrote:

Thanks for the feedback, some comments inline. - Alex

Quick Note, all mention of RAID is software not hardware.

On 02/13/2018 12:13 AM, Chris Giorgi wrote:

Hi Alex,

Overall, this looks like a solid machine, but I do have a few suggestions
considering the details of the hardware configuration.

-First, a RAID5 array for the spinning rust pool may leave the pool

unduly

susceptible to complete failure during recovery from a single drive

failure

and replacement due to the extreme load on all discs while recreating the
data on the replaced disk tending to trigger a subsequent failure. Also,

no

hot spare is available, leaving the pool running in degraded mode until
someone can physically swap the drive. A RAID6 (or ZFS RAIDZ2)
configuration having two drives worth of recovery data greatly minimizes
such risk.
--Suggest that all 4 hot-swap bays be provisioned with the HGST models
listed (512b emulated sector size) in the quote if the 4k native sector
size drives are not available -- this can be worked around at the FS

level

(ashift=12) with minimal performance impact.

We actually don't need that much space, 2 TB drives would have sufficed,
I picked the smallest size they offered which was 8 TB. What about just
going with mirror on 2x HGST drives. Note we do have a backup server.
Normally I would also use 4 Drives, but the machines just don't have
that many bays.

If bulk storage isn't at a premium, a simple mirror would work fine using
either
2 or 3 drives (which would allow having the hot-spare online in the mirror,
while improving read performance for free).

Having 4 drives does double the capacity while tolerating two drives failing
before loosing any data, and extra space can be used for storing snapshots
locally, allowing rolling back a particular dataset to a previous working
state
without having to recover from backups -- zfs allows access to the previous
snapshots files by simply accessing them in the hidden .zfs directories,
which
is a life-saver for accidental file overwrite or deletion.

-Second, RAID5 will seriously reduce the performance of the SSDs, and,
especially on writes, increases latency, which somewhat defeats the

purpose

of utilizing SSDs. A simple mirror array would be much better performing
and have the same level of redundancy, while a stripe could be much

faster

when used as a cache for hot data from the HDDs rather than the primary
storage. For heavy write loads, such as databases, MLC SSDs really aren't
suitable because of the wear rate and they usually lack of
power-loss-protection. A smaller capacity but higher iops NVMe type SSD

on

the PCI bus would be much more effective for those workloads.
--Suggest identifying workloads needing high-speed storage and determine
read vs. write requirements before final selection of SSDs. Use two SATA
SSDs in mirror or stripe configuration for bulk storage or cache.

Consider

PCIe connected NVMe if large number of writes and transactions.

Speed isn't a huge issue, SSDs in any form seem to perform fast enough.
We tend to use SSD storage for everything. Having slow spinning disks at
all is new, and only suggested to start holding larger archives in a

publicly accessible way.

Looking at
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.osgeo.org/diskstats_latency/index.html
shows that write-latency is the largest bottleneck, while
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.osgeo.org/diskstats_iops/index.html
indicates that home, www, and mailman have the highest iops needs,
with most IO being small writes, which are poorly suited for the SSDs
indicated in the quote (DC S4500), which are intended for low write, high
read
use and will quickly wear out if used for high write loads.

DC 3600 or 3700series drives are designed for the higher write environment
and would be a better choice.

Another option that may be better yet would be to purchase one of the
lower-cost
PCIe Optane SSDs (900P) to provide a very high-performance cache for
hot-data,
and a pair of small high-write volume SSDs to use for mirrored
write-caching -- in
ZFS terms, this would allow a portion of the PCIe SSD to be used for the
L2ARC,
and the mirrored SATA SSDs for the SLOG.

We could go to 4 drives and do a mirror 2x2. We want the ability to keep
going when a drive drops and get a new drive in within a couple of days.

Note OSGeo6 is 6xSSD with I believe 2 RAID5.

Can you verify the type of SSDs, there are other options - also make
sure to note these are not the consumer models.

-Third, the memory really is the biggest bottleneck and resource limit,

so

I would favor increasing that as much as possible over the size of the

SSD

pool. Unused memory is used to cache filesystem contents to RAM, which is
orders of magnitude faster than a SSD, but is there for your workloads

when

needed.
--Suggest 128GB RAM, making trade-offs against SSD capacity if budget
requires.

If you look at OSGeo6 I'm not sure we're really utilizing all the ram we
bought.
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.
osgeo.org/index.html

Though really in this case it's about $900 to add the additional ram up
to 128. I'm on the fence about this. Since I'd prefer to buy more often
cheaper machines than load up expensive ones.

Looking at the stats, it appears that it's fully utilizing the ~128G
allocated, with
most of it acting as a page cache for the filesystems.

There do appear to be instances when swap was getting hit, but I can't tell
why from just the graphs.

Increasing the memory will almost always be worth it compared to processor
or disk speed upgrades, especially in a multi-tenant environment -- In unix,
RAM is never wasted, it just gets allocated to various subsystems as needed.

Some general comments on filesystems, software stack, and virtualization,
in reverse order.

For most of the needs I have seen discussed, full virtualization is far
more heavy-handed than necessary -- a container solution such as LXC/LXD
would be much more appropriate and allow for much better granularity with
lower overhead. A few VMs may be useful for particular projects that need
to run their own kernel, low-level services, or suspend and move to

another

host for some reason, but those are the exception, not the rule. Many

tools

for managing VMs can also manage containers, and provisioning many
containers off the same base template is both very easy and consumes very
little additional disk space when used on a CoW filesystem

(Copy-on-write)

that supports cloning; additionally, backups are both instantaneous and
only take up as much space as the changed files. My personal preference

is

to use ZFS for a filesystem because it supports all levels of the storage
stack from disk to filesystem to snapshots and remote backup in a single
tool and thus can detect and correct data corruption anywhere in the

stack

before it can be persisted. LVM2 and associated tools provide mostly
similar functionality, but I find them much less intuitive and more
difficult to administrate - that's certainly may be just a matter of
personal taste and experience.

We've already decided to go VM, so we can migrate existing services. In
our case administering a VM can be delegated easily. We do plan to try
our containers on OSGeo6 (existing). But for now we really just need to
move existing VMs from OSGeo3 so we can retire the hardware. These
include Downloads, Wiki, Trac/SVN, and Webextra (Foss4g)

As a migration path, it's certainly easy enough to spin up a couple of VMs,
but in the long run, those services really should be split up in a more
fine-
grained manner to make them both easier to admin and to reduce resource
usage for services when they aren't in active use.

The current setup with many services running in each VM both increases
resource contention and decreases performance, while also requiring
more administrative overhead and making upgrading of a portion of the
software stack very difficult without upgrading the entire system.

I would be happy to go over the various services needed by which users
to help plan a proper migration away from the all-in-one VMs, and there
is no problem with having both VMs and containers in various combinations,
including containers within VMs where that makes sense.

I hope this helps with the purchasing and provisioning decisions.

Take care,
   ~~~Chris Giorgi~~~

On Mon, Feb 12, 2018 at 1:27 PM, Regina Obe <lr@pcorp.us> wrote:

Alex,

This looks good to me +1. Really excited to have a new Box in place.

I'm also thinking that with the new box, we could start off-loading
osgeo3,4 and allow Lance to upgrade the ganeti on them.
Since we won't have anything mission critical -- after we migrate

mission

critical stuff to osgeo7, if hardware on osgeo4 fails during upgrade, I
assume it wouldn't be a big deal.
As I recall, was it only osgeo4 that had a hardware issue?

Thanks,
Regina

-----Original Message-----
From: Sac [mailto:sac-bounces@lists.osgeo.org] On Behalf Of Alex M
Sent: Monday, February 12, 2018 3:54 PM
To: sac >> System Administration Committee Discussion/OSGeo <
sac@lists.osgeo.org>
Subject: [SAC] OSGeo7 Sever Config Quote

Here's the latest quote for us to discuss server configuration for

OSGeo7.

https://drive.google.com/open?id=1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL

The plan based on discussions is to manage KVM virtual machines, lvm
drives, with libvirt. At such time that we feel we need to go to

something

more advanced because we are managing multiple physical machines we

could

convert to ganeti or openstack (less sure about how to convert to
openstack).

The idea was up to 4 virtual machines, each would have someone

designated

to make sure it was updated, along with use of the unattended upgrades

for

security patches.

As quoted I've done RAID 5 SSD, and RAID 5 traditional, 3 drives each.
That will give us fast storage and large storage (think downloads and
foss4g archives).

I did redundant power to maximize uptime.

RAM is only 64 GB which is up to 16 for each of the Virtual Machines.

Please discuss and ask questions so we can possibly vote this week at

the

meeting.

Thanks,
Alex
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

All,

With Chris' help here's updated quotes, these are 4 variants that just
change the size of the large spinning disks. Which if I understand
correctly it's suggested we run in mirror so n-2 storage.

Chris is the mirror RAID based, which mode? Or some other way?

https://drive.google.com/file/d/1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL/view?usp=sharing

Opinion, I'm leaning towards the 4 TB drives instead of the 8 TB. That
will give over 7 TB of usable space with dual redundancy. Save us $500
which goes towards the +$800 on ram to get to 128GB or the Optane card
depending on how you look at it.

Thanks,
Alex

On 02/14/2018 01:03 PM, Chris Giorgi wrote:

On Wed, Feb 14, 2018 at 9:29 AM, Alex M <tech_dev@wildintellect.com> wrote:

Chris,

Want to take a try at configuring some options? Just make some quotes on
the silicon mechanics website. I can get us the Non-profit discount
(3-5% typically) on any quote through my Rep. See what you can do for
under $6000.

Targeting < $6000 using that base configuration is a bit tricky, but the
changes
I would make are the following:
+$848 64G -> 128G Ram
-$30 Hot-swap 4: S4500 -> HGST 8he SATA
(+$212 for 4x10he or +$540 for 4x12he)
-~$150 for 2x S3700 200G - 10DWPD Preferred for frequent writes (Not listed
by Silicon Mechanics - open market price)*
( or -$210 2x S4500 480G - 1DWPD -> 2x S4600 240G - 3DWPD Moderate writes
okay)
(or +$212 for 2x S4600 480G - 3DWPD)
(or +~$350 for 2x S3700 400G - 10DWPD Preferred for frequent writes) (Not
listed by Silicon Mechanics - open market price)
(or +$586 for 2x Micron 5100 Max 960G - 5DWPD)
+~$650 1x Optane SSD 900P 480GB PCIe -10DWPD (Not listed by Silicon
Mechanics - open market price)

- Additional RAM will make largest improvement in overall performance and
responsiveness under load.

- Configured with dual redundancy for the HDD array, this gives a capacity
of 16TB.
(20TB +$212, 24TB +$540).

- SATA SSDs would be configured for mirroring to be used primarily for
write caching,
2x200G S3700 is more than enough, and can support 2TB of writes daily
without
premature failure (vs. 480GB/day with S4500s). Extra space may be used for
DB
tables having high transaction rates.

- PCIe connected Optane SSD would provide blistering fast caching for hot
data,
orders of magnitude lower latency and higher IOPS than SATA SSDs can
provide.

Some bigger questions to discuss:
What services should be on what type of disk?
Which services should be easy to containerize?

Major services:
Downloads - no database, frequent writes only to the maven stuff?
Container Easy

Downloads can reside entirely on spinning platters with no problems.
Frequently
accessed files will be cached to memory by the filesystem, and the initial
latency of accessing cold data is still reasonable.

Trac/SVN - Postgres DB, not sure how heavy on writes. Container, tricky

-Trac/SVN will have no qualms about sitting in containers, preferably one
for
each Trac instance.
-A proxying web server configuration would allow a single server
configuration
to handle connections for all instances.
-Trac itself should have low write requirements, with moderate reads.
-SVN/git I/O loads based on commit volume; high frequency small random
writes.
-A single PostgreSQL installation should handle all of them with ease.
-PostgreSQL will be the highest transaction loading, but is log structured,
so block rewrites are not out of hand.
-A VM may be appropriate for PostgreSQL, but it also works fine in
containers, possibly better due to less overhead.

Webextra - mostly static archives. Container easy.

Same story as Downloads.

Wiki - MediaWiki php/mysql, lots of writes. Container moderate.

Write caching + plenty of RAM should make this load perform well.
No difficulty with containerization, but possibly use VM to limit
resources if needed.

Mailman - frequent writes. Container ?

Mail services may use either containers or VMs comfortably. Write-caching
will help immensely here. Long term storage on spinning rust is fine.

Projects & Adhoc - a variety of php and other type sites, mostly read,
not a lot of writes. Container - would make life easier if every project
was it's own containers, but we don't want that many db instances.

A container for every project is easily doable with a single PostgreSQL
instance serving all of them -- if a few projects need their own dedicated
instances, that's easy to setup using another container that they control.

Why VMs:
Mostly because 1 bad VM could easily be restarted and CPU/RAM allocation
could be somewhat isolated. Yes I know there is a way to quota ram on
containers. ACL is relatively easy, not everyone needs access to the
host. Yes I know ssh can be put inside containers, but that means
network routing is more complicated. That said I agree we should move
some things to containers when easy.

VMs are best used where unbounded resource consumption is likely or
where services need to be suspended and resumed elsewhere. Most
hypervisor based container technologies (such as LXC/LXD) can fully
compartmentalize resource usage, and containers can be stopped and
restarted just like VMs. The networking with ssh connections into the
containers is SOP these days and is the primary means used by
provisioning tools like Vagrant.

Minor notes:
Our experience with RAID 6 was terrible, but that was a hardware raid
card. Rebuild on even small SAS drives was days.
ZFS - doesn't ZFS require a lot of RAM?

Hardware RAID 6 suffers badly when you have to resliver because the
computational load must be handled by the card and be reasonably
transparent to the operational state of the machine. A good software
RAID is generally much faster, considering modern CPU power. ZFS
RAIDZ2 doesn't suffer from the same issues because the filesystem
understands the data duplication strategy directly, rather than having
a translation layer between the physical disks and the filesystem.

ZFS does like a lot of RAM to increase the amount of cached data,
but unless you're trying to use the deduplication feature (DON'T!),
it will happily get by on a modest amount. In general, all filesystems
will perform better if given lots of RAM for caching. ZFS can also very
effectively use both read (L2ARC) and write (SLOG) caches on fast
drives to improve performance on spinning rust arrays to the point
only cold-data has any noticeable latency. When using ZFS, the
entire pool can be compressed with lz4, which both increases
useful storage and bandwidth -- files which are already compressed
are recognized, so there is no drawback to using it on mixed data.

Take a look at the drives listed above and perhaps inquire
with Silicon Mechanics about the availability of the
DC S3700 200GB SATA SSDs and the Optane 900p 480G
PCIe SSD (workstation class, not the insanely expensive one!)
and let me know how far off the price target we end up.

Thanks,
    ~~~Chris~~~

Thanks,
Alex

On 02/13/2018 02:09 PM, Chris Giorgi wrote:

On Tue, Feb 13, 2018 at 8:57 AM, Alex M <tech_dev@wildintellect.com>

wrote:

Thanks for the feedback, some comments inline. - Alex

Quick Note, all mention of RAID is software not hardware.

On 02/13/2018 12:13 AM, Chris Giorgi wrote:

Hi Alex,

Overall, this looks like a solid machine, but I do have a few

suggestions

considering the details of the hardware configuration.

-First, a RAID5 array for the spinning rust pool may leave the pool

unduly

susceptible to complete failure during recovery from a single drive

failure

and replacement due to the extreme load on all discs while recreating

the

data on the replaced disk tending to trigger a subsequent failure.

Also,

no

hot spare is available, leaving the pool running in degraded mode until
someone can physically swap the drive. A RAID6 (or ZFS RAIDZ2)
configuration having two drives worth of recovery data greatly

minimizes

such risk.
--Suggest that all 4 hot-swap bays be provisioned with the HGST models
listed (512b emulated sector size) in the quote if the 4k native sector
size drives are not available -- this can be worked around at the FS

level

(ashift=12) with minimal performance impact.

We actually don't need that much space, 2 TB drives would have sufficed,
I picked the smallest size they offered which was 8 TB. What about just
going with mirror on 2x HGST drives. Note we do have a backup server.
Normally I would also use 4 Drives, but the machines just don't have
that many bays.

If bulk storage isn't at a premium, a simple mirror would work fine using
either
2 or 3 drives (which would allow having the hot-spare online in the

mirror,

while improving read performance for free).

Having 4 drives does double the capacity while tolerating two drives

failing

before loosing any data, and extra space can be used for storing

snapshots

locally, allowing rolling back a particular dataset to a previous working
state
without having to recover from backups -- zfs allows access to the

previous

snapshots files by simply accessing them in the hidden .zfs directories,
which
is a life-saver for accidental file overwrite or deletion.

-Second, RAID5 will seriously reduce the performance of the SSDs, and,
especially on writes, increases latency, which somewhat defeats the

purpose

of utilizing SSDs. A simple mirror array would be much better

performing

and have the same level of redundancy, while a stripe could be much

faster

when used as a cache for hot data from the HDDs rather than the primary
storage. For heavy write loads, such as databases, MLC SSDs really

aren't

suitable because of the wear rate and they usually lack of
power-loss-protection. A smaller capacity but higher iops NVMe type SSD

on

the PCI bus would be much more effective for those workloads.
--Suggest identifying workloads needing high-speed storage and

determine

read vs. write requirements before final selection of SSDs. Use two

SATA

SSDs in mirror or stripe configuration for bulk storage or cache.

Consider

PCIe connected NVMe if large number of writes and transactions.

Speed isn't a huge issue, SSDs in any form seem to perform fast enough.
We tend to use SSD storage for everything. Having slow spinning disks at
all is new, and only suggested to start holding larger archives in a

publicly accessible way.

Looking at
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.

osgeo.org/diskstats_latency/index.html

shows that write-latency is the largest bottleneck, while
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.

osgeo.org/diskstats_iops/index.html

indicates that home, www, and mailman have the highest iops needs,
with most IO being small writes, which are poorly suited for the SSDs
indicated in the quote (DC S4500), which are intended for low write, high
read
use and will quickly wear out if used for high write loads.

DC 3600 or 3700series drives are designed for the higher write

environment

and would be a better choice.

Another option that may be better yet would be to purchase one of the
lower-cost
PCIe Optane SSDs (900P) to provide a very high-performance cache for
hot-data,
and a pair of small high-write volume SSDs to use for mirrored
write-caching -- in
ZFS terms, this would allow a portion of the PCIe SSD to be used for the
L2ARC,
and the mirrored SATA SSDs for the SLOG.

We could go to 4 drives and do a mirror 2x2. We want the ability to keep
going when a drive drops and get a new drive in within a couple of days.

Note OSGeo6 is 6xSSD with I believe 2 RAID5.

Can you verify the type of SSDs, there are other options - also make
sure to note these are not the consumer models.

-Third, the memory really is the biggest bottleneck and resource limit,

so

I would favor increasing that as much as possible over the size of the

SSD

pool. Unused memory is used to cache filesystem contents to RAM, which

is

orders of magnitude faster than a SSD, but is there for your workloads

when

needed.
--Suggest 128GB RAM, making trade-offs against SSD capacity if budget
requires.

If you look at OSGeo6 I'm not sure we're really utilizing all the ram we
bought.
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.
osgeo.org/index.html

Though really in this case it's about $900 to add the additional ram up
to 128. I'm on the fence about this. Since I'd prefer to buy more often
cheaper machines than load up expensive ones.

Looking at the stats, it appears that it's fully utilizing the ~128G
allocated, with
most of it acting as a page cache for the filesystems.

There do appear to be instances when swap was getting hit, but I can't

tell

why from just the graphs.

Increasing the memory will almost always be worth it compared to

processor

or disk speed upgrades, especially in a multi-tenant environment -- In

unix,

RAM is never wasted, it just gets allocated to various subsystems as

needed.

Some general comments on filesystems, software stack, and

virtualization,

in reverse order.

For most of the needs I have seen discussed, full virtualization is far
more heavy-handed than necessary -- a container solution such as

LXC/LXD

would be much more appropriate and allow for much better granularity

with

lower overhead. A few VMs may be useful for particular projects that

need

to run their own kernel, low-level services, or suspend and move to

another

host for some reason, but those are the exception, not the rule. Many

tools

for managing VMs can also manage containers, and provisioning many
containers off the same base template is both very easy and consumes

very

little additional disk space when used on a CoW filesystem

(Copy-on-write)

that supports cloning; additionally, backups are both instantaneous and
only take up as much space as the changed files. My personal preference

is

to use ZFS for a filesystem because it supports all levels of the

storage

stack from disk to filesystem to snapshots and remote backup in a

single

tool and thus can detect and correct data corruption anywhere in the

stack

before it can be persisted. LVM2 and associated tools provide mostly
similar functionality, but I find them much less intuitive and more
difficult to administrate - that's certainly may be just a matter of
personal taste and experience.

We've already decided to go VM, so we can migrate existing services. In
our case administering a VM can be delegated easily. We do plan to try
our containers on OSGeo6 (existing). But for now we really just need to
move existing VMs from OSGeo3 so we can retire the hardware. These
include Downloads, Wiki, Trac/SVN, and Webextra (Foss4g)

As a migration path, it's certainly easy enough to spin up a couple of

VMs,

but in the long run, those services really should be split up in a more
fine-
grained manner to make them both easier to admin and to reduce resource
usage for services when they aren't in active use.

The current setup with many services running in each VM both increases
resource contention and decreases performance, while also requiring
more administrative overhead and making upgrading of a portion of the
software stack very difficult without upgrading the entire system.

I would be happy to go over the various services needed by which users
to help plan a proper migration away from the all-in-one VMs, and there
is no problem with having both VMs and containers in various

combinations,

including containers within VMs where that makes sense.

I hope this helps with the purchasing and provisioning decisions.

Take care,
   ~~~Chris Giorgi~~~

On Mon, Feb 12, 2018 at 1:27 PM, Regina Obe <lr@pcorp.us> wrote:

Alex,

This looks good to me +1. Really excited to have a new Box in place.

I'm also thinking that with the new box, we could start off-loading
osgeo3,4 and allow Lance to upgrade the ganeti on them.
Since we won't have anything mission critical -- after we migrate

mission

critical stuff to osgeo7, if hardware on osgeo4 fails during upgrade,

I

assume it wouldn't be a big deal.
As I recall, was it only osgeo4 that had a hardware issue?

Thanks,
Regina

-----Original Message-----
From: Sac [mailto:sac-bounces@lists.osgeo.org] On Behalf Of Alex M
Sent: Monday, February 12, 2018 3:54 PM
To: sac >> System Administration Committee Discussion/OSGeo <
sac@lists.osgeo.org>
Subject: [SAC] OSGeo7 Sever Config Quote

Here's the latest quote for us to discuss server configuration for

OSGeo7.

https://drive.google.com/open?id=1X-z66jXXBUZuPqh6EP0d43g2NUCL7xcL

The plan based on discussions is to manage KVM virtual machines, lvm
drives, with libvirt. At such time that we feel we need to go to

something

more advanced because we are managing multiple physical machines we

could

convert to ganeti or openstack (less sure about how to convert to
openstack).

The idea was up to 4 virtual machines, each would have someone

designated

to make sure it was updated, along with use of the unattended upgrades

for

security patches.

As quoted I've done RAID 5 SSD, and RAID 5 traditional, 3 drives each.
That will give us fast storage and large storage (think downloads and
foss4g archives).

I did redundant power to maximize uptime.

RAM is only 64 GB which is up to 16 for each of the Virtual Machines.

Please discuss and ask questions so we can possibly vote this week at

the

meeting.

Thanks,
Alex
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac

Alex M wrote:

The plan based on discussions is to manage KVM virtual machines, lvm
drives, with libvirt.

Of course I'm not in a position to veto against this conclusion, anyhow I'd
like to emphasize that I'd rather not support another virtualization setup
having a boot loader inside the respective VM's. From my perspective this
adds maintenance overhead without benefit.

The last infrastructure transition was driven by people who've been
abdicating from their responsibilities shortly after making far-reaching
technical decisions, letting others suffer from the trouble they caused.
Thus, whoever is supporting the above conclusion should be prepared to stand
by their responsibilities for a couple of years fixing stuff *themselves*
whenever anything goes wrong !!

Cheers,
  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------

I concur with Martin, having more layers of complexity than necessary
makes things both fragile and difficult to administer.
I would like to propose the following stack as an alternative to the
proposed md + LVM2 + filesystem + KVM + libvirt (while still allowing
KVM/libvert when desired):

Operating system - Ubuntu?:
(https://www.ubuntu.com/)
  - It seems most SAC members are comfortable with the Debian based tools.
  - ZFS and LXD are actively supported by Canonical, LXD available
with commercial support.
  - Base debian does not include ZFS; you must build it from source in
the extras repo using DKMS.

Use ZFS instead of md software RAID + LVM2 volume management + filesystem:
(http://zfsonlinux.org/)
(https://wiki.ubuntu.com/ZFS)
  - ZFS is a logged CoW (Copy on Write) filesystem, which reduces the
random io and eliminates most dataloss risks from unexpected
powerloss.
  - ZFS handle all levels of the storage stack, thus can ensure data
integrity remains intact.
  - There are only a couple of commands to learn to handle almost all
storage related tasks:
    - `zpool` handles everything from the raw disks up to the 'pool' level; and
    - `zfs` handles everything related to 'datasets', which encompass
both logical volume and filesystem semantics.
  - Mount points for the datasets are managed by ZFS, no fstab editing
is required.
  - Each dataset can have it's options tuned individually, inheriting
it's parent's options by default.
  - Redundancy is available both at the device and dataset levels.
  - Snapshots are easy, instantaneous, and only consume the space
needed to store changes.
  - Backup and restore functions are built in and very easy to use;
`zfs send` and `zfs receive` work locally, to/from a file, or over the
network.
  - Datasets can be cloned instantly, with each clone growing in size
only as it changes from the source; this is ideal for containers.
  - Caching is handled at the block level, allowing all unchanged
portions of clones to be cached only once.
  - Caching can be distributed across additional fast devices for both
read (L2ARC) and write (ZIL SLOG) caches.
  - Unallocated system memory is used as the primary cache (ARC),
making hot data access nearly instant.
  - Utilizing a fast SLOG allows write latency to pools on spinning
HDDs to be reduced to the write latency of the SLOG device.
  - Read latency from a pool of spinning HDDs is only appreciable the
first time cold data is accessed, after it has been cached, it is read
from the ARC (RAM) or L2ARC (Fast storage).
  - Virtual raw block devices called zvols can be created to act as
backing stores for swap, VMs, or other low-level block device needs,
and can be coned and snapshotted just like datasets.
  - Quotas and acls can be set on a dataset level.
  - Several container/virtualization management tools support using
ZFS's clones and snapshots to quickly create, copy, provision, and
backup containers/VMs.

Use LXC/LXD containers in place of most VMs:
(https://linuxcontainers.org/lxd/)
(https://help.ubuntu.com/lts/serverguide/lxd.html)
(https://www.ubuntu.com/containers/lxd)
  - Administration is simpler.
  - Containers don't require their own kernel or filesystem.
  - Resource utilization is much lower.
  - By not fragmenting resources, they can be allocated more
efficiently and a single copy of duplicate data shared across multiple
containers.
  - Networking can be passed through without adding another layer of
device drivers.
  - Containers can nest with other containers or VMs; you can run
Docker inside a LXC container inside a VM if you really want to, as
well as running a couple of VMs inside a container if needed.
  - Containers can be set up as privileged and resources of the host
exposed where needed.
  - Containers can be stopped, started, and migrated in much the same
way as VMs.
  - For scaling up, OpenStack works with containers as well as VMs.

Please look this over and provide comments and concerns.

Take care,
   ~~~Chris~~~

On Fri, Feb 16, 2018 at 6:33 AM, Martin Spott <Martin.Spott@mgras.net> wrote:

Alex M wrote:

The plan based on discussions is to manage KVM virtual machines, lvm
drives, with libvirt.

Of course I'm not in a position to veto against this conclusion, anyhow I'd
like to emphasize that I'd rather not support another virtualization setup
having a boot loader inside the respective VM's. From my perspective this
adds maintenance overhead without benefit.

The last infrastructure transition was driven by people who've been
abdicating from their responsibilities shortly after making far-reaching
technical decisions, letting others suffer from the trouble they caused.
Thus, whoever is supporting the above conclusion should be prepared to stand
by their responsibilities for a couple of years fixing stuff *themselves*
whenever anything goes wrong !!

Cheers,
        Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/sac