[SAC] [support.osuosl.org #23649] [OSGeo] Failed disk in osgeo4.osuosl.bak

Resolving ticket.

On Mon Apr 21 10:35:17 2014, jldugger wrote:

Looks like this disk rebuild finished moments ago, so your RAID array
should be healthy and full of IOPS now. This should wrap up the disk
replacements in osgeo3 for a while.

As a reminder, osgeo4 also has a failed Power Supply.

Justin

On Fri Apr 18 10:41:17 2014, jldugger wrote:
> I'll mark it on my calendar then :wink:
>
> Justin
>
> On Thu Apr 17 21:15:27 2014, tech@wildintellect.com wrote:
> > Friday 1pm PST?
> >
> > Unless I hear screams from the community about some event
happening
> > lets
> > plan for that. We'll also plan to shutdown most if not all of the
VMs
> > to
> > make it go faster.
> >
> > Thanks,
> > Alex
> >
> > On 04/17/2014 02:56 PM, Justin Dugger via RT wrote:
> > > I've received a pair of drives today ATTN: OSGEO. Let me know
when
> > you'd like to take a downtime and we'll get that in.
> > >
> > > Justin
> > >
> > > On Fri Apr 11 14:29:56 2014, tech@wildintellect.com wrote:
> > >> Justin,
> > >>
> > >> Thanks, we suspected this when we did the battery replacement.
2
> > >> drives
> > >> have been ordered and should arrive early next week. Yes we got
a
> > >> spare
> > >> this time.
> > >>
> > >> When it comes in we should plan an outage window to turn on off
VMs
> > to
> > >> make the rebuild go faster.
> > >>
> > >> Thanks,
> > >> Alex
> > >>
> > >> On 04/11/2014 01:08 PM, Justin Dugger via RT wrote:
> > >>> Hey, an important followup!
> > >>>
> > >>> It appears that osgeo4 has lost another drive (this time in
slot
> > 0)
> > >> around 5:30AM:
> > >>>
> > >>> 05:39 PROBLEM: osgeo4.osuosl.bak/Dell RAID Array is CRITICAL,
> > >> CRITICAL: 0:BBU Charged (100%):0:RAID-6:6
drives:557.75GB:Partially
> > >> Drives:6 1 Bad Drives (88 Errors), Apr 11, 12:39 UTC
> > >>>
> > >>> This will affect I/O performance in the obvious ways: any read
> > >> involving the affected disk will require reading all other
volumes
> > to
> > >> calculate what the block should be.
> > >>>
> > >>> On Tue Apr 08 09:20:21 2014, jldugger wrote:
> > >>>> And another followup to document the results of the repair:
> > >>>>
> > >>>> osgeo4 came back online complaining that one of it's Power
Supply
> > >>>> units has failed. It also took quite a while for the VM qgis
to
> > >> fsck,
> > >>>> and that ended up requiring a manual fsck to repair.
> > >>>>
> > >>>> We've agreed to delay osgeo3's battery replacement until next
> > week.
> > >>>>
> > >>>> On Mon Apr 07 11:30:29 2014, jldugger wrote:
> > >>>>> Just to confirm/document what was discussed on IRC:
> > >>>>>
> > >>>>> The RAID array rebuild last week, but we discovered the
cause of
> > >> the
> > >>>>> low throughput was the RAID card on osgeo4 detected a weak
> > battery
> > >>>>> state and transitioned to a slower, safer WriteBack policy.
> > >>>>>
> > >>>>> We've received a pair of batteries and will be taking a
planned
> > >>>>> downtime to install them.
> > >>>>>
> > >>>>> On Thu Apr 03 09:25:58 2014, jldugger wrote:
> > >>>>>> On Thu Apr 03 08:28:55 2014, ramereth wrote:
> > >>>>>>> On Thu, Apr 3, 2014 at 12:04 AM, tech@wildintellect.com
via RT
> > <
> > >>>>>>> support@osuosl.org> wrote:
> > >>>>>>>
> > >>>>>>>> Something seems amiss. The ProjectsVM stopped responding,
> > high
> > >>>>>> disk
> > >>>>>>>> latency and iowait ( 10-11pm PST
> > >>>>>>>
> > >>>>>>> Rebuild Progress on Device at Enclosure 32, Slot 3
Completed
> > 82%
> > >>>>> in
> > >>>>>>> 200
> > >>>>>>> Minutes.
> > >>>>>>>
> > >>>>>>> I've never seen a rebuild take this long before but this
> > >>>> hardware
> > >>>>> is
> > >>>>>>> starting to show its age a little.
> > >>>>>>
> > >>>>>> The only time I've seen things go this slowly was the time
I
> > >>>> forgot
> > >>>>> to
> > >>>>>> take our (very busy) FTP mirror out of rotation for the
> > >>>> duration
> > >>>>> of
> > >>>>>> a build. Under RAID 5, recalculating a block on the
> > >> replacement
> > >>>>>> drive requires a reading in a block on all the other
drives.
> > >> So
> > >>>>>> rebuilds can 'steal' a lot of I/O from a system that was
> > >>>> already
> > >>>>>> down 1 disk worth of I/O requests per second. While you
can
> > >>>>>> sometimes tune the RAID firmware to rebuild at a lower
> > >>>> priority,
> > >>>>>> there's a balancing act between service latency and
> > repairing
> > >>>> the
> > >>>>>> RAID array before a second drive fails.
> > >>>>>>
> > >>>>>> TL;DR: sorry this is taking so long; I didn't realize the
> > >> services
> > >>>>>> depending on it were quite so IO bound.
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> Sac mailing list
> > >>> Sac@lists.osgeo.org
> > >>> http://lists.osgeo.org/mailman/listinfo/sac
> > >>>
> > >>
> > >
> > >
> > >
> > > _______________________________________________
> > > Sac mailing list
> > > Sac@lists.osgeo.org
> > > http://lists.osgeo.org/mailman/listinfo/sac
> > >
> >
>
>