[SAC] [OSGeo] #1982: osgeo6 disk full: 100% /var

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Keywords: osgeo6
---------------------------+--------------------
Disk full!

{{{
Aug 14 02:39:01 osgeoroot@osgeo6:/var/log# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/osgeo6-var 19G 19G 20K 100% /var
}}}

It seems that our monitoring services are not being looked at...? Nor they
generate warnings by email to the SAC list.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by neteler):

As an emergency solution, I have

{{{
root@osgeo6:/var/log# du -hs * | grep G
4.8G mail.info.1
4.8G mail.log.1

root@osgeo6:/var/log# rm -f mail.info.1 mail.log.1
}}}

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:1&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------
Changes (by neteler):

* cc: strk (added)

Comment:

Since osgeo6 is also the mailman list server, there is a notable backlog
of currently deferred emails:

{{{
root@osgeo6:/var/log# mailq | grep mailman-bounces@lists.osgeo.org | wc -l
143069
}}}

This is bloating the log files since due our too high posting frequency
emails are temporarily rejected:

{{{
ls -lart | grep 'mail.log\|mail.info'
...
-rw-r----- 1 root adm 959388006 Aug 14 03:45 mail.log
-rw-r----- 1 root adm 959408486 Aug 14 03:45 mail.info

tail -f /var/log/mail.info
...
Aug 14 03:47:41 osgeo6 postfix/smtp[11659]: 6CD906099BF8: host
smx00.udag.de[62.146.106.132] refused to talk to me: 421 Too many
concurrent SMTP connections from this IP address; please try again later.
...
}}}

These huge log files are eventually filling up the disk to 100%.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:2&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by wildintellect):

So clearly the log files contribute, but I'm not sure that's the root
cause. I think the mailman was starting to throw errors and that cascaded
it filling it's own logs, which then caused more errors. Anyone know if
there's a way to limit how big a mailman log will get before it cycles,
rather than waiting for logrotate? Can someone try to extract the head of
the logs to figure out what caused the initial issue?

I'll check on the munin email setup, I'm not sure what it's set to do when
it hits a limit.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:3&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by neteler):

Replying to [comment:3 wildintellect]:
> So clearly the log files contribute, but I'm not sure that's the root
cause. I think the mailman was starting to throw errors and that cascaded
it filling it's own logs, which then caused more errors. Anyone know if
there's a way to limit how big a mailman log will get before it cycles,
rather than waiting for logrotate?

AFAIK it is postfix, not mailman writing those two log files.

> Can someone try to extract the head of the logs to figure out what
caused the initial issue?

Most messages are like this:

{{{
Aug 14 14:01:15 osgeo6 postfix/smtp[9601]: 3FFB36332EB7: host
aspmx.l.google.com[74.125.28.26] said: 450-4.2.1 The user you are trying
to contact is receiving mail at a rate that 450-4.2.1 prevents additional
messages from being delivered. Please resend your 450-4.2.1 message at a
later time. If the user is able to receive mail at that 450-4.2.1 time,
your message will be delivered. For more information, please 450-4.2.1
visit 450 4.2.1 https://support.google.com/mail/?p=ReceivingRate
d11si5262351pln.414 - gsmtp (in reply to RCPT TO command)
}}}

Apparently we are hammering some other servers too much. At time it is a
kind of endless loop with those servers (i.e. for some recipients).

For now, I have added some postfix throttling:
{{{
/etc/postfix/main.cf

# throttle, see https://trac.osgeo.org/osgeo/ticket/1982
smtp_destination_concurrency_limit = 2
smtp_destination_rate_delay = 1s
smtp_extra_recipient_limit = 10
}}}

Maybe that will help to calm down google and dtag mail servers. We can
comment out these lines once the mail queue is closer to empty. At time:

{{{
mailq | grep mailman-bounces@lists.osgeo.org | wc -l
111064
}}}

which is 30,000 less than 11hs ago.

> I'll check on the munin email setup, I'm not sure what it's set to do
when it hits a limit.

ok thx

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:4&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by neteler):

No more mailman emails arrive, we should reboot "osgeo6"

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:5&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by martin):

To me it looks like more than 95% of the Postfix message queue consists of
bounces. I'm now going to remove these from the queue - and I hope I don't
catch any relevant EMail ....

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:6&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by wildintellect):

Looking at the munin graph, it started on Aug 7th. Logs from that date
would be the best thing to check to figure out how to avoid a repeat in
the future.
http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.osgeo.org/postfix_mailqueue.html

I updated munin to email the sysadmin@osgeo (Alex, Martin and Sandro get
this), I would add SAC but I'm not sure how to put multiple emails into
munin notify, or how to whitelist munin so it can send to the SAC list.
Anyone else know those parts?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:7&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by strk):

1) Have the mail sent to list
2) Access the list moderation and whitelist the sender

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:8&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+--------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: new
Priority: blocker | Milestone:
Component: Systems Admin | Resolution:
Keywords: osgeo6 |
---------------------------+--------------------

Comment (by neteler):

Replying to [comment:7 wildintellect]:
> I updated munin to email the sysadmin@osgeo (Alex, Martin and Sandro get
this)

Keep in mind that such alarms won't arrive in case of osgeo6's disk being
full as per this ticket. I guess that the addresses should be added
directly in munin.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:9&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1982: osgeo6 disk full: 100% /var
---------------------------+---------------------
Reporter: neteler | Owner: sac@…
     Type: defect | Status: closed
Priority: blocker | Milestone:
Component: Systems Admin | Resolution: fixed
Keywords: osgeo6 |
---------------------------+---------------------
Changes (by neteler):

* status: new => closed
* resolution: => fixed

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1982#comment:10&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.