[SAC] Mailman Postfix tuning - Urgent

Alex_M · April 19, 2014, 1:03am

Disk rebuild on osgeo4 is taking longer than expected. Not sure why, but
even with all the VMs off it was going at 2%/hour.

After a couple of hours I turned on Mail so that we wouldn't impact the
mailing lists for too long. The problem here is that Mail is the biggest
i/o user on osgeo4.

Poking around and reading up, I think there are some things that can be
done to make Mail behave better but would really prefer someone who
knows postfix step in and help out here.

Ideas and observations:
Ram-disk for the queues? (Yes we can allocate more ram to the instance)

We seem to have a high deferred queue, do we need to clean out bad
addresses? Perhaps increase the delay time, or spin off the deferred
queue to another "graveyard" server?

Is apache being hammered by bots scanning the archives?

Are there bot like things trying to subscribe, I see some odd @wp.pl
addresses in the apache logs, trying to hit many list in a few seconds
and then rotating the username.

Are we using a local DNS cache?

http://www.postfix.org/QSHAPE_README.html#deferred_queue
http://www.postfix.org/postconf.5.html#queue_run_delay

Who's up for the challenge?

Thanks,
Alex

PS: I'll turn QGIS and Projects back on in a few hours. Hopefully this
can be tuned before that. Adhoc will likely stay off for the weekend.

Alex_M · April 19, 2014, 1:36am

http://wiki.centos.org/HowTos/postfix_restrictions

Helo and Sender restrictions might help, by validating that senders have
valid domains to start and follow standard practices.

We do have recipient restrictions enabled already.

FYI, I just tried uping the queue_run_delay to 600 from default 300. To
try resending less often.

Thanks,
Alex

On 04/18/2014 06:03 PM, Alex Mandel wrote:

Disk rebuild on osgeo4 is taking longer than expected. Not sure why, but
even with all the VMs off it was going at 2%/hour.

After a couple of hours I turned on Mail so that we wouldn't impact the
mailing lists for too long. The problem here is that Mail is the biggest
i/o user on osgeo4.

Poking around and reading up, I think there are some things that can be
done to make Mail behave better but would really prefer someone who
knows postfix step in and help out here.

Ideas and observations:
Ram-disk for the queues? (Yes we can allocate more ram to the instance)

We seem to have a high deferred queue, do we need to clean out bad
addresses? Perhaps increase the delay time, or spin off the deferred
queue to another "graveyard" server?

Is apache being hammered by bots scanning the archives?

Are there bot like things trying to subscribe, I see some odd @wp.pl
addresses in the apache logs, trying to hit many list in a few seconds
and then rotating the username.

Are we using a local DNS cache?

http://www.postfix.org/QSHAPE_README.html#deferred_queue
http://www.postfix.org/postconf.5.html#queue_run_delay

Who's up for the challenge?

Thanks,
Alex

PS: I'll turn QGIS and Projects back on in a few hours. Hopefully this
can be tuned before that. Adhoc will likely stay off for the weekend.
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/sac

Alex_M · April 19, 2014, 2:21am

More progress, based on some advice, I looked at apache.
Turns out Apache is the cause of much of the i/o along with bouncing
emails. I added Buffered logs, but that didn't change much.

I've added a robots.txt with crawl delay since google and bing are all
over the logs. But there's also strange subscription patterns. For now
apache is off on Mail. I'll probably leave it off until the disk rebuild
finishes (eta end of weekend). So no new subscriptions till then. Any
volunteers to look more into this?

Looks like mail queue is now catching up to real time...

FYI started modifying grub to include elevator=noop to change the
scheduler (makes huge difference on QGIS and Projects) and picking
swappiness between 10-30 on various machines since we have no shortage
of ram.

Thanks,
Alex

PS: All the tips I'm putting in emails are coming from various sources
other than me, they are appreciated, so keep them coming.

On 04/18/2014 06:36 PM, Alex Mandel wrote:

http://wiki.centos.org/HowTos/postfix_restrictions

Helo and Sender restrictions might help, by validating that senders have
valid domains to start and follow standard practices.

We do have recipient restrictions enabled already.

FYI, I just tried uping the queue_run_delay to 600 from default 300. To
try resending less often.

Thanks,
Alex

On 04/18/2014 06:03 PM, Alex Mandel wrote:

Disk rebuild on osgeo4 is taking longer than expected. Not sure why, but
even with all the VMs off it was going at 2%/hour.

After a couple of hours I turned on Mail so that we wouldn't impact the
mailing lists for too long. The problem here is that Mail is the biggest
i/o user on osgeo4.

Poking around and reading up, I think there are some things that can be
done to make Mail behave better but would really prefer someone who
knows postfix step in and help out here.

Ideas and observations:
Ram-disk for the queues? (Yes we can allocate more ram to the instance)

We seem to have a high deferred queue, do we need to clean out bad
addresses? Perhaps increase the delay time, or spin off the deferred
queue to another "graveyard" server?

Is apache being hammered by bots scanning the archives?

Are there bot like things trying to subscribe, I see some odd @wp.pl
addresses in the apache logs, trying to hit many list in a few seconds
and then rotating the username.

Are we using a local DNS cache?

http://www.postfix.org/QSHAPE_README.html#deferred_queue
http://www.postfix.org/postconf.5.html#queue_run_delay

Who's up for the challenge?

Thanks,
Alex

PS: I'll turn QGIS and Projects back on in a few hours. Hopefully this
can be tuned before that. Adhoc will likely stay off for the weekend.
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/sac

_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/sac

Alex_M · April 20, 2014, 6:34pm

So Apache on Mail is still off. I tested a couple of times for a few
minutes, and the second I turn it on we get hammered by bing/google and
someone using wp.pl and hotmail addresses to mass subscribe attempt to
all our lists. This drives the load to 20+ instantly and keeps it there.
For those wondering, this is the smoking gun performance killer that
causes trouble for all the other VMs on osgeo4 (not that some other
things don't contribute a little).

** Until we fix this the list subscription pages will be unavailable **

Why- because it's blocking i/o on the whole machine.

The only ideas I've seen online so far are:
1. modsecurity - behavioral blocking of known bad from OWASP rules
2. mod_evasive - behavioral blocking by IP activity, ie too many hits
from same ip/time
3. a honeypot field in the subscription pages that prevents successful
submission by a bot
4. Some other QoS type Throttling of apache by ip.

I've also noticed with the disks not blocking our active mail queue is
maxed out, the deferred is dropping over time, and the deferred
addresses appear to all be @wp.pl
If we leave apache off, I think this will clear in a day or so and the
active queue should drop back down to real mail levels.

Helo and Sender restrictions might help prevent us from being used by
spammers. http://wiki.centos.org/HowTos/postfix_restrictions

I really need help on this one, everything I've posted the last few days
is stuff I've learned about mail servers since Friday. I have 0
experience configuring mail servers.

Thanks,
Alex

On 04/18/2014 07:21 PM, Alex Mandel wrote:

More progress, based on some advice, I looked at apache.
Turns out Apache is the cause of much of the i/o along with bouncing
emails. I added Buffered logs, but that didn't change much.

I've added a robots.txt with crawl delay since google and bing are all
over the logs. But there's also strange subscription patterns. For now
apache is off on Mail. I'll probably leave it off until the disk rebuild
finishes (eta end of weekend). So no new subscriptions till then. Any
volunteers to look more into this?

Looks like mail queue is now catching up to real time...

FYI started modifying grub to include elevator=noop to change the
scheduler (makes huge difference on QGIS and Projects) and picking
swappiness between 10-30 on various machines since we have no shortage
of ram.

Thanks,
Alex

PS: All the tips I'm putting in emails are coming from various sources
other than me, they are appreciated, so keep them coming.

On 04/18/2014 06:36 PM, Alex Mandel wrote:

http://wiki.centos.org/HowTos/postfix_restrictions

Helo and Sender restrictions might help, by validating that senders have
valid domains to start and follow standard practices.

We do have recipient restrictions enabled already.

FYI, I just tried uping the queue_run_delay to 600 from default 300. To
try resending less often.

Thanks,
Alex

On 04/18/2014 06:03 PM, Alex Mandel wrote:

Disk rebuild on osgeo4 is taking longer than expected. Not sure why, but
even with all the VMs off it was going at 2%/hour.

After a couple of hours I turned on Mail so that we wouldn't impact the
mailing lists for too long. The problem here is that Mail is the biggest
i/o user on osgeo4.

Poking around and reading up, I think there are some things that can be
done to make Mail behave better but would really prefer someone who
knows postfix step in and help out here.

Ideas and observations:
Ram-disk for the queues? (Yes we can allocate more ram to the instance)

We seem to have a high deferred queue, do we need to clean out bad
addresses? Perhaps increase the delay time, or spin off the deferred
queue to another "graveyard" server?

Is apache being hammered by bots scanning the archives?

Are there bot like things trying to subscribe, I see some odd @wp.pl
addresses in the apache logs, trying to hit many list in a few seconds
and then rotating the username.

Are we using a local DNS cache?

http://www.postfix.org/QSHAPE_README.html#deferred_queue
http://www.postfix.org/postconf.5.html#queue_run_delay

Who's up for the challenge?

Thanks,
Alex

PS: I'll turn QGIS and Projects back on in a few hours. Hopefully this
can be tuned before that. Adhoc will likely stay off for the weekend.
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/sac

neteler · April 20, 2014, 7:56pm

On Sun, Apr 20, 2014 at 8:34 PM, Alex Mandel <tech_dev@wildintellect.com> wrote:

So Apache on Mail is still off. I tested a couple of times for a few
minutes, and the second I turn it on we get hammered by bing/google and
someone using wp.pl and hotmail addresses to mass subscribe attempt to
all our lists. This drives the load to 20+ instantly and keeps it there.
For those wondering, this is the smoking gun performance killer that
causes trouble for all the other VMs on osgeo4 (not that some other
things don't contribute a little).

For the record: I have implemented a fail2ban regex to turn down these
subscription attempts.
Apache is running again, to be seen if it performs ok now...

Markus

neteler · April 22, 2014, 8:32pm

On Sun, Apr 20, 2014 at 9:56 PM, Markus Neteler <neteler@osgeo.org> wrote:

On Sun, Apr 20, 2014 at 8:34 PM, Alex Mandel <tech_dev@wildintellect.com> wrote:

So Apache on Mail is still off.

Apache is running for more than 24h and seems to behave ok.

...

I have implemented a fail2ban regex to turn down these
subscription attempts.

[alias hotmail spam]

Yesterday I added two more regex into
/etc/fail2ban/filter.d/apache-mailman-bot.conf
It seems to catch 'em nicely:

mail:/root # grep apache-mailman-bot /var/log/fail2ban.log | grep Ban | wc -l
391

BTW: I put a no-mercy setting, the attacking IP is temporarily blocked
at the first attempt to abuse mailman for spamming.

cheers,
Markus