postfix队列健康状况分析(一)

原创

qidizi 2023-04-28 22:21:41 博主文章分类：Postfix ©著作权

文章标签 .net hive Server 文章分类 jQuery 前端开发

©著作权归作者所有：来自51CTO博客作者qidizi的原创作品，请联系作者获取转载授权，否则将追究法律责任

Introducing the qshape tool

When mail is draining slowly or the queue is unexpectedly large,run qshape(1) as the super-user (root) to help zero in on the problem.The qshape(1) program displays a tabular view of the Postfix queuecontents.

On the horizontal axis, it displays the queue age withfine granularity for recent messages and (geometrically) less finegranularity for older messages.
The vertical axis displays the destination (or with the"-s" switch the sender) domain. Domains with the most messages arelisted first.

For example, in the output below we see the top 10 lines ofthe (mostly forged) sender domain distribution for captured spamin the "hold" queue:

$ qshape -s hold | head T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 486 0 0 1 0 0 2 4 20 40 419 yahoo.com 14 0 0 1 0 0 0 0 1 0 12 extremepricecuts.net 13 0 0 0 0 0 0 0 2 0 11 ms35.hinet.net 12 0 0 0 0 0 0 0 0 1 11 winnersdaily.net 12 0 0 0 0 0 0 0 2 0 10 hotmail.com 11 0 0 0 0 0 0 0 0 1 10 worldnet.fr 6 0 0 0 0 0 0 0 0 0 6 ms41.hinet.net 6 0 0 0 0 0 0 0 0 0 6 osn.de 5 0 0 0 0 0 1 0 0 0 4

The "T" column shows the total (in this case sender) countfor each domain. The columns with numbers above them, show countsfor messages aged fewer than that many minutes, but not youngerthan the age limit for the previous column. The row labeled "TOTAL"shows the total count for all domains.
In this example, there are 14 messages allegedly fromyahoo.com, 1 between 10 and 20 minutes old, 1 between 320 and 640minutes old and 12 older than 1280 minutes (1440 minutes in a day).

When the output is a terminal intermediate results showing the top 20domains (-n option) are displayed after every 1000 messages (-N option)and the final output also shows only the top 20 domains. This makesqshape useful even when the deferred queue is very large and it mayotherwise take prohibitively long to read the entire deferred queue.

By default, qshape shows statistics for the union of both theincoming and active queues which are the most relevant queues tolook at when analyzing performance.

One can request an alternate list of queues:

$ qshape deferred$ qshape incoming active deferred

this will show the age distribution of the deferred queue orthe union of the incoming active and deferred queues.

Command line options control the number of display "buckets",the age limit for the smallest bucket, display of parent domaincounts and so on. The "-h" option outputs a summary of the availableswitches.

Trouble shooting with qshape

Large numbers in the qshape output represent a large number ofmessages that are destined to (or alleged to come from) a particulardomain. It should be possible to tell at a glance which domainsdominate the queue sender or recipient counts, approximately whena burst of mail started, and when it stopped.

The problem destinations or sender domains appear near the topleft corner of the output table. Remember that the active queuecan accommodate up to 20000 ($qmgr_message_active_limit) messages.To check whether this limit has been reached, use:

$ qshape -s active

(show sender statistics)

If the total sender count is below 20000 the active queue isnot yet saturated, any high volume sender domains show near thetop of the output.

With oqmgr(8) the active queue is also limited to at most 20000recipient addresses ($qmgr_message_recipient_limit). To check forexhaustion of this limit use:

$ qshape active

(show recipient statistics)

Having found the high volume domains, it is often useful tosearch the logs for recent messages pertaining to the domains inquestion.

# Find deliveries to example.com#$ tail -10000 /var/log/maillog | egrep -i ': to=<.*@example\.com>,' | less# Find messages from example.com#$ tail -10000 /var/log/maillog | egrep -i ': from=<.*@example\.com>,' | less

You may want to drill in on some specific queue ids:

# Find all messages for a specific queue id.#$ tail -10000 /var/log/maillog | egrep ': 2B2173FF68: '

Also look for queue manager warning messages in the log. Thesewarnings can suggest strategies to reduce congestion.

$ egrep 'qmgr.*(panic|fatal|error|warning):' /var/log/maillog

When all else fails try the Postfix mailing list for help, butplease don't forget to include the top 10 or 20 lines of qshape(1)output.

Example 1: Healthy queue

When looking at just the incoming and active queues, undernormal conditions (no congestion) the incoming and active queuesare nearly empty. Mail leaves the system almost as quickly as itcomes in or is deferred without congestion in the active queue.

$ qshape

(show incoming and active queue status) T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 5 0 0 0 1 0 0 0 1 1 2 meri.uwasa.fi 5 0 0 0 1 0 0 0 1 1 2

If one looks at the two queues separately, the incoming queueis empty or perhaps briefly has one or two messages, while theactive queue holds more messages and for a somewhat longer time:

$ qshape incoming T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 0 0 0 0 0 0 0 0 0 0 0$ qshape active T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 5 0 0 0 1 0 0 0 1 1 2 meri.uwasa.fi 5 0 0 0 1 0 0 0 1 1 2

Example 2: Deferred queue full ofdictionary attack bounces

This is from a server where recipient validation is not yetavailable for some of the hosted domains. Dictionary attacks onthe unvalidated domains result in bounce backscatter. The bouncesdominate the queue, but with proper tuning they do not saturate theincoming or active queues. The high volume of deferred mail is nota direct cause for alarm.

$ qshape deferred | head T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 2234 4 2 5 9 31 57 108 201 464 1353 heyhihellothere.com 207 0 0 1 1 6 6 8 25 68 92 pleazerzoneprod.com 105 0 0 0 0 0 0 0 5 44 56 groups.msn.com 63 2 1 2 4 4 14 14 14 8 0 orion.toppoint.de 49 0 0 0 1 0 2 4 3 16 23 kali.com.cn 46 0 0 0 0 1 0 2 6 12 25 meri.uwasa.fi 44 0 0 0 0 1 0 2 8 11 22 gjr.paknet.com.pk 43 1 0 0 1 1 3 3 6 12 16 aristotle.algonet.se 41 0 0 0 0 0 1 2 11 12 15

The domains shown are mostly bulk-mailers and all the volumeis the tail end of the time distribution, showing that short termarrival rates are moderate. Larger numbers and lower message agesare more indicative of current trouble. Old mail still going nowhereis largely harmless so long as the active and incoming queues areshort. We can also see that the groups.msn.com undeliverables arelow rate steady stream rather than a concentrated dictionary attackthat is now over.

$ qshape -s deferred | head T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 2193 4 4 5 8 33 56 104 205 465 1309 MAILER-DAEMON 1709 4 4 5 8 33 55 101 198 452 849 example.com 263 0 0 0 0 0 0 0 0 2 261 example.org 209 0 0 0 0 0 1 3 6 11 188 example.net 6 0 0 0 0 0 0 0 0 0 6 example.edu 3 0 0 0 0 0 0 0 0 0 3 example.gov 2 0 0 0 0 0 0 0 1 0 1 example.mil 1 0 0 0 0 0 0 0 0 0 1

Looking at the sender distribution, we see that as expectedmost of the messages are bounces.

Example 3: Congestion in the activequeue

This example is taken from a Feb 2004 discussion on the PostfixUsers list. Congestion was reported with the active and incomingqueues large and not shrinking despite very large delivery agentprocess limits. The thread is archived at:http://groups.google.com/groups?threadm=c0b7js$2r65$1@FreeBSD.csie.NCTU.edu.twandhttp://archives.neohapsis.com/archives/postfix/2004-02/thread.html#1371

Using an older version of qshape(1) it was quickly determinedthat all the messages were for just a few destinations:

$ qshape

(show incoming and active queue status) T A 5 10 20 40 80 160 320 320+ TOTAL 11775 9996 0 0 1 1 42 94 221 1420 user.sourceforge.net 7678 7678 0 0 0 0 0 0 0 0 lists.sourceforge.net 2313 2313 0 0 0 0 0 0 0 0 gzd.gotdns.com 102 0 0 0 0 0 0 0 2 100

The "A" column showed the count of messages in the active queue,and the numbered columns showed totals for the deferred queue. At10000 messages (Postfix 1.x active queue size limit) the activequeue is full. The incoming was growing rapidly.

With the trouble destinations clearly identified, the administratorquickly found and fixed the problem. It is substantially harder toglean the same information from the logs. While a careful readingof mailq(1) output should yield similar results, it is much harderto gauge the magnitude of the problem by looking at the queueone message at a time.

Example 4: High volume destination backlog

When a site you send a lot of email to is down or slow, mailmessages will rapidly build up in the deferred queue, or worse, inthe active queue. The qshape output will show large numbers forthe destination domain in all age buckets that overlap the startingtime of the problem:

$ qshape deferred | head T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 5000 200 200 400 800 1600 1000 200 200 200 200 highvolume.com 4000 160 160 320 640 1280 1440 0 0 0 0 ...

Here the "highvolume.com" destination is continuing to accumulatedeferred mail. The incoming and active queues are fine, but thedeferred queue started growing some time between 1 and 2 hours agoand continues to grow.

If the high volume destination is not down, but is insteadslow, one might see similar congestion in the active queue. Activequeue congestion is a greater cause for alarm; one might need totake measures to ensure that the mail is deferred instead or evenadd an access(5) rule asking the sender to try again later.

If a high volume destination exhibits frequent bursts of consecutiveconnections refused by all MX hosts or "421 Server busy errors", itis possible for the queue manager to mark the destination as "dead"despite the transient nature of the errors. The destination will beretried again after the expiration of a $minimal_backoff_time timer.If the error bursts are frequent enough it may be that only a smallquantity of email is delivered before the destination is again marked"dead". In some cases enabling static (not on demand) connectioncaching by listing the appropriate nexthop domain in a table included in"smtp_connection_cache_destinations" may help to reduce the error rate,because most messages will re-use existing connections.

The MTA that has been observed most frequently to exhibit suchbursts of errors is Microsoft Exchange, which refuses connectionsunder load. Some proxy virus scanners in front of the Exchangeserver propagate the refused connection to the client as a "421"error.

Note that it is now possible to configure Postfix to exhibit similarlyerratic behavior by misconfiguring the anvil(8) service. Do not useanvil(8) for steady-state rate limiting, its purpose is (unintentional)DoS prevention and the rate limits set should be very generous!

If one finds oneself needing to deliver a high volume of mail to adestination that exhibits frequent brief bursts of errors and connectioncaching does not solve the problem, there is a subtle workaround.

Postfix version 2.5 and later:

In master.cf set up a dedicated clone of the "smtp" transportfor the destination in question. In the example below we will callit "fragile".
In master.cf configure a reasonable process limit for thecloned smtp transport (a number in the 10-20 range is typical).
IMPORTANT!!! In main.cf configure a large per-destinationpseudo-cohort failure limit for the cloned smtp transport. /etc/postfix/main.cf: transport_maps = hash:/etc/postfix/transport fragile_destination_concurrency_failed_cohort_limit = 100 fragile_destination_concurrency_limit = 20/etc/postfix/transport: example.com fragile:/etc/postfix/master.cf: # service type private unpriv chroot wakeup maxproc command fragile unix - - n - 20 smtp See also the documentation fordefault_destination_concurrency_failed_cohort_limit anddefault_destination_concurrency_limit.

Earlier Postfix versions:

In master.cf set up a dedicated clone of the "smtp"transport for the destination in question. In the example belowwe will call it "fragile".
In master.cf configure a reasonable process limit for thetransport (a number in the 10-20 range is typical).
IMPORTANT!!! In main.cf configure a very large initialand destination concurrency limit for this transport (say 2000). /etc/postfix/main.cf: transport_maps = hash:/etc/postfix/transport initial_destination_concurrency = 2000 fragile_destination_concurrency_limit = 2000/etc/postfix/transport: example.com fragile:/etc/postfix/master.cf: # service type private unpriv chroot wakeup maxproc command fragile unix - - n - 20 smtp See also the documentation for default_destination_concurrency_limit.

The effect of this configuration is that up to 2000consecutive errors are tolerated without marking the destinationdead, while the total concurrency remains reasonable (10-20processes). This trick is only for a very specialized situation:high volume delivery into a channel with multi-error burststhat is capable of high throughput, but is repeatedly throttled bythe bursts of errors.

When a destination is unable to handle the load even after thePostfix process limit is reduced to 1, a desperate measure is toinsert brief delays between delivery attempts.

Postfix version 2.5 and later:

In master.cf set up a dedicated clone of the "smtp" transportfor the problem destination. In the example below we call it "slow".
In main.cf configure a short delay between deliveries tothe same destination. /etc/postfix/main.cf: transport_maps = hash:/etc/postfix/transport slow_destination_rate_delay = 1 slow_destination_concurrency_failed_cohort_limit = 100/etc/postfix/transport: example.com slow:/etc/postfix/master.cf: # service type private unpriv chroot wakeup maxproc command slow unix - - n - - smtp

See also the documentation for default_destination_rate_delay.

This solution forces the Postfix smtp(8) client to wait for$slow_destination_rate_delay seconds between deliveries to the samedestination.

IMPORTANT!! The large slow_destination_concurrency_failed_cohort_limitvalue is needed. This prevents Postfix from deferring all mail forthe same destination after only one connection or handshake error(the reason for this is that non-zero slow_destination_rate_delayforces a per-destination concurrency of 1).

Earlier Postfix versions:

In the transport map entry for the problem destination,specify a dead host as the primary nexthop.
In the master.cf entry for the transport specify theproblem destination as the fallback_relay and specify a smallsmtp_connect_timeout value. /etc/postfix/main.cf: transport_maps = hash:/etc/postfix/transport/etc/postfix/transport: example.com slow:[dead.host]/etc/postfix/master.cf: # service type private unpriv chroot wakeup maxproc command slow unix - - n - 1 smtp -o fallback_relay=problem.example.com -o smtp_connect_timeout=1 -o smtp_connection_cache_on_demand=no

This solution forces the Postfix smtp(8) client to wait for$smtp_connect_timeout seconds between deliveries. The connectioncaching feature is disabled to prevent the client from skippingover the dead host.

Postfix queue directories

The following sections describe Postfix queues: their purpose,what normal behavior looks like, and how to diagnose abnormalbehavior.

The "maildrop" queue

Messages that have been submitted via the Postfix sendmail(1)command, but not yet brought into the main Postfix queue by thepickup(8) service, await processing in the "maildrop" queue. Messagescan be added to the "maildrop" queue even when the Postfix systemis not running. They will begin to be processed once Postfix isstarted.

The "maildrop" queue is drained by the single threaded pickup(8)service scanning the queue directory periodically or when notifiedof new message arrival by the postdrop(1) program. The postdrop(1)program is a setgid helper that allows the unprivileged Postfixsendmail(1) program to inject mail into the "maildrop" queue andto notify the pickup(8) service of its arrival.

All mail that enters the main Postfix queue does so via thecleanup(8) service. The cleanup service is responsible for envelopeand header rewriting, header and body regular expression checks,automatic bcc recipient processing, milter content processing, andreliable insertion of the message into the Postfix "incoming" queue.

In the absence of excessive CPU consumption in cleanup(8) headeror body regular expression checks or other software consuming allavailable CPU resources, Postfix performance is disk I/O bound.The rate at which the pickup(8) service can inject messages intothe queue is largely determined by disk access times, since thecleanup(8) service must commit the message to stable storage beforereturning success. The same is true of the postdrop(1) programwriting the message to the "maildrop" directory.

As the pickup service is single threaded, it can only deliverone message at a time at a rate that does not exceed the reciprocaldisk I/O latency (+ CPU if not negligible) of the cleanup service.

Congestion in this queue is indicative of an excessive local messagesubmission rate or perhaps excessive CPU consumption in the cleanup(8)service due to excessive body_checks, or (Postfix ≥ 2.3) high latencymilters.

Note, that once the active queue is full, the cleanup servicewill attempt to slow down message injection by pausing $in_flow_delayfor each message. In this case "maildrop" queue congestion may bea consequence of congestion downstream, rather than a problem inits own right.

Note, you should not attempt to deliver large volumes of mail viathe pickup(8) service. High volume sites should avoid using "simple"content filters that re-inject scanned mail via Postfix sendmail(1)and postdrop(1).

A high arrival rate of locally submitted mail may be an indicationof an uncaught forwarding loop, or a run-away notification program.Try to keep the volume of local mail injection to a moderate level.

The "postsuper -r" command can place selected messages intothe "maildrop" queue for reprocessing. This is most useful forresetting any stale content_filter settings. Requeuing a large numberof messages using "postsuper -r" can clearly cause a spike in thesize of the "maildrop" queue.