[Mailman-Users] Stuck OutgoingRunner

Sebastian Hagedorn Hagedorn at uni-koeln.de
Fri Mar 16 07:37:04 EDT 2018


It happened again yesterday. Details below.

--On 7. Februar 2018 um 12:43:18 +0900 Yasuhito FUTATSUKI 
<futatuki at poem.co.jp> wrote:

> In fact,
>
> On 02/02/18 19:26, Sebastian Hagedorn wrote:
>> root at mailman3/usr/lib/mailman/bin]$ strace -p 1677
>> Process 1677 attached
>> recvfrom(10, ^CProcess 1677 detached
>
> indicates the OutGoingRunner process 1677 was still in recvfrom(2)
> system call (perhaps called from recv(2)) for FD 10, and
>
>> [root at mailman3/usr/lib/mailman/bin]$ lsof -p 1677
>> COMMAND    PID    USER   FD   TYPE   DEVICE SIZE/OFF   NODE NAME
>> python2.7 1677 mailman  cwd    DIR    253,0     4096 173998
>> /usr/lib/mailman python2.7 1677 mailman  rtd    DIR    253,0     4096
>> 2 /
>> ...
>> python2.7 1677 mailman   10u  IPv6 46441320      0t0    TCP
>> mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp
>> (ESTABLISHED)
>
> indicates its FD 10 was ESTABLISHED connection to the MTA.

That situation was exactly the same. This time we confirmed on the MTA that 
there was no trace of that connection anymore. At the time of the incident, 
the MTA was once again under high load and delaying commands. That 
definitely seems to be a contributing factor. We didn't find any evidence 
of a connection that was dropped by the MTA, but with four OutgoingRunners 
we didn't find a way to determine which transaction related to which runner.

> If the MTA is hanging up (or very slow progress) in application layer and
> keeping alive TCP connection in lower layer, client using smtplib
> without specifying timeout, like current SMTPDirect handler in Mailman,
> must wait for response or the MTA dying.

If I understood Mark correctly, when the MTA dropped the connection that 
should have raised socket.error regardless of timeouts. The question is why 
it didn't. I suppose that could be either a bug in our version of the 
Python libraries or in the OS. Any ideas how we should proceed to determine 
the root cause?
-- 
    .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
                 .:.Regionales Rechenzentrum (RRZK).:.
   .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/mailman-users/attachments/20180316/4619df6f/attachment.sig>


More information about the Mailman-Users mailing list