[Mailman-Developers] Faulty Member Subscribe/Unsubscribes

Thu Sep 29 18:44:03 CEST 2011

On 9/28/2011 11:52 PM, Andrew Case wrote:
> Thanks Mark, see inline comments.
> 
>>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass
>>> sub
>>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
>>> mgt page
>>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>>> acase at example.com
>>> [mailman at myhost] ~/logs |>
>>
>>
>> There is a bug in the Mailman 2.1 branch, but the above is not it. The
>> above log shows that acase at example.com was added by admin mass subscribe
>> at 17:15:14 and then a bit more than 4 minutes later, was removed by
>> checking the unsub box on the admin Membership List and submitting.
> 
> I was trying to show that even after the user was removed, they're still
> listed as a member.

Sorry, I missed that. You are correct and this does appear to be a
manifestation of the same issue.

[...]
>> The thread you point to above is relevant, but it is not a locking
>> issue. The problem is due to list caching in Mailman/Queue/Runner.py
>> and/or nearly concurrent processes which first load the list unlocked
>> and later lock it. The issue is that the resolution of the config.pck
>> timestamp is 1 second, and if a process has a list object and that list
>> object is updated by another process within the same second as the
>> timestamp on the first process's object, the first process won't load
>> the updated list when it locks it. This can result in things like a
>> subscribe being done and logged and then silently reversed.
> 
> The result sounds the same, but would this happen even if I'm loading the
> page with more than a second in between each step outlined above?

It is tricky. Each add_members, remove_members and web CGI post is a
separate process. If these processes are run sequentially, there should
not be any problem because each process will load the list, lock it
update it and save it before the next process loads it.

The problem occurs when processes run concurrently. The scenario is
process A loads the list unlocked; process B locks the list and updates
it; process A tries to lock the list and gets the lock after process B
relinquishes it; if the timestamp on the config.pck from process B's
update is in the same second as the timestamp of process A's initial
load, process A thinks the list hasn't been updated and doesn't reload
it after obtaining the lock. Thus, when process A saves the list,
process B's changes are reversed.

This is complicated by list caching in the qrunners because each qrunner
may have a cached copy of the list, so it can act as process A in the
above scenario with its cached copy playing the role of the initially
loaded list. To complicate this further, the qrunners get involved even
in the simple scenario with sequential commands because add_members,
remove_members and CGIs result in notices being sent, and the qrunner
processes that send the notices are running concurrently. This is why
the stress test will fail even though commands are run sequentially.

[...]
> I applied the patch but it doesn't seem to have made a difference.

As you later report, restarting the qrunners did seem to fix it.

[...]
>> The post at
>> <->
>> contains a "stress test" that will probably reproduce the problem.
> 
> Correct.  Only one subscriber was subscribed to each test list.  Keep in
> mind that in the stress test given if you use a sleep counter of 5 with 6
> lists, that means you're waiting _30 seconds_ before the next add_member
> command is run for that list (I'm assume the timing issue is per-list, not
> per run of add_members).  Even if you set the timer down to 1 that's a 6
> second sleep.  This shouldn't effect a cache that we're comparing for the
> given second.  Anyway, my script ran fine with the 5 second sleep (30
> seconds per list add), but showed discrepancies with a 3 second sleep.

So you are adding 'sleep' commands after each add_members? I'm not sure
what you're doing. Is there a different test elsewhere in the thread?

I have used a couple of tests as attached. They are the same except for
list order and are very similar to the one in the original thread. Note
that they contain only one sleep after all the add_members just to allow
things to settle before running list_members.

>> I suspect your Mailman server must be very busy for you to see this bug
>> that frequently. However, it looks like I need to install the fix for
>> Mailman 2.1.15.

Actually, I don't think the issue is the busy server. I think it is more
likely that NFS causes timing issues between add_members and
VirginRunner and OutgoingRunner that just make the bug more likely to
trigger.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: list_cache_stress_test
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110929/fcc4222a/attachment-0002.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: list_cache_stress_test_2
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110929/fcc4222a/attachment-0003.ksh>