[Mailman-Developers] Faulty Member Subscribe/Unsubscribes

Andrew Case acase at cims.nyu.edu
Thu Sep 29 10:13:46 CEST 2011


Mark,

I realized I hadn't restarted my QRunners after this patch.  It looks like
its working perfectly now!  Even with a sleep of 0.  Thanks so much!

--
Drew

On Thu, September 29, 2011 2:52 am, Andrew Case wrote:
> Thanks Mark, see inline comments.
>
>>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass
>>> sub
>>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
>>> mgt page
>>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>>> acase at example.com
>>> [mailman at myhost] ~/logs |>
>>
>>
>> There is a bug in the Mailman 2.1 branch, but the above is not it. The
>> above log shows that acase at example.com was added by admin mass subscribe
>> at 17:15:14 and then a bit more than 4 minutes later, was removed by
>> checking the unsub box on the admin Membership List and submitting.
>
> I was trying to show that even after the user was removed, they're still
> listed as a member.
>
>> If you check your web server logs, you will find POST transactions to
>> the admin page for both these events.
>
> Agreed.
>
>>> The same also happens when subscribing.  I will mass subscribe users
>>> (or
>>> when users confirm subscription via email/web), the logs indicated that
>>> they have been subscribed successfully, but then when I go look them
>>> up,
>>> they are not listed on the members list.
>>>
>>> This happens sporadically, but I am generally able to reproduce the
>>> error
>>> if I do it a couple times in a row.
>>
>>
>> This is possibly a manifestation of the bug, but I'm surprised it is
>> happening that frequently.
>
> Easiest way for me to replicated the problem is:
> 1) check the unsubscribe box for user A then hit submit
> 2) after reload check the unsubscribe box for user B then hit submit
> 3) reaload the "membership list" page and user B is back on the list
>
> This happens even after I wait a couple seconds in between each step.
>
>>> I'm suspicious there may be a locking issue and config.pck is reverting
>>> to
>>> config.pck.last.  I found this thread rather helpful in analyzing
>>> potential problems, but I have yet to figure anything out:
>>>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD
>>
>>
>> The thread you point to above is relevant, but it is not a locking
>> issue. The problem is due to list caching in Mailman/Queue/Runner.py
>> and/or nearly concurrent processes which first load the list unlocked
>> and later lock it. The issue is that the resolution of the config.pck
>> timestamp is 1 second, and if a process has a list object and that list
>> object is updated by another process within the same second as the
>> timestamp on the first process's object, the first process won't load
>> the updated list when it locks it. This can result in things like a
>> subscribe being done and logged and then silently reversed.
>
> The result sounds the same, but would this happen even if I'm loading the
> page with more than a second in between each step outlined above?
>
>> List locking is working as it should. The issue is that the first
>> process doesn't reload the updated list when it acquires the lock
>> because it thinks it already has the latest version.
>>
>> I thought I had fixed this on the 2.1 branch, but it seems I only fixed
>> it for the now defunct 2.2 branch.
>>
>> A relevant thread starts at
>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>> and continues at
>> <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>
>>
>> The patch in the attached cache.patch file should fix it.
>
> I applied the patch but it doesn't seem to have made a difference.
>
>
>>> In addition if I just run the following commands over and over, then
>>> the
>>> bug never seems to come up.  This is part of why I am worrying about
>>> locking:
>>>   bin/add_members ...
>>>   bin/remove_members ...
>>
>>
>> That won't do it. bin/add_members alone will do it, but only if there is
>> a nearly concurrent process updating the same list.
>>
>>
>>> Is there a good way to test locking between servers?  I've run the
>>> tests/test_lockfile.py, but it reports it is OK.
>>>
>>> Any and all help would be GREATLY appreciated.  We've been trying to
>>> triage this bug for weeks and it is terribly disruptive for our users.
>>
>>
>> The post at
>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>> contains a "stress test" that will probably reproduce the problem.
>
> Correct.  Only one subscriber was subscribed to each test list.  Keep in
> mind that in the stress test given if you use a sleep counter of 5 with 6
> lists, that means you're waiting _30 seconds_ before the next add_member
> command is run for that list (I'm assume the timing issue is per-list, not
> per run of add_members).  Even if you set the timer down to 1 that's a 6
> second sleep.  This shouldn't effect a cache that we're comparing for the
> given second.  Anyway, my script ran fine with the 5 second sleep (30
> seconds per list add), but showed discrepancies with a 3 second sleep.
>
>> I suspect your Mailman server must be very busy for you to see this bug
>> that frequently. However, it looks like I need to install the fix for
>> Mailman 2.1.15.
>
> We run about 600 different mailing lists for our department and this has
> been a continues headache.  I appreciate all the hard work you guys do.
>
>> It is also curious that the only reports of this that I can recall both
>> come from solaris users. There may be complications in your case due to
>> NFS, but locking shouldn't be the issue. Run the stress test and see if
>> it fails. If it does, try the patch.
>
> Patch didn't seem to help.  Is there an easy way to omit the caching in
> this?
>
> Thanks,
> --
> Drew
>
>>
>> Let us know what happens.
>>
>> --
>> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
>> San Francisco Bay Area, California    better use your sense - B. Dylan
>>
>>
>
>


Andrew Case
Systems Administrator
Courant Institute of Mathematical Sciences
New York University
251 Mercer St., Room 1023
New York, NY 10012-1110
Phone: 212-998-3147




More information about the Mailman-Developers mailing list