[Mailman-Developers] Faulty Member Subscribe/Unsubscribes

Thu Sep 29 19:30:12 CEST 2011

[...]

> It is tricky. Each add_members, remove_members and web CGI post is a
> separate process. If these processes are run sequentially, there should
> not be any problem because each process will load the list, lock it
> update it and save it before the next process loads it.
>
> The problem occurs when processes run concurrently. The scenario is
> process A loads the list unlocked; process B locks the list and updates
> it; process A tries to lock the list and gets the lock after process B
> relinquishes it; if the timestamp on the config.pck from process B's
> update is in the same second as the timestamp of process A's initial
> load, process A thinks the list hasn't been updated and doesn't reload
> it after obtaining the lock. Thus, when process A saves the list,
> process B's changes are reversed.
>
> This is complicated by list caching in the qrunners because each qrunner
> may have a cached copy of the list, so it can act as process A in the
> above scenario with its cached copy playing the role of the initially
> loaded list. To complicate this further, the qrunners get involved even
> in the simple scenario with sequential commands because add_members,
> remove_members and CGIs result in notices being sent, and the qrunner
> processes that send the notices are running concurrently. This is why
> the stress test will fail even though commands are run sequentially.

Thank you for that explanation.  I did seem to have confusion as to when
the qrunners cache and/or update these config.pck files and when the
add/remove_members commands did as well.  There seemed to be some sort of
conflict between the two.

[...]

>>> The post at
>>> <->
>>> contains a "stress test" that will probably reproduce the problem.
>>
>> Correct.  Only one subscriber was subscribed to each test list.  Keep in
>> mind that in the stress test given if you use a sleep counter of 5 with
>> 6
>> lists, that means you're waiting _30 seconds_ before the next add_member
>> command is run for that list (I'm assume the timing issue is per-list,
>> not
>> per run of add_members).  Even if you set the timer down to 1 that's a 6
>> second sleep.  This shouldn't effect a cache that we're comparing for
>> the
>> given second.  Anyway, my script ran fine with the 5 second sleep (30
>> seconds per list add), but showed discrepancies with a 3 second sleep.
>
>
> So you are adding 'sleep' commands after each add_members?

Yes I was.  Without a sleep in between add_member calls, it was failing
for ~50% of the calls to add_members.  With a 5 second sleep it would tend
to work most of the time.

> I'm not sure what you're doing. Is there a different test elsewhere in
> the thread?

See my updated stress test that I sent you in my last email.

> I have used a couple of tests as attached. They are the same except for
> list order and are very similar to the one in the original thread. Note
> that they contain only one sleep after all the add_members just to allow
> things to settle before running list_members.

That makes sense.

>>> I suspect your Mailman server must be very busy for you to see this bug
>>> that frequently. However, it looks like I need to install the fix for
>>> Mailman 2.1.15.
>
>
> Actually, I don't think the issue is the busy server. I think it is more
> likely that NFS causes timing issues between add_members and
> VirginRunner and OutgoingRunner that just make the bug more likely to
> trigger.

I think you hit the nail on the head here.  It explains a lot.

Thanks,

--
Drew