From barry at list.org  Mon Sep 26 00:26:28 2011
From: barry at list.org (Barry Warsaw)
Date: Sun, 25 Sep 2011 18:26:28 -0400
Subject: [Mailman-Developers] RELEASED: Mailman 3.0 alpha 8
Message-ID: <20110925182628.30cd0a0c@resist.wooz.org>

I am very happy to announce the release of the eighth alpha for Mailman 3.0,
code named "Where's My Thing?".  This is the last planned alpha release, as I
want to work toward the first beta in order to meet my goal of an 11/11/11
final release (of the core engine at least).

If you've been holding off looking at Mailman 3, I invite you to do so now.
Once beta 1 is released I will not be adding any new features.  I do hope to
put up a few live test lists soon, so stay tuned.

There have been a large number of fixes and new features, especially in the
REST API.  My thanks go to Stephen Goss who has contributed greatly to this
release, with bug reports, wish list items, patches, and merge proposals.
Full details of what's new in 3.0a8 is available here:

    http://tinyurl.com/6yxgclf

The tarball can be downloaded from Launchpad or the Cheeseshop:

    https://launchpad.net/mailman
    http://pypi.python.org/pypi/mailman/3.0.0a8

The full documentation is also online:

    http://packages.python.org/mailman/

Enjoy,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110925/5642593a/attachment.pgp>

From barry at python.org  Tue Sep 27 03:03:38 2011
From: barry at python.org (Barry Warsaw)
Date: Mon, 26 Sep 2011 21:03:38 -0400
Subject: [Mailman-Developers] [Bug 860159] Re: Mailman 3.0 support for
	Postgres
In-Reply-To: <20110927004418.10717.68620.malone@gac.canonical.com>
References: <20110927001849.9391.31768.malonedeb@gac.canonical.com>
	<20110927004418.10717.68620.malone@gac.canonical.com>
Message-ID: <20110926210338.05782ba4@resist.wooz.org>

On Sep 27, 2011, at 12:44 AM, Stephen A. Goss wrote:

>I've attached a branch with the Postgres support code that I've cooked
>up.

Fantastic, thanks!  Do be sure to add a merge proposal.

>3. There is now an alternate mailman_pg.sql file which is used to create
>the tables. Currently, two foreign key constraints are commented out
>because those are violated in Mailman 3 (apparently this doesn't bother
>SQLite). Some column TYPES are different from mailman.sql. Classes are
>created in a slightly different order due to FK constraint creation
>requires the referenced table to actually exist. The primary key indexes
>defined after each class are probably redundant, as those are created
>automatically for SERIAL columns defined as PRIMARY KEY in Postgres.

Do you have any thoughts on whether the two .sql files can possibly be shared?
My biggest concern is that it will be difficult-ish to keep them in sync as I
add or modify the SQLite version.  If it's not possible, so be it.

I took a quick look at the Python changes, and I think I'm going to refactor
the code to not hardcode so much in stock.py.  For example, I'll probably
rename StockDatabase to SQLiteDatabase and add a PostgresDatabase class,
adding a common super class.  That way, you'd only need to put this in your
mailman.cfg file:

[database]
class: mailman.database.postgres.PostgresDatabase

Don't worry about that too much, I can make that change when I merge your
branch.

>4. Probably more FK constraint violations exist that my tests haven't
>uncovered.

I'd definitely like to be able to run the test suite against Postgres, if even
for now it's a manual select (e.g. because Postgres would obviously have to be
installed and configured in order to work).

From omacneil at thecsl.org  Tue Sep 27 03:11:54 2011
From: omacneil at thecsl.org (Dan MacNeil)
Date: Mon, 26 Sep 2011 21:11:54 -0400
Subject: [Mailman-Developers] [Bug 860159] Re: Mailman 3.0 support for
 Postgres
In-Reply-To: <20110926210338.05782ba4@resist.wooz.org>
References: <20110927001849.9391.31768.malonedeb@gac.canonical.com>	<20110927004418.10717.68620.malone@gac.canonical.com>
	<20110926210338.05782ba4@resist.wooz.org>
Message-ID: <4E8122DA.7030607@thecsl.org>

On 09/26/2011 09:03 PM, Barry Warsaw wrote:
>   Currently, two foreign key constraints are commented out
>>because those are violated in Mailman 3 (apparently this doesn't bother
>>SQLite).

Foreign key constraints are available in sqlite 3.6.19 and above. They 
are turned off by default. They can be enabled with:

	PRAGMA foreign_keys = ON;

More details at:
		http://www.sqlite.org/foreignkeys.html	

From barry at python.org  Tue Sep 27 03:27:13 2011
From: barry at python.org (Barry Warsaw)
Date: Mon, 26 Sep 2011 21:27:13 -0400
Subject: [Mailman-Developers] [Bug 860159] Re: Mailman 3.0 support for
 Postgres
In-Reply-To: <4E8122DA.7030607@thecsl.org>
References: <20110927001849.9391.31768.malonedeb@gac.canonical.com>
	<20110927004418.10717.68620.malone@gac.canonical.com>
	<20110926210338.05782ba4@resist.wooz.org>
	<4E8122DA.7030607@thecsl.org>
Message-ID: <20110926212713.3329062f@resist.wooz.org>

On Sep 26, 2011, at 09:11 PM, Dan MacNeil wrote:

>Foreign key constraints are available in sqlite 3.6.19 and above. They are
>turned off by default. They can be enabled with:
>
>	PRAGMA foreign_keys = ON;
>
>More details at:
>		http://www.sqlite.org/foreignkeys.html	

Thanks.  AFAICT, Ubuntu's sqlite3 package is compiled with this enabled, and
adding this line to mailman.sql still passes all the tests.

-Barry

From felipe at felipegasper.com  Tue Sep 27 21:20:37 2011
From: felipe at felipegasper.com (Felipe Gasper)
Date: Tue, 27 Sep 2011 14:20:37 -0500
Subject: [Mailman-Developers] Hello
Message-ID: <4E822205.2040504@felipegasper.com>

Hi all,

	Barry said to email this list with an interest in helping with UI for MM3.

	I?ve done UI development for cPanel, Inc. for the past couple years. 
I?m fluent with Perl and JS/CSS/HTML/etc. I messed with Python a tiny 
bit about 8 years ago; I?m hoping to beef up my skills in that area as 
part of helping out with MM3. I?d say I?m reasonably well-versed in SMTP.

	I?ve used Git and SVN; Bazaar will be new to me.

	I?m looking forward to helping out.

cheers,
-Felipe Gasper
Houston, TX

From barry at list.org  Wed Sep 28 22:35:28 2011
From: barry at list.org (Barry Warsaw)
Date: Wed, 28 Sep 2011 16:35:28 -0400
Subject: [Mailman-Developers] Hello
In-Reply-To: <4E822205.2040504@felipegasper.com>
References: <4E822205.2040504@felipegasper.com>
Message-ID: <20110928163528.67436857@resist.wooz.org>

Hi Felipe,

On Sep 27, 2011, at 02:20 PM, Felipe Gasper wrote:

> I?ve done UI development for cPanel, Inc. for the past couple years. I?m
> fluent with Perl and JS/CSS/HTML/etc. I messed with Python a tiny bit about
> 8 years ago; I?m hoping to beef up my skills in that area as part of helping
> out with MM3. I?d say I?m reasonably well-versed in SMT P.
>
> I?ve used Git and SVN; Bazaar will be new to me.
>
> I?m looking forward to helping out.

Welcome!  I think you'll have no problems with Python and Bazaar, but do feel
free to ask any questions, either here or on freenode #mailman.   Florian and
Terri will probably be able to better answer questions about the web ui part
of the project.

Cheers,
-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110928/75396309/attachment.pgp>

From acase at cims.nyu.edu  Thu Sep 29 00:08:59 2011
From: acase at cims.nyu.edu (Andrew Case)
Date: Wed, 28 Sep 2011 18:08:59 -0400
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
Message-ID: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>

My configuration:
  Mailman: 2.1.14
  OS: Solaris 10
  Python: 2.4.5
  PREFIX = '/usr/mailman'
  Server setup: 1 server for web management, 1 server for MTA/qrunner. 
/usr/mailman is NFS mounted on both servers


I've been having the following issue my mailman lists:

A user is either subscribed or unsubscribed according to the logs, but
then if I look at the member list, the action has not been done (or has
been undone).  For example, here is where I remove a subscriber and then
look at the list members and they are still in the list:

[mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass sub
Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
mgt page
[mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
acase at example.com
[mailman at myhost] ~/logs |>

The same also happens when subscribing.  I will mass subscribe users (or
when users confirm subscription via email/web), the logs indicated that
they have been subscribed successfully, but then when I go look them up,
they are not listed on the members list.

This happens sporadically, but I am generally able to reproduce the error
if I do it a couple times in a row.

I'm suspicious there may be a locking issue and config.pck is reverting to
config.pck.last.  I found this thread rather helpful in analyzing
potential problems, but I have yet to figure anything out:
  http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD

In addition if I just run the following commands over and over, then the
bug never seems to come up.  This is part of why I am worrying about
locking:
  bin/add_members ...
  bin/remove_members ...

Is there a good way to test locking between servers?  I've run the
tests/test_lockfile.py, but it reports it is OK.

Any and all help would be GREATLY appreciated.  We've been trying to
triage this bug for weeks and it is terribly disruptive for our users.

Thanks,

--
Drew


From mark at msapiro.net  Thu Sep 29 04:48:19 2011
From: mark at msapiro.net (Mark Sapiro)
Date: Wed, 28 Sep 2011 19:48:19 -0700
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
Message-ID: <4E83DC73.90708@msapiro.net>

On 9/28/2011 3:08 PM, Andrew Case wrote:
> My configuration:
>   Mailman: 2.1.14
>   OS: Solaris 10
>   Python: 2.4.5
>   PREFIX = '/usr/mailman'
>   Server setup: 1 server for web management, 1 server for MTA/qrunner. 
> /usr/mailman is NFS mounted on both servers
> 
> 
> I've been having the following issue my mailman lists:
> 
> A user is either subscribed or unsubscribed according to the logs, but
> then if I look at the member list, the action has not been done (or has
> been undone).  For example, here is where I remove a subscriber and then
> look at the list members and they are still in the list:
> 
> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass sub
> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
> mgt page
> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
> acase at example.com
> [mailman at myhost] ~/logs |>


There is a bug in the Mailman 2.1 branch, but the above is not it. The
above log shows that acase at example.com was added by admin mass subscribe
at 17:15:14 and then a bit more than 4 minutes later, was removed by
checking the unsub box on the admin Membership List and submitting.

If you check your web server logs, you will find POST transactions to
the admin page for both these events.


> The same also happens when subscribing.  I will mass subscribe users (or
> when users confirm subscription via email/web), the logs indicated that
> they have been subscribed successfully, but then when I go look them up,
> they are not listed on the members list.
> 
> This happens sporadically, but I am generally able to reproduce the error
> if I do it a couple times in a row.


This is possibly a manifestation of the bug, but I'm surprised it is
happening that frequently.


> I'm suspicious there may be a locking issue and config.pck is reverting to
> config.pck.last.  I found this thread rather helpful in analyzing
> potential problems, but I have yet to figure anything out:
>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD


The thread you point to above is relevant, but it is not a locking
issue. The problem is due to list caching in Mailman/Queue/Runner.py
and/or nearly concurrent processes which first load the list unlocked
and later lock it. The issue is that the resolution of the config.pck
timestamp is 1 second, and if a process has a list object and that list
object is updated by another process within the same second as the
timestamp on the first process's object, the first process won't load
the updated list when it locks it. This can result in things like a
subscribe being done and logged and then silently reversed.

List locking is working as it should. The issue is that the first
process doesn't reload the updated list when it acquires the lock
because it thinks it already has the latest version.

I thought I had fixed this on the 2.1 branch, but it seems I only fixed
it for the now defunct 2.2 branch.

A relevant thread starts at
<http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
and continues at
<http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>

The patch in the attached cache.patch file should fix it.


> In addition if I just run the following commands over and over, then the
> bug never seems to come up.  This is part of why I am worrying about
> locking:
>   bin/add_members ...
>   bin/remove_members ...


That won't do it. bin/add_members alone will do it, but only if there is
a nearly concurrent process updating the same list.


> Is there a good way to test locking between servers?  I've run the
> tests/test_lockfile.py, but it reports it is OK.
> 
> Any and all help would be GREATLY appreciated.  We've been trying to
> triage this bug for weeks and it is terribly disruptive for our users.


The post at
<http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
contains a "stress test" that will probably reproduce the problem.

I suspect your Mailman server must be very busy for you to see this bug
that frequently. However, it looks like I need to install the fix for
Mailman 2.1.15.

It is also curious that the only reports of this that I can recall both
come from solaris users. There may be complications in your case due to
NFS, but locking shouldn't be the issue. Run the stress test and see if
it fails. If it does, try the patch.

Let us know what happens.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cache.patch
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110928/3b4713f1/attachment.ksh>

From acase at cims.nyu.edu  Thu Sep 29 08:52:59 2011
From: acase at cims.nyu.edu (Andrew Case)
Date: Thu, 29 Sep 2011 02:52:59 -0400
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <4E83DC73.90708@msapiro.net>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
Message-ID: <ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>

Thanks Mark, see inline comments.

>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass
>> sub
>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
>> mgt page
>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>> acase at example.com
>> [mailman at myhost] ~/logs |>
>
>
> There is a bug in the Mailman 2.1 branch, but the above is not it. The
> above log shows that acase at example.com was added by admin mass subscribe
> at 17:15:14 and then a bit more than 4 minutes later, was removed by
> checking the unsub box on the admin Membership List and submitting.

I was trying to show that even after the user was removed, they're still
listed as a member.

> If you check your web server logs, you will find POST transactions to
> the admin page for both these events.

Agreed.

>> The same also happens when subscribing.  I will mass subscribe users (or
>> when users confirm subscription via email/web), the logs indicated that
>> they have been subscribed successfully, but then when I go look them up,
>> they are not listed on the members list.
>>
>> This happens sporadically, but I am generally able to reproduce the
>> error
>> if I do it a couple times in a row.
>
>
> This is possibly a manifestation of the bug, but I'm surprised it is
> happening that frequently.

Easiest way for me to replicated the problem is:
1) check the unsubscribe box for user A then hit submit
2) after reload check the unsubscribe box for user B then hit submit
3) reaload the "membership list" page and user B is back on the list

This happens even after I wait a couple seconds in between each step.

>> I'm suspicious there may be a locking issue and config.pck is reverting
>> to
>> config.pck.last.  I found this thread rather helpful in analyzing
>> potential problems, but I have yet to figure anything out:
>>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD
>
>
> The thread you point to above is relevant, but it is not a locking
> issue. The problem is due to list caching in Mailman/Queue/Runner.py
> and/or nearly concurrent processes which first load the list unlocked
> and later lock it. The issue is that the resolution of the config.pck
> timestamp is 1 second, and if a process has a list object and that list
> object is updated by another process within the same second as the
> timestamp on the first process's object, the first process won't load
> the updated list when it locks it. This can result in things like a
> subscribe being done and logged and then silently reversed.

The result sounds the same, but would this happen even if I'm loading the
page with more than a second in between each step outlined above?

> List locking is working as it should. The issue is that the first
> process doesn't reload the updated list when it acquires the lock
> because it thinks it already has the latest version.
>
> I thought I had fixed this on the 2.1 branch, but it seems I only fixed
> it for the now defunct 2.2 branch.
>
> A relevant thread starts at
> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
> and continues at
> <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>
>
> The patch in the attached cache.patch file should fix it.

I applied the patch but it doesn't seem to have made a difference.


>> In addition if I just run the following commands over and over, then the
>> bug never seems to come up.  This is part of why I am worrying about
>> locking:
>>   bin/add_members ...
>>   bin/remove_members ...
>
>
> That won't do it. bin/add_members alone will do it, but only if there is
> a nearly concurrent process updating the same list.
>
>
>> Is there a good way to test locking between servers?  I've run the
>> tests/test_lockfile.py, but it reports it is OK.
>>
>> Any and all help would be GREATLY appreciated.  We've been trying to
>> triage this bug for weeks and it is terribly disruptive for our users.
>
>
> The post at
> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
> contains a "stress test" that will probably reproduce the problem.

Correct.  Only one subscriber was subscribed to each test list.  Keep in
mind that in the stress test given if you use a sleep counter of 5 with 6
lists, that means you're waiting _30 seconds_ before the next add_member
command is run for that list (I'm assume the timing issue is per-list, not
per run of add_members).  Even if you set the timer down to 1 that's a 6
second sleep.  This shouldn't effect a cache that we're comparing for the
given second.  Anyway, my script ran fine with the 5 second sleep (30
seconds per list add), but showed discrepancies with a 3 second sleep.

> I suspect your Mailman server must be very busy for you to see this bug
> that frequently. However, it looks like I need to install the fix for
> Mailman 2.1.15.

We run about 600 different mailing lists for our department and this has
been a continues headache.  I appreciate all the hard work you guys do.

> It is also curious that the only reports of this that I can recall both
> come from solaris users. There may be complications in your case due to
> NFS, but locking shouldn't be the issue. Run the stress test and see if
> it fails. If it does, try the patch.

Patch didn't seem to help.  Is there an easy way to omit the caching in this?

Thanks,
--
Drew

>
> Let us know what happens.
>
> --
> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
> San Francisco Bay Area, California    better use your sense - B. Dylan
>
>


From acase at cims.nyu.edu  Thu Sep 29 09:06:00 2011
From: acase at cims.nyu.edu (Andrew Case)
Date: Thu, 29 Sep 2011 03:06:00 -0400
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <4E83DC73.90708@msapiro.net>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
Message-ID: <b15f8b24a544ad281c0310fd2aa9ef96.squirrel@webmail.cims.nyu.edu>

Mark,

Another question below...

>> I'm suspicious there may be a locking issue and config.pck is reverting
>> to
>> config.pck.last.  I found this thread rather helpful in analyzing
>> potential problems, but I have yet to figure anything out:
>>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD
>
>
> The thread you point to above is relevant, but it is not a locking
> issue. The problem is due to list caching in Mailman/Queue/Runner.py
> and/or nearly concurrent processes which first load the list unlocked
> and later lock it. The issue is that the resolution of the config.pck
> timestamp is 1 second, and if a process has a list object and that list
> object is updated by another process within the same second as the
> timestamp on the first process's object, the first process won't load
> the updated list when it locks it. This can result in things like a
> subscribe being done and logged and then silently reversed.

You think it should be okay though if my qrunners are all running on my
mta server instead of my webserver though.  This wouldn't be causing a
problem with the caching right?

Thanks,

--
Drew


From acase at cims.nyu.edu  Thu Sep 29 10:13:46 2011
From: acase at cims.nyu.edu (Andrew Case)
Date: Thu, 29 Sep 2011 04:13:46 -0400
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
	<ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
Message-ID: <b6fde274052300f818caa82458dc2631.squirrel@webmail.cims.nyu.edu>

Mark,

I realized I hadn't restarted my QRunners after this patch.  It looks like
its working perfectly now!  Even with a sleep of 0.  Thanks so much!

--
Drew

On Thu, September 29, 2011 2:52 am, Andrew Case wrote:
> Thanks Mark, see inline comments.
>
>>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass
>>> sub
>>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
>>> mgt page
>>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>>> acase at example.com
>>> [mailman at myhost] ~/logs |>
>>
>>
>> There is a bug in the Mailman 2.1 branch, but the above is not it. The
>> above log shows that acase at example.com was added by admin mass subscribe
>> at 17:15:14 and then a bit more than 4 minutes later, was removed by
>> checking the unsub box on the admin Membership List and submitting.
>
> I was trying to show that even after the user was removed, they're still
> listed as a member.
>
>> If you check your web server logs, you will find POST transactions to
>> the admin page for both these events.
>
> Agreed.
>
>>> The same also happens when subscribing.  I will mass subscribe users
>>> (or
>>> when users confirm subscription via email/web), the logs indicated that
>>> they have been subscribed successfully, but then when I go look them
>>> up,
>>> they are not listed on the members list.
>>>
>>> This happens sporadically, but I am generally able to reproduce the
>>> error
>>> if I do it a couple times in a row.
>>
>>
>> This is possibly a manifestation of the bug, but I'm surprised it is
>> happening that frequently.
>
> Easiest way for me to replicated the problem is:
> 1) check the unsubscribe box for user A then hit submit
> 2) after reload check the unsubscribe box for user B then hit submit
> 3) reaload the "membership list" page and user B is back on the list
>
> This happens even after I wait a couple seconds in between each step.
>
>>> I'm suspicious there may be a locking issue and config.pck is reverting
>>> to
>>> config.pck.last.  I found this thread rather helpful in analyzing
>>> potential problems, but I have yet to figure anything out:
>>>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD
>>
>>
>> The thread you point to above is relevant, but it is not a locking
>> issue. The problem is due to list caching in Mailman/Queue/Runner.py
>> and/or nearly concurrent processes which first load the list unlocked
>> and later lock it. The issue is that the resolution of the config.pck
>> timestamp is 1 second, and if a process has a list object and that list
>> object is updated by another process within the same second as the
>> timestamp on the first process's object, the first process won't load
>> the updated list when it locks it. This can result in things like a
>> subscribe being done and logged and then silently reversed.
>
> The result sounds the same, but would this happen even if I'm loading the
> page with more than a second in between each step outlined above?
>
>> List locking is working as it should. The issue is that the first
>> process doesn't reload the updated list when it acquires the lock
>> because it thinks it already has the latest version.
>>
>> I thought I had fixed this on the 2.1 branch, but it seems I only fixed
>> it for the now defunct 2.2 branch.
>>
>> A relevant thread starts at
>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>> and continues at
>> <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>
>>
>> The patch in the attached cache.patch file should fix it.
>
> I applied the patch but it doesn't seem to have made a difference.
>
>
>>> In addition if I just run the following commands over and over, then
>>> the
>>> bug never seems to come up.  This is part of why I am worrying about
>>> locking:
>>>   bin/add_members ...
>>>   bin/remove_members ...
>>
>>
>> That won't do it. bin/add_members alone will do it, but only if there is
>> a nearly concurrent process updating the same list.
>>
>>
>>> Is there a good way to test locking between servers?  I've run the
>>> tests/test_lockfile.py, but it reports it is OK.
>>>
>>> Any and all help would be GREATLY appreciated.  We've been trying to
>>> triage this bug for weeks and it is terribly disruptive for our users.
>>
>>
>> The post at
>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>> contains a "stress test" that will probably reproduce the problem.
>
> Correct.  Only one subscriber was subscribed to each test list.  Keep in
> mind that in the stress test given if you use a sleep counter of 5 with 6
> lists, that means you're waiting _30 seconds_ before the next add_member
> command is run for that list (I'm assume the timing issue is per-list, not
> per run of add_members).  Even if you set the timer down to 1 that's a 6
> second sleep.  This shouldn't effect a cache that we're comparing for the
> given second.  Anyway, my script ran fine with the 5 second sleep (30
> seconds per list add), but showed discrepancies with a 3 second sleep.
>
>> I suspect your Mailman server must be very busy for you to see this bug
>> that frequently. However, it looks like I need to install the fix for
>> Mailman 2.1.15.
>
> We run about 600 different mailing lists for our department and this has
> been a continues headache.  I appreciate all the hard work you guys do.
>
>> It is also curious that the only reports of this that I can recall both
>> come from solaris users. There may be complications in your case due to
>> NFS, but locking shouldn't be the issue. Run the stress test and see if
>> it fails. If it does, try the patch.
>
> Patch didn't seem to help.  Is there an easy way to omit the caching in
> this?
>
> Thanks,
> --
> Drew
>
>>
>> Let us know what happens.
>>
>> --
>> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
>> San Francisco Bay Area, California    better use your sense - B. Dylan
>>
>>
>
>


Andrew Case
Systems Administrator
Courant Institute of Mathematical Sciences
New York University
251 Mercer St., Room 1023
New York, NY 10012-1110
Phone: 212-998-3147


From f at state-of-mind.de  Thu Sep 29 13:36:07 2011
From: f at state-of-mind.de (Florian Fuchs)
Date: Thu, 29 Sep 2011 13:36:07 +0200
Subject: [Mailman-Developers] Hello
In-Reply-To: <20110928163528.67436857@resist.wooz.org>
References: <4E822205.2040504@felipegasper.com>
	<20110928163528.67436857@resist.wooz.org>
Message-ID: <8E6C58F5-5DE3-4F92-A439-6FCA9E966A48@state-of-mind.de>

Hi Felipe,

as for the UI I think a good starting point would be to check out the work Benedict Stein and Anna Granudd have done during the last two Google Summers of Code. You can find it here:

https://launchpad.net/~mailmanwebgsoc2011

The web UI is based on the Django framework and communicates with the Mailman core via a REST API. In order to do so it depends on a client library which translates some of the HTTP API calls into Python object logic. The library code can be found here:

https://launchpad.net/mailman.client

Benedict has also created a detailed step by step installation page for mailman3a7 and the current web UI (I'm not sure if MM3's current Alpha8 is fully compatible with the client library as it's brand new... I will try to do any necessary adjustments as soon as possible...)

http://wiki.list.org/pages/viewpage.action?pageId=11960560

As Barry said, feel free to ask any questions!

Cheers 
Florian


Am 28.09.2011 um 22:35 schrieb Barry Warsaw <barry at list.org>:

> Hi Felipe,
> 
> On Sep 27, 2011, at 02:20 PM, Felipe Gasper wrote:
> 
>> I?ve done UI development for cPanel, Inc. for the past couple years. I?m
>> fluent with Perl and JS/CSS/HTML/etc. I messed with Python a tiny bit about
>> 8 years ago; I?m hoping to beef up my skills in that area as part of helping
>> out with MM3. I?d say I?m reasonably well-versed in SMT P.
>> 
>> I?ve used Git and SVN; Bazaar will be new to me.
>> 
>> I?m looking forward to helping out.
> 
> Welcome!  I think you'll have no problems with Python and Bazaar, but do feel
> free to ask any questions, either here or on freenode #mailman.   Florian and
> Terri will probably be able to better answer questions about the web ui part
> of the project.
> 
> Cheers,
> -Barry
> 
> _______________________________________________
> Mailman-Developers mailing list
> Mailman-Developers at python.org
> http://mail.python.org/mailman/listinfo/mailman-developers
> Mailman FAQ: http://wiki.list.org/x/AgA3
> Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/f%40state-of-mind.de
> 
> Security Policy: http://wiki.list.org/x/QIA9


From acase at cims.nyu.edu  Thu Sep 29 17:58:54 2011
From: acase at cims.nyu.edu (Andrew Case)
Date: Thu, 29 Sep 2011 11:58:54 -0400
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <b6fde274052300f818caa82458dc2631.squirrel@webmail.cims.nyu.edu>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
	<ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
	<b6fde274052300f818caa82458dc2631.squirrel@webmail.cims.nyu.edu>
Message-ID: <bc19d8c12bb0c37882cde710bf1a05b3.squirrel@webmail.cims.nyu.edu>

Mark,

I don't know if you'll find this worth trying to fix, but I revised the
stress test this morning to be a bit more stressful.  The biggest change I
made (besides upping the number of subscribers) was that instead of adding
each member to all the lists before adding the second member, it now adds
all the members to one list successively before moving on the the next
list.  This causes a much lower amount of time between subscriptions per
list, causing the config.pck files to be updated in a much smaller
timeframe.

When I did this I saw that maybe 1-5% of the time a user was still omitted
from the list (user was silently removed).  I think that because these are
processed by the queue runner on a different host and because the
timestamp check is being done on an NFS stored file, there is potential
that the qrunner for this doesn't yet have an updated mtime for that file
(or even a small ntp time drift could cause this).  When I commented out
the caching part of the code in MailList.py this bug never seems to show
up:
            #if mtime < self.__timestamp:
            #    # File is not newer
            #    return None, None

So I think there may still be a race condition here, but the chances of it
are unlikely that human interaction would trigger this.  If however, you
have a script that is subscribing users (one after another), this could
still come up.  I actually happen to have such a script, but I run it on
the same host as the qrunners, so I haven't experienced this before.

In my case I think it's probably not worth keeping the performance gain
that the caching adds for sake of consistency.

Attached is the modified stress test I'm using.

Thanks again,

--
Drew


On Thu, September 29, 2011 4:13 am, Andrew Case wrote:
> Mark,
>
> I realized I hadn't restarted my QRunners after this patch.  It looks like
> its working perfectly now!  Even with a sleep of 0.  Thanks so much!
>
> --
> Drew
>
> On Thu, September 29, 2011 2:52 am, Andrew Case wrote:
>> Thanks Mark, see inline comments.
>>
>>>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>>>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin
>>>> mass
>>>> sub
>>>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com;
>>>> member
>>>> mgt page
>>>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>>>> acase at example.com
>>>> [mailman at myhost] ~/logs |>
>>>
>>>
>>> There is a bug in the Mailman 2.1 branch, but the above is not it. The
>>> above log shows that acase at example.com was added by admin mass
>>> subscribe
>>> at 17:15:14 and then a bit more than 4 minutes later, was removed by
>>> checking the unsub box on the admin Membership List and submitting.
>>
>> I was trying to show that even after the user was removed, they're still
>> listed as a member.
>>
>>> If you check your web server logs, you will find POST transactions to
>>> the admin page for both these events.
>>
>> Agreed.
>>
>>>> The same also happens when subscribing.  I will mass subscribe users
>>>> (or
>>>> when users confirm subscription via email/web), the logs indicated
>>>> that
>>>> they have been subscribed successfully, but then when I go look them
>>>> up,
>>>> they are not listed on the members list.
>>>>
>>>> This happens sporadically, but I am generally able to reproduce the
>>>> error
>>>> if I do it a couple times in a row.
>>>
>>>
>>> This is possibly a manifestation of the bug, but I'm surprised it is
>>> happening that frequently.
>>
>> Easiest way for me to replicated the problem is:
>> 1) check the unsubscribe box for user A then hit submit
>> 2) after reload check the unsubscribe box for user B then hit submit
>> 3) reaload the "membership list" page and user B is back on the list
>>
>> This happens even after I wait a couple seconds in between each step.
>>
>>>> I'm suspicious there may be a locking issue and config.pck is
>>>> reverting
>>>> to
>>>> config.pck.last.  I found this thread rather helpful in analyzing
>>>> potential problems, but I have yet to figure anything out:
>>>>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD
>>>
>>>
>>> The thread you point to above is relevant, but it is not a locking
>>> issue. The problem is due to list caching in Mailman/Queue/Runner.py
>>> and/or nearly concurrent processes which first load the list unlocked
>>> and later lock it. The issue is that the resolution of the config.pck
>>> timestamp is 1 second, and if a process has a list object and that list
>>> object is updated by another process within the same second as the
>>> timestamp on the first process's object, the first process won't load
>>> the updated list when it locks it. This can result in things like a
>>> subscribe being done and logged and then silently reversed.
>>
>> The result sounds the same, but would this happen even if I'm loading
>> the
>> page with more than a second in between each step outlined above?
>>
>>> List locking is working as it should. The issue is that the first
>>> process doesn't reload the updated list when it acquires the lock
>>> because it thinks it already has the latest version.
>>>
>>> I thought I had fixed this on the 2.1 branch, but it seems I only fixed
>>> it for the now defunct 2.2 branch.
>>>
>>> A relevant thread starts at
>>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>>> and continues at
>>> <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>
>>>
>>> The patch in the attached cache.patch file should fix it.
>>
>> I applied the patch but it doesn't seem to have made a difference.
>>
>>
>>>> In addition if I just run the following commands over and over, then
>>>> the
>>>> bug never seems to come up.  This is part of why I am worrying about
>>>> locking:
>>>>   bin/add_members ...
>>>>   bin/remove_members ...
>>>
>>>
>>> That won't do it. bin/add_members alone will do it, but only if there
>>> is
>>> a nearly concurrent process updating the same list.
>>>
>>>
>>>> Is there a good way to test locking between servers?  I've run the
>>>> tests/test_lockfile.py, but it reports it is OK.
>>>>
>>>> Any and all help would be GREATLY appreciated.  We've been trying to
>>>> triage this bug for weeks and it is terribly disruptive for our users.
>>>
>>>
>>> The post at
>>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>>> contains a "stress test" that will probably reproduce the problem.
>>
>> Correct.  Only one subscriber was subscribed to each test list.  Keep in
>> mind that in the stress test given if you use a sleep counter of 5 with
>> 6
>> lists, that means you're waiting _30 seconds_ before the next add_member
>> command is run for that list (I'm assume the timing issue is per-list,
>> not
>> per run of add_members).  Even if you set the timer down to 1 that's a 6
>> second sleep.  This shouldn't effect a cache that we're comparing for
>> the
>> given second.  Anyway, my script ran fine with the 5 second sleep (30
>> seconds per list add), but showed discrepancies with a 3 second sleep.
>>
>>> I suspect your Mailman server must be very busy for you to see this bug
>>> that frequently. However, it looks like I need to install the fix for
>>> Mailman 2.1.15.
>>
>> We run about 600 different mailing lists for our department and this has
>> been a continues headache.  I appreciate all the hard work you guys do.
>>
>>> It is also curious that the only reports of this that I can recall both
>>> come from solaris users. There may be complications in your case due to
>>> NFS, but locking shouldn't be the issue. Run the stress test and see if
>>> it fails. If it does, try the patch.
>>
>> Patch didn't seem to help.  Is there an easy way to omit the caching in
>> this?
>>
>> Thanks,
>> --
>> Drew
>>
>>>
>>> Let us know what happens.
>>>
>>> --
>>> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
>>> San Francisco Bay Area, California    better use your sense - B. Dylan
>>>
>>>
>>
>>
>
>
> Andrew Case
> Systems Administrator
> Courant Institute of Mathematical Sciences
> New York University
> 251 Mercer St., Room 1023
> New York, NY 10012-1110
> Phone: 212-998-3147
>
>


Andrew Case
Systems Administrator
Courant Institute of Mathematical Sciences
New York University
251 Mercer St., Room 1023
New York, NY 10012-1110
Phone: 212-998-3147
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_subscribe
Type: application/octet-stream
Size: 759 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110929/1b233f3d/attachment.obj>

From mark at msapiro.net  Thu Sep 29 18:44:03 2011
From: mark at msapiro.net (Mark Sapiro)
Date: Thu, 29 Sep 2011 09:44:03 -0700
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
	<ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
Message-ID: <4E84A053.6040900@msapiro.net>

On 9/28/2011 11:52 PM, Andrew Case wrote:
> Thanks Mark, see inline comments.
> 
>>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass
>>> sub
>>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
>>> mgt page
>>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>>> acase at example.com
>>> [mailman at myhost] ~/logs |>
>>
>>
>> There is a bug in the Mailman 2.1 branch, but the above is not it. The
>> above log shows that acase at example.com was added by admin mass subscribe
>> at 17:15:14 and then a bit more than 4 minutes later, was removed by
>> checking the unsub box on the admin Membership List and submitting.
> 
> I was trying to show that even after the user was removed, they're still
> listed as a member.


Sorry, I missed that. You are correct and this does appear to be a
manifestation of the same issue.


[...]
>> The thread you point to above is relevant, but it is not a locking
>> issue. The problem is due to list caching in Mailman/Queue/Runner.py
>> and/or nearly concurrent processes which first load the list unlocked
>> and later lock it. The issue is that the resolution of the config.pck
>> timestamp is 1 second, and if a process has a list object and that list
>> object is updated by another process within the same second as the
>> timestamp on the first process's object, the first process won't load
>> the updated list when it locks it. This can result in things like a
>> subscribe being done and logged and then silently reversed.
> 
> The result sounds the same, but would this happen even if I'm loading the
> page with more than a second in between each step outlined above?


It is tricky. Each add_members, remove_members and web CGI post is a
separate process. If these processes are run sequentially, there should
not be any problem because each process will load the list, lock it
update it and save it before the next process loads it.

The problem occurs when processes run concurrently. The scenario is
process A loads the list unlocked; process B locks the list and updates
it; process A tries to lock the list and gets the lock after process B
relinquishes it; if the timestamp on the config.pck from process B's
update is in the same second as the timestamp of process A's initial
load, process A thinks the list hasn't been updated and doesn't reload
it after obtaining the lock. Thus, when process A saves the list,
process B's changes are reversed.

This is complicated by list caching in the qrunners because each qrunner
may have a cached copy of the list, so it can act as process A in the
above scenario with its cached copy playing the role of the initially
loaded list. To complicate this further, the qrunners get involved even
in the simple scenario with sequential commands because add_members,
remove_members and CGIs result in notices being sent, and the qrunner
processes that send the notices are running concurrently. This is why
the stress test will fail even though commands are run sequentially.


[...]
> I applied the patch but it doesn't seem to have made a difference.


As you later report, restarting the qrunners did seem to fix it.


[...]
>> The post at
>> <->
>> contains a "stress test" that will probably reproduce the problem.
> 
> Correct.  Only one subscriber was subscribed to each test list.  Keep in
> mind that in the stress test given if you use a sleep counter of 5 with 6
> lists, that means you're waiting _30 seconds_ before the next add_member
> command is run for that list (I'm assume the timing issue is per-list, not
> per run of add_members).  Even if you set the timer down to 1 that's a 6
> second sleep.  This shouldn't effect a cache that we're comparing for the
> given second.  Anyway, my script ran fine with the 5 second sleep (30
> seconds per list add), but showed discrepancies with a 3 second sleep.


So you are adding 'sleep' commands after each add_members? I'm not sure
what you're doing. Is there a different test elsewhere in the thread?

I have used a couple of tests as attached. They are the same except for
list order and are very similar to the one in the original thread. Note
that they contain only one sleep after all the add_members just to allow
things to settle before running list_members.


>> I suspect your Mailman server must be very busy for you to see this bug
>> that frequently. However, it looks like I need to install the fix for
>> Mailman 2.1.15.


Actually, I don't think the issue is the busy server. I think it is more
likely that NFS causes timing issues between add_members and
VirginRunner and OutgoingRunner that just make the bug more likely to
trigger.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: list_cache_stress_test
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110929/fcc4222a/attachment-0002.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: list_cache_stress_test_2
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110929/fcc4222a/attachment-0003.ksh>

From mark at msapiro.net  Thu Sep 29 18:47:02 2011
From: mark at msapiro.net (Mark Sapiro)
Date: Thu, 29 Sep 2011 09:47:02 -0700
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <b15f8b24a544ad281c0310fd2aa9ef96.squirrel@webmail.cims.nyu.edu>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
	<b15f8b24a544ad281c0310fd2aa9ef96.squirrel@webmail.cims.nyu.edu>
Message-ID: <4E84A106.9060004@msapiro.net>

On 9/29/2011 12:06 AM, Andrew Case wrote:
> 
> You think it should be okay though if my qrunners are all running on my
> mta server instead of my webserver though.  This wouldn't be causing a
> problem with the caching right?


As long as Mailman's locks directory is a single NFS shared directory,
there should be no problem. The problem you have is due to the bug, and
the patch should fix it.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


From acase at cims.nyu.edu  Thu Sep 29 19:30:12 2011
From: acase at cims.nyu.edu (Andrew Case)
Date: Thu, 29 Sep 2011 13:30:12 -0400
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <4E84A053.6040900@msapiro.net>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
	<ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
	<4E84A053.6040900@msapiro.net>
Message-ID: <9c547f8afb5dfdf51f27786b8a07143c.squirrel@webmail.cims.nyu.edu>

[...]

> It is tricky. Each add_members, remove_members and web CGI post is a
> separate process. If these processes are run sequentially, there should
> not be any problem because each process will load the list, lock it
> update it and save it before the next process loads it.
>
> The problem occurs when processes run concurrently. The scenario is
> process A loads the list unlocked; process B locks the list and updates
> it; process A tries to lock the list and gets the lock after process B
> relinquishes it; if the timestamp on the config.pck from process B's
> update is in the same second as the timestamp of process A's initial
> load, process A thinks the list hasn't been updated and doesn't reload
> it after obtaining the lock. Thus, when process A saves the list,
> process B's changes are reversed.
>
> This is complicated by list caching in the qrunners because each qrunner
> may have a cached copy of the list, so it can act as process A in the
> above scenario with its cached copy playing the role of the initially
> loaded list. To complicate this further, the qrunners get involved even
> in the simple scenario with sequential commands because add_members,
> remove_members and CGIs result in notices being sent, and the qrunner
> processes that send the notices are running concurrently. This is why
> the stress test will fail even though commands are run sequentially.

Thank you for that explanation.  I did seem to have confusion as to when
the qrunners cache and/or update these config.pck files and when the
add/remove_members commands did as well.  There seemed to be some sort of
conflict between the two.

[...]

>>> The post at
>>> <->
>>> contains a "stress test" that will probably reproduce the problem.
>>
>> Correct.  Only one subscriber was subscribed to each test list.  Keep in
>> mind that in the stress test given if you use a sleep counter of 5 with
>> 6
>> lists, that means you're waiting _30 seconds_ before the next add_member
>> command is run for that list (I'm assume the timing issue is per-list,
>> not
>> per run of add_members).  Even if you set the timer down to 1 that's a 6
>> second sleep.  This shouldn't effect a cache that we're comparing for
>> the
>> given second.  Anyway, my script ran fine with the 5 second sleep (30
>> seconds per list add), but showed discrepancies with a 3 second sleep.
>
>
> So you are adding 'sleep' commands after each add_members?

Yes I was.  Without a sleep in between add_member calls, it was failing
for ~50% of the calls to add_members.  With a 5 second sleep it would tend
to work most of the time.

> I'm not sure what you're doing. Is there a different test elsewhere in
> the thread?

See my updated stress test that I sent you in my last email.

> I have used a couple of tests as attached. They are the same except for
> list order and are very similar to the one in the original thread. Note
> that they contain only one sleep after all the add_members just to allow
> things to settle before running list_members.

That makes sense.

>>> I suspect your Mailman server must be very busy for you to see this bug
>>> that frequently. However, it looks like I need to install the fix for
>>> Mailman 2.1.15.
>
>
> Actually, I don't think the issue is the busy server. I think it is more
> likely that NFS causes timing issues between add_members and
> VirginRunner and OutgoingRunner that just make the bug more likely to
> trigger.

I think you hit the nail on the head here.  It explains a lot.

Thanks,

--
Drew


From mark at msapiro.net  Thu Sep 29 19:30:35 2011
From: mark at msapiro.net (Mark Sapiro)
Date: Thu, 29 Sep 2011 10:30:35 -0700
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <bc19d8c12bb0c37882cde710bf1a05b3.squirrel@webmail.cims.nyu.edu>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
	<ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
	<b6fde274052300f818caa82458dc2631.squirrel@webmail.cims.nyu.edu>
	<bc19d8c12bb0c37882cde710bf1a05b3.squirrel@webmail.cims.nyu.edu>
Message-ID: <4E84AB3B.2090204@msapiro.net>

On 9/29/2011 8:58 AM, Andrew Case wrote:
> 
> When I did this I saw that maybe 1-5% of the time a user was still omitted
> from the list (user was silently removed).  I think that because these are
> processed by the queue runner on a different host and because the
> timestamp check is being done on an NFS stored file, there is potential
> that the qrunner for this doesn't yet have an updated mtime for that file
> (or even a small ntp time drift could cause this).  When I commented out
> the caching part of the code in MailList.py this bug never seems to show
> up:
>             #if mtime < self.__timestamp:
>             #    # File is not newer
>             #    return None, None


Actually, that is not the cache. It is just the test for whether the
current list object, cached or whatever, needs to be reloaded from disk.

I think that your configuration with NFS and possible time jitter
between servers makes the bug more likely.


> So I think there may still be a race condition here, but the chances of it
> are unlikely that human interaction would trigger this.  If however, you
> have a script that is subscribing users (one after another), this could
> still come up.  I actually happen to have such a script, but I run it on
> the same host as the qrunners, so I haven't experienced this before.


It can happen even where everything is on a single host, but as I said,
I think your configuration makes it more likely.


> In my case I think it's probably not worth keeping the performance gain
> that the caching adds for sake of consistency.


Attached is a patch to remove list caching from the qrunners. This patch
has the additional advantage of limiting the growth of the qrunners over
time. Old entries were supposed to be freed from the cache, but a self
reference in the default MemberAdaptor prevented this from occurring.

For reasons of trying not to be disruptive this patch and the bug fix I
sent earlier were never applied to the 2.1 branch. I think this was a
mistake, and I will apply them for Mailman 2.1.15.


> Attached is the modified stress test I'm using.


Thanks.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: disable_cache.patch
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110929/43bad8ad/attachment.ksh>

From acase at cims.nyu.edu  Thu Sep 29 20:36:26 2011
From: acase at cims.nyu.edu (Andrew Case)
Date: Thu, 29 Sep 2011 14:36:26 -0400
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <4E84AB3B.2090204@msapiro.net>
References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu>
	<4E83DC73.90708@msapiro.net>
	<ab74412b1aa0634b8432320baa69a4fa.squirrel@webmail.cims.nyu.edu>
	<b6fde274052300f818caa82458dc2631.squirrel@webmail.cims.nyu.edu>
	<bc19d8c12bb0c37882cde710bf1a05b3.squirrel@webmail.cims.nyu.edu>
	<4E84AB3B.2090204@msapiro.net>
Message-ID: <e8328800b3b99f280f088c49d73a0089.squirrel@webmail.cims.nyu.edu>

Hey Mark,

Take a look at these results that are with the 2 patches you've sent me
(and qrunners restarted).  I'm adding 10 users to 6 lists:

[... creating list output cut ...]

 + Subscribing to testlist1
Subscribed: foo1 at www.cs.nyu.edu
Subscribed: foo2 at www.cs.nyu.edu
Subscribed: foo3 at www.cs.nyu.edu
Subscribed: foo4 at www.cs.nyu.edu
Subscribed: foo5 at www.cs.nyu.edu
Subscribed: foo6 at www.cs.nyu.edu
Subscribed: foo7 at www.cs.nyu.edu
Subscribed: foo8 at www.cs.nyu.edu
Subscribed: foo9 at www.cs.nyu.edu
Subscribed: foo10 at www.cs.nyu.edu
 + Subscribing to testlist2
Subscribed: foo1 at www.cs.nyu.edu
Subscribed: foo2 at www.cs.nyu.edu
Subscribed: foo3 at www.cs.nyu.edu
Traceback (most recent call last):
  File "/usr/mailman/bin/add_members", line 258, in <module>
    main()
  File "/usr/mailman/bin/add_members", line 238, in main
    addall(mlist, nmembers, 0, send_welcome_msg, s)
  File "/usr/mailman/bin/add_members", line 135, in addall
    mlist.ApprovedAddMember(userdesc, ack, 0)
  File "/usr/mailman/Mailman/MailList.py", line 948, in ApprovedAddMember
    assert self.Locked()
AssertionError
Subscribed: foo5 at www.cs.nyu.edu
Subscribed: foo6 at www.cs.nyu.edu
Subscribed: foo7 at www.cs.nyu.edu
Subscribed: foo8 at www.cs.nyu.edu
Subscribed: foo9 at www.cs.nyu.edu
Subscribed: foo10 at www.cs.nyu.edu
 + Subscribing to testlist3
Subscribed: foo1 at www.cs.nyu.edu

[... subscribing users to all other lists went fine ...]


 + Subscribers for testlist1:
foo10 at www.cs.nyu.edu
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo4 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu
foo7 at www.cs.nyu.edu
foo8 at www.cs.nyu.edu
foo9 at www.cs.nyu.edu
 + Removing list testlist1
 + Subscribers for testlist2:
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu
foo7 at www.cs.nyu.edu
foo8 at www.cs.nyu.edu
foo9 at www.cs.nyu.edu
 + Removing list testlist2

[... the rest were all fine ...]


There was a locking issue with testlist2 foo4, which is fine since it
doesn't report back as successful.  But you'll also notice that foo10
wasn't listed as a subscriber even though it appeared as though that
subscribe was successful.


Here's some errors on the very next run where I'm subscribing 10 people to
each list as well:

[... cut expected results ...]
 + Subscribers for testlist4:
foo10 at www.cs.nyu.edu
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo4 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu [** no foo8 **]
foo7 at www.cs.nyu.edu
foo9 at www.cs.nyu.edu
 + Removing list testlist4
 + Subscribers for testlist5:
foo10 at www.cs.nyu.edu
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo4 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu
foo7 at www.cs.nyu.edu [** no foo8 or foo9 **]
 + Removing list testlist5
[... cut expected results ...]


That's my (1-5%) failure.  But when I also comment out the following:

>>             #if mtime < self.__timestamp:
>>             #    # File is not newer
>>             #    return None, None

It seems to work each time (I ran 3 tests in a row, all with expected
results).

Let me know what you think.

Thanks,

--
Drew


On Thu, September 29, 2011 1:30 pm, Mark Sapiro wrote:
> On 9/29/2011 8:58 AM, Andrew Case wrote:
>>
>> When I did this I saw that maybe 1-5% of the time a user was still
>> omitted
>> from the list (user was silently removed).  I think that because these
>> are
>> processed by the queue runner on a different host and because the
>> timestamp check is being done on an NFS stored file, there is potential
>> that the qrunner for this doesn't yet have an updated mtime for that
>> file
>> (or even a small ntp time drift could cause this).  When I commented out
>> the caching part of the code in MailList.py this bug never seems to show
>> up:
>>             #if mtime < self.__timestamp:
>>             #    # File is not newer
>>             #    return None, None
>
>
> Actually, that is not the cache. It is just the test for whether the
> current list object, cached or whatever, needs to be reloaded from disk.
>
> I think that your configuration with NFS and possible time jitter
> between servers makes the bug more likely.
>
>
>> So I think there may still be a race condition here, but the chances of
>> it
>> are unlikely that human interaction would trigger this.  If however, you
>> have a script that is subscribing users (one after another), this could
>> still come up.  I actually happen to have such a script, but I run it on
>> the same host as the qrunners, so I haven't experienced this before.
>
>
> It can happen even where everything is on a single host, but as I said,
> I think your configuration makes it more likely.
>
>
>> In my case I think it's probably not worth keeping the performance gain
>> that the caching adds for sake of consistency.
>
>
> Attached is a patch to remove list caching from the qrunners. This patch
> has the additional advantage of limiting the growth of the qrunners over
> time. Old entries were supposed to be freed from the cache, but a self
> reference in the default MemberAdaptor prevented this from occurring.
>
> For reasons of trying not to be disruptive this patch and the bug fix I
> sent earlier were never applied to the 2.1 branch. I think this was a
> mistake, and I will apply them for Mailman 2.1.15.
>
>
>> Attached is the modified stress test I'm using.
>
>
> Thanks.
>
> --
> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
> San Francisco Bay Area, California    better use your sense - B. Dylan
>
>


Andrew Case
Systems Administrator
Courant Institute of Mathematical Sciences
New York University
251 Mercer St., Room 1023
New York, NY 10012-1110
Phone: 212-998-3147


From mark at msapiro.net  Thu Sep 29 22:06:46 2011
From: mark at msapiro.net (Mark Sapiro)
Date: Thu, 29 Sep 2011 13:06:46 -0700
Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes
In-Reply-To: <e8328800b3b99f280f088c49d73a0089.squirrel@webmail.cims.nyu.edu>
Message-ID: <PC1952011092913064602029334a038@MSAPIRO>

Andrew Case wrote:
>
>Traceback (most recent call last):
>  File "/usr/mailman/bin/add_members", line 258, in <module>
>    main()
>  File "/usr/mailman/bin/add_members", line 238, in main
>    addall(mlist, nmembers, 0, send_welcome_msg, s)
>  File "/usr/mailman/bin/add_members", line 135, in addall
>    mlist.ApprovedAddMember(userdesc, ack, 0)
>  File "/usr/mailman/Mailman/MailList.py", line 948, in ApprovedAddMember
>    assert self.Locked()
>AssertionError


The above is definitely a problem, but I can't see how it can occur
unless there is some race condition at the level of the file system.


[...]
>There was a locking issue with testlist2 foo4, which is fine since it
>doesn't report back as successful.  But you'll also notice that foo10
>wasn't listed as a subscriber even though it appeared as though that
>subscribe was successful.


Yes, I see that and both are cause for concern. I am more concerned
about the AssertionError with testlist2 foo4, because I can't see how
that can happen without some file system anomaly.


>Here's some errors on the very next run where I'm subscribing 10 people to
>each list as well:
[...]
>That's my (1-5%) failure.  But when I also comment out the following:
>
>>>             #if mtime < self.__timestamp:
>>>             #    # File is not newer
>>>             #    return None, None
>
>It seems to work each time (I ran 3 tests in a row, all with expected
>results).


The code above is a decision about whether we need to reload the list
object from the file system based on the file system time stamp vs.
our internal time stamp.

The original code said "if mtime <= self.__timestamp:". Since these
time stamps are in whole seconds, that test (to skip loading) could
succeed even if the file time stamp was a fraction of a second newer
than the internal time stamp. Thus, the bug.

The fix is to make the test "if mtime < self.__timestamp:" meanin we
only skip loading if the file time stamp is strictly less than the
internal time stamp, but in your case, if the clock on the MTA/qrunner
machine is a bit faster than that on the machine running the script,
the internal time stamp of the qrunner process could be a second ahead
of the file time stamp set by the add_members process even though on
an absolute scale it is older.

So, I think we're concerned here about clock skew between the machines,
and in that case, commenting out the code completely as you have done
makes sense.

I have filed bug reports at
<https://bugs.launchpad.net/mailman/+bug/862675> and
<https://bugs.launchpad.net/mailman/+bug/862683> about these issues in
preparation to fixing them.

Now I don't know whether to stick with the "if mtime <
self.__timestamp:" test which will work on a single server or to
reload unconditionally as you have done which seems to be necessary in
a shared file system situation with possible time skews.

As far as the AssertionError is concerned, I don't know what to make of
it. It appears to be a file system issue outside of Mailman, so I
don't know how to deal with it. I think the code in Mailman's LockFile
module is correct. If you are willing to try debugging this further,
you could set

LIST_LOCK_DEBUGGING = True

in mm_cfg.py and restart your qrunners and try to reproduce the
exception. This will log copious information to Mailman's 'locks' log
which may help to understand what happened.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan