From barry at list.org Mon Sep 26 00:26:28 2011 From: barry at list.org (Barry Warsaw) Date: Sun, 25 Sep 2011 18:26:28 -0400 Subject: [Mailman-Developers] RELEASED: Mailman 3.0 alpha 8 Message-ID: <20110925182628.30cd0a0c@resist.wooz.org> I am very happy to announce the release of the eighth alpha for Mailman 3.0, code named "Where's My Thing?". This is the last planned alpha release, as I want to work toward the first beta in order to meet my goal of an 11/11/11 final release (of the core engine at least). If you've been holding off looking at Mailman 3, I invite you to do so now. Once beta 1 is released I will not be adding any new features. I do hope to put up a few live test lists soon, so stay tuned. There have been a large number of fixes and new features, especially in the REST API. My thanks go to Stephen Goss who has contributed greatly to this release, with bug reports, wish list items, patches, and merge proposals. Full details of what's new in 3.0a8 is available here: http://tinyurl.com/6yxgclf The tarball can be downloaded from Launchpad or the Cheeseshop: https://launchpad.net/mailman http://pypi.python.org/pypi/mailman/3.0.0a8 The full documentation is also online: http://packages.python.org/mailman/ Enjoy, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From barry at python.org Tue Sep 27 03:03:38 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 26 Sep 2011 21:03:38 -0400 Subject: [Mailman-Developers] [Bug 860159] Re: Mailman 3.0 support for Postgres In-Reply-To: <20110927004418.10717.68620.malone@gac.canonical.com> References: <20110927001849.9391.31768.malonedeb@gac.canonical.com> <20110927004418.10717.68620.malone@gac.canonical.com> Message-ID: <20110926210338.05782ba4@resist.wooz.org> On Sep 27, 2011, at 12:44 AM, Stephen A. Goss wrote: >I've attached a branch with the Postgres support code that I've cooked >up. Fantastic, thanks! Do be sure to add a merge proposal. >3. There is now an alternate mailman_pg.sql file which is used to create >the tables. Currently, two foreign key constraints are commented out >because those are violated in Mailman 3 (apparently this doesn't bother >SQLite). Some column TYPES are different from mailman.sql. Classes are >created in a slightly different order due to FK constraint creation >requires the referenced table to actually exist. The primary key indexes >defined after each class are probably redundant, as those are created >automatically for SERIAL columns defined as PRIMARY KEY in Postgres. Do you have any thoughts on whether the two .sql files can possibly be shared? My biggest concern is that it will be difficult-ish to keep them in sync as I add or modify the SQLite version. If it's not possible, so be it. I took a quick look at the Python changes, and I think I'm going to refactor the code to not hardcode so much in stock.py. For example, I'll probably rename StockDatabase to SQLiteDatabase and add a PostgresDatabase class, adding a common super class. That way, you'd only need to put this in your mailman.cfg file: [database] class: mailman.database.postgres.PostgresDatabase Don't worry about that too much, I can make that change when I merge your branch. >4. Probably more FK constraint violations exist that my tests haven't >uncovered. I'd definitely like to be able to run the test suite against Postgres, if even for now it's a manual select (e.g. because Postgres would obviously have to be installed and configured in order to work). From omacneil at thecsl.org Tue Sep 27 03:11:54 2011 From: omacneil at thecsl.org (Dan MacNeil) Date: Mon, 26 Sep 2011 21:11:54 -0400 Subject: [Mailman-Developers] [Bug 860159] Re: Mailman 3.0 support for Postgres In-Reply-To: <20110926210338.05782ba4@resist.wooz.org> References: <20110927001849.9391.31768.malonedeb@gac.canonical.com> <20110927004418.10717.68620.malone@gac.canonical.com> <20110926210338.05782ba4@resist.wooz.org> Message-ID: <4E8122DA.7030607@thecsl.org> On 09/26/2011 09:03 PM, Barry Warsaw wrote: > Currently, two foreign key constraints are commented out >>because those are violated in Mailman 3 (apparently this doesn't bother >>SQLite). Foreign key constraints are available in sqlite 3.6.19 and above. They are turned off by default. They can be enabled with: PRAGMA foreign_keys = ON; More details at: http://www.sqlite.org/foreignkeys.html From barry at python.org Tue Sep 27 03:27:13 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 26 Sep 2011 21:27:13 -0400 Subject: [Mailman-Developers] [Bug 860159] Re: Mailman 3.0 support for Postgres In-Reply-To: <4E8122DA.7030607@thecsl.org> References: <20110927001849.9391.31768.malonedeb@gac.canonical.com> <20110927004418.10717.68620.malone@gac.canonical.com> <20110926210338.05782ba4@resist.wooz.org> <4E8122DA.7030607@thecsl.org> Message-ID: <20110926212713.3329062f@resist.wooz.org> On Sep 26, 2011, at 09:11 PM, Dan MacNeil wrote: >Foreign key constraints are available in sqlite 3.6.19 and above. They are >turned off by default. They can be enabled with: > > PRAGMA foreign_keys = ON; > >More details at: > http://www.sqlite.org/foreignkeys.html Thanks. AFAICT, Ubuntu's sqlite3 package is compiled with this enabled, and adding this line to mailman.sql still passes all the tests. -Barry From felipe at felipegasper.com Tue Sep 27 21:20:37 2011 From: felipe at felipegasper.com (Felipe Gasper) Date: Tue, 27 Sep 2011 14:20:37 -0500 Subject: [Mailman-Developers] Hello Message-ID: <4E822205.2040504@felipegasper.com> Hi all, Barry said to email this list with an interest in helping with UI for MM3. I?ve done UI development for cPanel, Inc. for the past couple years. I?m fluent with Perl and JS/CSS/HTML/etc. I messed with Python a tiny bit about 8 years ago; I?m hoping to beef up my skills in that area as part of helping out with MM3. I?d say I?m reasonably well-versed in SMTP. I?ve used Git and SVN; Bazaar will be new to me. I?m looking forward to helping out. cheers, -Felipe Gasper Houston, TX From barry at list.org Wed Sep 28 22:35:28 2011 From: barry at list.org (Barry Warsaw) Date: Wed, 28 Sep 2011 16:35:28 -0400 Subject: [Mailman-Developers] Hello In-Reply-To: <4E822205.2040504@felipegasper.com> References: <4E822205.2040504@felipegasper.com> Message-ID: <20110928163528.67436857@resist.wooz.org> Hi Felipe, On Sep 27, 2011, at 02:20 PM, Felipe Gasper wrote: > I?ve done UI development for cPanel, Inc. for the past couple years. I?m > fluent with Perl and JS/CSS/HTML/etc. I messed with Python a tiny bit about > 8 years ago; I?m hoping to beef up my skills in that area as part of helping > out with MM3. I?d say I?m reasonably well-versed in SMT P. > > I?ve used Git and SVN; Bazaar will be new to me. > > I?m looking forward to helping out. Welcome! I think you'll have no problems with Python and Bazaar, but do feel free to ask any questions, either here or on freenode #mailman. Florian and Terri will probably be able to better answer questions about the web ui part of the project. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From acase at cims.nyu.edu Thu Sep 29 00:08:59 2011 From: acase at cims.nyu.edu (Andrew Case) Date: Wed, 28 Sep 2011 18:08:59 -0400 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes Message-ID: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> My configuration: Mailman: 2.1.14 OS: Solaris 10 Python: 2.4.5 PREFIX = '/usr/mailman' Server setup: 1 server for web management, 1 server for MTA/qrunner. /usr/mailman is NFS mounted on both servers I've been having the following issue my mailman lists: A user is either subscribed or unsubscribed according to the logs, but then if I look at the member list, the action has not been done (or has been undone). For example, here is where I remove a subscriber and then look at the list members and they are still in the list: [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass sub Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member mgt page [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase acase at example.com [mailman at myhost] ~/logs |> The same also happens when subscribing. I will mass subscribe users (or when users confirm subscription via email/web), the logs indicated that they have been subscribed successfully, but then when I go look them up, they are not listed on the members list. This happens sporadically, but I am generally able to reproduce the error if I do it a couple times in a row. I'm suspicious there may be a locking issue and config.pck is reverting to config.pck.last. I found this thread rather helpful in analyzing potential problems, but I have yet to figure anything out: http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD In addition if I just run the following commands over and over, then the bug never seems to come up. This is part of why I am worrying about locking: bin/add_members ... bin/remove_members ... Is there a good way to test locking between servers? I've run the tests/test_lockfile.py, but it reports it is OK. Any and all help would be GREATLY appreciated. We've been trying to triage this bug for weeks and it is terribly disruptive for our users. Thanks, -- Drew From mark at msapiro.net Thu Sep 29 04:48:19 2011 From: mark at msapiro.net (Mark Sapiro) Date: Wed, 28 Sep 2011 19:48:19 -0700 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> Message-ID: <4E83DC73.90708@msapiro.net> On 9/28/2011 3:08 PM, Andrew Case wrote: > My configuration: > Mailman: 2.1.14 > OS: Solaris 10 > Python: 2.4.5 > PREFIX = '/usr/mailman' > Server setup: 1 server for web management, 1 server for MTA/qrunner. > /usr/mailman is NFS mounted on both servers > > > I've been having the following issue my mailman lists: > > A user is either subscribed or unsubscribed according to the logs, but > then if I look at the member list, the action has not been done (or has > been undone). For example, here is where I remove a subscriber and then > look at the list members and they are still in the list: > > [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase > Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass sub > Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member > mgt page > [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase > acase at example.com > [mailman at myhost] ~/logs |> There is a bug in the Mailman 2.1 branch, but the above is not it. The above log shows that acase at example.com was added by admin mass subscribe at 17:15:14 and then a bit more than 4 minutes later, was removed by checking the unsub box on the admin Membership List and submitting. If you check your web server logs, you will find POST transactions to the admin page for both these events. > The same also happens when subscribing. I will mass subscribe users (or > when users confirm subscription via email/web), the logs indicated that > they have been subscribed successfully, but then when I go look them up, > they are not listed on the members list. > > This happens sporadically, but I am generally able to reproduce the error > if I do it a couple times in a row. This is possibly a manifestation of the bug, but I'm surprised it is happening that frequently. > I'm suspicious there may be a locking issue and config.pck is reverting to > config.pck.last. I found this thread rather helpful in analyzing > potential problems, but I have yet to figure anything out: > http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD The thread you point to above is relevant, but it is not a locking issue. The problem is due to list caching in Mailman/Queue/Runner.py and/or nearly concurrent processes which first load the list unlocked and later lock it. The issue is that the resolution of the config.pck timestamp is 1 second, and if a process has a list object and that list object is updated by another process within the same second as the timestamp on the first process's object, the first process won't load the updated list when it locks it. This can result in things like a subscribe being done and logged and then silently reversed. List locking is working as it should. The issue is that the first process doesn't reload the updated list when it acquires the lock because it thinks it already has the latest version. I thought I had fixed this on the 2.1 branch, but it seems I only fixed it for the now defunct 2.2 branch. A relevant thread starts at and continues at The patch in the attached cache.patch file should fix it. > In addition if I just run the following commands over and over, then the > bug never seems to come up. This is part of why I am worrying about > locking: > bin/add_members ... > bin/remove_members ... That won't do it. bin/add_members alone will do it, but only if there is a nearly concurrent process updating the same list. > Is there a good way to test locking between servers? I've run the > tests/test_lockfile.py, but it reports it is OK. > > Any and all help would be GREATLY appreciated. We've been trying to > triage this bug for weeks and it is terribly disruptive for our users. The post at contains a "stress test" that will probably reproduce the problem. I suspect your Mailman server must be very busy for you to see this bug that frequently. However, it looks like I need to install the fix for Mailman 2.1.15. It is also curious that the only reports of this that I can recall both come from solaris users. There may be complications in your case due to NFS, but locking shouldn't be the issue. Run the stress test and see if it fails. If it does, try the patch. Let us know what happens. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cache.patch URL: From acase at cims.nyu.edu Thu Sep 29 08:52:59 2011 From: acase at cims.nyu.edu (Andrew Case) Date: Thu, 29 Sep 2011 02:52:59 -0400 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: <4E83DC73.90708@msapiro.net> References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> Message-ID: Thanks Mark, see inline comments. >> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase >> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass >> sub >> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member >> mgt page >> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase >> acase at example.com >> [mailman at myhost] ~/logs |> > > > There is a bug in the Mailman 2.1 branch, but the above is not it. The > above log shows that acase at example.com was added by admin mass subscribe > at 17:15:14 and then a bit more than 4 minutes later, was removed by > checking the unsub box on the admin Membership List and submitting. I was trying to show that even after the user was removed, they're still listed as a member. > If you check your web server logs, you will find POST transactions to > the admin page for both these events. Agreed. >> The same also happens when subscribing. I will mass subscribe users (or >> when users confirm subscription via email/web), the logs indicated that >> they have been subscribed successfully, but then when I go look them up, >> they are not listed on the members list. >> >> This happens sporadically, but I am generally able to reproduce the >> error >> if I do it a couple times in a row. > > > This is possibly a manifestation of the bug, but I'm surprised it is > happening that frequently. Easiest way for me to replicated the problem is: 1) check the unsubscribe box for user A then hit submit 2) after reload check the unsubscribe box for user B then hit submit 3) reaload the "membership list" page and user B is back on the list This happens even after I wait a couple seconds in between each step. >> I'm suspicious there may be a locking issue and config.pck is reverting >> to >> config.pck.last. I found this thread rather helpful in analyzing >> potential problems, but I have yet to figure anything out: >> http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD > > > The thread you point to above is relevant, but it is not a locking > issue. The problem is due to list caching in Mailman/Queue/Runner.py > and/or nearly concurrent processes which first load the list unlocked > and later lock it. The issue is that the resolution of the config.pck > timestamp is 1 second, and if a process has a list object and that list > object is updated by another process within the same second as the > timestamp on the first process's object, the first process won't load > the updated list when it locks it. This can result in things like a > subscribe being done and logged and then silently reversed. The result sounds the same, but would this happen even if I'm loading the page with more than a second in between each step outlined above? > List locking is working as it should. The issue is that the first > process doesn't reload the updated list when it acquires the lock > because it thinks it already has the latest version. > > I thought I had fixed this on the 2.1 branch, but it seems I only fixed > it for the now defunct 2.2 branch. > > A relevant thread starts at > > and continues at > > > The patch in the attached cache.patch file should fix it. I applied the patch but it doesn't seem to have made a difference. >> In addition if I just run the following commands over and over, then the >> bug never seems to come up. This is part of why I am worrying about >> locking: >> bin/add_members ... >> bin/remove_members ... > > > That won't do it. bin/add_members alone will do it, but only if there is > a nearly concurrent process updating the same list. > > >> Is there a good way to test locking between servers? I've run the >> tests/test_lockfile.py, but it reports it is OK. >> >> Any and all help would be GREATLY appreciated. We've been trying to >> triage this bug for weeks and it is terribly disruptive for our users. > > > The post at > > contains a "stress test" that will probably reproduce the problem. Correct. Only one subscriber was subscribed to each test list. Keep in mind that in the stress test given if you use a sleep counter of 5 with 6 lists, that means you're waiting _30 seconds_ before the next add_member command is run for that list (I'm assume the timing issue is per-list, not per run of add_members). Even if you set the timer down to 1 that's a 6 second sleep. This shouldn't effect a cache that we're comparing for the given second. Anyway, my script ran fine with the 5 second sleep (30 seconds per list add), but showed discrepancies with a 3 second sleep. > I suspect your Mailman server must be very busy for you to see this bug > that frequently. However, it looks like I need to install the fix for > Mailman 2.1.15. We run about 600 different mailing lists for our department and this has been a continues headache. I appreciate all the hard work you guys do. > It is also curious that the only reports of this that I can recall both > come from solaris users. There may be complications in your case due to > NFS, but locking shouldn't be the issue. Run the stress test and see if > it fails. If it does, try the patch. Patch didn't seem to help. Is there an easy way to omit the caching in this? Thanks, -- Drew > > Let us know what happens. > > -- > Mark Sapiro The highway is for gamblers, > San Francisco Bay Area, California better use your sense - B. Dylan > > From acase at cims.nyu.edu Thu Sep 29 09:06:00 2011 From: acase at cims.nyu.edu (Andrew Case) Date: Thu, 29 Sep 2011 03:06:00 -0400 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: <4E83DC73.90708@msapiro.net> References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> Message-ID: Mark, Another question below... >> I'm suspicious there may be a locking issue and config.pck is reverting >> to >> config.pck.last. I found this thread rather helpful in analyzing >> potential problems, but I have yet to figure anything out: >> http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD > > > The thread you point to above is relevant, but it is not a locking > issue. The problem is due to list caching in Mailman/Queue/Runner.py > and/or nearly concurrent processes which first load the list unlocked > and later lock it. The issue is that the resolution of the config.pck > timestamp is 1 second, and if a process has a list object and that list > object is updated by another process within the same second as the > timestamp on the first process's object, the first process won't load > the updated list when it locks it. This can result in things like a > subscribe being done and logged and then silently reversed. You think it should be okay though if my qrunners are all running on my mta server instead of my webserver though. This wouldn't be causing a problem with the caching right? Thanks, -- Drew From acase at cims.nyu.edu Thu Sep 29 10:13:46 2011 From: acase at cims.nyu.edu (Andrew Case) Date: Thu, 29 Sep 2011 04:13:46 -0400 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> Message-ID: Mark, I realized I hadn't restarted my QRunners after this patch. It looks like its working perfectly now! Even with a sleep of 0. Thanks so much! -- Drew On Thu, September 29, 2011 2:52 am, Andrew Case wrote: > Thanks Mark, see inline comments. > >>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase >>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass >>> sub >>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member >>> mgt page >>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase >>> acase at example.com >>> [mailman at myhost] ~/logs |> >> >> >> There is a bug in the Mailman 2.1 branch, but the above is not it. The >> above log shows that acase at example.com was added by admin mass subscribe >> at 17:15:14 and then a bit more than 4 minutes later, was removed by >> checking the unsub box on the admin Membership List and submitting. > > I was trying to show that even after the user was removed, they're still > listed as a member. > >> If you check your web server logs, you will find POST transactions to >> the admin page for both these events. > > Agreed. > >>> The same also happens when subscribing. I will mass subscribe users >>> (or >>> when users confirm subscription via email/web), the logs indicated that >>> they have been subscribed successfully, but then when I go look them >>> up, >>> they are not listed on the members list. >>> >>> This happens sporadically, but I am generally able to reproduce the >>> error >>> if I do it a couple times in a row. >> >> >> This is possibly a manifestation of the bug, but I'm surprised it is >> happening that frequently. > > Easiest way for me to replicated the problem is: > 1) check the unsubscribe box for user A then hit submit > 2) after reload check the unsubscribe box for user B then hit submit > 3) reaload the "membership list" page and user B is back on the list > > This happens even after I wait a couple seconds in between each step. > >>> I'm suspicious there may be a locking issue and config.pck is reverting >>> to >>> config.pck.last. I found this thread rather helpful in analyzing >>> potential problems, but I have yet to figure anything out: >>> http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD >> >> >> The thread you point to above is relevant, but it is not a locking >> issue. The problem is due to list caching in Mailman/Queue/Runner.py >> and/or nearly concurrent processes which first load the list unlocked >> and later lock it. The issue is that the resolution of the config.pck >> timestamp is 1 second, and if a process has a list object and that list >> object is updated by another process within the same second as the >> timestamp on the first process's object, the first process won't load >> the updated list when it locks it. This can result in things like a >> subscribe being done and logged and then silently reversed. > > The result sounds the same, but would this happen even if I'm loading the > page with more than a second in between each step outlined above? > >> List locking is working as it should. The issue is that the first >> process doesn't reload the updated list when it acquires the lock >> because it thinks it already has the latest version. >> >> I thought I had fixed this on the 2.1 branch, but it seems I only fixed >> it for the now defunct 2.2 branch. >> >> A relevant thread starts at >> >> and continues at >> >> >> The patch in the attached cache.patch file should fix it. > > I applied the patch but it doesn't seem to have made a difference. > > >>> In addition if I just run the following commands over and over, then >>> the >>> bug never seems to come up. This is part of why I am worrying about >>> locking: >>> bin/add_members ... >>> bin/remove_members ... >> >> >> That won't do it. bin/add_members alone will do it, but only if there is >> a nearly concurrent process updating the same list. >> >> >>> Is there a good way to test locking between servers? I've run the >>> tests/test_lockfile.py, but it reports it is OK. >>> >>> Any and all help would be GREATLY appreciated. We've been trying to >>> triage this bug for weeks and it is terribly disruptive for our users. >> >> >> The post at >> >> contains a "stress test" that will probably reproduce the problem. > > Correct. Only one subscriber was subscribed to each test list. Keep in > mind that in the stress test given if you use a sleep counter of 5 with 6 > lists, that means you're waiting _30 seconds_ before the next add_member > command is run for that list (I'm assume the timing issue is per-list, not > per run of add_members). Even if you set the timer down to 1 that's a 6 > second sleep. This shouldn't effect a cache that we're comparing for the > given second. Anyway, my script ran fine with the 5 second sleep (30 > seconds per list add), but showed discrepancies with a 3 second sleep. > >> I suspect your Mailman server must be very busy for you to see this bug >> that frequently. However, it looks like I need to install the fix for >> Mailman 2.1.15. > > We run about 600 different mailing lists for our department and this has > been a continues headache. I appreciate all the hard work you guys do. > >> It is also curious that the only reports of this that I can recall both >> come from solaris users. There may be complications in your case due to >> NFS, but locking shouldn't be the issue. Run the stress test and see if >> it fails. If it does, try the patch. > > Patch didn't seem to help. Is there an easy way to omit the caching in > this? > > Thanks, > -- > Drew > >> >> Let us know what happens. >> >> -- >> Mark Sapiro The highway is for gamblers, >> San Francisco Bay Area, California better use your sense - B. Dylan >> >> > > Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147 From f at state-of-mind.de Thu Sep 29 13:36:07 2011 From: f at state-of-mind.de (Florian Fuchs) Date: Thu, 29 Sep 2011 13:36:07 +0200 Subject: [Mailman-Developers] Hello In-Reply-To: <20110928163528.67436857@resist.wooz.org> References: <4E822205.2040504@felipegasper.com> <20110928163528.67436857@resist.wooz.org> Message-ID: <8E6C58F5-5DE3-4F92-A439-6FCA9E966A48@state-of-mind.de> Hi Felipe, as for the UI I think a good starting point would be to check out the work Benedict Stein and Anna Granudd have done during the last two Google Summers of Code. You can find it here: https://launchpad.net/~mailmanwebgsoc2011 The web UI is based on the Django framework and communicates with the Mailman core via a REST API. In order to do so it depends on a client library which translates some of the HTTP API calls into Python object logic. The library code can be found here: https://launchpad.net/mailman.client Benedict has also created a detailed step by step installation page for mailman3a7 and the current web UI (I'm not sure if MM3's current Alpha8 is fully compatible with the client library as it's brand new... I will try to do any necessary adjustments as soon as possible...) http://wiki.list.org/pages/viewpage.action?pageId=11960560 As Barry said, feel free to ask any questions! Cheers Florian Am 28.09.2011 um 22:35 schrieb Barry Warsaw : > Hi Felipe, > > On Sep 27, 2011, at 02:20 PM, Felipe Gasper wrote: > >> I?ve done UI development for cPanel, Inc. for the past couple years. I?m >> fluent with Perl and JS/CSS/HTML/etc. I messed with Python a tiny bit about >> 8 years ago; I?m hoping to beef up my skills in that area as part of helping >> out with MM3. I?d say I?m reasonably well-versed in SMT P. >> >> I?ve used Git and SVN; Bazaar will be new to me. >> >> I?m looking forward to helping out. > > Welcome! I think you'll have no problems with Python and Bazaar, but do feel > free to ask any questions, either here or on freenode #mailman. Florian and > Terri will probably be able to better answer questions about the web ui part > of the project. > > Cheers, > -Barry > > _______________________________________________ > Mailman-Developers mailing list > Mailman-Developers at python.org > http://mail.python.org/mailman/listinfo/mailman-developers > Mailman FAQ: http://wiki.list.org/x/AgA3 > Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ > Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/f%40state-of-mind.de > > Security Policy: http://wiki.list.org/x/QIA9 From acase at cims.nyu.edu Thu Sep 29 17:58:54 2011 From: acase at cims.nyu.edu (Andrew Case) Date: Thu, 29 Sep 2011 11:58:54 -0400 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> Message-ID: Mark, I don't know if you'll find this worth trying to fix, but I revised the stress test this morning to be a bit more stressful. The biggest change I made (besides upping the number of subscribers) was that instead of adding each member to all the lists before adding the second member, it now adds all the members to one list successively before moving on the the next list. This causes a much lower amount of time between subscriptions per list, causing the config.pck files to be updated in a much smaller timeframe. When I did this I saw that maybe 1-5% of the time a user was still omitted from the list (user was silently removed). I think that because these are processed by the queue runner on a different host and because the timestamp check is being done on an NFS stored file, there is potential that the qrunner for this doesn't yet have an updated mtime for that file (or even a small ntp time drift could cause this). When I commented out the caching part of the code in MailList.py this bug never seems to show up: #if mtime < self.__timestamp: # # File is not newer # return None, None So I think there may still be a race condition here, but the chances of it are unlikely that human interaction would trigger this. If however, you have a script that is subscribing users (one after another), this could still come up. I actually happen to have such a script, but I run it on the same host as the qrunners, so I haven't experienced this before. In my case I think it's probably not worth keeping the performance gain that the caching adds for sake of consistency. Attached is the modified stress test I'm using. Thanks again, -- Drew On Thu, September 29, 2011 4:13 am, Andrew Case wrote: > Mark, > > I realized I hadn't restarted my QRunners after this patch. It looks like > its working perfectly now! Even with a sleep of 0. Thanks so much! > > -- > Drew > > On Thu, September 29, 2011 2:52 am, Andrew Case wrote: >> Thanks Mark, see inline comments. >> >>>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase >>>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin >>>> mass >>>> sub >>>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; >>>> member >>>> mgt page >>>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase >>>> acase at example.com >>>> [mailman at myhost] ~/logs |> >>> >>> >>> There is a bug in the Mailman 2.1 branch, but the above is not it. The >>> above log shows that acase at example.com was added by admin mass >>> subscribe >>> at 17:15:14 and then a bit more than 4 minutes later, was removed by >>> checking the unsub box on the admin Membership List and submitting. >> >> I was trying to show that even after the user was removed, they're still >> listed as a member. >> >>> If you check your web server logs, you will find POST transactions to >>> the admin page for both these events. >> >> Agreed. >> >>>> The same also happens when subscribing. I will mass subscribe users >>>> (or >>>> when users confirm subscription via email/web), the logs indicated >>>> that >>>> they have been subscribed successfully, but then when I go look them >>>> up, >>>> they are not listed on the members list. >>>> >>>> This happens sporadically, but I am generally able to reproduce the >>>> error >>>> if I do it a couple times in a row. >>> >>> >>> This is possibly a manifestation of the bug, but I'm surprised it is >>> happening that frequently. >> >> Easiest way for me to replicated the problem is: >> 1) check the unsubscribe box for user A then hit submit >> 2) after reload check the unsubscribe box for user B then hit submit >> 3) reaload the "membership list" page and user B is back on the list >> >> This happens even after I wait a couple seconds in between each step. >> >>>> I'm suspicious there may be a locking issue and config.pck is >>>> reverting >>>> to >>>> config.pck.last. I found this thread rather helpful in analyzing >>>> potential problems, but I have yet to figure anything out: >>>> http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD >>> >>> >>> The thread you point to above is relevant, but it is not a locking >>> issue. The problem is due to list caching in Mailman/Queue/Runner.py >>> and/or nearly concurrent processes which first load the list unlocked >>> and later lock it. The issue is that the resolution of the config.pck >>> timestamp is 1 second, and if a process has a list object and that list >>> object is updated by another process within the same second as the >>> timestamp on the first process's object, the first process won't load >>> the updated list when it locks it. This can result in things like a >>> subscribe being done and logged and then silently reversed. >> >> The result sounds the same, but would this happen even if I'm loading >> the >> page with more than a second in between each step outlined above? >> >>> List locking is working as it should. The issue is that the first >>> process doesn't reload the updated list when it acquires the lock >>> because it thinks it already has the latest version. >>> >>> I thought I had fixed this on the 2.1 branch, but it seems I only fixed >>> it for the now defunct 2.2 branch. >>> >>> A relevant thread starts at >>> >>> and continues at >>> >>> >>> The patch in the attached cache.patch file should fix it. >> >> I applied the patch but it doesn't seem to have made a difference. >> >> >>>> In addition if I just run the following commands over and over, then >>>> the >>>> bug never seems to come up. This is part of why I am worrying about >>>> locking: >>>> bin/add_members ... >>>> bin/remove_members ... >>> >>> >>> That won't do it. bin/add_members alone will do it, but only if there >>> is >>> a nearly concurrent process updating the same list. >>> >>> >>>> Is there a good way to test locking between servers? I've run the >>>> tests/test_lockfile.py, but it reports it is OK. >>>> >>>> Any and all help would be GREATLY appreciated. We've been trying to >>>> triage this bug for weeks and it is terribly disruptive for our users. >>> >>> >>> The post at >>> >>> contains a "stress test" that will probably reproduce the problem. >> >> Correct. Only one subscriber was subscribed to each test list. Keep in >> mind that in the stress test given if you use a sleep counter of 5 with >> 6 >> lists, that means you're waiting _30 seconds_ before the next add_member >> command is run for that list (I'm assume the timing issue is per-list, >> not >> per run of add_members). Even if you set the timer down to 1 that's a 6 >> second sleep. This shouldn't effect a cache that we're comparing for >> the >> given second. Anyway, my script ran fine with the 5 second sleep (30 >> seconds per list add), but showed discrepancies with a 3 second sleep. >> >>> I suspect your Mailman server must be very busy for you to see this bug >>> that frequently. However, it looks like I need to install the fix for >>> Mailman 2.1.15. >> >> We run about 600 different mailing lists for our department and this has >> been a continues headache. I appreciate all the hard work you guys do. >> >>> It is also curious that the only reports of this that I can recall both >>> come from solaris users. There may be complications in your case due to >>> NFS, but locking shouldn't be the issue. Run the stress test and see if >>> it fails. If it does, try the patch. >> >> Patch didn't seem to help. Is there an easy way to omit the caching in >> this? >> >> Thanks, >> -- >> Drew >> >>> >>> Let us know what happens. >>> >>> -- >>> Mark Sapiro The highway is for gamblers, >>> San Francisco Bay Area, California better use your sense - B. Dylan >>> >>> >> >> > > > Andrew Case > Systems Administrator > Courant Institute of Mathematical Sciences > New York University > 251 Mercer St., Room 1023 > New York, NY 10012-1110 > Phone: 212-998-3147 > > Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147 -------------- next part -------------- A non-text attachment was scrubbed... Name: test_subscribe Type: application/octet-stream Size: 759 bytes Desc: not available URL: From mark at msapiro.net Thu Sep 29 18:44:03 2011 From: mark at msapiro.net (Mark Sapiro) Date: Thu, 29 Sep 2011 09:44:03 -0700 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> Message-ID: <4E84A053.6040900@msapiro.net> On 9/28/2011 11:52 PM, Andrew Case wrote: > Thanks Mark, see inline comments. > >>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase >>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass >>> sub >>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member >>> mgt page >>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase >>> acase at example.com >>> [mailman at myhost] ~/logs |> >> >> >> There is a bug in the Mailman 2.1 branch, but the above is not it. The >> above log shows that acase at example.com was added by admin mass subscribe >> at 17:15:14 and then a bit more than 4 minutes later, was removed by >> checking the unsub box on the admin Membership List and submitting. > > I was trying to show that even after the user was removed, they're still > listed as a member. Sorry, I missed that. You are correct and this does appear to be a manifestation of the same issue. [...] >> The thread you point to above is relevant, but it is not a locking >> issue. The problem is due to list caching in Mailman/Queue/Runner.py >> and/or nearly concurrent processes which first load the list unlocked >> and later lock it. The issue is that the resolution of the config.pck >> timestamp is 1 second, and if a process has a list object and that list >> object is updated by another process within the same second as the >> timestamp on the first process's object, the first process won't load >> the updated list when it locks it. This can result in things like a >> subscribe being done and logged and then silently reversed. > > The result sounds the same, but would this happen even if I'm loading the > page with more than a second in between each step outlined above? It is tricky. Each add_members, remove_members and web CGI post is a separate process. If these processes are run sequentially, there should not be any problem because each process will load the list, lock it update it and save it before the next process loads it. The problem occurs when processes run concurrently. The scenario is process A loads the list unlocked; process B locks the list and updates it; process A tries to lock the list and gets the lock after process B relinquishes it; if the timestamp on the config.pck from process B's update is in the same second as the timestamp of process A's initial load, process A thinks the list hasn't been updated and doesn't reload it after obtaining the lock. Thus, when process A saves the list, process B's changes are reversed. This is complicated by list caching in the qrunners because each qrunner may have a cached copy of the list, so it can act as process A in the above scenario with its cached copy playing the role of the initially loaded list. To complicate this further, the qrunners get involved even in the simple scenario with sequential commands because add_members, remove_members and CGIs result in notices being sent, and the qrunner processes that send the notices are running concurrently. This is why the stress test will fail even though commands are run sequentially. [...] > I applied the patch but it doesn't seem to have made a difference. As you later report, restarting the qrunners did seem to fix it. [...] >> The post at >> <-> >> contains a "stress test" that will probably reproduce the problem. > > Correct. Only one subscriber was subscribed to each test list. Keep in > mind that in the stress test given if you use a sleep counter of 5 with 6 > lists, that means you're waiting _30 seconds_ before the next add_member > command is run for that list (I'm assume the timing issue is per-list, not > per run of add_members). Even if you set the timer down to 1 that's a 6 > second sleep. This shouldn't effect a cache that we're comparing for the > given second. Anyway, my script ran fine with the 5 second sleep (30 > seconds per list add), but showed discrepancies with a 3 second sleep. So you are adding 'sleep' commands after each add_members? I'm not sure what you're doing. Is there a different test elsewhere in the thread? I have used a couple of tests as attached. They are the same except for list order and are very similar to the one in the original thread. Note that they contain only one sleep after all the add_members just to allow things to settle before running list_members. >> I suspect your Mailman server must be very busy for you to see this bug >> that frequently. However, it looks like I need to install the fix for >> Mailman 2.1.15. Actually, I don't think the issue is the busy server. I think it is more likely that NFS causes timing issues between add_members and VirginRunner and OutgoingRunner that just make the bug more likely to trigger. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: list_cache_stress_test URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: list_cache_stress_test_2 URL: From mark at msapiro.net Thu Sep 29 18:47:02 2011 From: mark at msapiro.net (Mark Sapiro) Date: Thu, 29 Sep 2011 09:47:02 -0700 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> Message-ID: <4E84A106.9060004@msapiro.net> On 9/29/2011 12:06 AM, Andrew Case wrote: > > You think it should be okay though if my qrunners are all running on my > mta server instead of my webserver though. This wouldn't be causing a > problem with the caching right? As long as Mailman's locks directory is a single NFS shared directory, there should be no problem. The problem you have is due to the bug, and the patch should fix it. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan From acase at cims.nyu.edu Thu Sep 29 19:30:12 2011 From: acase at cims.nyu.edu (Andrew Case) Date: Thu, 29 Sep 2011 13:30:12 -0400 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: <4E84A053.6040900@msapiro.net> References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> <4E84A053.6040900@msapiro.net> Message-ID: <9c547f8afb5dfdf51f27786b8a07143c.squirrel@webmail.cims.nyu.edu> [...] > It is tricky. Each add_members, remove_members and web CGI post is a > separate process. If these processes are run sequentially, there should > not be any problem because each process will load the list, lock it > update it and save it before the next process loads it. > > The problem occurs when processes run concurrently. The scenario is > process A loads the list unlocked; process B locks the list and updates > it; process A tries to lock the list and gets the lock after process B > relinquishes it; if the timestamp on the config.pck from process B's > update is in the same second as the timestamp of process A's initial > load, process A thinks the list hasn't been updated and doesn't reload > it after obtaining the lock. Thus, when process A saves the list, > process B's changes are reversed. > > This is complicated by list caching in the qrunners because each qrunner > may have a cached copy of the list, so it can act as process A in the > above scenario with its cached copy playing the role of the initially > loaded list. To complicate this further, the qrunners get involved even > in the simple scenario with sequential commands because add_members, > remove_members and CGIs result in notices being sent, and the qrunner > processes that send the notices are running concurrently. This is why > the stress test will fail even though commands are run sequentially. Thank you for that explanation. I did seem to have confusion as to when the qrunners cache and/or update these config.pck files and when the add/remove_members commands did as well. There seemed to be some sort of conflict between the two. [...] >>> The post at >>> <-> >>> contains a "stress test" that will probably reproduce the problem. >> >> Correct. Only one subscriber was subscribed to each test list. Keep in >> mind that in the stress test given if you use a sleep counter of 5 with >> 6 >> lists, that means you're waiting _30 seconds_ before the next add_member >> command is run for that list (I'm assume the timing issue is per-list, >> not >> per run of add_members). Even if you set the timer down to 1 that's a 6 >> second sleep. This shouldn't effect a cache that we're comparing for >> the >> given second. Anyway, my script ran fine with the 5 second sleep (30 >> seconds per list add), but showed discrepancies with a 3 second sleep. > > > So you are adding 'sleep' commands after each add_members? Yes I was. Without a sleep in between add_member calls, it was failing for ~50% of the calls to add_members. With a 5 second sleep it would tend to work most of the time. > I'm not sure what you're doing. Is there a different test elsewhere in > the thread? See my updated stress test that I sent you in my last email. > I have used a couple of tests as attached. They are the same except for > list order and are very similar to the one in the original thread. Note > that they contain only one sleep after all the add_members just to allow > things to settle before running list_members. That makes sense. >>> I suspect your Mailman server must be very busy for you to see this bug >>> that frequently. However, it looks like I need to install the fix for >>> Mailman 2.1.15. > > > Actually, I don't think the issue is the busy server. I think it is more > likely that NFS causes timing issues between add_members and > VirginRunner and OutgoingRunner that just make the bug more likely to > trigger. I think you hit the nail on the head here. It explains a lot. Thanks, -- Drew From mark at msapiro.net Thu Sep 29 19:30:35 2011 From: mark at msapiro.net (Mark Sapiro) Date: Thu, 29 Sep 2011 10:30:35 -0700 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> Message-ID: <4E84AB3B.2090204@msapiro.net> On 9/29/2011 8:58 AM, Andrew Case wrote: > > When I did this I saw that maybe 1-5% of the time a user was still omitted > from the list (user was silently removed). I think that because these are > processed by the queue runner on a different host and because the > timestamp check is being done on an NFS stored file, there is potential > that the qrunner for this doesn't yet have an updated mtime for that file > (or even a small ntp time drift could cause this). When I commented out > the caching part of the code in MailList.py this bug never seems to show > up: > #if mtime < self.__timestamp: > # # File is not newer > # return None, None Actually, that is not the cache. It is just the test for whether the current list object, cached or whatever, needs to be reloaded from disk. I think that your configuration with NFS and possible time jitter between servers makes the bug more likely. > So I think there may still be a race condition here, but the chances of it > are unlikely that human interaction would trigger this. If however, you > have a script that is subscribing users (one after another), this could > still come up. I actually happen to have such a script, but I run it on > the same host as the qrunners, so I haven't experienced this before. It can happen even where everything is on a single host, but as I said, I think your configuration makes it more likely. > In my case I think it's probably not worth keeping the performance gain > that the caching adds for sake of consistency. Attached is a patch to remove list caching from the qrunners. This patch has the additional advantage of limiting the growth of the qrunners over time. Old entries were supposed to be freed from the cache, but a self reference in the default MemberAdaptor prevented this from occurring. For reasons of trying not to be disruptive this patch and the bug fix I sent earlier were never applied to the 2.1 branch. I think this was a mistake, and I will apply them for Mailman 2.1.15. > Attached is the modified stress test I'm using. Thanks. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: disable_cache.patch URL: From acase at cims.nyu.edu Thu Sep 29 20:36:26 2011 From: acase at cims.nyu.edu (Andrew Case) Date: Thu, 29 Sep 2011 14:36:26 -0400 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: <4E84AB3B.2090204@msapiro.net> References: <95e21858fd84b36004ac74e3847d32e1.squirrel@webmail.cims.nyu.edu> <4E83DC73.90708@msapiro.net> <4E84AB3B.2090204@msapiro.net> Message-ID: Hey Mark, Take a look at these results that are with the 2 patches you've sent me (and qrunners restarted). I'm adding 10 users to 6 lists: [... creating list output cut ...] + Subscribing to testlist1 Subscribed: foo1 at www.cs.nyu.edu Subscribed: foo2 at www.cs.nyu.edu Subscribed: foo3 at www.cs.nyu.edu Subscribed: foo4 at www.cs.nyu.edu Subscribed: foo5 at www.cs.nyu.edu Subscribed: foo6 at www.cs.nyu.edu Subscribed: foo7 at www.cs.nyu.edu Subscribed: foo8 at www.cs.nyu.edu Subscribed: foo9 at www.cs.nyu.edu Subscribed: foo10 at www.cs.nyu.edu + Subscribing to testlist2 Subscribed: foo1 at www.cs.nyu.edu Subscribed: foo2 at www.cs.nyu.edu Subscribed: foo3 at www.cs.nyu.edu Traceback (most recent call last): File "/usr/mailman/bin/add_members", line 258, in main() File "/usr/mailman/bin/add_members", line 238, in main addall(mlist, nmembers, 0, send_welcome_msg, s) File "/usr/mailman/bin/add_members", line 135, in addall mlist.ApprovedAddMember(userdesc, ack, 0) File "/usr/mailman/Mailman/MailList.py", line 948, in ApprovedAddMember assert self.Locked() AssertionError Subscribed: foo5 at www.cs.nyu.edu Subscribed: foo6 at www.cs.nyu.edu Subscribed: foo7 at www.cs.nyu.edu Subscribed: foo8 at www.cs.nyu.edu Subscribed: foo9 at www.cs.nyu.edu Subscribed: foo10 at www.cs.nyu.edu + Subscribing to testlist3 Subscribed: foo1 at www.cs.nyu.edu [... subscribing users to all other lists went fine ...] + Subscribers for testlist1: foo10 at www.cs.nyu.edu foo1 at www.cs.nyu.edu foo2 at www.cs.nyu.edu foo3 at www.cs.nyu.edu foo4 at www.cs.nyu.edu foo5 at www.cs.nyu.edu foo6 at www.cs.nyu.edu foo7 at www.cs.nyu.edu foo8 at www.cs.nyu.edu foo9 at www.cs.nyu.edu + Removing list testlist1 + Subscribers for testlist2: foo1 at www.cs.nyu.edu foo2 at www.cs.nyu.edu foo3 at www.cs.nyu.edu foo5 at www.cs.nyu.edu foo6 at www.cs.nyu.edu foo7 at www.cs.nyu.edu foo8 at www.cs.nyu.edu foo9 at www.cs.nyu.edu + Removing list testlist2 [... the rest were all fine ...] There was a locking issue with testlist2 foo4, which is fine since it doesn't report back as successful. But you'll also notice that foo10 wasn't listed as a subscriber even though it appeared as though that subscribe was successful. Here's some errors on the very next run where I'm subscribing 10 people to each list as well: [... cut expected results ...] + Subscribers for testlist4: foo10 at www.cs.nyu.edu foo1 at www.cs.nyu.edu foo2 at www.cs.nyu.edu foo3 at www.cs.nyu.edu foo4 at www.cs.nyu.edu foo5 at www.cs.nyu.edu foo6 at www.cs.nyu.edu [** no foo8 **] foo7 at www.cs.nyu.edu foo9 at www.cs.nyu.edu + Removing list testlist4 + Subscribers for testlist5: foo10 at www.cs.nyu.edu foo1 at www.cs.nyu.edu foo2 at www.cs.nyu.edu foo3 at www.cs.nyu.edu foo4 at www.cs.nyu.edu foo5 at www.cs.nyu.edu foo6 at www.cs.nyu.edu foo7 at www.cs.nyu.edu [** no foo8 or foo9 **] + Removing list testlist5 [... cut expected results ...] That's my (1-5%) failure. But when I also comment out the following: >> #if mtime < self.__timestamp: >> # # File is not newer >> # return None, None It seems to work each time (I ran 3 tests in a row, all with expected results). Let me know what you think. Thanks, -- Drew On Thu, September 29, 2011 1:30 pm, Mark Sapiro wrote: > On 9/29/2011 8:58 AM, Andrew Case wrote: >> >> When I did this I saw that maybe 1-5% of the time a user was still >> omitted >> from the list (user was silently removed). I think that because these >> are >> processed by the queue runner on a different host and because the >> timestamp check is being done on an NFS stored file, there is potential >> that the qrunner for this doesn't yet have an updated mtime for that >> file >> (or even a small ntp time drift could cause this). When I commented out >> the caching part of the code in MailList.py this bug never seems to show >> up: >> #if mtime < self.__timestamp: >> # # File is not newer >> # return None, None > > > Actually, that is not the cache. It is just the test for whether the > current list object, cached or whatever, needs to be reloaded from disk. > > I think that your configuration with NFS and possible time jitter > between servers makes the bug more likely. > > >> So I think there may still be a race condition here, but the chances of >> it >> are unlikely that human interaction would trigger this. If however, you >> have a script that is subscribing users (one after another), this could >> still come up. I actually happen to have such a script, but I run it on >> the same host as the qrunners, so I haven't experienced this before. > > > It can happen even where everything is on a single host, but as I said, > I think your configuration makes it more likely. > > >> In my case I think it's probably not worth keeping the performance gain >> that the caching adds for sake of consistency. > > > Attached is a patch to remove list caching from the qrunners. This patch > has the additional advantage of limiting the growth of the qrunners over > time. Old entries were supposed to be freed from the cache, but a self > reference in the default MemberAdaptor prevented this from occurring. > > For reasons of trying not to be disruptive this patch and the bug fix I > sent earlier were never applied to the 2.1 branch. I think this was a > mistake, and I will apply them for Mailman 2.1.15. > > >> Attached is the modified stress test I'm using. > > > Thanks. > > -- > Mark Sapiro The highway is for gamblers, > San Francisco Bay Area, California better use your sense - B. Dylan > > Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147 From mark at msapiro.net Thu Sep 29 22:06:46 2011 From: mark at msapiro.net (Mark Sapiro) Date: Thu, 29 Sep 2011 13:06:46 -0700 Subject: [Mailman-Developers] Faulty Member Subscribe/Unsubscribes In-Reply-To: Message-ID: Andrew Case wrote: > >Traceback (most recent call last): > File "/usr/mailman/bin/add_members", line 258, in > main() > File "/usr/mailman/bin/add_members", line 238, in main > addall(mlist, nmembers, 0, send_welcome_msg, s) > File "/usr/mailman/bin/add_members", line 135, in addall > mlist.ApprovedAddMember(userdesc, ack, 0) > File "/usr/mailman/Mailman/MailList.py", line 948, in ApprovedAddMember > assert self.Locked() >AssertionError The above is definitely a problem, but I can't see how it can occur unless there is some race condition at the level of the file system. [...] >There was a locking issue with testlist2 foo4, which is fine since it >doesn't report back as successful. But you'll also notice that foo10 >wasn't listed as a subscriber even though it appeared as though that >subscribe was successful. Yes, I see that and both are cause for concern. I am more concerned about the AssertionError with testlist2 foo4, because I can't see how that can happen without some file system anomaly. >Here's some errors on the very next run where I'm subscribing 10 people to >each list as well: [...] >That's my (1-5%) failure. But when I also comment out the following: > >>> #if mtime < self.__timestamp: >>> # # File is not newer >>> # return None, None > >It seems to work each time (I ran 3 tests in a row, all with expected >results). The code above is a decision about whether we need to reload the list object from the file system based on the file system time stamp vs. our internal time stamp. The original code said "if mtime <= self.__timestamp:". Since these time stamps are in whole seconds, that test (to skip loading) could succeed even if the file time stamp was a fraction of a second newer than the internal time stamp. Thus, the bug. The fix is to make the test "if mtime < self.__timestamp:" meanin we only skip loading if the file time stamp is strictly less than the internal time stamp, but in your case, if the clock on the MTA/qrunner machine is a bit faster than that on the machine running the script, the internal time stamp of the qrunner process could be a second ahead of the file time stamp set by the add_members process even though on an absolute scale it is older. So, I think we're concerned here about clock skew between the machines, and in that case, commenting out the code completely as you have done makes sense. I have filed bug reports at and about these issues in preparation to fixing them. Now I don't know whether to stick with the "if mtime < self.__timestamp:" test which will work on a single server or to reload unconditionally as you have done which seems to be necessary in a shared file system situation with possible time skews. As far as the AssertionError is concerned, I don't know what to make of it. It appears to be a file system issue outside of Mailman, so I don't know how to deal with it. I think the code in Mailman's LockFile module is correct. If you are willing to try debugging this further, you could set LIST_LOCK_DEBUGGING = True in mm_cfg.py and restart your qrunners and try to reproduce the exception. This will log copious information to Mailman's 'locks' log which may help to understand what happened. -- Mark Sapiro The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan