Nabitel information broadcast mail

From tim.one@comcast.net Sun Sep 1 08:04:44 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 01 Sep 2002 03:04:44 -0400 Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8 In-Reply-To: <20020824183542.GA22248@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > ... > For whatever reason, setting HAMBIAS to 1.0 seems to produce worse results. It's remarkable. Graham's scheme is pasted together out of all sorts of things that shouldn't work , but this one seems the most mysterious. It has a huge effect in my 5x5 c.l.py test grid. Combining all unique msgs identified as false negative or false positive across all 20 test runs, At HAMBIAS = 1.0 total false negatives goes down by a factor of 2 (337 -> 166) total false positives goes up by a factor of 7.6 (23 -> 174) and some of the false positives are just amazing -- David Ascher announcing a Python conference, Laura Creighton pontificating about the GPL, ... it's hard to fathom! One innocuous example: """ Hello, I love all these speed debates but if speed were our only concern we would all be writing in assembly for all non internet based programs...! Thank you, Vincent A. Primavera prob = 0.99918657946 prob('only') = 0.645419 prob('would') = 0.349237 prob('hello,') = 0.342435 prob('assembly') = 0.34891 prob('thank') = 0.819611 prob('these') = 0.677099 prob('all') = 0.709966 prob('you,') = 0.803672 prob('concern') = 0.225352 prob('our') = 0.951928 prob('internet') = 0.942274 prob('speed') = 0.305927 prob('but') = 0.229635 prob('love') = 0.736116 prob('non') = 0.885065 prob('writing') = 0.150994 """ There's not a lot going on in that msg! *Perhaps* the primary effect of boosting HAMBIAS is to take common glue words (like 'these' and 'all') out of this uniquely "only look at smoking guns" scoring scheme altogether? I don't know what "sense" there is in letting 'these' vote in favor of spam, for example. At HAMBIAS = 3.0 total false negatives goes up by a factor of 2.08 (337 -> 702) total false positives goes down by a factor of 4.6 (23 -> 5) Somebody else think about this . It's certainly the easiest knob to twiddle to make a false-positive versus false-negative rate tradeoff. From sholden@holdenweb.com Sun Sep 1 10:14:25 2002 From: sholden@holdenweb.com (Steve Holden) Date: Sun, 1 Sep 2002 05:14:25 -0400 Subject: [Python-Dev] tiny optimization in ceval mainloop References: <15726.52313.734491.272985@gargle.gargle.HOWL> <0ED9227E-BBF1-11D6-B9DE-0030655234CE@cwi.nl> <15727.31272.80804.453415@gargle.gargle.HOWL> <200208301413.g7UEDqZ07890@pcp02138704pcs.reston01.va.comcast.net> <15727.33074.324120.988215@gargle.gargle.HOWL> <200208301429.g7UETqQ08033@pcp02138704pcs.reston01.va.comcast.net> <15727.33451.698048.657655@slothrop.zope.com> <2m3csw5qu9.fsf@starship.python.net> <055301c2503a$e1cfea60$6300000a@holdenweb.com> <2mfzwwiaud.fsf@starship.python.net> Message-ID: <003301c25197$f8522600$6300000a@holdenweb.com> [Michael Hudson] > "Steve Holden" writes: > > > > A bunch of 0.5% improvements add up. If there's not much cost in > > > complexity, why not go for it? > > > > > > > Yeah, right, we just need 200 of them and we're laughing. Computation in > > infinitesimal time. > > Multiply up doesn't have the same ring to it, does it? > Indeed not. I try to keep my pedantry in control, but it escapes from time to time. regards ----------------------------------------------------------------------- Steve Holden http://www.holdenweb.com/ Python Web Programming pydish.holdenweb.com/pwp/ Previous .sig file retired to www.homeforoldsigs.com ----------------------------------------------------------------------- From skip@manatee.mojam.com Sun Sep 1 13:00:23 2002 From: skip@manatee.mojam.com (Skip Montanaro) Date: Sun, 1 Sep 2002 07:00:23 -0500 Subject: [Python-Dev] Weekly Python Bug/Patch Summary Message-ID: <200209011200.g81C0NSH019331@manatee.mojam.com> Bug/Patch Summary ----------------- 282 open / 2810 total bugs (+7) 119 open / 1676 total patches (+10) New Bugs -------- textwrap has problems wrapping hyphens (2002-08-17) http://python.org/sf/596434 Another dealloc stack killer (2002-08-25) http://python.org/sf/600007 Installing w/o admin generates key error (2002-08-27) http://python.org/sf/600952 bug in new execvpe (2002-08-27) http://python.org/sf/601077 weird header wrapping in email.Generator (2002-08-28) http://python.org/sf/601392 xmlrpclib ignores CDATA (2002-08-28) http://python.org/sf/601534 some int results that should be bool (2002-08-29) http://python.org/sf/601775 smtplib mishandles empty sender (2002-08-29) http://python.org/sf/602029 configure finds c++ w/o --with-cxx (2002-08-29) http://python.org/sf/602102 os.popen() negative error code IOError (2002-08-29) http://python.org/sf/602245 3rd parameter for Tkinter.scan_dragto (2002-08-30) http://python.org/sf/602259 Bgen should learn about booleans (2002-08-30) http://python.org/sf/602291 option for not writing .py[co] files (2002-08-30) http://python.org/sf/602345 Jaguar "install" does not overwrite (2002-08-30) http://python.org/sf/602398 non greedy match bug (2002-08-30) http://python.org/sf/602444 pydoc -g dumps core on Solaris 2.8 (2002-08-30) http://python.org/sf/602627 cgitb tracebacks not accessible (2002-08-31) http://python.org/sf/602893 New Patches ----------- test_commands test fails under Cygwin (2002-04-16) http://python.org/sf/544740 email: RFC 2231 parameters encoding (2002-08-26) http://python.org/sf/600096 IDLE [Open module]: import submodules (2002-08-26) http://python.org/sf/600152 Robustness tweak to httplib.py (2002-08-26) http://python.org/sf/600488 Refactoring of difflib.Differ (2002-08-27) http://python.org/sf/600984 build_ext forgets libraries par w MSVC (2002-08-28) http://python.org/sf/601314 obmalloc,structmodule: 64bit, big endian (2002-08-28) http://python.org/sf/601369 expose PYTHON_API_VERSION via sys (2002-08-28) http://python.org/sf/601456 replace_header method for Message class (2002-08-29) http://python.org/sf/601959 sys.path in user.py (2002-08-29) http://python.org/sf/602005 improper use of strncpy in getpath (2002-08-29) http://python.org/sf/602108 single shared ticker (2002-08-29) http://python.org/sf/602191 Closed Bugs ----------- test_commands test fails under Cygwin (2002-04-16) http://python.org/sf/544740 Various Playstation 2 Linux Test Errors (2002-06-12) http://python.org/sf/567892 Core dump when using mmap. (2002-08-20) http://python.org/sf/597938 execfile() not show filename when IOErro (2002-08-23) http://python.org/sf/599163 SocketServer wrong about allow_reuse_add (2002-08-24) http://python.org/sf/599681 sub[n] not working as expected. (2002-08-24) http://python.org/sf/599757 httplib.connect broken in 2.1 branch (2002-08-25) http://python.org/sf/599838 NameError value is not the name error (2002-08-25) http://python.org/sf/599869 Closed Patches -------------- "simplification" to ceval.c (2002-08-19) http://python.org/sf/597221 Failure building the documentation (2002-08-22) http://python.org/sf/598996 From martin@v.loewis.de Sun Sep 1 22:25:39 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 01 Sep 2002 23:25:39 +0200 Subject: [Python-Dev] mimetypes patch #554192 In-Reply-To: <3D5F9C2D.8010209@livinglogic.de> References: <3D5BEBB8.7080904@livinglogic.de> <15707.61612.844119.819432@anthem.wooz.org> <3D5CE38D.9080905@livinglogic.de> <3D5F9C2D.8010209@livinglogic.de> Message-ID: Walter D=F6rwald writes: > >>Even better would be, if we could assign priorities to the mappings, > >>so that for e.g. image/jpeg the preferred extension is .jpeg. > >>Then guess_type() and guess_extension() would return the preferred > >>mimetype/extension. > > Do you have a specific application for that in mind? It sounds like > > overkill. >=20 > I'm using a web mirror script which uses the extensions from > guess_extension to save all downloaded resources, and I hate it > when the HTML files are named .htm and JPEG images are named .jpe. Then this is your preference - others might prefer jpg, just because their file system can deal better with that. If you can agree that this is your preference, you should put the preference mechanism into the application. Maybe your preference can be expressed algorithmically? It might be that you always want the longest known extension (it is unlikely that you prefer "jpeg" over "jpg" just because that contains a vowel :-). Regards, Martin From martin@v.loewis.de Sun Sep 1 22:31:26 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 01 Sep 2002 23:31:26 +0200 Subject: [Python-Dev] PyString_DecodeEscape and PEP293 In-Reply-To: <3D60EA3B.7030008@livinglogic.de> References: <3D60EA3B.7030008@livinglogic.de> Message-ID: Walter D=F6rwald writes: > A recent checkin added a function PyString_DecodeEscape() > to stringobject.c. To make this function PEP293 compatible > it would need access to unicode_decode_call_errorhandler > which is defined static in unicodeobject.c. Does > PyString_DecodeEscape() really need an errors argument? What do you mean, "really need"? The callers of this function pass the argument, in particular escape_decode. Is that "real"? > If yes, we could either move it to unicodeobject.c=20 No. It has to do little with Unicode. > or make unicode_decode_call_errorhandler externally visible. I don't know this function. What does this have to do with Unicode? > Another problem that I noticed is that string-escape can't > be used for encoding Unicode objects: That is a feature. string-escape has nothing to do with Unicode. Regards, Martin From martin@v.loewis.de Sun Sep 1 22:22:29 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 01 Sep 2002 23:22:29 +0200 Subject: [Python-Dev] PEP 277 (unicode filenames): please review In-Reply-To: References: Message-ID: Matthias Urlichs writes: > Linux and MacOSX use UTF-8 and should probably be treated as such,=20 > i.e. I want to open("=E4=F6=FC"), not open("=E4=F6=FC".encode("utf-8")). What would be "=E4=F6=FC" in this context? Your message was encoded as Latin-1 - was that deliberate? You could expect that open(u"=E4=F6=FC") works well; for the way you write it, somebody needs to know what encoding the string has. Linux does *not* "use" UTF-8. On the file system API, it treats arbitrary byte sequences as-is, i.e. when you pass "=E4=F6=FC" as Latin-1, it will put those bytes on disk - if you later use "=E4=F6=FC" in UTF-8, Linux won't find the file. Instead, the convention seems to be that file names are in the locale's encoding - which might be UTF-8, if you use a UTF-8 locale. > Byte strings are perfectly OK if they have a common encoding (meaning=20 > UTF-8, in some accepted normal form).=20 Unfortunately, that precondition is false. There is no common encoding on Linux. Regards, Martin From martin@v.loewis.de Sun Sep 1 22:57:32 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 01 Sep 2002 23:57:32 +0200 Subject: [Python-Dev] To commit or not to commit In-Reply-To: <200208261847.g7QIlI806850@pcp02138704pcs.reston01.va.comcast.net> References: <3D6A7742.1030005@livinglogic.de> <200208261847.g7QIlI806850@pcp02138704pcs.reston01.va.comcast.net> Message-ID: Guido van Rossum writes: > > Any objections against committing the patch? > > What do MvL and MAL say? I'm still concerned about the massive amounts of C code, most of which could be expressed way more compact in Python code. Walter convinced me that this (the aspect that I picked in a discussion) does have a real performance impact for real data, so I guess I have to live with that. Because of the size, I'm sure there are still bugs in it. I couldn't spot any by inspection, so I think the patch is ready to be installed. Regards, Martin From tdelaney@avaya.com Sun Sep 1 23:53:39 2002 From: tdelaney@avaya.com (Delaney, Timothy) Date: Mon, 2 Sep 2002 08:53:39 +1000 Subject: [Python-Dev] The first trustworthy GBayes results Message-ID: > From: Tim Peters [mailto:tim.one@comcast.net] > > Training GBayes is cheap, and the more you feed it the less need to do > information-destroying transformations (like folding case or ignoring > punctuation). Speaking of which, I had a thought this morning (in the shower of course ;) about a slightly more intelligent tokeniser. Split on whitespace, then runs of punctuation at the end of "words" are split off as a separate word. So: a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames) A phrase. -> 'A', 'phrase', '.' WTF??? -> 'WTF', '???' >>> import module -> '>>>', 'import', 'module' Might this be useful? No code of course ;) Tim Delaney From drifty@bigfoot.com Sun Sep 1 23:57:53 2002 From: drifty@bigfoot.com (Brett Cannon) Date: Sun, 1 Sep 2002 15:57:53 -0700 (PDT) Subject: [Python-Dev] Python-dev summary for 2002-08-15 - 2002-09-01 Message-ID: Yes, with Michael's permission, I am attempting to start up the Python-dev summaries again. Below is my attempt at summarizing the last half of August. It's longer then normal summaries, but that is because I bothered to include discussions on threads that were not directly relating to the Python core but are interesting nonetheless (e.g., the whole spambayes thread). I am posting to Python-dev first before posting to c.l.py, c.l.py.a (also lwn.net and probably Slashdot) because I want to get the general okay from the list that I have done a good enough of a job to send this out; I don't want to have a summary that represents the going-ons here without the general populace (or just the BDFL since he can overrule =) being okay with it. I am also curious as to whether I should go into more or less detail, leave out the summaries that do not directly pertain to the Python core, etc. So please read the summary and let me know if you are okay with it. If so I will try to do semi-monthly summaries from now on. Oh, and I am on vacation right now and will be doing a lot of travelling in the next two months, so I can't guarantee summaries will be this quick to come out for a while. I will do them, though, even if they are a week late. =) Oh, and if I do get the okay to do this, expect a lot of dumb questions from me in the future in terms of clarifying things. Just remember, it is for the good of the Python community. =) ======================================= This is a summary of traffic on the python-dev mailing list between August 16, 2002 and September 1, 2002 (exclusive). It is intended to inform the wider Python community of ongoing developments. To comment, just post to python-list@python.org or comp.lang.python in the usual way. Give your posting a meaningful subject line, and if it's about a PEP, include the PEP number (e.g. Subject: PEP 201 - Lockstep iteration) All python-dev members are interested in seeing ideas discussed by the community, so don't hesitate to take a stance on a PEP if you have an opinion. This is the first summary written by Brett Cannon. Summaries are archived no where at the moment. =) They will be, though, so stay tuned for the URL in future summaries. Posting distribution (with apologies to mbm, but thanks to mwh for the code) Number of articles in summary: 585 80 | [|] | [|] | [|] | [|] | [|] [|] 60 | [|] [|] [|] | [|] [|] [|] | [|] [|] [|] | [|] [|] [|] | [|] [|] [|] [|] 40 | [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] 20 | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] 0 +-071-025-012-042-063-084-030-021-039-009-047-027-033-041-036-005 Fri 16| Sun 18| Tue 20| Thu 22| Sat 24| Mon 26| Wed 28| Fri 30| Sat 17 Mon 19 Wed 21 Fri 23 Sun 25 Tue 27 Thu 29 Sat 31 ================ Type Categories ================ This VERY long thread was sparked by Andrew Koenig asking if a discussion of making type categories more explicit had ever occured (Andrew meant for category to mean "the set of all types that implement a particular marker interface"). As Andrew later pointed out, he was asking about "a way of making notions such as 'file-like object' more formal and/or automatic". The discussion quickly started using the term interface to mean defining a way to specify that an object implemented certain methods (think of it in terms of Java's 'implements' mechanism). Once that was out of the way, the discussion took off. Zope's implementation was pointed out (http://cvs.zope.org/Zope3/lib/python/Interface/) very quickly. PEP 245 (Python Interface Syntax) was also brought to the attention of the list. The idea of using inheritance to handle interfaces was brought up. Guido said that he hasn't "given up the hope that inheritance and interfaces could use the same mechanisms. But Jim Fulton, based on years of experience in Zope, claims they really should be different" in terms of how interfaces should be handled in objects. Jeremy Hylton tried to channel Jim's opinion by pointing out that "We'd like to use interfaces to make fairly strong claims. If a class A implements an interface I, then we should be able to use an instance of A anywhere that an I is needed." But "the inheritance mechanism is too general" because if a class A implements interface I and then a class B, which does not implement I, subclasses class A we end up with a class B that claims it has a certain interface which it doesn't actually have. Guido understood the point, but still thought inheritence could be used "if there was a way to "shut off" inheritance as far as isinstance() (or issubclass())" is concerned. Guido asked the simple question, "Why do keep arguing for inheritance? (a) the need to deny inheritance from an interface, while essential, is relatively rare IMO, and in *most* cases the inheritance rules work just fine; (b) having two separate but similar mechanisms makes the language larger." Samuele Pedroni asked that any implementation "allow also for refering to anonymous super-interfaces of an interface in terms of the interface plus a subset of its signatures, also e.g. FileLike and just 'write'. [that means an interface can be thought to correspond to a set of (tag,signature) tuples, where tag identifies the interface, and one can also just consider subsets of it]". The thread has finally seemed to have stopped (for now) with Guido saying he is mulling the whole thing in the back of his head. This is a very sticky topic because of the number of design decisions required and how it might change the way people program in Python. There was also a partial sub-thread in this whole discussion about multimethods; basically a way to do overloading of methods based on parameter signature. Most of the discussion was over syntax and such and how to handle resolution order. It then seemed to go to the wayside when the main part of the thread took over again. ============================== type categories -- an example ============================== This thread was starteed when Andrew Koenig said that the reason he brought up his type category question was because he wanted a way so as to be able to identify members of a type easily. He now had an example in a program he was writing where what the type of the argument was varied and thus what needed to be done to the data changed accordingly. Jermey Hylton suggested the isinstance(obj, type(re.compile(''))) idiom. Andrew asked if this was guaranteed to work, which Jeremy said no. I asked why this was not guaranteed, and Frederick Lundh said because re.compile() is a factory fxn and it is possible that a future version could return a different object based on the pattern. =============================================== Python build trouble with the new gcc/binutils =============================================== Andrew Koenig said that he couldn't compile Python using the newest gcc (this was the day after the latest release hit servers). With help from Zack Weinberg of Code Sourcery (who also recently rewrote the tempfile module), the problem was tracked down to binutils 2.13. being the culprit and was not Python's fault. =================================== Last call: mortal interned strings =================================== The patch python.org/sf/576101 removes the default immortality of interned strings. I believe it was in early August (possibly spilled over from late July) when Oren Tirosh proposed the idea and wrote the above mentioned patch. There had been some discussion over whether any 3rd party code was reliant upon interned strings being immortal; none was found (MacPython was reliant upon it, but since it is under Python core control it was considered a moot point since it could be changed). It has been checked in. With the patch the way to make a string immortal is to call PyString_InternImmortal(); no code in the core uses this function. ===================================== PEP 218 (sets); moving set.py to Lib ===================================== Thanks to Greg Wilson (for writing the PEP), Alex Martelli (for writing the module initially), and Guido (for refactoring Alex's code) the stdlib has now gained a sets module. It has both the notion of mutable and immutable sets (the latter used when you have a set of sets). There was discussion about how sets should print (sorted or not; unsorted is default but option is there to print sorted) and what operators should be overloaded for working on sets (| and & were chosen). The module is a beautiful chunk of code and I highly recommend reading its source. =========================================== A few lessons from the tempfile.py rewrite =========================================== Zack Weinberg, after rewriting the tempfile module, brought up three points: 1) Lack of dummy threads, 2) lack of a pthreads_once equivalent, and 3) lack of a way to skip tests from unittest.py via some built-in method. Guido responded accordingly: 1) since some code uses the idiom of trying to import thread and catching the exception if it fails, Guido said he would be willing to accept a dummy_thread.py that would allow: try: import thread as _thread except ImportError: import dummy_thread as _thread to work. No word on whether this is being written at the moment. 2) Guido said the method was, in his opinion, overkill. He said to "be Pythonic, live dangerously, accept the risk that a ^C can screw you. It can anyway. :-)". And as for 3) Guido deferred Zack to the PyUnit list and Steve Purcell since Python just tracks Steve's code (pyunit.sf.net). Guido's suggestion was to stick code that was reliant on some other code in a separate testing suite that is only run when the reliant code is available. =========================== Standard datetime objects? =========================== Kevin Jacobs asked what stage the new datetime object was at. Guido said it is in python/nondist/sandbox/datetime/ in CVS which also has comments pointing to a wiki containing the current work on it. Fred L. Drake, Jr. is working on the C re-implementation and Guido expects a checkin at any moment (hasn't happened as of this writing). =================== PEP 269 versus 283 =================== Jonathan Riehl noticed that PEP 283 said PEP 269 was dead; not good considering he was close to having a patch for PEP 269 (pgen module to interface with the C version). Guido said he will revive the PEP. The patch has since been put on SF at python.org/sf/599331 . ============================== What is a backport candidate? ============================== Since Python 2.2 is going to be around for a long time, the question was brought up of what constitutes code that should be backported. Guido made the following three points: 1) code trivial to backport should always be backported 2) code patcheing 2.3 code should obviously not be backported 3) 2.2 code requires changes to use patch, but applies; gradients of this exist. So please, when submitting patches, mention whether you think the patch should be backported to the 2.2 tree and any possible dependencies it might have in a backport. ================================= python/nondist/sandbox/spambayes ================================= In response to Paul Graham's spam filter written using Baye's Rule (Slashdot post on it is at http://developers.slashdot.org/article.pl?sid=02/08/16/1428238&tid=156), a thread spawned around this checkin of code that followed that paper's suggestions. This thread quickly jumped into discussions on data structures, Baye's Rule, and a whole lot of talk about spam. Very interesting if spam filtering interests you. Tim Peters has been leading the drive on this chunk of code (and thanks to his illness that befelled him in late August which he has subsequently gotten over he had a few days of major hacking on it; Tim showed he is a performance stats whore ). A very cool quote came out of this thread from Eric S. Raymond when discussing the spam filter he has been working on: "This is actually the first new program I've coded in C (rather than Python) in a good four years or so". ==================== Parsing vs. lexing. ==================== In response to a question by Aahz about what the differences were between a lexer, parser, and tokenizer, Eric Raymond posted a good overview of the differences. Guido later commented in an email mentioning SPARK and about how Python's lexer (pgen) works and why he wrote it. He also made some other comments on lexers. Jeremy Hylton pointed out a "neat new paper about an old algorithm for recursive descent parsers with backtracking and unlimited lookahead" by Bryan Ford at http://www.brynosaurus.com/pub.html . Alex Martelli pointed out that this discussion reminded him of "a long-ago interview with Borland's techies" in which they said they were able to make Borland PASCAL fit on a floppy while MS PASCAL took multiple floppies. Their trick was "we just did everything by the Dragon Book -- except that the parser is a hand-written recursive descent parser [Aho &c being adamant defenders of Yacc & the like], which buys us a lot". Someone named Noah also emailed a discussion on lexers and parsers pulling in Finite State Machines, Push Down Autonoma, and Turing Machines in his discussion. Martin Sj?n says that Haskell's pattern matching and lazy evaluation makes lexers easy (even a Recursive-Descent parser), but unfortunately Haskell does not play with other languages nicely. Haskell is where Python got it's list comprehension idea. ========================================= [Python-Dev] Fw: Security hole in rexec? ========================================= It was brought to the attention of the list that deleting __builtins__ allowed a compromise in rexec. Guido pointed out that python.org/sf/577530 reports this. He also said don't trust rexec. A patch is going to be submitted to document the view that rexec is really not that safe. ================= A `cogen' module ================= Francois Pinard asked about Cartesian products using the new sets module. Guido didn't think people would in general need it. Francois quickly started this thread of discussing a cogen module to generate Cartesian products and other ways of operating on sets. ================= Mersenne Twister ================= Raymond Hettinger volunteered to implement the Merseene Twister algorithm (one in Python exists at www.math.keio.ac.jp/~matumoto/emt.html). While discussing to implement in C or Python, Guido noticed that random.Random re-implements whrandom. Guido then came up with the idea of writing a base random class that is subclassed where .random() can be implemented; Tim Peters agreed and suggested more methods to subclass. ================================= New PEP Format: reStructuredText ================================= David Goodger and Barry Warsaw have now gotten reST as a usable syntax for PEPs. Read the PEPs on the subject to learn more: - PEP 12 -- Sample reStructuredText PEP Template (http://www.python.org/peps/pep-0012.html) - PEP 258 -- Docutils Design Specification (http://www.python.org/peps/pep-0258.html) - PEP 287 -- reStructuredText Docstring Format (http://www.python.org/peps/pep-0287.html) ==================================== tiny optimization in ceval mainloop ==================================== Jeremy Hylton noticed that in ceval that their is a test of whether the ticker was 0 or if things_to_do was set to true (explanation of the ticker, checkinterval, and the GIL follow this paragraph). Jeremy wondered if we could just drop the ticker to 0 when things_to_do is true. Jack Janssen, though, pointed out that clearing it is not guaranteed since there may be an interrupt routine when "we fiddle things_to_do". Skip Montanaro then pointed out that since neither ticker nor things_to_do is fiddled with unless the GIL is held that instead of causing each thread to execute this test that they could be made globals instead; he did a patch that implements this (python.org/sf/602191). Guido then said that if there wasn't a decent speed improvement, then no patch would be checked in. He then changed his mind when it was pointed out that it actually simplified the code. Skip tested anyway, though, and there is a speed improvement. This also brought up whether the default value of 10 for checkinterval was reasonable. It was then agreed to be bumped up to 100. Jack ran some code and said he noticed a definite improvement. Python's version of threading is not like in C. There is something called the GIL (Global Interpreter Lock) which any thread wishing to execute Python code or play with Python objects must hold. This means that when you have Python threads running (using the thread or threading module) they are usually all waiting in line to get the GIL. Now for Python to decide when to release the GIL for another thread to grab it, it uses the ticker. This variable counts down to zero by being decremented every time a Python opcode is executed (originally defaulted to 10, now defaulted to 100). The ticker's starting value after each release of the GIL is what sys.checkinterval() sets. To get a better understanding of therading under Python I recommend reading Aahz's tutorials on threading. From tim.one@comcast.net Mon Sep 2 00:40:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 01 Sep 2002 19:40:38 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: Message-ID: [Delaney, Timothy] > Speaking of which, I had a thought this morning (in the shower of > course ;) about a slightly more intelligent tokeniser. "Intelligence" isn't necessarily helpful with a statistical scheme, and always makes it harder to adapt to other languages. > Split on whitespace, then runs of punctuation at the end of "words" are > split off as a separate word. For example , "free!!" never appears in a ham msg in my corpora, but appears often in the spam samples. OTOH, plain "free" is a weak spam indicator on c.l.py, given the frequent supposedly on-topic arguments about free beer versus free speech, etc. > a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames) > > A phrase. -> 'A', 'phrase', '.' > > WTF??? -> 'WTF', '???' > > >>> import module -> '>>>', 'import', 'module' The first and last are the same as just splitting on whitespace. The 2nd-last may lose the distinction between WTF??? and a solicitation to join the World Trade Federation ; WTF isn't likely to make it into a list of smoking guns regardless. Hard to guess about the 2nd. The database isn't large enough to worry about reducing its size, btw -- the only gimmicks I care about are those that increase accuracy. > Might this be useful? No code of course ;) It takes about an hour to run and evaluate tests for one change. If you want to motivate me to try, supply a patch against timtest.py (in the sandbox), else I've already got far more ideas than time to test them properly. Anyone else want to test this one? From tdelaney@avaya.com Mon Sep 2 01:04:39 2002 From: tdelaney@avaya.com (Delaney, Timothy) Date: Mon, 2 Sep 2002 10:04:39 +1000 Subject: [Python-Dev] The first trustworthy GBayes results Message-ID: > From: Tim Peters [mailto:tim.one@comcast.net] > > For example , "free!!" never appears in a ham msg in my > corpora, but > appears often in the spam samples. OTOH, plain "free" is a weak spam > indicator on c.l.py, given the frequent supposedly on-topic > arguments about > free beer versus free speech, etc. I'd actually thought of this limitation, and how it could be avoided. This so-called "more intelligent" tokeniser would probably work best in a system which scored word pairs as well as single words. For example: "I want free beer!!!" would be split as 'I' 'want' 'free' 'beer' '!!!' This might then be scored as 'I' 0.5 'want' 0.5 'free' 0.5 'beer' 0.1 (beer is unlikely to be a spam indicator ;) '!!!' 0.9 'I want' 0.3 'want free' 0.99 (do you want free hot ...?) 'free beer' 0.01 (free beer is never a spam indicator ;) 'beer !!!' 0.5 Whether any weighting should be applied to single words or word pairs I don't know - my gut feeling is that they should be weighted the same, but guts are no replacement for empirical evidence. I just brought CVS python down at home and tried compiling with MinGW (no success so far ...) but I'll have a look at the GBayes stuff sometime soon and see if the above helps at all. Unfortunately, I just started my work day ... Tim Delaney From tdelaney@avaya.com Mon Sep 2 01:38:10 2002 From: tdelaney@avaya.com (Delaney, Timothy) Date: Mon, 2 Sep 2002 10:38:10 +1000 Subject: [Python-Dev] The first trustworthy GBayes results Message-ID: > From: Delaney, Timothy [mailto:tdelaney@avaya.com] > > Whether any weighting should be applied to single words or > word pairs I > don't know - my gut feeling is that they should be weighted > the same, but > guts are no replacement for empirical evidence. On second thought - if a word-pair appears, then the separate parts should not be checked as separate words. So, If I had scores: 'free' 0.1 'beer' 0.1 ('want', 'free',) 0.9 ('free', 'beer',) 0.01 ('free', '!!!',) 0.99 then the following phrases would match (case-folding) as: 'I want free beer!!!': ('want', 'free',) 0.9 ('free', 'beer',) 0.01 'Get *** for free!!!' ('free', '!!!',) 0.99 'I want free beer. Free the beer!!!' ('want', 'free',) 0.9 ('free', 'beer',) 0.01 'free' 0.1 'beer' 0.1 Damn I wish I was at home to try this out ... :( Tim Delaney From skip@pobox.com Mon Sep 2 03:29:09 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 1 Sep 2002 21:29:09 -0500 Subject: [Python-Dev] Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: References: Message-ID: <15730.52469.604124.730029@localhost.localdomain> Brett> I am posting to Python-dev first before posting to c.l.py, Brett> c.l.py.a ... because I want to get the general okay from the Brett> list... Looks good to me. The only trivial nit I would like to raise is that any URLs you embed in the text be true URLs. I'd also prefer they be encased in <...>, but that's slightly less important and generally only matters when URLs are immediately followed by punctuation. So, instead of Brett> Guido said he will revive the PEP. The patch has since been put Brett> on SF at python.org/sf/599331 . you'd have Brett> Guido said he will revive the PEP. The patch has since been put Brett> on SF at . The two changes make it much more likely that email readers will be able to successfully highlight such URLs correctly. Skip From skip@pobox.com Mon Sep 2 03:34:24 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 1 Sep 2002 21:34:24 -0500 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: References: Message-ID: <15730.52784.407584.441515@localhost.localdomain> Tim> It takes about an hour to run and evaluate tests for one change. Tim> If you want to motivate me to try, supply a patch against Tim> timtest.py (in the sandbox), else I've already got far more ideas Tim> than time to test them properly. Anyone else want to test this Tim> one? Care to identify some of those ideas? Skip From tim.one@comcast.net Mon Sep 2 03:43:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 01 Sep 2002 22:43:01 -0400 Subject: [Python-Dev] spambayes status Message-ID: This is a multi-part message in MIME format. --Boundary_(ID_Uqs78No0Dj49zOTKCoTyzA) Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: 7BIT I spent an enormous amount of time this weekend running tests against various changes -- a "1% inspiration, 99% perspiration" kind of thing. There are lots of words about the changes (both good and bad) in the comment blocks and checkin msgs. The biggest "conceptual" change is that I'm now using (but only using) the Subject and From lines from the headers (my earlier belief that the ham corpora Subject lines were too corrupted by Mailman decorations turned out to be wrong). Adding Subject lines gave a remarkably small improvement, btw. Most changes I tried either didn't matter, or hurt. Approximately 70 more blatant spams in the ham corpora were identified and replaced with (randomly selected) legitimate msgs. The f-p rate is too low now to measure changes with confidence. Best guess I can make from the evidence is that it's below 0.05% now. The false negative rate has improved more, and there's still plenty of those (so it's still easy to be confident about whether changes do or don't help that). Across all 20 runs (each training on 4000 ham + about 2750 spam, then predicting against a different set with the same number of each), these are the false positive and negative rates now (percentages; note that 0.025% is a single message in the f-p column; a single msg in the f-n column is about 0.036%): f-p f-n 0.000 1.236 0.000 1.164 0.050 1.454 0.000 1.599 0.025 1.527 0.025 1.236 0.050 1.163 0.025 1.309 0.025 1.891 0.000 1.418 0.075 1.745 0.050 1.708 0.025 1.491 0.000 0.836 0.050 1.091 0.025 1.309 0.025 1.491 0.000 1.127 0.025 1.309 0.050 1.636 The aggregate number of unique f-p across all runs is down to 8. The aggregate number of unique f-n across all runs is 336. The 8 ham messages for which at least one run claimed it was spam are attached. Note that I finally removed the "If AOL were a car" spam from the good corpus; while it may or may not be amusing, it *was* automated bulk email, even to the extent of including large blocks of random characters at the end. The message consisting almost entirely of quoting a Nigerian scam message looks like it would be a "false postitive" under any scheme worth using, but I left it in the good corpus (so it's still an f-p here), because it wasn't bulk email (the original msg was, but the reply was not). --Boundary_(ID_Uqs78No0Dj49zOTKCoTyzA) Content-type: application/x-zip-compressed; name=fp.zip Content-transfer-encoding: base64 Content-disposition: attachment; filename=fp.zip UEsDBBQAAAAIAOewIS0GzF+x8RsAABBiAAAGAAAAZnAudHh07FvrcttGlv6vKr1D2+vZSI4IoXEH x1YkW7KliWW7LDlOTSq11QSaJCxcGFxE0W+7z7AvsN9pXAhe5ImqkuxOlZ2KTaK7T5/7+fo0+PTp H/lnd+dUlOLwXCSHV7K0DrnuOp6rlXfl7s4sz0bsOdM13/dd09dN3fTN+vHed0GWzES6OPhuX03x fF/3usHbqIjKZsTnvtMNyEEiorgdMbnn+e1QOZW5bMnprudx123Hsipv1xiWYxnt80KmpdaOWJy7 TjsiUhEvio4HB/911HIZyOhWht1C03OcJRsivSk6PkxdB5Pt4FTGsx/aId4+jdJbWZQJeGkp+sut JlUsyuVe3UitiaJ97vku73goqtFnGZTD94tymqUb+83UY23jeZ6J8KD3dHfnQyPpkI3zLGG/cN/W uMs1W3P5r2wP4mTP0yyUls61MPmiFaVIQ5GHcTSWGiy8v7vDmj+jBSOGtWbzLJ+weVROmSyScsb2 zu6ihJmawdl/8N6qKGTczk+LaqDrumff4J/l4DjLWU1uEEdFebwk/Xf2SYYHjOvsXVAyAzZgujk0 raFpsIFuEZE10Yq4ugmD1NKitJQ5jL8hC9vr5qwPLVn6hesapNBc49dV2b+mJbb3Ps8GnuZr5mH3 ab9Wz9Xl9XtoYUnr4uTENh1HJ+GfbZf+aJv4Homv6+x7nW8VX3d909DZ3i/cNTSTa56nGSb/dR/M Lzf/PVp6k5VVUfN9eX3Cbi3N0RzG9jxf1zhzB9wZcCSE/Z5++pIyTzds5+SMa7pu+NYLR79HGj7U jVaaS1kUYiIHF6dD9kzXXanzgNvcDp+MRoZt+5b+RLq2x8ciOG5FvV+Go92dV1DKkD3+KQpuIsku ozguHrNnt+rrfyX09XjLqutseK9Ndneumqhk17mI0iidsJfIS4UsWJSyOlTp08cfVU6Vw+1iG0Oz M+LlxeXZ4CeZF1GWDhk0trvzMoNUaTm4XsxAoZR35eEsxnZ/X2o7mArsWj5/HBXwNc/2B/xxb2Eu 0mIs88FZGmQhuBwydxQhm/88eJ9HWR6ViyEz6evl1SUCuvf0bZYnIlZDGJD5EIoL8qzIxiV7V5Vx lt2ws7tZDmMxS1NuZtvENBZEiXz35mzI4P9hFciQvVj0VjfD7KeVZVcyDWmXnsYHIkyitKf33Z2z PM/yYkC2+frEnwcvpEyvqYwM78krrWyJSJd6NzQdDr7HdW64Orz6PdWHUKYByIyq+GZ35w1teI7c D/egJFhmwz4rufytkiv7/NBk8OdUMI4aAu+zotxKYNXN1Fz4WhHk0QgsPJuW5Wx4eLiWfdV3yHFI FKJ0nB32KB4dLN3lgRwX7c4tKxfIMa9lKnMRszAqgqogtTGiozI46nbr/ShDk1wkCcVGLNJJhZju MdLjQNsU+WNa/J8JXaUbYp/kwRQJ9n5WZtFM5vSwz8PhEZXd84jVaIa+XDCRMIod0gopbJFVDC6W JUAMoSihzIJBeWWbVYImq4hbEBejWFJWIS1TZsnaVKMx9lKkiphCJbTVdY1d6GM/84GHq2XSOqmR kZov4yFpS+cmZ4ZlM44QQPIUd2uPbRvheqYQixr5aiIlyq8gZ5LlxPqYsgpJSUJeNXPZG0w+YAon MsjL5nKEj/CWRtvz+Xwjrx9iZ4T3KhF2UhQVcl4gkY5rSNpNIFNixnlWFdgMeRe1bRpBax8IKbEz pMdRlU+m7Owcgp6eH7CogGkmMCSsF5Ler4KshCuD1N7bjF3909pn+MY6XEfYQEUAskkGzbKLDgyy kwqGotQKWylV1wo17N0dFDUgGRYIKBC4asFGUvlEHoKk0l0aldBfSEun4CrI0nGEpFRG2KPGjsQs lLu7Q/uLMKS8LCW0HNOGF2PW4lySQ1IWPWBhBiRTYqSENxwG2WxxSDEdZwVMBZ+gSp5VtUmwJeFr Je8slqIgFssqR5UrWZmxqsA2nySMmAMwFAH0CnEa5uB0I9ARcIF6w2IGchF5M8UAzMRCkSA/QAew jtKjaEjReAwL1F5DAyRjBIPPRF7i6xiDxAQUoE4A5HH/8qTzoD/rxyLDMmxj5VSkqnVzKoDL6kM4 aIO9HdszDa8/esSH0zJpzz0u97htrK6W3UnAxFmgg/T2ZKiiYeP4UK9KZXvkcE3Pc9zeMhVBG8sw oCJp24CKu60b8WGESn13z9gs2jzdNKu6AcAUy+ntpaJ8GxOZttTjyoDKCdsGVK7YKg8J1BvoUKGY BACD+PsYAYZgHWmyQN5+K+fFJM+qWTFkdLrVqIQ1Gb+H/mQai0AWLci7nlYAeS57JUcE8nR8HprO kHstyHuXT0QafVHOPGQXYSgzRP4HWUrKfv+TUpJKJfa013Cw5029yeQJv01jzzq+en1hOjaXYwBf otHw/Pb6vQIWyOyDcwUwHEPzDGB/HGOMh6E+UkAuRfh15AfopmuGzi0c9vhDsN/awveinA5Zij0f jcd3WlVp8OZHVUV/i6RoH4SZFsr2C80eSxniIFZUsVYOkOygPMxYDukbQ/3Po2hCM7VEhpGYi0Wx Qpe+lFWaaVH5qJD5LWD0SAuK+qgxqgoyVEGDtGDAtbyxIsqxLNTDzjiPkPYGSHUqG0JNuRzX0m74 1tCzXAqONYS7Alv/n0BlrhlsT94B+ERU4UR8P1z+C2DjClhIs0lV3AiyINBhWt0d5rOE/h+cv/t0 /U6j/LuyIJCVNo60KgxgrcPX7z++oXU3xeHKrFAmoohEqPLSoUqD2oxo/RCFzznn+n8i/T2vUc5/ l1EgiK0/ux7xQ9eydee+eiQVOiBE0ODfjU7RCDhxI2XO86yUG1PHsC1suDE7zOIY59D9zdSfUZxM Np4D1qZlvtisSWk0gT+JdGNgVgGDA3ZsDMBT4Az5YgtTCfJ5FNy3x8bzDCFeANpsDLQtuBdN0N9H cVMcwo9Jn+d+qtNGYpGQIgC/+vmumLYp7uzjakYCdp7Os/ymIAd8FCyA1ZDJivlIxrGaWS5m0wwx UeBLvXQl8TQ178NCpCh6Of45pr/G8SJNm7bH1wvfBzlGmkB4F83hFRkBBd3mlmE6jmZ4Fve1e0+z Xdn8IKlARLcomqxVKh7G9QFoGs3Y3sd8gqSy3xVE1/9z6peta5ZtGBo39Ac1L9ZXrlTrF/nPpaNZ pq8/eX2qGZZvIEKP141ztNIbMji7FHnTG9KHtkctsYFuE/FtlR2VEyjUcTVOveKfB3TgoeZQWed8 gQolj9utaAIOfZSW17lgZD7bsrDtCk22t8kWH4IzSgdtPj67ut5fY+8+kZq1tGQdB129PbvGWYk6 eIrVrjyuR8hmufThdu6/RbmkzpL5rbP0rbP0h3WWTubss5RfDnCAzqjFhNM+TvxjUhWO52+bSsrm Ec7dOJEDRtftA5qGUspG8GMcyhFf4e4O6djQ2UKijqsz9GCAdI/iwF5RdSB/FQFhmpjaMyyUtzLO ZgT8DnFsApDe3SGItFJOavDz+GQkkGp/FCMc6OmsdaM+HY+T+XSASKa6wxTeoN5EUufR3R2K/+HD SoymgfMj9vHD67O31+zk7Sl7+e7tq4tTfLs4eYMhGv1Q20c5VV1ktpegZvqrKMfkC5ZU+KfI4iiI Smq05QygIwrKrhkTSCZUw0jkYQEbQNHtIYHNp1Ewbbp/YkQ9lTIj4vDkGXVbykz17kZS9T8aGzLA kcVqrycVZQVjKnIwA5tmM0krofusikMiKWaobEGEJKyxer8U9MZgG+sbwq9qGIdiVkMnev6JYIVq 75zXmG13Z+/V5afzfRiF6DbupKmmW0vgNUGmVPW3QKKZolpNaRkvWIGMXc2YICFKcp/dnRxHJDln M5FKYoco19IoXoHwIjh7SgqpL1KjCWEEaiW1NMDkXFBr7IDJOxlU1HdTPbmZiEJl1YryPQmq7AcZ 2r7cjHaHdCyJ4qgUEHwKhFAQ70WpNHZa5ZRVaHLdeaUxfNjdCasyksWB0rq4VaRrs4wjMKC6bqpS 0YKPV08M88Dm7gH1APau55i3gOJyWTdhKbG9QtSzc7CaS/CuBDhiV4gqlEGE5TUhZ7LGxzQiCa+I v4KdNkCbjeNMULlt28EiUJC6aEwMqEdyi7JVw0tJ2ovZC4D9nq2ajJAgJSyY8iElXZ0ayub2iLqf MIhQSUIi1Nvd1HXikdq/02hexah4ypHwPKLO4TTLy6JN4ngya9uj5OtT0Uwk3yCwrFqEVLJDDShC uXRcZOAojqAdzIVQ0ZjskMsZKKvt5R0cWcWgkr8TCWw3hqfeIPTcdGGVKTGi+K/lnou0plWvXLKb zdNVlhvZFhB6Mi1ZIm4k3Hx3pyhBGkaqY5uKGDSHbFEVrZUQ7wE1DdiL1UhX7JM12zML6tq8UPoe RWGoLgKiW5zjqA8BPgvaQN31IgFQI/+APuS1RxB7CQHBpse/ZrTah5veb+3KNQsgUEYqGSmjNsZv MpmIRaPdMXy2UP6wTpnCTAVNjMhlTfe4pk4AhRZ3a1T2E3mJgk/aKBBLtUUa9VMDmZrXdS9cNZy7 9vqKLyPnwjGXCYbILlRGWmGxeUOl7jd3uaT2JqqFIqbDwqJ2/Dav1FWxl0iolV4HCNVbScTrFSFY nKgrAKIF9ZfUlSbr5lQ36H2JBXEZQ5IJyk9reJUHlbozdS3U9MNZexRtdQQHwoz5lIpFq21Ysfbc nqnAaW0fcgI1XXmKCCg8yTe7RVVddojtOPqiBOoIaexcxRImU88eoaNcUBUa5Q6xJPna+FQKERNk N5UniiktWpqynFbFgBn23zoc10qAyALXe6C7z1z9b7W6ES17jcrbulXsH7BmeYKowmEXasvoyuET YQYEfljfQdCWRdnk7IYRKjIgThPaHgI8SiitS5X2VRkm569v7xf1fYdqUCg9qvyD5aNqQWJGQRWj DMMBEwHwBhFgWhpWQYSMpNSg+F8S0ho48b6+O0kJ7SjnU/pUiid4RW8NFIzr4LcQ4zrVNBUHbjWK KzlDgVJBhbVz1JI6/TQK6zyB3B1GkgggZWFggTSIq1C20vVABWZJHAnTerexuAP2qpIRZfFWG8i0 sUqVpAja+UIFKAxCpVhhF3XBF9JVUlx0SbhzqEb6FwS8Goh00Dw7SyfwpD5GVDl/EySqq6DmsUwn UDwIK5R5RDdqh+qm0jAtxplr+y43TRW2vw/rLCdeyYBuxFAcS3WJGMu7A3Zxky2iA/ZGTLKiFubP buuZh9xyDd9Zf/vOcy3XcXzH6e6MqA4MH9/I9HHTZ7K7nlldbmT4w/rIfCrbFptBNwn2CrG6Z5iK RA6f3XQTuzltO+ykKrOk9z5bN2Eklz0/wze8jlWUYcRO2z+zuG171kqjcrmO3oHzl2/wrW3l2aZp mluYVo3OaVbSl2aua/nm8jZsfS6dPpqJpsEtfaPpd9Yj1W9sttdMpmP4Vve8fmOwGeKu7fjGQ5t+ MSJJC4pA9fjStJxpYrS8mdAScTendl+xyDUZVo/G1QCxCs3RjUaVRr1vM5MPMN1wMTnUomml4kg7 +bi9LfijpK4gDH7cKPB3tQS7pl7nDuysJtvciVXygFnsH1VKjSCDGkEGH1o2+x5pTr/3wkuEEx5m T7ieJrn9hB9fnA4s37U9TSkhHKdBhH/k9muvewQXFdszdBN2wf8213x7v9cY6ysSKdgwOdmPM9My TQ4fYqtL2R532C+Kp1/3/4zepKOuyBz8pesP601urHxAQ407jm38e1xAUUeN828ttW8ttT/uZa1s TkgPp6Ou4rC2MDEqmqwpYUzVuPZAhwq7fOXqr7j0sw+5aXuusY4ObMPyfFu3dd7d2c0prtpi7Jie 3pVpeQc0nE7aQdf0XW+jyBNWWlZl17K411VldZJoSdMb991ANO6oehayZzvQewOsreS6Z+rd+zFA klq3EHL0AEsUTLuyatnL99wJmra0XNOw3XXx2rfgl8U7jm5k98a8Y7mu20GJROQ33Xs0nm8ZS4AC PbbPLcfU230eR9+Fjxtahu75Tsda0yNFuuoYN33eQy10pG6HbNOje/+HQQVCB4iOWCzqIkfPmmqn pV9q8BAUWjUCmBDdfWEy16JZcRv0MEX/SXtBlKTLh9r7d1fXZ6fbccPnTMbeKCoBHj7Po+AG6P7p 0zQrZiJ5+vQYB888qmoA8TBA8cu761/ZaadE9ia6hSceNh5Jh5NXSoUdhviTrgipmPrc1xz3oVeE qytX8M3Lufmbp1HhenKpWaZj2vrxNs13l4SXWXoA8MROZnkNpAx9aNpDy2GvL6+3oiDDonf/DRu7 8HtvCEVZjqJ7rgj7fAARcdfxTRMb9gmzvXXGlrmcWw2HL++7JtwQqr/kgcDFtnX6RdK/CXDxvuGW b7jlD8MtlwtGNZeaR+r1NOxav+orWFFW1OlhbTlsddbeIqluYJLdUpM2V42rOqtSr/O7kFGtVG10 Ed/Uv+0RbCznbFncMEXl5fl6XlbE6C3ked3okuwzcEr9xrSIm95bXXBVn1GVZcZO0oW6olwTBJWh //54V92pLVTiQExbJ7SXzBXD44j64OuvnVPbqulmYU5C7cGx0iMTCBWqZCo+1otWh+6o16X9FQjP OnRN7lvrAE93TIebjm1a3hJHrHY0KGSof7yYQstlC5o6JEWNufVnG10fiiVxsz6t1/SgXyh2QIfU 37V7AEFXmy039HJgKrfuqfZZH1Hh2jx0cNg2rSXv/e6KB6zbaaF+Kb0DaNxytkmnFPRFHHXcmrZ7 ryKDrJvm+cuX4Lby3MLlXCKUujfkNoYp0Nsx3bftTrJePujB1W0/2nQ009I8R7Pb32yG0zJHXdPW 7E4Y8Iv4HT/a/Je/2bSCH397o36zyf/xu3+zqSo7N1beaTKcoe627zStSUa/VASu4Jal+cTqV6Vi e+ptoSuK5jNVX5dM/UQ/G/hEN8Zvr/e38eEODQIZ7HvdWOVD7fp5OjJqvbQvJDGiz65kfhshp+3Z mo1zlA2vWNPTs9eW7rw5+eePR/fsav9vO1fb0zYShL8j8R9WUSOCDhy/xXZQkiNA6XEtHC2ge+lV lYmdZFsnjrxOKf31NzPrt8QmTa+0AokIlMSx17Oz65lnZ+eZPdVI77qABftW++Cwfdg3D4z2Eayj Do51vEht9c3WAfypjmOYdtvc98ap2avWSy/F5K9wjrKX8sljHfr6cv+uaxDp1LaqR3Krtg4LEuc1 S+b+AmCt0EGm+QrGY842XEv7/5sg+YCAoP4EBJ+A4L0BwWR/ET0yopuxjGddnWU3kSkRuIeLuihG rchI/KTglam2Na1cWMJp220HHE3mFmu1dFMr978eH2TwoJCavuS8cTf1k8SYpXMp2exd6TAmdGWR rvxwpxlHvYJTXvToElqWL+LT2bxcAqLDJ+VkeCR7BlV3HvuYYNLV7HI7Hv9UPhh7ZTlB+qqjcZmm Bat+vxu5Hg8LP5X8tI844U4oMXMngwRLmIqqLwEJWzs6PiQgce6vDSSOI77DdKsQIgA3Yu+p1h3F H0Y3SmHsZRmDt7raVjQbIKOB2djv8jsTgkIEpCl9uuoivWp7nSIXKzt5dvLDOokiGMsSs0b1YQCN uqJZjmIoWnuplEVl11nDUeDxtJryLale8Twt6jAyXr96/acOLbbUgqct98FExEdVOuzFPjSoEwjw 4sl+EA7cYAweqixZRReXJMvEOdNblmF8izgSpXz1NAmoLlFWDPJ9ZJ1E7iXREhx1MZ/ymP0eijFn HfEB3/c5pjjIUw4HK6JTRXoE7kR4ITth0sCQzV6CjSiuauqWBv+GrvRhZrfMCrEWSBvO0BkOfctW kmsNTVds23ZsH/DlTAbnlFEYjoL0coxnrluYIoNbc7HrigHn+blHXEDzXGb8c+LgbW5cCT/a7WMS LuDFeRw3NUWHa06mu29wkU7Q65tFLkwC+bAsjgJSz2nUacitHTnsoGJV39Pgzzo/Tcf+r91jHhD9 e/fgdg953cHpZ8q4EUg5HrpCMqLB5yLN+sIvYI4Cb2xpSLAujDcfxKIpG2z+RkFovN0FGNBdPHEu MFK9w8Y8Ft3dlqIyxBAcJOnilxism+ienL1/8/z8/eUfmCqMw9OF51x7WPuz6hO8fYK39wZv3xbN q4RxmLsFMyigjLuTNFUusZoS0qUp5luT5DjlyVLmF5xKeJLS2uWPlOcmr+OUaAsgFN8AM4Ej9OJx t6b9W6/1WOULz2QS/bEcUTFYF8/9bq3eGM6DAGNs2wKBcb3BhcejbVGjwJs8+Qw+9WQ7J6cvWIYC 2Rjw1cDv6kxEg27N/0K85eYNn3rhTXPIKY0vDvkA9PhhNiKhpcBaO2kPwWC13A/nJSUFQUnpAHTZ IHCF6NYm/nQ+kN5E1Hodl40jf7io01qv3pCfOk2312nC1bLnj6Lrac+zaabiPFuthnoDvC92F7u6 VvuPaQ5kmmitownBv/irVfF4NaDrd9uc6pdsYYW6JqFH7JXtwnNCuiE946o3s36bG5rC/g7nCUtC 5lrXKanaFwN3hiYXjDbyfhiSwXZY5Ma0nzR2p6xZVzY3dNnAhGO6DMItjFTEtNOTLoDZlthCdOay LXiow3Abvl7D8uAjhSmiWwLBe8QpQ6GK86O+QjngvO7FIH+3OS4Y44c+CVFCmgn3YoAf+jMne5ub 3ToZm+81uo9llHMj+/V+f93EPr5+6/oq61H9whbWMa0iewQSy5rY1cSqHo59WC3toHUjfhrSblny jaINyefT5WX9z4jTIgVBs1SnREEwTMw+1Kx8+1KmAGS7r46VF98FRQyJMpNu25qGlifBYZGzLK3O Ug0j4wrM3DytT1fzElnePM4yAVXbtE2tQC8gZkn6Y9vQHS3b3U1IR1mbdsvMxB/5UcTLQVx/HhVj oUmYNGF+lc4mtlN6umU6uciNaVUxlWLdlZad5xnGblZcC3ScacrzxYznzVjttmlnd0iJNsXSJPfF OgCHmCYKjmj2wWosAp82DWTyYAxLVzfg9L0qPfAFKfe0xjqjCfgJP9q/CaPAS8owTYPe2gmB5xf9 5vnF8Q6b3f7D0wyRKSV2wBI64f2JBeKBoRYS5jQsm0vFuPRyMa5L2Q12ACvyYP45SyrE5NVFZoI7 iT6Gz1zn0zNtX+YPagUl9Cpz8bDYsm5qioaBaN0qJNuVW0DygQoX6I4DH1u6vXw1axS6lS/NtTb2 z3QwIXC7IuEPR2viYjnD/fxmlWl5JdVlTf+iEqX56vLwh3Ae7q8ey7dxHlTHfDxVRJ5iak8xtXuM qcm0asyxez5HImxaTDNGdqk7cb/Ar8Ety/w47SCnRNOELI9hdndzg7Zy2DXxshOOL7Kwp+zqop8d aZxl5NfEp9GogJmSYXGqDjuUvHGUiGFRhOyOguGjT6G7awIPIhc2E3FbYSdxktp4jbVEOGUjyuCg CCc+RexhMUsVAsBxgGs5Zg3wL7/Kkq7UcupmqHdSN7Rhfgzz6AhRSGK7RcrDxrkF7VwjZRpLqwja DuHxbVoygIADHo+RZev50F2s6UvkkbRyiHSXEgrKmiqw1ohcBUsYuDDMymwMi+UZlQBrgvML3Oum uBX40+bGf1BLAQIUABQAAAAIAOewIS0GzF+x8RsAABBiAAAGAAAAAAAAAAEAIAC2gQAAAABmcC50 eHRQSwUGAAAAAAEAAQA0AAAAFRwAAAAA --Boundary_(ID_Uqs78No0Dj49zOTKCoTyzA)-- From tim.one@comcast.net Mon Sep 2 03:50:36 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 01 Sep 2002 22:50:36 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: <15730.52784.407584.441515@localhost.localdomain> Message-ID: > Tim> It takes about an hour to run and evaluate tests for one change. > Tim> If you want to motivate me to try, supply a patch against > Tim> timtest.py (in the sandbox), else I've already got far more ideas > Tim> than time to test them properly. Anyone else want to test this > Tim> one? [Skip Montanaro] > Care to identify some of those ideas? Nope, I'm puking sick of this topic now. Look for XXX comments in timtest.py for some of them. You can infer others from places where XXX comments aren't . The f-p rate can't be improved anymore (meaning that it's too low for me to measure an improvement if one were made). The f-n rate is still high, but adding more headers is likely the most effective way to cut f-n, and my testing corpora won't allow me to test that (the header lines are too damned different since my ham and spam came from entirely different sources). It's somebody else's turn now ... and thank Barry for the email pkg! It's been a joy to use. From oren-py-d@hishome.net Mon Sep 2 05:22:05 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Mon, 2 Sep 2002 00:22:05 -0400 Subject: [Python-Dev] Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: References: Message-ID: <20020902042205.GA29553@hishome.net> Nice work! Some other threads you may want to include in your summary : The 'str' in 'string' feature: http://mail.python.org/pipermail/python-dev/2002-August/027354.html PEP 237 deprecation warnings and hex constants: http://mail.python.org/pipermail/python-dev/2002-August/027783.html PEP 277 - unicode filenames http://mail.python.org/pipermail/python-dev/2002-August/027651.html Oren From tim.one@comcast.net Mon Sep 2 07:54:35 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 02 Sep 2002 02:54:35 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: Message-ID: [Delaney, Timothy] > On second thought - if a word-pair appears, then the separate parts should > not be checked as separate words. > > So, If I had scores: > > 'free' 0.1 > 'beer' 0.1 > ('want', 'free',) 0.9 > ('free', 'beer',) 0.01 > ('free', '!!!',) 0.99 > > then the following phrases would match (case-folding) as: > > 'I want free beer!!!': > > ('want', 'free',) 0.9 > ('free', 'beer',) 0.01 > > 'Get *** for free!!!' > > ('free', '!!!',) 0.99 > > 'I want free beer. Free the beer!!!' > > ('want', 'free',) 0.9 > ('free', 'beer',) 0.01 > 'free' 0.1 > 'beer' 0.1 > > Damn I wish I was at home to try this out ... :( I'm going to say a lot of stuff here, and then shut up . I want to move on to other things, but there's an opportunity to pass on some darned good advice for those who can hear. Combining pairs of words is called "word bigrams". My intuition at the start was that it would do better. OTOH, my intuition also was that character n-grams for a relatively large n would do better still. The latter may be so for "foreign" languages, but for this particular task using Graham's scheme on the c.l.py tests, turns out they sucked. A comment block in timtest.py explains why. I didn't try word bigrams because the f-p rate is already supernaturally low, so there doesn't seem anything left to be gained there. This echoes what Graham sez on his web page: One idea that I haven't tried yet is to filter based on word pairs, or even triples, rather than individual words. This should yield a much sharper estimate of the probability. My comment with benefit of hindsight: it doesn't. Because the scoring scheme throws away everything except about a dozen extremes, the "probabilities" that come out are almost always very near 0 or very near 1; only very short or (or especially "and") very bland msgs come out in between. This outcome is largely independent of the tokenization scheme -- the scoring scheme forces it, provided only that the tokenization scheme produces stuff *some* of which *does* vary in frequency between spam and ham. For example, in my current database, the word "offers" has a probability of .96. If you based the probabilities on word pairs, you'd end up with "special offers" and "valuable offers" having probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a probability of .1 or less. The theory is indeed appealing . The reason I haven't done this is that filtering based on individual words already works so well. Which is also the reason I didn't pursue it. But it does mean that there is room to tighten the filters if spam gets harder to detect. I expect it would also need a different scoring scheme then. OK, I ran a full test using word bigrams. It gets one strike against it at the start because the database size grows by a factor between 2 and 3. That's only justified if the results are better. Before-and-after f-p (false positive) percentages: before bigrams 0.000 0.025 0.000 0.025 0.050 0.050 0.000 0.025 0.025 0.050 0.025 0.100 0.050 0.075 0.025 0.025 0.025 0.050 0.000 0.025 0.075 0.050 0.050 0.000 0.025 0.050 0.000 0.025 0.050 0.075 0.025 0.025 0.025 0.025 0.000 0.000 0.025 0.050 0.050 0.025 Lost on 12 runs Tied on 5 runs Won on 3 runs total # of unique fps across all runs rose from 8 to 17 The f-n percentages on the same runs: before bigrams 1.236 1.091 1.164 1.091 1.454 1.708 1.599 1.563 1.527 1.491 1.236 1.127 1.163 1.345 1.309 1.309 1.891 1.927 1.418 1.382 1.745 1.927 1.708 1.963 1.491 1.782 0.836 0.800 1.091 1.127 1.309 1.309 1.491 1.709 1.127 1.018 1.309 1.018 1.636 1.672 Lost on 9 runs Tied on 2 runs Won on 9 runs total # of unique fns across all runs rose from 336 to 350 This doesn't need deep analysis: it costs more, and on the face of it either doesn't help, or helps so little it's not worth the cost. Now I'll tell in you confidence that the way to make a scheme like this excellent is to keep your ego out of it and let the data *tell* you what works: getting the best test setup you can is the most important thing you can possibly do, which must include multiple training and test corpora (e.g., if I had used only one pair, I would have had a 3/20 chance of erroneously concluding that bigrams might help the f-p rate, when running across 20 pairs shows that they almost certainly do it harm; while I would have had an even chance of drawing a wrong conclusion-- in either direction --about the effect on the f-n rate). The second most important thing is to run a fat test all the way to the end before concluding anything. A subtler point is that you should never keep a change that doesn't *prove* itself a winner: neutral changes bloat your code with proven irrelevancies that will come back to make your life harder later, in part because they'll randomly interfere with future changes in ways that make it harder to recognize a significant change when you stumble into one. Most things you try won't help -- indeed, many of them will deliver worse results. I dare say my intution for this kind of classification task is better than most programmers' (in part because I had years of professional experience in a related field), and most of the things I tried I had to throw away. BFD -- then you try something else. When I find something that works I can rationalize it, but when I try something that doesn't, no amount of argument can change that the data said it sucked . Two things about *this* task have fooled me repeatedly: 1. The "only look at smoking guns" nature of the scoring step makes many kinds of "on average" intuitions worthless: "on average" almost everything is thrown away! For example, you're not going to find bad results reported for n-grams (neither character- nor word-based) in the literature, and because most scoring schemes throw much less away. Graham's scheme strikes as brilliant in this specific respect: it's worth enduring the ego humiliation to get such a spectacularly low f-p rate from such simple and fast code. 2. Most mailing-list messages are much shorter than this one. This systematically frustrates "well, averaged over enough words" intuitions too. Cute: In particular, word bigrams systematically hate conference announcements. The current word one-gram scheme hated them too, until I started folding case. Then their SCREAMING stopped acting against them. But they're still using the language of advertisement, and word bigrams can't help but notice that more strongly than individual words do. Here from the TOOLS Europe '99 announcement: prob('more information') = 0.916003 prob('web site') = 0.895518 prob('please write') = 0.99 prob('you wish') = 0.984494 prob('our web') = 0.985578 prob('visit our') = 0.99 Here from the XP2001 - FINAL CALL FOR PAPERS: prob('web site:') = 0.926174 prob('receive this') = 0.945813 prob('you receive') = 0.987542 prob('most exciting') = 0.99 prob('alberta, canada') = 0.99 prob('e-mail to:') = 0.99 Here from the XP2002 - CALL FOR PRACTITIONER'S REPORTS ('BOM' is an artificial token I made up for "beginning of message", to give something for the first word in the message to pair up with): prob('web site:') = 0.926174 prob('this announcement') = 0.94359 prob('receive this') = 0.945813 prob('forward this') = 0.99 prob('e-mail to:') = 0.99 prob('BOM *****') = 0.99 prob('you receive') = 0.987542 Here from the TOOLS Europe 2000 announcement: prob('visit the') = 0.96 prob('you receive') = 0.967805 prob('accept our') = 0.99 prob('our apologies') = 0.99 prob('quality and') = 0.99 prob('receive more') = 0.99 prob('asia and') = 0.99 A vanilla f-p showing where bigrams can hurt was a short msg about setting up a Python user's group. Bigrams gave it large penalties for phrases like "fully functional" (most often seen in spams for bootleg software, but here applied to the proposed user group's web site -- and "web site" is also a strong spam indicator!). OTOH, the poster also said "Aahz rocks". As a bigram, that neither helped nor hurt (that 2-word phrase is unique in the corpus); but as an individual word, "Aahz" is a strong non-spam indicator on c.l.py (and will probably remain so until he starts spamming ). It did find one spam hiding in a ham corpus: """ NNTP-Posting-Host: 212.64.45.236 Newsgroups: comp.lang.python,comp.lang.rexx Date: Thu, 21 Oct 1999 10:18:52 -0700 Message-ID: <67821AB23987D311ADB100A0241979E5396955@news.ykm.com> From: znblrn@hetronet.com Subject: Rudolph The Rednose Hooters Here Lines: 4 Path: news!uunet!ffx.uu.net!newsfeed.fast.net!howland.erols.net!newsfeed.cwix.com! news.cfw.com!paxfeed.eni.net!DAIPUB.DataAssociatesInc..com Xref: news comp.lang.python:74468 comp.lang.rexx:31946 To: python-list@python.org THis IS it: The site where they talk about when you are 50 years old. http://huizen.dds.nl/~jansen20 """ there's-no-substitute-for-experiment-except-drugs-ly y'rs - tim From tdelaney@avaya.com Mon Sep 2 08:43:06 2002 From: tdelaney@avaya.com (Delaney, Timothy) Date: Mon, 2 Sep 2002 17:43:06 +1000 Subject: [Python-Dev] The first trustworthy GBayes results Message-ID: > From: Tim Peters [mailto:tim.one@comcast.net] > > I'm going to say a lot of stuff here, and then shut up > . I want to > move on to other things, but there's an opportunity to pass > on some darned > good advice for those who can hear. Pretty darned good advice too ... but you won't object if I waste some time playing with this stuff anyway I hope. Only one way to accumulate experience after all ;) Personally, I considered that you were already well past the point of diminishing returns, and anything further was of academic interest to those who felt a desire to tinker ... (i.e. the hard work has been done, and everything else is just fun and games :) If enough people (or just one dedicated person) waste enough time, who knows what may come out. Hey - it worked for timsort didn't it ...? ;) Tim Delaney From mal@lemburg.com Mon Sep 2 09:02:27 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 02 Sep 2002 10:02:27 +0200 Subject: [Python-Dev] To commit or not to commit References: <3D6A7742.1030005@livinglogic.de> <200208261847.g7QIlI806850@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3D731B13.9090909@lemburg.com> Martin v. Loewis wrote: > Guido van Rossum writes: > > >>>Any objections against committing the patch? >> >>What do MvL and MAL say? > > Because of the size, I'm sure there are still bugs in it. I couldn't > spot any by inspection, so I think the patch is ready to be installed. +1. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From tim.one@comcast.net Mon Sep 2 09:09:54 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 02 Sep 2002 04:09:54 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: Message-ID: [Delaney, Timothy] > Pretty darned good advice too ... but you won't object if I waste > some time playing with this stuff anyway I hope. Only one way to accumulate > experience after all ;) Not at all! Knock yourself out -- it's really a lot of fun, except when it gets so tedious you start punching the wall just to watch your knuckles bleed . > Personally, I considered that you were already well past the point of > diminishing returns, Not yet -- false positives are a horrible thing, and the false negative rate still lets a lot of spam through. Cutting the f-n rate, e.g., in half, would mean half as much spam to deal with; generalization left to the reader. > and anything further was of academic interest to those who felt a desire to > tinker ... The best hope for reducing f-n lies in exploiting more header lines than I can test with my mixed corpora, and there's *tons* of room for improvement there (note that the f-n rate is more than 20x greater than the f-p rate now). Anyone who wants to tackle that with tedious experiment should first pick Neil Schemenauer's brain: he had a good start on that early last week. > (i.e. the hard work has been done, and everything else is just fun and > games :) If enough people (or just one dedicated person) waste enough time, > who knows what may come out. Hey - it worked for timsort didn't it ...? ;) Indeed so, and it works for this too -- never underestimate the power of working yourself sick. If you also *write* about it, you can make everyone else ill too by proxy . sharing-the-pain-ly y'rs - tim From walter@livinglogic.de Mon Sep 2 12:21:22 2002 From: walter@livinglogic.de (=?ISO-8859-15?Q?Walter_D=F6rwald?=) Date: Mon, 02 Sep 2002 13:21:22 +0200 Subject: [Python-Dev] PyString_DecodeEscape and PEP293 References: <3D60EA3B.7030008@livinglogic.de> Message-ID: <3D7349B2.8010706@livinglogic.de> Martin v. Loewis wrote: > Walter D�rwald writes: > > >>A recent checkin added a function PyString_DecodeEscape() >>to stringobject.c. To make this function PEP293 compatible >>it would need access to unicode_decode_call_errorhandler >>which is defined static in unicodeobject.c. Does >>PyString_DecodeEscape() really need an errors argument? > > > What do you mean, "really need"? The callers of this function pass the > argument, in particular escape_decode. Is that "real"? So does escape_decode need an errors argument. AFAICT escape_decode is used only in the context of reading pickles. Will there ever be a need to call escape_decode with anything other than errors="strict"? >>If yes, we could either move it to unicodeobject.c > > > No. It has to do little with Unicode. > > >>or make unicode_decode_call_errorhandler externally visible. > > > I don't know this function. It's a static function in unicodeobject.c in the PEP293 patch that does the complete error handling for decoding. > What does this have to do with Unicode? I expected that all codecs to unicode<->8bit coding/decoding "string-escape" seems to be an exception. >>Another problem that I noticed is that string-escape can't >>be used for encoding Unicode objects: > > > That is a feature. string-escape has nothing to do with Unicode. So it doesn't need the new PEP293 error handling? Bye, Walter D�rwald From walter@livinglogic.de Mon Sep 2 12:22:25 2002 From: walter@livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Mon, 02 Sep 2002 13:22:25 +0200 Subject: [Python-Dev] To commit or not to commit References: <3D6A7742.1030005@livinglogic.de> <200208261847.g7QIlI806850@pcp02138704pcs.reston01.va.comcast.net> <3D731B13.9090909@lemburg.com> Message-ID: <3D7349F1.4090100@livinglogic.de> M.-A. Lemburg wrote: > Martin v. Loewis wrote: > >> Guido van Rossum writes: >> >>>> Any objections against committing the patch? >>> >>> What do MvL and MAL say? >> >> Because of the size, I'm sure there are still bugs in it. I couldn't >> spot any by inspection, so I think the patch is ready to be installed. > > +1. OK, I'll check it in then. Bye, Walter D�rwald From pinard@iro.umontreal.ca Mon Sep 2 13:02:55 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Mon, 02 Sep 2002 08:02:55 -0400 Subject: [Python-Dev] Re: The first trustworthy GBayes results In-Reply-To: (Tim Peters's message of "Mon, 02 Sep 2002 02:54:35 -0400") References: Message-ID: [Tim Peters] [... extremely good work and stuff and comments, for a good while now ...] Hi, Tim. I read your messages, witnessing your work and progress in that area, with great interest, and also saved them for later contemplation! :-) Spam always annoyed me, as most of us, and despite many efforts I did, it is increasingly successful at traversing my filters -- so this idea of Graham or Bayesian filters is timely and welcome. Most previous filters I observed are based on various (random) tests or events (you surely know all this), and `procmail'-based filters, or even the popular SpamAssassin, are either very slow or at least slow. The tool I use since 1998 is much faster, especially after I rewrote it in Python!, it is also based on various tests or events. Your works concentrated on tuning the statistical formulas and lexical analysis, and building operational data from preset corpora. I'm sure all the knowledge gleaned there will make its way everywhere, and reach me. For a tiny share, I decided to experiment with day-to-day user aspects of using such a filter, and built a Gnus interface over Eric Raymond's Bogofilter. There are two functions to this program, one is about learning from messages known to be ham or spam, the other is about classification of incoming messages. By the way, if there are Gnus users among you, just ask me for the recipe... It goes pretty well for me, so far. The principle, put forward by Paul Graham, is to let the user have two delete commands: delete-as-ham or delete-as-spam. Eric pushed this idea a bit further by postponing learning until the user quits the mail reader, `mutt' in his case. As Gnus allows me to have many mailgroups and folders and shuffle between them, I postpone learning until the user switches mailgroups or quit, and only for the _final_ disposition of a message: that is, when a message is merely saved into another folder, the decision will be taken when leaving that other folder, and not the current one. Messages marked as "saved" are _not_ sent, so to avoid double learning. The fact is that ham messages are more likely to be postponed than spam, because ham is more often filed here and there. Even if many or most ham messages are deleted, this introduce a short term bias in the learning statistics by which the percentage of spam seems to be higher (in my case, 1157 messages have been learned in about three days, 20% of which were spam), but this percentage will later be lowered as filed messages get reprocessed. Another effect is that the delay itself in ham learning may have a slight effect on classification, but since both ham and spam are well represented, the effect is likely negligible. Tim corpora are surely very clean, at least by now, while day-to-day learning may yield slightly tainted learning. In my case, when a thread does not interest me, I often kill all articles it contains in one command, without opening each of them to see if it would not be spam: the threading itself makes it unlikely. But nevertheless possible, you surely noticed that bad guys now fetch and re-use already published subjects as a way to get through. That means that if big corpora are thinkable in case of mailing lists having existed for a while, those are probably not very usable for individual users. GBayes, Bogofilter and others should ideally resist some amount of ham-tainted-as-spam or spam-tainted-as-ham at learning time. After adding Graham filtering as a supplementary method to my spam detection tool, I gladly observe that it successfully detects many spam messages which would otherwise fall in the cracks, so it really brings something to me. But I also see many spam cases (are they?) it does not detect and that it would hardly: one simple example is that _for me_, invalidly structured MIME is indicative of an un-interesting message, as interesting people know better! One particular problem I observed are Tim messages themselves, which are undoubtedly very miummy ham messages, but discussing and quoting many spam inside them. Should these be registered as ham or spam? :-) Would not these defeat the learning to some extent? Where should Tim add his own messages in the corpora he uses, and what changes would result in `GBayes' effectiveness? -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From guido@python.org Mon Sep 2 15:01:45 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 02 Sep 2002 10:01:45 -0400 Subject: [Python-Dev] Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: Your message of "Sun, 01 Sep 2002 21:29:09 CDT." <15730.52469.604124.730029@localhost.localdomain> References: <15730.52469.604124.730029@localhost.localdomain> Message-ID: <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> > Looks good to me. The only trivial nit I would like to raise is that any > URLs you embed in the text be true URLs. I'd also prefer they be encased in > <...>, but that's slightly less important and generally only matters when > URLs are immediately followed by punctuation. So, instead of > > Brett> Guido said he will revive the PEP. The patch has since been put > Brett> on SF at python.org/sf/599331 . > > you'd have > > Brett> Guido said he will revive the PEP. The patch has since been put > Brett> on SF at . > > The two changes make it much more likely that email readers will be able to > successfully highlight such URLs correctly. I think adding http:// alone should be sufficient. Despite all the official recommendations, I've always hated the <...> form. However, do keep a space after the URL if punctuation were to follow (which you already did). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Mon Sep 2 15:06:05 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 02 Sep 2002 10:06:05 -0400 Subject: [Python-Dev] To commit or not to commit In-Reply-To: Your message of "Mon, 02 Sep 2002 10:02:27 +0200." <3D731B13.9090909@lemburg.com> References: <3D6A7742.1030005@livinglogic.de> <200208261847.g7QIlI806850@pcp02138704pcs.reston01.va.comcast.net> <3D731B13.9090909@lemburg.com> Message-ID: <200209021406.g82E65b30667@pcp02138704pcs.reston01.va.comcast.net> > >>>Any objections against committing the patch? > >> > >>What do MvL and MAL say? > > > > Because of the size, I'm sure there are still bugs in it. I couldn't > > spot any by inspection, so I think the patch is ready to be installed. > > +1. OK, anchors away then! :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From pinard@iro.umontreal.ca Mon Sep 2 16:15:54 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Mon, 02 Sep 2002 11:15:54 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> (Guido van Rossum's message of "Mon, 02 Sep 2002 10:01:45 -0400") References: <15730.52469.604124.730029@localhost.localdomain> <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> Message-ID: >> you'd have >> >> Brett> Guido said he will revive the PEP. The patch has since been put >> Brett> on SF at . >> >> The two changes make it much more likely that email readers will be able to >> successfully highlight such URLs correctly. > > I think adding http:// alone should be sufficient. Despite all the > official recommendations, I've always hated the <...> form. Gnus highlights correctly with the `http://', and adds clickability. The `<' and '>' are not needed. I do not know what other mail readers do. To get the same effects with email addresses, I often prefer using `mailto:' as a prefix over writing `<' and `>' around a quoted address in a message body, even if not fully systematic about this. In the message header itself, `<' and '>' are the proper way to go, of course. -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From tim.one@comcast.net Mon Sep 2 16:41:00 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 02 Sep 2002 11:41:00 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: Message-ID: I usually add <> to http thingies when I remember to. A couple people yelled at me, claiming their readers couldn't recognize http thingies otherwise. This seems particularly odd, since I almost always put them on their own line: http://www.python.org OTOH, *my* reader doesn't recognize them in the style, neither with nor without <>. From barry@python.org Mon Sep 2 16:48:47 2002 From: barry@python.org (Barry A. Warsaw) Date: Mon, 2 Sep 2002 11:48:47 -0400 Subject: [Python-Dev] To commit or not to commit References: <3D6A7742.1030005@livinglogic.de> <200208261847.g7QIlI806850@pcp02138704pcs.reston01.va.comcast.net> <3D731B13.9090909@lemburg.com> <3D7349F1.4090100@livinglogic.de> Message-ID: <15731.34911.231999.691324@anthem.wooz.org> >>>>> "WD" =3D=3D Walter D=F6rwald writes: WD> OK, I'll check it in then. Does that mean it's time to mark PEP 293 as Final and move it to the Finished PEPs category in PEP 0? -Barry From walter@livinglogic.de Mon Sep 2 17:29:14 2002 From: walter@livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Mon, 02 Sep 2002 18:29:14 +0200 Subject: [Python-Dev] To commit or not to commit References: <3D6A7742.1030005@livinglogic.de> <200208261847.g7QIlI806850@pcp02138704pcs.reston01.va.comcast.net> <3D731B13.9090909@lemburg.com> <3D7349F1.4090100@livinglogic.de> <15731.34911.231999.691324@anthem.wooz.org> Message-ID: <3D7391DA.6010306@livinglogic.de> Barry A. Warsaw wrote: >>>>>>"WD" == Walter D�rwald writes: >>>>> > > WD> OK, I'll check it in then. > > Does that mean it's time to mark PEP 293 as Final and move it to the > Finished PEPs category in PEP 0? Guido already changed PEP 283, so: yes. Only a few cleanup tasks remain (Neals comments, LaTeX documentation for the rest of the C functions). Bye, Walter D�rwald From barry@python.org Mon Sep 2 17:48:47 2002 From: barry@python.org (Barry A. Warsaw) Date: Mon, 2 Sep 2002 12:48:47 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 References: Message-ID: <15731.38511.160332.641594@anthem.wooz.org> >>>>> "TP" == Tim Peters writes: TP> I usually add <> to http thingies when I remember to. A TP> couple people yelled at me, claiming their readers couldn't TP> recognize http thingies otherwise. This seems particularly TP> odd, since I almost always put them on their own line: TP> http://www.python.org As do I. TP> OTOH, *my* reader doesn't recognize them in the TP> TP> style, neither with nor without <>. Mine does too, but it's not the <> that is the distinguishing feature, AFAIK. The <> seem to be most useful for inline urls where trailing punctuation gets incorrectly attached to the url. -Barry From drifty@bigfoot.com Mon Sep 2 18:11:40 2002 From: drifty@bigfoot.com (Brett Cannon) Date: Mon, 2 Sep 2002 10:11:40 -0700 (PDT) Subject: [Python-Dev] Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido van Rossum] > > you'd have > > > > Brett> Guido said he will revive the PEP. The patch has since been put > > Brett> on SF at . > > > > The two changes make it much more likely that email readers will be able to > > successfully highlight such URLs correctly. > > I think adding http:// alone should be sufficient. Despite all the > official recommendations, I've always hated the <...> form. However, > do keep a space after the URL if punctuation were to follow (which you > already did). > I think I will go with adding http:// to all addresses and putting them on their own line. -Brett From martin@v.loewis.de Mon Sep 2 21:31:56 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 02 Sep 2002 22:31:56 +0200 Subject: [Python-Dev] PyString_DecodeEscape and PEP293 In-Reply-To: <3D7349B2.8010706@livinglogic.de> References: <3D60EA3B.7030008@livinglogic.de> <3D7349B2.8010706@livinglogic.de> Message-ID: Walter D=F6rwald writes: > So does escape_decode need an errors argument. AFAICT > escape_decode is used only in the context of reading pickles. > Will there ever be a need to call escape_decode with anything > other than errors=3D"strict"? It's a codec, so anybody is entitled to write "foo".decode("string-escape", "replace") if they chose to. If you are suggesting that this is not supported is only acceptable if you also suggest how it should fail. Silently ignoring the "replace" argument is not acceptable. > > What does this have to do with Unicode? >=20 > I expected that all codecs to unicode<->8bit coding/decoding > "string-escape" seems to be an exception. That was my original expectation as well. By now, I have accepted things like >>> "foo".encode("base64") 'Zm9v\n' So codecs can do way more things than converting between unicode<->byte strings. Whether it is a good thing that they are that flexible is still open to debate, however, it was convenient for string-escape. > So it doesn't need the new PEP293 error handling? Probably not - just supporting "strict", "replace", "ignore", and failing for any other error handling would be sufficient. If you manage to make it fail for anything but "strict", that would be acceptable as well (IMO). Regards, Martin From tdelaney@avaya.com Tue Sep 3 00:25:19 2002 From: tdelaney@avaya.com (Delaney, Timothy) Date: Tue, 3 Sep 2002 09:25:19 +1000 Subject: [Python-Dev] Python-dev summary for 2002-08-15 - 2002-09-01 Message-ID: > From: Brett Cannon [mailto:bac@OCF.Berkeley.EDU] > > I think I will go with adding http:// to all addresses and > putting them on their own line. May I suggest that this may be a good test document for reStructuredText? Especially if it is going to such places as slashdot ... Tim Delaney From skip@pobox.com Tue Sep 3 02:37:48 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 2 Sep 2002 20:37:48 -0500 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: References: <15730.52469.604124.730029@localhost.localdomain> <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15732.4716.629172.615326@12-248-11-90.client.attbi.com> >> I think adding http:// alone should be sufficient. Despite all = the >> official recommendations, I've always hated the <...> form. Fran=E7ois> Gnus highlights correctly with the `http://', and adds Fran=E7ois> clickability. The `<' and '>' are not needed. =20 I use VM. It highlights correctly as long as the leading "http://" is there, provided the URL isn't followed immediately by punctuation which= can occur in a URL. In Brett's summary he avoids that problem by adding a = space between the URL and the ambiguous punctuation. I think that looks odde= r than the <...> notation. Skip From barry@python.org Tue Sep 3 05:16:32 2002 From: barry@python.org (Barry A. Warsaw) Date: Tue, 3 Sep 2002 00:16:32 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 References: <15730.52469.604124.730029@localhost.localdomain> <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> <15732.4716.629172.615326@12-248-11-90.client.attbi.com> Message-ID: <15732.14240.941982.728027@anthem.wooz.org> >>>>> "SM" == Skip Montanaro writes: >> I think adding http:// alone should be sufficient. Despite all >> the official recommendations, I've always hated the <...> form. SM> In Brett's summary he avoids that problem by adding a space SM> between the URL and the ambiguous punctuation. I think that SM> looks odder than the <...> notation. I agree (and also use VM :), but putting the url on a separate line looks fine, unless that greatly increases the vertical whitespace. -Barry From akim@epita.fr Tue Sep 3 07:26:52 2002 From: akim@epita.fr (Akim Demaille) Date: 03 Sep 2002 08:26:52 +0200 Subject: [Python-Dev] Re: HAVE_CONFIG_H In-Reply-To: References: <200207291930.g6TJUYi05460@pcp02138704pcs.reston01.va.comcast.net>

<200207301539.g6UFdUS09930@odiug.zope.com> <200207301622.g6UGMBl17143@odiug.zope.com>

Message-ID: >>>>> "Fran=E7ois" =3D=3D Fran=E7ois Pinard wri= tes: Fran=E7ois> [Akim Demaille] >> I'm not sure I completely understand the question here: if >> HAVE_CONFIG_H is specified, it means config.h is created. So if >> you use a config.h, why does it matter not to define HAVE_CONFIG_H? Fran=E7ois> Hi, Akim. I hope life is still good to you! :-) Hi Fran=E7ois! The new (scholar) year is starting now, so life is still good, but I'm a bit afraid of what it might be done in the near future :) Fran=E7ois> In the beginnings of Autoconf, the `config.h' file did not Fran=E7ois> exist. David MacKenzie added it as a way to reduce the Fran=E7ois> `make' output clutter. Nowadays, I suspect almost all Fran=E7ois> packages of at least moderate size uses it. Agreed. Fran=E7ois> Our traditional `lib/' modules have to work in many Fran=E7ois> packages, whether `config.h' has been created or not, this Fran=E7ois> being decided on a per package basis, and that is why there Fran=E7ois> is a conditional inclusion of `config.h' in each of these Fran=E7ois> `lib/' modules. He took a good while before we got Fran=E7ois> stabilised on the exact stanza of this inclusion (I Fran=E7ois> especially remember the massive unilateral changes by Roland Fran=E7ois> McGrath introducing the BROKEN_BROKET define, or something Fran=E7ois> like that, and all the doing it later took to clean this Fran=E7ois> out.) I understand. Fran=E7ois> Python (the distribution, which is what is in question here) Fran=E7ois> does not use any of our `lib/' things, it is not going to Fran=E7ois> use them, and it is not going to provide new such modules, Fran=E7ois> so the distribution includes `config.h' everywhere, by Fran=E7ois> permanent choice, without any need to use `HAVE_CONFIG_H' to Fran=E7ois> decide if that inclusion is needed or not. So, even Fran=E7ois> `-DHAVE_CONFIG_H' is useless `make' clutter in this case, Fran=E7ois> and that's why the Python packagers wanted to get rid of it. Fran=E7ois> In fact, in practice `-DHAVE_CONFIG_H' is only needed for Fran=E7ois> packages using those common `lib/' modules, but many Fran=E7ois> packages do not. Now that Autoconf is used with projects Fran=E7ois> who have a life outside GNU, this is less necessary. Guido Fran=E7ois> found, and got me to remember, that `@DEFS@' is the culprit: Fran=E7ois> people just do not have to use it in their hand-crafted Fran=E7ois> Makefiles, which is the case for Python. For away-from-GNU Fran=E7ois> packages using Automake, some Automake option might exist so Fran=E7ois> `@DEFS@' does not get generated? The only goal here is to Fran=E7ois> get a cleaner `make' output. I understand the goal, but much of the effort is devoted to having the thing work cleanly, not being beautiful. Another goal is to have it being easy to maintain, i.e., not having too much to document, too much to support, too much to test etc. So, although I don't know what the Automake team might think of this idea, I suspect they'll want to focus on other features :( From sholden@holdenweb.com Tue Sep 3 11:52:36 2002 From: sholden@holdenweb.com (Steve Holden) Date: Tue, 3 Sep 2002 06:52:36 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 References: Message-ID: <00cb01c25338$05d87120$6300000a@holdenweb.com> ----- > I usually add <> to http thingies when I remember to. A couple people > yelled at me, claiming their readers couldn't recognize http thingies > otherwise. This seems particularly odd, since I almost always put them on > their own line: > > http://www.python.org > > OTOH, *my* reader doesn't recognize them in the > > > > style, neither with nor without <>. > But it nevertheless sends out something that *it* will recognise as a URL. Both your references were correctly represented as hyperlinks in OE when I read your message! regards ----------------------------------------------------------------------- Steve Holden http://www.holdenweb.com/ Python Web Programming pydish.holdenweb.com/pwp/ Previous .sig file retired to www.homeforoldsigs.com ----------------------------------------------------------------------- From greg@python.org Tue Sep 3 14:41:12 2002 From: greg@python.org (Greg Ward) Date: Tue, 3 Sep 2002 09:41:12 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: References: <20020828194248.GA16407@cthulhu.gerg.ca> Message-ID: <20020903134112.GC1227@cthulhu.gerg.ca> [Tim, last week] > What's an acceptable false positive rate? [my response] > Speaking as one of the people who reviews suspected spam for python.org > and rescues false positives, I would say that the more relevant figure > is: how much suspected spam do I have to review every morning? < 10 > messages would be peachy; right now it's around 5-20 messages per day. [Tim again] > I must be missing something. I would *hope* that you review *all* messages > claimed to be spam, in which case the number of msgs to be reviewed would, > in a perfectly accurate system, be equal to the number of spams received. Good lord, certainly not! Remember that Exim rejects a couple hundred messages a day that never get near SpamAssassin -- that's mostly Chinese/Korean junk that's rejected on the basis of 8-bit chars or banned charsets in the headers. Then, probably 50-75% of what SA gets its hands on scores >= 10.0, so it too is rejected at SMTP time. Only messages that score < 10 are accepted, and those that score >= 5.0 are set aside in /var/mail/spam for review. That's 10-30 messages/day. (I do occasionally scan Exim's reject log on mail.python.org to see what's getting rejected today -- Exim kindly logs the full headers of every message that is rejected after the DATA command. I usually make it to about 11am of a given day's logfile before my eyes glaze over from the endless stream of spam and viruses.) Note that we *used* to accept messages before passing them to SpamAssassin, so never rejected anything on the basis of its SA score. Back then, we saved and reviewed probably 50-70 messages/day. Very, very, very few (if any) false positives scored >= 10.0, which is why that's the threshold for SMTP-time rejection. > OTOH, the false positive rate doesn't have anything to do with the number of > spams received, it has to do with the number of non-spams received. Err, yeah, good point. I make a point of talking about "suspected spam", which is any message that scores between 5.0 and 10.0. IMHO, the true nature of those messages can only be determined by manual inspection. > Maybe you don't want this kind of approach at all. The classifier doesn't > have "gray areas" in practice: it tends to give probabilites near 1, or > near 0, and there's very little in between -- a msg either has a > preponderance of spam indicators, or a preponderance of non-spam indicators. That's a great improvement over SpamAssassin then: with SA, the grey area (IMHO) is scores from 3 to 10... which is why several python.org lists now have a little bit of Mailman configuration magic that makes MM set aside messages with an SA score >= 3 for list admin review. (It's probably worth getting the list admin to do a bit more work in order to avoid sending low-scoring spam to the list.) However, as long as "very little" != "nothing", we still need to worry a bit about that grey area. What do you think we should do with a message whose spam probability is between (say) 0.1 and 0.9? Send it on, reject it, or set it aside? Just how many messages fall in that grey area anyways? Greg -- Greg Ward http://www.gerg.ca/ MTV -- get off the air! -- Dead Kennedys From mcherm@destiny.com Tue Sep 3 15:11:45 2002 From: mcherm@destiny.com (Michael Chermside) Date: Tue, 03 Sep 2002 10:11:45 -0400 Subject: [Python-Dev] Re: PEP 218 (sets); moving set.py to Lib Message-ID: <3D74C321.7070103@destiny.com> >> Hmm, I intended to have s1.refresh() return a new object for use in >> s2 while leaving s1 alone (being immutable and all). Now, I wonder >> if that was the right thing to do. The answer lies in use cases for >> algorithms that need sets of sets. If anyone knows off the top of >> their head that would be great; otherwise, I seem to remember that >> some of that business was found in compiler algorithms and graph >> packages. > > Let's call YAGNI on this one. > Furthermore, what if I create a BIG set like this: s = ImmutableSet( range(2**x) ) Now, not only do I use lots of memory for s, I ALSO keep around lots of memory to preserve a temporary list which I never wanted to keep anyhow! -- Michael Chermside From walter@livinglogic.de Tue Sep 3 17:05:21 2002 From: walter@livinglogic.de (=?ISO-8859-15?Q?Walter_D=F6rwald?=) Date: Tue, 03 Sep 2002 18:05:21 +0200 Subject: [Python-Dev] PyString_DecodeEscape and PEP293 References: <3D60EA3B.7030008@livinglogic.de> <3D7349B2.8010706@livinglogic.de> Message-ID: <3D74DDC1.7040609@livinglogic.de> Martin v. Loewis wrote: > Walter D�rwald writes: > > >>So does escape_decode need an errors argument. AFAICT >>escape_decode is used only in the context of reading pickles. >>Will there ever be a need to call escape_decode with anything >>other than errors="strict"? > > > It's a codec, so anybody is entitled to write > > "foo".decode("string-escape", "replace") > > if they chose to. If you are suggesting that this is not supported is > only acceptable if you also suggest how it should fail. Silently > ignoring the "replace" argument is not acceptable. I won't suggest that. Let's keep PyString_DecodeEscape as it is now. It should not be a problem for encoding, because encoding can't fail, so there is no need for using "xmlcharrefreplace" etc. as the error handling. Decoding can fail, but lets add custom error handling only when the need for it arises (which hopefully won't). > [...] >>So it doesn't need the new PEP293 error handling? > > Probably not - just supporting "strict", "replace", "ignore", and > failing for any other error handling would be sufficient. If you > manage to make it fail for anything but "strict", that would be > acceptable as well (IMO). OK, lets keep PyString_DecodeEscape as it is now (i.e. "strict", "ignore", "replace" implemented inline with no custom error handling). Bye, Walter D�rwald From tim.one@comcast.net Tue Sep 3 17:27:57 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 03 Sep 2002 12:27:57 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <00cb01c25338$05d87120$6300000a@holdenweb.com> Message-ID: [Tim] > OTOH, *my* reader doesn't recognize them in the > > > > style, neither with nor without <>. [Steve Holden] > But it nevertheless sends out something that *it* will recognise as a URL. I think you're assuming I use Outlook Express. I don't; "my reader" is usually Outlook 2000. > Both your references were correctly represented as hyperlinks in OE when I > read your message! Yes, OE and Outlook differ in this repsect. From guido@python.org Tue Sep 3 17:53:45 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 03 Sep 2002 12:53:45 -0400 Subject: [Python-Dev] Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: Your message of "Sun, 01 Sep 2002 15:57:53 PDT." References: Message-ID: <200209031653.g83GrjQ01929@odiug.zope.com> > Yes, with Michael's permission, I am attempting to start up the Python-dev > summaries again. Below is my attempt at summarizing the last half of > August. It's longer then normal summaries, but that is because I bothered > to include discussions on threads that were not directly relating to the > Python core but are interesting nonetheless (e.g., the whole spambayes > thread). > > I am posting to Python-dev first before posting to c.l.py, c.l.py.a (also > lwn.net and probably Slashdot) because I want to get the general okay from > the list that I have done a good enough of a job to send this out; I don't > want to have a summary that represents the going-ons here without the > general populace (or just the BDFL since he can overrule =) being okay > with it. I am also curious as to whether I should go into more or less > detail, leave out the summaries that do not directly pertain to the Python > core, etc. > > So please read the summary and let me know if you are okay with it. If so > I will try to do semi-monthly summaries from now on. Oh, and I am on > vacation right now and will be doing a lot of travelling in the next two > months, so I can't guarantee summaries will be this quick to come out for > a while. I will do them, though, even if they are a week late. =) > > Oh, and if I do get the okay to do this, expect a lot of dumb questions > from me in the future in terms of clarifying things. Just remember, it is > for the good of the Python community. =) Thanks, Brett. Minor comments ahead; but basically, go ahead -- don't let striving for perfection keep you from posting something good! > > ======================================= > > > This is a summary of traffic on the python-dev mailing list between August > 16, 2002 and September 1, 2002 (exclusive). It is intended to inform the > wider Python community of ongoing developments. To comment, just post to > python-list@python.org or comp.lang.python in the usual way. Give your > posting a meaningful subject line, and if it's about a PEP, include the > PEP number (e.g. Subject: PEP 201 - Lockstep iteration) All python-dev > members are interested in seeing ideas discussed by the community, so > don't hesitate to take a stance on a PEP if you have an opinion. > > This is the first summary written by Brett Cannon. > Summaries are archived no where at the moment. =) They will be, though, > so stay tuned for the URL in future summaries. > > > > Posting distribution (with apologies to mbm, but thanks to mwh for the > code) > > Number of articles in summary: 585 > > 80 | [|] > | [|] > | [|] > | [|] > | [|] [|] > 60 | [|] [|] [|] > | [|] [|] [|] > | [|] [|] [|] > | [|] [|] [|] > | [|] [|] [|] [|] > 40 | [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] > 20 | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] > | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] > 0 +-071-025-012-042-063-084-030-021-039-009-047-027-033-041-036-005 > Fri 16| Sun 18| Tue 20| Thu 22| Sat 24| Mon 26| Wed 28| Fri 30| > Sat 17 Mon 19 Wed 21 Fri 23 Sun 25 Tue 27 Thu 29 Sat 31 I'm not sure I care about this diagram. It's also kind of hard to read. I would mind less if it was at the end of the summary. > > > ================ > Type Categories > ================ > This VERY long thread was sparked by Andrew Koenig asking if a discussion > of making type categories more explicit had ever occured (Andrew meant for > category to mean "the set of all types that implement a particular marker > interface"). As Andrew later pointed out, he was asking about "a way of > making notions such as 'file-like object' more formal and/or automatic". > The discussion quickly started using the term interface to mean defining a > way to specify that an object implemented certain methods (think of it in > terms of Java's 'implements' mechanism). Once that was out of the way, > the discussion took off. Zope's implementation was pointed out > (http://cvs.zope.org/Zope3/lib/python/Interface/) very quickly. PEP 245 > (Python Interface Syntax) was also brought to the attention of the list. > The idea of using inheritance to handle interfaces was brought up. Guido > said that he hasn't "given up the hope that inheritance and interfaces > could use the same mechanisms. But Jim Fulton, based on years of > experience in Zope, claims they really should be different" in terms of > how interfaces should be handled in objects. Jeremy Hylton tried to > channel Jim's opinion by pointing out that "We'd like to use interfaces to > make fairly strong claims. If a class A implements an interface I, then > we should be able to use an instance of A anywhere that an I is needed." > But "the inheritance mechanism is too general" because if a class A > implements interface I and then a class B, which does not implement I, > subclasses class A we end up with a class B that claims it has a certain > interface which it doesn't actually have. Guido understood the point, but > still thought inheritence could be used "if there was a way to "shut off" > inheritance as far as isinstance() (or issubclass())" is concerned. Guido > asked the simple question, "Why do keep arguing for inheritance? (a) the > need to deny inheritance from an interface, while essential, is relatively > rare IMO, and in *most* cases the inheritance rules work just fine; (b) > having two separate but similar mechanisms makes the language larger." > Samuele Pedroni asked that any implementation "allow also for refering to > anonymous super-interfaces of an interface in terms of the interface plus > a subset of its signatures, also e.g. FileLike and just 'write'. [that > means an interface can be thought to correspond to a set of > (tag,signature) tuples, where tag identifies the interface, and one can > also just consider subsets of it]". The thread has finally seemed to have > stopped (for now) with Guido saying he is mulling the whole thing in the > back of his head. This is a very sticky topic because of the number of > design decisions required and how it might change the way people program > in Python. Please break up that paragraph into pieces shorter than 12 lines each. :-) > There was also a partial sub-thread in this whole discussion about > multimethods; basically a way to do overloading of methods based on > parameter signature. Most of the discussion was over syntax and such and > how to handle resolution order. It then seemed to go to the wayside when > the main part of the thread took over again. > > ============================== > type categories -- an example > ============================== > This thread was starteed when Andrew Koenig said that the reason he > brought up his type category question was because he wanted a way so as to > be able to identify members of a type easily. He now had an example in a > program he was writing where what the type of the argument was varied and > thus what needed to be done to the data changed accordingly. Jermey > Hylton suggested the isinstance(obj, type(re.compile(''))) idiom. Andrew > asked if this was guaranteed to work, which Jeremy said no. I asked why > this was not guaranteed, and Frederick Lundh said because re.compile() is > a factory fxn and it is possible that a future version could return a > different object based on the pattern. > > =============================================== > Python build trouble with the new gcc/binutils > =============================================== > Andrew Koenig said that he couldn't compile Python using the newest gcc > (this was the day after the latest release hit servers). With help from > Zack Weinberg of Code Sourcery (who also recently rewrote the tempfile > module), the problem was tracked down to binutils 2.13. being the culprit > and was not Python's fault. > > =================================== > Last call: mortal interned strings > =================================== > The patch python.org/sf/576101 removes the default immortality of interned > strings. I believe it was in early August (possibly spilled over from > late July) when Oren Tirosh proposed the idea and wrote the above > mentioned patch. There had been some discussion over whether any 3rd > party code was reliant upon interned strings being immortal; none was > found (MacPython was reliant upon it, but since it is under Python core > control it was considered a moot point since it could be changed). It has > been checked in. With the patch the way to make a string immortal is to > call PyString_InternImmortal(); no code in the core uses this function. > > ===================================== > PEP 218 (sets); moving set.py to Lib > ===================================== > Thanks to Greg Wilson (for writing the PEP), Alex Martelli (for writing > the module initially), and Guido (for refactoring Alex's code) the stdlib You might add Raymond Hettinger who wrote the docs and did significant work on the code after me. Also Tim Peters who added some good speedups. > has now gained a sets module. It has both the notion of mutable and > immutable sets (the latter used when you have a set of sets). There was > discussion about how sets should print (sorted or not; unsorted is default > but option is there to print sorted) This option is no longer documented though. It may yet disappear. > and what operators should be > overloaded for working on sets (| and & were chosen). The module is a > beautiful chunk of code and I highly recommend reading its source. Thanks. > =========================================== > A few lessons from the tempfile.py rewrite > =========================================== > Zack Weinberg, after rewriting the tempfile module, brought up three > points: > 1) Lack of dummy threads, 2) lack of a pthreads_once equivalent, and 3) > lack of a way to skip tests from unittest.py via some built-in method. > Guido responded accordingly: 1) since some code uses the idiom of trying > to import thread and catching the exception if it fails, Guido said he > would be willing to accept a dummy_thread.py that would allow: > > try: > import thread as _thread > except ImportError: > import dummy_thread as _thread > > to work. No word on whether this is being written at the moment. 2) > Guido said the method was, in his opinion, overkill. He said to "be > Pythonic, live dangerously, accept the risk that a ^C can screw you. It > can anyway. :-)". And as for 3) Guido deferred Zack to the PyUnit list > and Steve Purcell since Python just tracks Steve's code (pyunit.sf.net). > Guido's suggestion was to stick code that was reliant on some other code > in a separate testing suite that is only run when the reliant code is > available. > > =========================== > Standard datetime objects? > =========================== > Kevin Jacobs asked what stage the new datetime object was at. Guido said > it is in python/nondist/sandbox/datetime/ in CVS which also has comments > pointing to a wiki containing the current work on it. Fred L. Drake, Jr. > is working on the C re-implementation and Guido expects a checkin at any > moment (hasn't happened as of this writing). Has now, in the sandbox (more to come). > =================== > PEP 269 versus 283 > =================== > Jonathan Riehl noticed that PEP 283 said PEP 269 was dead; not good > considering he was close to having a patch for PEP 269 (pgen module to > interface with the C version). Guido said he will revive the PEP. The > patch has since been put on SF at python.org/sf/599331 . > > ============================== > What is a backport candidate? > ============================== > Since Python 2.2 is going to be around for a long time, the question was > brought up of what constitutes code that should be backported. Guido made > the following three points: > > 1) code trivial to backport should always be backported > > 2) code patcheing 2.3 code should obviously not be backported x > > 3) 2.2 code requires changes to use patch, but applies; gradients of this > exist. > > So please, when submitting patches, mention whether you think the patch > should be backported to the 2.2 tree and any possible dependencies it > might have in a backport. > > ================================= > python/nondist/sandbox/spambayes > ================================= > In response to Paul Graham's spam filter written using Baye's Rule > (Slashdot post on it is at > http://developers.slashdot.org/article.pl?sid=02/08/16/1428238&tid=156), a > thread spawned around this checkin of code that followed that paper's > suggestions. This thread quickly jumped into discussions on data > structures, Baye's Rule, and a whole lot of talk about spam. Very > interesting if spam filtering interests you. Tim Peters has been leading > the drive on this chunk of code (and thanks to his illness that befelled > him in late August which he has subsequently gotten over he had a few days > of major hacking on it; Tim showed he is a performance stats whore > ). > > A very cool quote came out of this thread from Eric S. Raymond when > discussing the spam filter he has been working on: "This is actually the > first new program I've coded in C (rather than > Python) in a good four years or so". (Several of us think even this didn't have to be coded in C after all. :-) > ==================== > Parsing vs. lexing. > ==================== > In response to a question by Aahz about what the differences were between > a lexer, parser, and tokenizer, Eric Raymond posted a good overview of the > differences. Guido later commented in an email mentioning SPARK and about > how Python's lexer (pgen) works and why he wrote it. He also made some > other comments on lexers. Jeremy Hylton pointed out a "neat new paper > about an old algorithm for recursive descent parsers with backtracking and > unlimited lookahead" by Bryan Ford at http://www.brynosaurus.com/pub.html > . Alex Martelli pointed out that this discussion reminded him of "a > long-ago interview with Borland's techies" in which they said they were > able to make Borland PASCAL fit on a floppy while MS PASCAL took multiple > floppies. Their trick was "we just did everything by the Dragon Book -- > except that the parser is a hand-written recursive descent parser [Aho &c > being adamant defenders of Yacc & the like], which buys us a lot". > Someone named Noah also emailed a discussion on lexers and parsers pulling > in Finite State Machines, Push Down Autonoma, and Turing Machines in his > discussion. > > Martin Sj?n says that Haskell's pattern matching and lazy evaluation makes Come on, you know his real name is Sj�gren. :-) > lexers easy (even a Recursive-Descent parser), but unfortunately Haskell > does not play with other languages nicely. Haskell is where Python got > it's list comprehension idea. > > ========================================= > [Python-Dev] Fw: Security hole in rexec? > ========================================= > It was brought to the attention of the list that deleting __builtins__ > allowed a compromise in rexec. Guido pointed out that > python.org/sf/577530 reports this. He also said don't trust rexec. > > A patch is going to be submitted to document the view that rexec is really > not that safe. It was checked in. > ================= > A `cogen' module > ================= > Francois Pinard asked about Cartesian products using the new sets module. > Guido didn't think people would in general need it. Francois quickly > started this thread of discussing a cogen module to generate Cartesian > products and other ways of operating on sets. Tim Peters quickly posted *his* elaborate state-of-the-art code, which ended the discussion (as usual, posting code is a good way to stop discussion :-). > ================= > Mersenne Twister > ================= > Raymond Hettinger volunteered to implement the Merseene Twister algorithm > (one in Python exists at www.math.keio.ac.jp/~matumoto/emt.html). While > discussing to implement in C or Python, Guido noticed that random.Random > re-implements whrandom. Guido then came up with the idea of writing a > base random class that is subclassed where .random() can be implemented; > Tim Peters agreed and suggested more methods to subclass. > > ================================= > New PEP Format: reStructuredText > ================================= > David Goodger and Barry Warsaw have now gotten reST as a usable syntax for > PEPs. Read the PEPs on the subject to learn more: > > - PEP 12 -- Sample reStructuredText PEP Template > (http://www.python.org/peps/pep-0012.html) > > - PEP 258 -- Docutils Design Specification > (http://www.python.org/peps/pep-0258.html) > > - PEP 287 -- reStructuredText Docstring Format > (http://www.python.org/peps/pep-0287.html) > > ==================================== > tiny optimization in ceval mainloop > ==================================== > Jeremy Hylton noticed that in ceval that their is a test of whether the > ticker was 0 or if things_to_do was set to true (explanation of the > ticker, checkinterval, and the GIL follow this paragraph). Jeremy > wondered if we could just drop the ticker to 0 when things_to_do is true. > Jack Janssen, though, pointed out that clearing it is not guaranteed since > there may be an interrupt routine when "we fiddle things_to_do". Skip > Montanaro then pointed out that since neither ticker nor things_to_do is > fiddled with unless the GIL is held that instead of causing each thread to > execute this test that they could be made globals instead; he did a patch > that implements this (python.org/sf/602191). Guido then said that if > there wasn't a decent speed improvement, then no patch would be checked > in. He then changed his mind when it was pointed out that it actually > simplified the code. Skip tested anyway, though, and there is a speed > improvement. This also brought up whether the default value of 10 for > checkinterval was reasonable. It was then agreed to be bumped up to 100. > Jack ran some code and said he noticed a definite improvement. > > Python's version of threading is not like in C. There is something called > the GIL (Global Interpreter Lock) which any thread wishing to execute > Python code or play with Python objects must hold. This means that when > you have Python threads running (using the thread or threading module) > they are usually all waiting in line to get the GIL. Now for Python to > decide when to release the GIL for another thread to grab it, it uses the > ticker. This variable counts down to zero by being decremented every time > a Python opcode is executed (originally defaulted to 10, now defaulted to > 100). The ticker's starting value after each release of the GIL is what > sys.checkinterval() sets. > > To get a better understanding of therading under Python I recommend > reading Aahz's tutorials on threading. > > > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev All in all, please keep this up!!! --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Tue Sep 3 18:53:36 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 03 Sep 2002 13:53:36 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: <20020903134112.GC1227@cthulhu.gerg.ca> Message-ID: [Tim again] >> I must be missing something. I would *hope* that you review >> *all* messages claimed to be spam, in which case the number of msgs >> to be reviewed would, in a perfectly accurate system, be equal to the >> number of spams received. [Greg Ward] > Good lord, certainly not! Remember that Exim rejects a couple hundred > messages a day that never get near SpamAssassin -- that's mostly > Chinese/Korean junk that's rejected on the basis of 8-bit chars or > banned charsets in the headers. Then, probably 50-75% of what SA gets > its hands on scores >= 10.0, so it too is rejected at SMTP time. Only > messages that score < 10 are accepted, and those that score >= 5.0 are > set aside in /var/mail/spam for review. That's 10-30 messages/day. > > (I do occasionally scan Exim's reject log on mail.python.org to see > what's getting rejected today -- Exim kindly logs the full headers of > every message that is rejected after the DATA command. I usually make > it to about 11am of a given day's logfile before my eyes glaze over from > the endless stream of spam and viruses.) I get about 200 spams per day on my own email accounts, and look at all of them. I don't look at the headers at all, I just look at the msgs in a capable HTML-aware mail reader, as a matter of course while dealing with all the day's email. It's rare that it takes more than a second to recognize a spam by eyeball and hit the delete key. At about 200 per day, it's just now reaching my "hmm, this is becoming a nuisance sometimes" threshold. Our tolerance levels for manual review seem to differ by a factor of 100 or more . > Note that we *used* to accept messages before passing them to > SpamAssassin, so never rejected anything on the basis of its SA score. > Back then, we saved and reviewed probably 50-70 messages/day. Very, > very, very few (if any) false positives scored >= 10.0, which is why > that's the threshold for SMTP-time rejection. I can tell you the mean false negative and false positive rates on what I've been working on, and even measure their variance across both training and prediction sets. (The fn rate is well under 2% now (adding in more headers should improve that a lot), and the fp rate under 0.05% (but I doubt that adding in more headers will improve this)). So long as we don't know the rates for the scheme you're using now, there's no objective basis for comparison. ... >> Maybe you don't want this kind of approach at all. The classifier doesn't >> have "gray areas" in practice: it tends to give probabilites near 1, or >> near 0, and there's very little in between -- a msg either has a >> preponderance of spam indicators, or a preponderance of non-spam >> indicators. > That's a great improvement over SpamAssassin then: with SA, the grey > area (IMHO) is scores from 3 to 10... which is why several python.org > lists now have a little bit of Mailman configuration magic that makes MM > set aside messages with an SA score >= 3 for list admin review. (It's > probably worth getting the list admin to do a bit more work in order to > avoid sending low-scoring spam to the list.) > > However, as long as "very little" != "nothing", we still need to worry a > bit about that grey area. What do you think we should do with a message > whose spam probability is between (say) 0.1 and 0.9? Send it on, reject > it, or set it aside? Under Graham's scheme, send it on. It doesn't have grey areas in a useful sense, becuase the scoring step only looks at a handful of extremes: extremes in, extremes out, and when it's wrong it's *spectacularly* wrong (e.g., the very rare (< 0.05%) false positives generally have "probabilties" exceeding 0.99, and a false negative often has a "probability" less then 0.01). > Just how many messages fall in that grey area anyways? I can't get at my testing setup now and don't know the answer offhand. I'll try to make time tonight to determine the answer. I guess the interesting stats are what percent of hams have probs in (0.1, 0.9), and what percent of spams. In general, it's only very brief messages that don't score near 0.0 or 1.0, so this *may* turn out to be the same thing as asking what percentages of hams and spams are very brief. Note too that adding the headers in *should* catch a lot more spam under this scheme. But, even as is, and even if I strip all the HTML tags out of spam, fewer than 1 spam in 50 scores less than 0.9. The ones that are passed on now include all spams with empty bodies (a message with an empty body scores 0.5). From tismer@tismer.com Tue Sep 3 19:26:01 2002 From: tismer@tismer.com (Christian Tismer) Date: Tue, 03 Sep 2002 20:26:01 +0200 Subject: [Python-Dev] Get rid of etype struct Message-ID: <3D74FEB9.5060406@tismer.com> This is a multi-part message in MIME format. --------------080306050703000101060801 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Hi Guido, I think I have a solution for this one, see the attached diff. I did what you suggested: Make the adressing of the members dependant from the metatype. The etype struct has lost its members[1] field, to make it easier to extend the structure. Instead, the allocator always adds one to the size, to have the sentinel in place. I did not yet publish the etype stucture, since I didn't find a good name and place for it. Testing was also not very thorow. I just checked that types work from Python and that I can add __slots__ to them. Will re-port this stuff to my Py2.2 Stackless base and try it out as base type for my own C types. It took me the whole day to understand how it must work, and then just an hour to get it to work. This is quite some stuff :-) Can somebody please have a look, if there are subtle errors? ciao - chris ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=591586&group_id=5470 --------------080306050703000101060801 Content-Type: text/plain; charset=us-ascii; name="typeobject.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="typeobject.diff" cvs -z9 diff -u dist/src/Objects/typeobject.c Index: dist/src/Objects/typeobject.c =================================================================== RCS file: /cvsroot/python/python/dist/src/Objects/typeobject.c,v retrieving revision 2.179 diff -u -r2.179 typeobject.c --- dist/src/Objects/typeobject.c 16 Aug 2002 17:01:08 -0000 2.179 +++ dist/src/Objects/typeobject.c 3 Sep 2002 18:04:39 -0000 @@ -20,9 +20,12 @@ see add_operators() below. */ PyBufferProcs as_buffer; PyObject *name, *slots; - PyMemberDef members[1]; + /* here are optional user slots, followed by the members. */ } etype; +#define GET_MEMBERS(etype) \ + ((PyMemberDef *)(((char *)etype) + (etype)->type.ob_type->tp_basicsize)-1) + static PyMemberDef type_members[] = { {"__basicsize__", T_INT, offsetof(PyTypeObject,tp_basicsize),READONLY}, {"__itemsize__", T_INT, offsetof(PyTypeObject, tp_itemsize), READONLY}, @@ -213,7 +216,8 @@ PyType_GenericAlloc(PyTypeObject *type, int nitems) { PyObject *obj; - const size_t size = _PyObject_VAR_SIZE(type, nitems); + const size_t size = _PyObject_VAR_SIZE(type, nitems+1); + /* note that we need to add one, for the sentinel */ if (PyType_IS_GC(type)) obj = _PyObject_GC_Malloc(size); @@ -253,7 +257,7 @@ PyMemberDef *mp; n = type->ob_size; - mp = ((etype *)type)->members; + mp = GET_MEMBERS((etype *)type); for (i = 0; i < n; i++, mp++) { if (mp->type == T_OBJECT_EX) { char *addr = (char *)self + mp->offset; @@ -318,7 +322,7 @@ PyMemberDef *mp; n = type->ob_size; - mp = ((etype *)type)->members; + mp = GET_MEMBERS((etype *)type); for (i = 0; i < n; i++, mp++) { if (mp->type == T_OBJECT_EX && !(mp->flags & READONLY)) { char *addr = (char *)self + mp->offset; @@ -1125,7 +1129,8 @@ /* Are slots allowed? */ nslots = PyTuple_GET_SIZE(slots); - if (nslots > 0 && base->tp_itemsize != 0) { + if (nslots > 0 && base->tp_itemsize != 0 && !PyType_Check(base)) { + /* for the special case of meta types, allow slots */ PyErr_Format(PyExc_TypeError, "nonempty __slots__ " "not supported for subtype of '%s'", @@ -1334,7 +1339,7 @@ } /* Add descriptors for custom slots from __slots__, or for __dict__ */ - mp = et->members; + mp = GET_MEMBERS(et); slotoffset = base->tp_basicsize; if (slots != NULL) { for (i = 0; i < nslots; i++, mp++) { @@ -1366,7 +1371,7 @@ } type->tp_basicsize = slotoffset; type->tp_itemsize = base->tp_itemsize; - type->tp_members = et->members; + type->tp_members = GET_MEMBERS(et); type->tp_getset = subtype_getsets; /* Special case some slots */ *****CVS exited normally with code 1***** --------------080306050703000101060801-- From nas@python.ca Tue Sep 3 19:34:47 2002 From: nas@python.ca (Neil Schemenauer) Date: Tue, 3 Sep 2002 11:34:47 -0700 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: References: <20020903134112.GC1227@cthulhu.gerg.ca> Message-ID: <20020903183447.GA13310@glacier.arctrix.com> Tim Peters wrote: > Under Graham's scheme, send it on. It doesn't have grey areas in a useful > sense, becuase the scoring step only looks at a handful of extremes: > extremes in, extremes out, and when it's wrong it's *spectacularly* wrong > (e.g., the very rare (< 0.05%) false positives generally have "probabilties" > exceeding 0.99, and a false negative often has a "probability" less then > 0.01). I noticed that as well. When the classifier goes wrong it goes badly wrong and using different thresholds would not help. It seems that increasing the number of discriminators doesn't really help either. Too bad because otherwise you could flag those messages for human classification. On the bright side, based on the number of mis-classified messages in my corpus, it looks like a human would have a very hard time doing a better job. Perhaps all that is needed is a bypass mechanism for that small fraction of non-spammers. That way if their initial message is rejected they would still have some way of getting through. Erik Naggum made an interesting comment. He said that spam should be handled at the transport level. Greg's work on doing filtering at SMTP time accomplishes this and makes a lot of sense. When a message is rejected, the sending mail server is the one that has to deal with it. In the case of spam, the sending server is often an open rely. Letting it handle the bounces is sweet justice. :-) I bring this up because "STMP time filtering" makes a bypass mechanism work much better. With a system like TMDA, confirmation notices usually generate double-bounces. Instead, we could reject the message with a 5xx error that includes instructions on how to bypass the filter (e.g. include a cookie in the body of the message). Neil From python@discworld.dyndns.org Tue Sep 3 19:39:14 2002 From: python@discworld.dyndns.org (Charles Cazabon) Date: Tue, 3 Sep 2002 12:39:14 -0600 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: ; from tim.one@comcast.net on Tue, Sep 03, 2002 at 01:53:36PM -0400 References: <20020903134112.GC1227@cthulhu.gerg.ca> Message-ID: <20020903123914.B30532@twoflower.internal.do> Tim Peters wrote: > > Under Graham's scheme, send it on. It doesn't have grey areas in a useful > sense, becuase the scoring step only looks at a handful of extremes: > extremes in, extremes out, and when it's wrong it's *spectacularly* wrong > (e.g., the very rare (< 0.05%) false positives generally have "probabilties" > exceeding 0.99, and a false negative often has a "probability" less then > 0.01). I would love to see how the results would be affected by applying the scoring scheme to the entire content of the message, instead of just the 15 (or 16 in your case) most extreme samples. By the way, you never said why you increased that number by one; did it make that much difference? Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From guido@python.org Tue Sep 3 18:50:31 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 03 Sep 2002 13:50:31 -0400 Subject: [Python-Dev] Proposed Mixins for Wide Interfaces In-Reply-To: Your message of "Sat, 31 Aug 2002 12:44:06 EDT." <001101c2510d$9fce0920$5f66accf@othello> References: <001101c2510d$9fce0920$5f66accf@othello> Message-ID: <200209031750.g83HoVq05812@odiug.zope.com> > How about adding some mixins to simplify the > implementation of some of the fatter interfaces? Can you suggest implementations for these, to be absolutely clear what you mean? > class CompareMixin: > """ > Given an __eq__ method in a subclass, adds a __ne__ method > Given __eq__ and __lt__, adds !=, <=, >, >=. > """ What if the "natural" thing to implement is __le__ instead of __lt__? That's the case for sets. Or __gt__ (less likely)? > class MappingMixin: > """ > Given __setitem__, __getitem__, and keys, > implements values, items, update, get, setdefault, len, > iterkeys, iteritems, itervalues, has_key, and __contains__. > > If __delitem__ is also supplied, implements clear, pop, > and popitem. > > Takes advantage of __iter__ if supplied (recommended). Does that mean that if you have __iter__, you don't use keys()? In that case it should implement keys() out of __iter__. Maybe this should be required. > Takes advantage of __contains__ or has_key if supplied > (recommended). > """ Let's standardize on __contains__, not has_key(). I guess you could provide __contains__ as follows: def __contains__(self, key): try: self[key] except KeyError: return 0 else: return 1 I don't mind if there are some recursions amongst the various implementations; if you don't supply the minimum, the implementation will raise "RuntimeError: maximum recursion depth exceeded". > The idea is to make it easier to implement these interfaces. > Also, if the interfaces get expanded, the clients automatically > updated. A similar thing for sequences would be useful too, right? --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Tue Sep 3 20:08:57 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 03 Sep 2002 15:08:57 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: <20020903123914.B30532@twoflower.internal.do> Message-ID: [Charles Cazabon] > I would love to see how the results would be affected by applying > the scoring scheme to the entire content of the message, instead of > just the 15 (or 16 in your case) most extreme samples. Then it would be close to a classic Bayesian classifier, and like any such would need entirely different scoring code to avoid catastrophic floating-point errors (right now an intermediate result can't become smaller than 0.01**16 = 1e-32, so fp troubles are impossible; raise the exponent to a measly 200 and you're already out of the range of IEEE double precision; classic classifiers word in logarithm space instead for this reason). You can read lots of papers on how those do; all evidence suggests they do worse than this scheme on the spam versus non-spam task. > By the way, you never said why you increased that number by one; It's explained in the comment block preceding the MAX_DISCRIMINATORS definition. BTW, in an unreported experiment I boosted MAX_DISCRIMINATORS to 36. I don't recall what happened now, but it was a disaster for at least one of the error rates. > did it make that much difference? Not on average. It helped eliminate a narrow class of false positives, where previously the first 15 extremes the classifier saw had 8 probs of .99 and 7 of .01. That works out to "spam". Making the # of classifiers even instead allowed for graceful ties, which favor ham in this scheme. All previous decisions "should be" revisited after each new change, though, and in this particular case it could well be that stipping HTML tags out of plain-text messages also addressed the same narrow issue but in a more effective way (without some special gimmick, virtually every message including so much as an example of HTML got scored as spam). From guido@python.org Tue Sep 3 20:41:10 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 03 Sep 2002 15:41:10 -0400 Subject: [Python-Dev] Should KeyError use repr() on its argument? Message-ID: <200209031941.g83JfAK07542@odiug.zope.com> (SF bug 598451.) The KeyError exception doesn't apply repr() to its argument. That's annoying in cases like this: >>> a = {} >>> a[''] Traceback (most recent call last): File "", line 1, in ? KeyError >>> Should this be fixed? How? (I guess we could add a KeyError__str__ method to exceptions.c that applies repr().) I've got a feeling this is a feature, but not a very useful one. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue Sep 3 20:54:48 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 03 Sep 2002 15:54:48 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: Your message of "Tue, 03 Sep 2002 11:34:47 PDT." <20020903183447.GA13310@glacier.arctrix.com> References: <20020903134112.GC1227@cthulhu.gerg.ca> <20020903183447.GA13310@glacier.arctrix.com> Message-ID: <200209031954.g83Jsmw07797@odiug.zope.com> > Erik Naggum made an interesting comment. He said that spam should be > handled at the transport level. Greg's work on doing filtering at SMTP > time accomplishes this and makes a lot of sense. When a message is > rejected, the sending mail server is the one that has to deal with it. > In the case of spam, the sending server is often an open rely. Letting > it handle the bounces is sweet justice. :-) In the case of a false positive, it has the added advantage that at least the poor sender, falsely accused of sending spam, gets a bounce and may try to try again. > I bring this up because "STMP time filtering" makes a bypass mechanism > work much better. With a system like TMDA, confirmation notices usually > generate double-bounces. Instead, we could reject the message with a > 5xx error that includes instructions on how to bypass the filter (e.g. > include a cookie in the body of the message). Do you still believe that TMDA is the only answer to spam? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue Sep 3 20:57:00 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 03 Sep 2002 15:57:00 -0400 Subject: [Python-Dev] Should KeyError use repr() on its argument? In-Reply-To: Your message of "Tue, 03 Sep 2002 15:41:10 EDT." <200209031941.g83JfAK07542@odiug.zope.com> References: <200209031941.g83JfAK07542@odiug.zope.com> Message-ID: <200209031957.g83Jv0k07810@odiug.zope.com> > The KeyError exception doesn't apply repr() to its argument. That's > annoying in cases like this: > > >>> a = {} > >>> a[''] > Traceback (most recent call last): > File "", line 1, in ? > KeyError > >>> > > Should this be fixed? How? (I guess we could add a KeyError__str__ > method to exceptions.c that applies repr().) > > I've got a feeling this is a feature, but not a very useful one. I take it back. args[0] being the actual key that failed is a feature. str() not using repr() on args[0] is a bug. I'll fix it. --Guido van Rossum (home page: http://www.python.org/~guido/) From pinard@iro.umontreal.ca Tue Sep 3 20:54:27 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Tue, 03 Sep 2002 15:54:27 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <200209031653.g83GrjQ01929@odiug.zope.com> (Guido van Rossum's message of "Tue, 03 Sep 2002 12:53:45 -0400") References: <200209031653.g83GrjQ01929@odiug.zope.com> Message-ID: [Guido van Rossum] >> 80 | [|] >> | [|] >> | [|] >> | [|] >> | [|] [|] >> 60 | [|] [|] [|] >> | [|] [|] [|] >> | [|] [|] [|] >> | [|] [|] [|] >> | [|] [|] [|] [|] >> 40 | [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] >> 20 | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] >> | [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] [|] >> 0 +-071-025-012-042-063-084-030-021-039-009-047-027-033-041-036-005 >> Fri 16| Sun 18| Tue 20| Thu 22| Sat 24| Mon 26| Wed 28| Fri 30| >> Sat 17 Mon 19 Wed 21 Fri 23 Sun 25 Tue 27 Thu 29 Sat 31 > > [...] It's also kind of hard to read. [...] True. But not so difficult to improve. Adding a bit of simplicity yields: | 84 80 | [] | [] | [] | 71 [] | [] 63 [] 60 | [] [] [] | [] [] [] | [] [] [] | [] [] [] 47 | [] 42 [] [] [] 40 | [] [] [] [] 39 [] 41 | [] [] [] [] [] [] [] 36 | [] [] [] [] 30 [] [] 33 [] [] | [] [] [] [] [] [] [] 27 [] [] [] | [] 25 [] [] [] [] 21 [] [] [] [] [] [] 20 | [] [] [] [] [] [] [] [] [] [] [] [] [] | [] [] [] [] [] [] [] [] [] [] [] [] [] | [] [] 12 [] [] [] [] [] [] 9 [] [] [] [] [] | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] 5 | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] 0 +---------------------------------------------------------------- Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Tue Sep 3 20:57:50 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Tue, 03 Sep 2002 15:57:50 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <200209031653.g83GrjQ01929@odiug.zope.com> (Guido van Rossum's message of "Tue, 03 Sep 2002 12:53:45 -0400") References: <200209031653.g83GrjQ01929@odiug.zope.com> Message-ID: [Guido van Rossum] >> ================= >> A `cogen' module >> ================= >> Francois Pinard asked about Cartesian products using the new sets module. >> Guido didn't think people would in general need it. Francois quickly >> started this thread of discussing a cogen module to generate Cartesian >> products and other ways of operating on sets. > > Tim Peters quickly posted *his* elaborate state-of-the-art code, which > ended the discussion (as usual, posting code is a good way to stop > discussion :-). I'll be back! (Not that I especially look like Arnold Schwartzeneger!) -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From guido@python.org Tue Sep 3 21:18:03 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 03 Sep 2002 16:18:03 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: Your message of "Tue, 03 Sep 2002 15:54:27 EDT." References: <200209031653.g83GrjQ01929@odiug.zope.com> Message-ID: <200209032018.g83KI3q08343@odiug.zope.com> > > [...] It's also kind of hard to read. [...] > > True. But not so difficult to improve. Adding a bit of simplicity yields: > > | 84 > 80 | [] > | [] > | [] > | 71 [] > | [] 63 [] > 60 | [] [] [] > | [] [] [] > | [] [] [] > | [] [] [] 47 > | [] 42 [] [] [] > 40 | [] [] [] [] 39 [] 41 > | [] [] [] [] [] [] [] 36 > | [] [] [] [] 30 [] [] 33 [] [] > | [] [] [] [] [] [] [] 27 [] [] [] > | [] 25 [] [] [] [] 21 [] [] [] [] [] [] > 20 | [] [] [] [] [] [] [] [] [] [] [] [] [] > | [] [] [] [] [] [] [] [] [] [] [] [] [] > | [] [] 12 [] [] [] [] [] [] 9 [] [] [] [] [] > | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] 5 > | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] > 0 +---------------------------------------------------------------- > Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat > 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Ooh, much better. Still, put this at the end instead of at the top of the message. It's not *that* interesting. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Tue Sep 3 21:32:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 03 Sep 2002 16:32:55 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: <20020903183447.GA13310@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > I noticed that as well. When the classifier goes wrong it goes badly > wrong and using different thresholds would not help. It seems that > increasing the number of discriminators doesn't really help either. Too > bad because otherwise you could flag those messages for human > classification. I think it's worse than just that: suppose any scheme says "OK, this is spam, with probability 0.9995". If it's reporting accurate probabilities, then another way to read that claim is "On average, one time in 2000 this message actually isn't spam". In real life we have to accept that there's no scheme with a 0% false positive rate-- not even human review --short of the scheme that never calls anything spam. Since deciding on the largest acceptable false positive rate is far more a social than a technical issue, a group of nerds will do anything rather than face it . From David Abrahams" <200209031653.g83GrjQ01929@odiug.zope.com> <200209032018.g83KI3q08343@odiug.zope.com> Message-ID: <17d001c2538d$f82650f0$1c86db41@boostconsulting.com> Turn it sideways and it'll get smaller... From: "Guido van Rossum" > > > [...] It's also kind of hard to read. [...] > > > > True. But not so difficult to improve. Adding a bit of simplicity yields: > > > > | 84 > > 80 | [] > > | [] > > | [] > > | 71 [] > > | [] 63 [] > > 60 | [] [] [] > > | [] [] [] > > | [] [] [] > > | [] [] [] 47 > > | [] 42 [] [] [] > > 40 | [] [] [] [] 39 [] 41 > > | [] [] [] [] [] [] [] 36 > > | [] [] [] [] 30 [] [] 33 [] [] > > | [] [] [] [] [] [] [] 27 [] [] [] > > | [] 25 [] [] [] [] 21 [] [] [] [] [] [] > > 20 | [] [] [] [] [] [] [] [] [] [] [] [] [] > > | [] [] [] [] [] [] [] [] [] [] [] [] [] > > | [] [] 12 [] [] [] [] [] [] 9 [] [] [] [] [] > > | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] 5 > > | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] > > 0 +---------------------------------------------------------------- > > Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat > > 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > > Ooh, much better. Still, put this at the end instead of at the top of > the message. It's not *that* interesting. > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev From skip@pobox.com Tue Sep 3 22:39:01 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 3 Sep 2002 16:39:01 -0500 Subject: [Python-Dev] Two random and nearly unrelated ideas Message-ID: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> While adding a blurb to Misc/NEWS about the change to the thread ticker and check interval, it occurred to me that perhaps Misc/NEWS would benefit from conversion to ReST format. You could pump an HTML version out to the website periodically. Second (also considered during the above edit), it would be nice to get rid of the ticker altogether in systems with proper signal support. On those platforms couldn't an alarm replace polling for the ticker? I know signals are tricky devils, but it still seems it would be a win if you could use it. You'd have to install a SIGALRM handler which would trip periodically. It would also have to keep track of any alarm handler the programmer installed. Just for the heck of it I recompiled ceval.c with the (--_Py_Ticker < 0) block ifdef'd out. Got a 1.7% increase in pystones over the now default checkinterval == 100 situation. Skip From nas@python.ca Tue Sep 3 22:52:51 2002 From: nas@python.ca (Neil Schemenauer) Date: Tue, 3 Sep 2002 14:52:51 -0700 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: References: <20020903183447.GA13310@glacier.arctrix.com> Message-ID: <20020903215251.GA14101@glacier.arctrix.com> Tim Peters wrote: > Since deciding on the largest acceptable false positive rate is far > more a social than a technical issue, a group of nerds will do > anything rather than face it . I think we pretty much ran out of things to do. :-) Still, I think the acceptable rate depends heavily on what happens to the rejects. If they go to /dev/null then it would have to be very low. If there are bounces and a way for the innocent victims to bypass the filter then I consider 0.5% good enough for most situations. The major remaining problem would be handing legitimate automated email. For mailing lists that probably isn't an issue. I'm probably not the guy to listen to about acceptable rates, though. I currently use TMDA and therefore am a heartless bastard. :-) Neil From jeremy@alum.mit.edu Tue Sep 3 22:53:46 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Tue, 3 Sep 2002 17:53:46 -0400 Subject: [Python-Dev] mysterious hangs in socket code Message-ID: <15733.12138.568668.562013@slothrop.zope.com> I've been running a small, multi-threaded program to retrieve web pages today. The entire program appears to hang when I perform a slow DNS operation, even there is no application-level coordinate between the threads. The motivation comes from http://www.python.org/sf/591349, but I ended up writing a similar small test script, which I've attached. When I run this program with Python 2.1, it produces a steady stream of output -- urls and the time it took to load them. Most of the pages take less than a second, but some take a very long time. If I run this program with Python 2.2 or 2.3, it produces little bursts of output, then pauses for a long time, then repeats. I believe that the problem relates to DNS lookups, but not in a way I fully understand. If I connect gdb to any of the threads while the program is hung, it is always inside getaddrinfo(). My first realization was that the socketmodule stopped wrapping DNS lookups in By_BEGIN/END_ALLOW_THREADS calls when the IPv6 changes were integrated. But if I restore these calls -- see http://www.python.org/sf/604210 -- I don't see any change in behavior. The program still hangs periodically. One possibility is that the Linux getaddrinfo() is thread-safe, but only by way of a lock that only allows one request to be outstanding at a time. Not sure what the other possibilities are, but the current behavior is awful. Jeremy --------------------------------------------------------------------- import httplib import Queue import random import sys import threading import time import traceback import urlparse headers = {"Accept": "text/plain, text/html, image/jpeg, image/jpg, " "image/gif, image/png, */*"} class URLThread(threading.Thread): def __init__(self, queue): threading.Thread.__init__(self) self._queue = queue self._stopevent = threading.Event() def stop(self): self._stopevent.set() def run(self): while not self._stopevent.isSet(): self.fetch() def fetch(self): url = self._queue.get() t0 = time.time() try: self._fetch(url) except: etype, value, tb = sys.exc_info() L = ["Error occurred fetching %s\n" % url, "%s: %s\n" % (etype, value), ] L += traceback.format_tb(tb) sys.stderr.write("".join(L)) t1 = time.time() print url, round(t1 - t0, 2) def _fetch(self, url): parts = urlparse.urlparse(url) host = parts[1] path = parts[2] h = httplib.HTTPConnection(host) h.connect() h.request("GET", path, headers=headers) r = h.getresponse() r.read() h.close() urls = """\ http://www.andersen.com/ http://www.google.com/ http://www.google.com/images/logo.gif http://www.microsoft.com/ http://www.microsoft.com/homepage/gif/bnr-microsoft.gif http://www.microsoft.com/homepage/gif/1ptrans.gif http://www.microsoft.com/library/toolbar/images/curve.gif http://www.yahoo.com/ http://www.sourceforge.net/ http://www.slashdot.org/ http://www.kuro5hin.org/ http://www.intel.com/ http://www.aol.com/ http://www.amazon.com/ http://www.cnn.com/ http://money.cnn.com/ http://www.expedia.com/ http://www.tripod.com/ http://www.hotmail.com/ http://www.angelfire.com/ http://www.excite.com/ http://www.verisign.com/ http://www.riaa.com/ http://www.enron.com/ http://www.securityspace.com/ http://www.directv.com/ http://www.att.com/ http://www.qwest.com/ http://www.covad.com/ http://www.sprint.com/ http://www.mci.com/ http://www.worldcom.com/ """ urls = [u for u in urls.split("\n") if u] REPEAT = 10 THREADS = 8 class RandomQueue: def __init__(self, L): self.list = L def get(self): return random.choice(self.list) if __name__ == "__main__": urlq = RandomQueue(urls) sys.setcheckinterval(10) threads = [] for i in range(THREADS): t = URLThread(urlq) t.start() threads.append(t) while 1: try: time.sleep(30) except: break print "Shutting down threads..." for t in threads: t.stop() for t in threads: t.join() From drifty@bigfoot.com Wed Sep 4 00:00:52 2002 From: drifty@bigfoot.com (Brett Cannon) Date: Tue, 3 Sep 2002 16:00:52 -0700 (PDT) Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <200209032018.g83KI3q08343@odiug.zope.com> Message-ID: [Guido van Rossum] > > > [...] It's also kind of hard to read. [...] > > > > True. But not so difficult to improve. Adding a bit of simplicity yields: > > > > | 84 > > 80 | [] > > | [] > > | [] > > | 71 [] > > | [] 63 [] > > 60 | [] [] [] > > | [] [] [] > > | [] [] [] > > | [] [] [] 47 > > | [] 42 [] [] [] > > 40 | [] [] [] [] 39 [] 41 > > | [] [] [] [] [] [] [] 36 > > | [] [] [] [] 30 [] [] 33 [] [] > > | [] [] [] [] [] [] [] 27 [] [] [] > > | [] 25 [] [] [] [] 21 [] [] [] [] [] [] > > 20 | [] [] [] [] [] [] [] [] [] [] [] [] [] > > | [] [] [] [] [] [] [] [] [] [] [] [] [] > > | [] [] 12 [] [] [] [] [] [] 9 [] [] [] [] [] > > | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] 5 > > | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] > > 0 +---------------------------------------------------------------- > > Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat > > 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > > Ooh, much better. Still, put this at the end instead of at the top of > the message. It's not *that* interesting. > How about I just get rid of it? It is only in there because Michael had it in his summaries. Actually, the entire header (from the first line to first summary) is there just because Michael had it there. I personally am happy keeping the header as it is sans this count; I know I had to read a lot of emails but I don't think anyone else cares. =) -Brett From jason-exp-1031786493.04d3ca@mastaler.com Wed Sep 4 00:28:24 2002 From: jason-exp-1031786493.04d3ca@mastaler.com (jason-exp-1031786493.04d3ca@mastaler.com) Date: Tue, 03 Sep 2002 17:28:24 -0600 Subject: [Python-Dev] Re: The first trustworthy GBayes results References: <20020903134112.GC1227@cthulhu.gerg.ca> <20020903183447.GA13310@glacier.arctrix.com> Message-ID: Neil Schemenauer writes: > I bring this up because "STMP time filtering" makes a bypass > mechanism work much better. With a system like TMDA, confirmation > notices usually generate double-bounces. Instead, we could reject > the message with a 5xx error that includes instructions on how to > bypass the filter (e.g. include a cookie in the body of the > message). TMDA doesn't do this because it would make more work for the sender to get his message delivered. Because TMDA stores the incoming messages in a local queue, the sender just has to reply to a confirmation request, and his original message gets delivered. As opposed to having to cut and paste his message from the body of a bounce and then resend it. So, not operating at the transport level saves your correspondents some work at the expense of some bandwidth. -- (http://tmda.net/) From aahz@pythoncraft.com Wed Sep 4 00:49:01 2002 From: aahz@pythoncraft.com (Aahz) Date: Tue, 3 Sep 2002 19:49:01 -0400 Subject: [Python-Dev] mysterious hangs in socket code In-Reply-To: <15733.12138.568668.562013@slothrop.zope.com> References: <15733.12138.568668.562013@slothrop.zope.com> Message-ID: <20020903234901.GA29756@panix.com> On Tue, Sep 03, 2002, Jeremy Hylton wrote: > > I've been running a small, multi-threaded program to retrieve web > pages today. The entire program appears to hang when I perform a slow > DNS operation, even there is no application-level coordinate between > the threads. gethostbyname() IIRC has frequently been non-reentrant. it might be related. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/ From pinard@iro.umontreal.ca Wed Sep 4 01:31:50 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Tue, 03 Sep 2002 20:31:50 -0400 Subject: [Python-Dev] Nit about `setdefault' documentation Message-ID: Quite a small nit. Reading: ----------------------------------------------------------------------> >>> help({}.setdefault) Help on built-in function setdefault: setdefault(...) D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if not D.has_key(k) ----------------------------------------------------------------------< I wonder if writing the last line as: ----------------------------------------------------------------------> D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D ----------------------------------------------------------------------< would not better represent Python current fashion. :-) -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From sholden@holdenweb.com Wed Sep 4 01:31:52 2002 From: sholden@holdenweb.com (Steve Holden) Date: Tue, 3 Sep 2002 20:31:52 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 References: <200209031653.g83GrjQ01929@odiug.zope.com> <200209032018.g83KI3q08343@odiug.zope.com> <17d001c2538d$f82650f0$1c86db41@boostconsulting.com> Message-ID: <008201c253aa$780144d0$6300000a@holdenweb.com> [Guido] > > > > Ooh, much better. Still, put this at the end instead of at the top of > > the message. It's not *that* interesting. > > [David] > Turn it sideways and it'll get smaller... > ... but no more interesting. Couldn't we just have a web page where this statistic was available slided and diced according to requirements? It looks especially bad in my standard mailreader variable-pitch font. The summary itself, however, looks excellent. regards ----------------------------------------------------------------------- Steve Holden http://www.holdenweb.com/ Python Web Programming pydish.holdenweb.com/pwp/ Previous .sig file retired to www.homeforoldsigs.com ----------------------------------------------------------------------- From tim.one@comcast.net Wed Sep 4 02:06:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 03 Sep 2002 21:06:43 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: <20020903134112.GC1227@cthulhu.gerg.ca> Message-ID: [Greg Ward] > ... > Just how many messages fall in that grey area anyways? Heh. Here's the probability distribution for the 4000 ham messages in my first test pair: Ham distribution for this pair: * = 67 items 0.00 4000 ************************************************************ 2.50 0 5.00 0 7.50 0 10.00 0 12.50 0 15.00 0 17.50 0 20.00 0 22.50 0 25.00 0 27.50 0 30.00 0 32.50 0 35.00 0 37.50 0 40.00 0 42.50 0 45.00 0 47.50 0 50.00 0 52.50 0 55.00 0 57.50 0 60.00 0 62.50 0 65.00 0 67.50 0 70.00 0 72.50 0 75.00 0 77.50 0 80.00 0 82.50 0 85.00 0 87.50 0 90.00 0 92.50 0 95.00 0 97.50 0 That is, they *all* got a "probability score" less than 2.5% (0.025). Here's the spam probability distribution across the same run: Spam distribution for this pair: * = 46 items 0.00 5 * 2.50 2 * 5.00 1 * 7.50 0 10.00 0 12.50 0 15.00 1 * 17.50 0 20.00 1 * 22.50 0 25.00 2 * 27.50 1 * 30.00 0 32.50 1 * 35.00 0 37.50 0 40.00 0 42.50 0 45.00 1 * 47.50 1 * 50.00 1 * 52.50 0 55.00 0 57.50 1 * 60.00 3 * 62.50 0 65.00 2 * 67.50 0 70.00 0 72.50 0 75.00 1 * 77.50 1 * 80.00 0 82.50 0 85.00 0 87.50 0 90.00 3 * 92.50 1 * 95.00 6 * 97.50 2715 ************************************************************ IOW, a spam usually scored at least 0.975 on this run, but some spams scored under 0.025. There's very little "in the middle". I've got 19 more sets like this if you care a lot . Here's the aggregate across all 20 runs (each msg is counted 4 times here, once for each of the runs in which it served in the prediction set against training on one of the 4 spam+ham collection pairs it doesn't belong to): Ham distribution for all runs: * = 1333 items 0.00 79938 ************************************************************ 2.50 8 * 5.00 3 * 7.50 0 10.00 3 * 12.50 1 * 15.00 3 * 17.50 1 * 20.00 1 * 22.50 0 25.00 0 27.50 0 30.00 1 * 32.50 4 * 35.00 2 * 37.50 0 40.00 2 * 42.50 0 45.00 1 * 47.50 1 * 50.00 1 * 52.50 0 55.00 0 57.50 0 60.00 0 62.50 1 * 65.00 0 67.50 0 70.00 2 * 72.50 0 75.00 1 * 77.50 1 * 80.00 0 82.50 0 85.00 1 * 87.50 1 * 90.00 0 92.50 1 * 95.00 1 * 97.50 21 * Spam distribution for all runs: * = 905 items 0.00 215 * 2.50 18 * 5.00 8 * 7.50 12 * 10.00 6 * 12.50 6 * 15.00 14 * 17.50 6 * 20.00 10 * 22.50 8 * 25.00 9 * 27.50 9 * 30.00 3 * 32.50 3 * 35.00 5 * 37.50 3 * 40.00 7 * 42.50 24 * 45.00 3 * 47.50 29 * 50.00 34 * 52.50 8 * 55.00 6 * 57.50 18 * 60.00 64 * 62.50 12 * 65.00 7 * 67.50 5 * 70.00 3 * 72.50 7 * 75.00 4 * 77.50 18 * 80.00 10 * 82.50 23 * 85.00 13 * 87.50 20 * 90.00 27 * 92.50 18 * 95.00 57 * 97.50 54256 ************************************************************ In percentage terms, very little lives outside the tips of the tail ends. Note that calling the spam cutoff 0.975 instead of 0.90 would save 2 false positives, at the expense of letting an additional 27+18+57 = 102 spams go thru. Here's the first example of a low-prob spam: """ Low prob spam! 0.0133104753792 Data/Spam/Set2/8007.txt prob('from:email name:') = 0.0488301 prob('thanks,') = 0.0300188 prob('subject:Hey') = 0.99 prob('today') = 0.852792 Return-Path: Delivered-To: bruce-spam@localhost Received: (qmail 14409 invoked by alias); 6 Mar 2002 20:07:42 -0000 Delivered-To: spam@bruce-guenter.dyndns.org Received: (qmail 14405 invoked from network); 6 Mar 2002 20:07:42 -0000 Received: from agamemnon.bfsmedia.com (204.83.201.2) by lorien.untroubled.org (192.168.1.3) with SMTP; 06 Mar 2002 20:07:42 -0000 Received: (qmail 13063 invoked by uid 500); 6 Mar 2002 20:02:05 -0000 Delivered-To: em-ca-spam@em.ca Received: (qmail 13057 invoked by uid 502); 6 Mar 2002 20:02:05 -0000 Delivered-To: bfsmedia-goose.kennels@bfsmedia.com Received: (qmail 13051 invoked from network); 6 Mar 2002 20:02:05 -0000 Received: from unknown (HELO smtp2.forserve.com) (63.170.11.221) by agamemnon.bfsmedia.com with SMTP; 6 Mar 2002 20:02:05 -0000 Date: Wed, 6 Mar 2002 15:12:41 -0500 Message-Id: <200203062012.g26KCfn08192@smtp2.forserve.com> X-Mailer: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.1) Gecko/20010607 Reply-To: From: To: Subject: Hey Fred Content-Length: 95 Lines: 9 Fred, It was nice to talk to you today I will send the proposal tonight. Thanks, Heidi """ You figure it out . I suspect bfsmedia would have added a high spam score if I looked at Received lines, but even several additional strong spam indicators wouldn't be enough to nail this one. BTW, this msg shows up many times in the spam corpora, varying the "Fred" and "Heidi" with other male and female names; I assume this is a harvester that's trying to provoke the recipient into replying. Several others are damaged in ways such that the email pkg can't create a msg out of them. I could easily enough add code to force such a msg to be considered spam. Some are wildly embarrassing failures: """ Low prob spam! 0.000102019995919 Data/Spam/Set3/681.txt prob('common,') = 0.01 prob('definately') = 0.01 prob('logic') = 0.01 prob('hell,') = 0.01 prob('it".') = 0.01 prob('obvious.') = 0.01 prob('theory') = 0.01 prob('whilst') = 0.01 prob('earning') = 0.99 prob('same,') = 0.01 prob('$500,000') = 0.99 prob('"bull",') = 0.99 prob('year!!!') = 0.99 prob('internet!') = 0.99 prob('tv:') = 0.99 prob('*this') = 0.99 Return-Path: Delivered-To: em-ca-bruceg@em.ca Received: (qmail 25721 invoked from network); 17 Aug 2002 01:05:07 -0000 Received: from unknown (HELO 65.102.48.161) (65.102.48.161) by churchill.factcomp.com with SMTP; 17 Aug 2002 01:05:07 -0000 Received: from unknown (149.89.93.47) by rly-xr02.mx.aol.com with NNFMP; Aug, 17 2002 1:50:22 AM -0800 Received: from anther.webhostingtalk.com ([88.58.121.118]) by da001d2020.lax-ca.osd.concentric.net with QMQP; Aug, 17 2002 12:40:13 AM -0700 Received: from 34.57.158.148 ([34.57.158.148]) by rly-xr02.mx.aol.com with local; Aug, 17 2002 12:02:05 AM +0300 From: rnpyjohn To: Undisclosed Recipients Cc: Subject: Please read this letter carefully, it works 100% Sender: rnpyjohn Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Date: Sat, 17 Aug 2002 02:03:28 +0100 X-Mailer: The Bat! (v1.52f) Business X-Priority: 1 Content-Length: 15985 *This is a one time mailing and this list will never be used again.* Hi, SEEN THIS MAIL BEFORE?, SICK OF FINDING IT IN YOUR INBOX? ME TOO, HONEST I was exactly the same, till one day whilst i was complaining about how tired i was of seeing ... """ The first 16 most extreme indicators are split 9 highly in favor of ham (.01) and 7 highly in favor of spam (.99). If I hadn't folded case away to let stinking conference announcements through , I expect it would have latched on to the SCREAMING at the start instead of looking deeper. Looking at the To: line probably would nail this one too, as "Undisclosed Recipients" has two 0.99 spam indicators right there. Whatever, you *don't* want to look at msgs with a mix of just 0.99 and 0.01 thingies: it's not all that unusual to get such an extreme mix, in spam or ham. this-isn't-your-father's-idea-of-probability-ly y'rs - tim From barry@python.org Wed Sep 4 02:35:27 2002 From: barry@python.org (Barry A. Warsaw) Date: Tue, 3 Sep 2002 21:35:27 -0400 Subject: [Python-Dev] mysterious hangs in socket code References: <15733.12138.568668.562013@slothrop.zope.com> Message-ID: <15733.25439.461968.51583@anthem.wooz.org> >>>>> "JH" == Jeremy Hylton writes: JH> I've been running a small, multi-threaded program to retrieve JH> web pages today. The entire program appears to hang when I JH> perform a slow DNS operation, even there is no JH> application-level coordinate between the threads. Does strace'ing the program provide any clues? Also, if it's a DNS thing, you should definitely try to run it on different networks (or at least pointing to different DNS servers). Ok, running it now as "strace python foo.py" (Py2.2.1) and I see similar behavior. It seems to mostly be sitting in select() calls and rt_sigsuspend() which I guess is a wrapper around sigsuspend(). When I use Python 2.1.3 I never see it sit in sigsuspend(). -Barry From pinard@iro.umontreal.ca Wed Sep 4 02:39:44 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Tue, 03 Sep 2002 21:39:44 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <008201c253aa$780144d0$6300000a@holdenweb.com> ("Steve Holden"'s message of "Tue, 3 Sep 2002 20:31:52 -0400") References: <200209031653.g83GrjQ01929@odiug.zope.com> <200209032018.g83KI3q08343@odiug.zope.com> <17d001c2538d$f82650f0$1c86db41@boostconsulting.com> <008201c253aa$780144d0$6300000a@holdenweb.com> Message-ID: [Steve Holden] > It looks especially bad in my standard mailreader variable-pitch font. Oh! You are touching a sensible nerve! :-) There are many cases where people do ASCII art in messages, and I'm not speaking of signatures here. People often insert ASCII tables or simple explicative drawings, these capabilities are useful enough for not being dismissed. You should use fixed width fonts when receiving, and even when sending email. (And people should limit their messages to 79 columns.) If something looks bad because of your variable-pitch fonts, the problem is emphatically _not_ in the sent message, and does not justify any alteration to the format of those messages. Another example is the fact that many fonts nowadays decided to improve over ASCII, and have an apostrophe which is not symmetrical to a grave accent. By design and since ASCII 1, long ago, they should be symmetrical. A few people push for everybody to stop `quoting' like this. I strongly believe that for displaying ASCII text, people should use ASCII fonts. If fonts are wrong, and despite many fonts are wrong, this should not be seen as the sender problem. The push is sometimes accompanied with the suggestion of switching to Unicode all over, as a way to avoid the problem. It is surely a good idea, but we are not there yet. In the meantime, ASCII stays ASCII. -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From fredrik@pythonware.com Wed Sep 4 06:41:42 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 4 Sep 2002 07:41:42 +0200 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 References: <200209031653.g83GrjQ01929@odiug.zope.com><200209032018.g83KI3q08343@odiug.zope.com><17d001c2538d$f82650f0$1c86db41@boostconsulting.com><008201c253aa$780144d0$6300000a@holdenweb.com> Message-ID: <005601c253d5$d0a63c50$ced241d5@hagrid> Fran=E7ois Pinard wrote: > [Steve Holden] > > > It looks especially bad in my standard mailreader variable-pitch font. > > Oh! You are touching a sensible nerve! :-) > > There are many cases where people do ASCII art in messages, and I'm not > speaking of signatures here. People often insert ASCII tables or simpl= e > explicative drawings, these capabilities are useful enough for not bein= g > dismissed. You should use fixed width fonts when receiving, and even w= hen > sending email. loser. if python really was all about "everything computers did when I learned to use them will always be the best way to do it", it would probably never have been invented. and this mailing list is about python. From oren-py-d@hishome.net Wed Sep 4 10:49:47 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Wed, 4 Sep 2002 05:49:47 -0400 Subject: [Python-Dev] Two random and nearly unrelated ideas In-Reply-To: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> Message-ID: <20020904094947.GA56953@hishome.net> On Tue, Sep 03, 2002 at 04:39:01PM -0500, Skip Montanaro wrote: > Second (also considered during the above edit), it would be nice to get rid > of the ticker altogether in systems with proper signal support. On those > platforms couldn't an alarm replace polling for the ticker? Not before all all Python I/O calls are converted to be EINTR-safe. After running into some problems with I/O interrupted by signals I tried to fix it myself but it requires a lot of work in some of the hairiest places in the Python codebase. Oren From fredrik@pythonware.com Wed Sep 4 12:22:26 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 4 Sep 2002 13:22:26 +0200 Subject: [Python-Dev] Two random and nearly unrelated ideas References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> Message-ID: <001b01c25405$5a6da520$0900a8c0@spiff> oren wrote: > Not before all all Python I/O calls are converted to be EINTR-safe. >=20 > After running into some problems with I/O interrupted by signals I = tried to > fix it myself but it requires a lot of work in some of the hairiest = places=20 > in the Python codebase. sounds like a good topic for a "here's what I learned when trying to fix this problem" PEP. From guido@python.org Wed Sep 4 12:24:16 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 07:24:16 -0400 Subject: [Python-Dev] Should KeyError use repr() on its argument? In-Reply-To: Your message of "Tue, 03 Sep 2002 16:29:32 PDT." References: Message-ID: <200209041124.g84BOHY03377@pcp02138704pcs.reston01.va.comcast.net> > > > The KeyError exception doesn't apply repr() to its argument. That's > > > annoying in cases like this: > > > > > > >>> a = {} > > > >>> a[''] > > > Traceback (most recent call last): > > > File "", line 1, in ? > > > KeyError > > > >>> > > > > > > Should this be fixed? How? (I guess we could add a KeyError__str__ > > > method to exceptions.c that applies repr().) > > > > > > I've got a feeling this is a feature, but not a very useful one. > > > > I take it back. args[0] being the actual key that failed is a > > feature. str() not using repr() on args[0] is a bug. I'll fix it. > > > > What is args[0]? args is the name of the instance variable that most exceptions use to store the arguments that were passed to them in the raise statement (or equivalent C API). It is a tuple. Examples: >>> a = KeyError() >>> a.args () >>> a = KeyError(1) >>> a.args (1,) >>> a = KeyError(1,2,3) >>> a.args (1, 2, 3) >>> try: {}[''] except KeyError, k: print k.args ('',) >>> > Are you saying that dicts use repr() instead of str() to > get the key value when accessing? No, I'm saying that str(KeyError('foo')) should return repr('foo') rather than 'foo' as it does now. See current CVS. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed Sep 4 12:44:32 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 07:44:32 -0400 Subject: [Python-Dev] Two random and nearly unrelated ideas In-Reply-To: Your message of "Wed, 04 Sep 2002 05:49:47 EDT." <20020904094947.GA56953@hishome.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> Message-ID: <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> > > Second (also considered during the above edit), it would be nice to get rid > > of the ticker altogether in systems with proper signal support. On those > > platforms couldn't an alarm replace polling for the ticker? > > Not before all all Python I/O calls are converted to be EINTR-safe. > > After running into some problems with I/O interrupted by signals I tried to > fix it myself but it requires a lot of work in some of the hairiest places > in the Python codebase. Signals: just say no. It is impossible to write correct code in the presence of signals. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed Sep 4 12:49:15 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 07:49:15 -0400 Subject: [Python-Dev] mysterious hangs in socket code In-Reply-To: Your message of "Tue, 03 Sep 2002 17:53:46 EDT." <15733.12138.568668.562013@slothrop.zope.com> References: <15733.12138.568668.562013@slothrop.zope.com> Message-ID: <200209041149.g84BnFV05659@pcp02138704pcs.reston01.va.comcast.net> > One possibility is that the Linux getaddrinfo() is thread-safe, but > only by way of a lock that only allows one request to be outstanding > at a time. The next step should be to get the getaddrinfo() source code from glibc and see what it does. It's open source, hey. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed Sep 4 12:51:10 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 07:51:10 -0400 Subject: [Python-Dev] Two random and nearly unrelated ideas In-Reply-To: Your message of "Tue, 03 Sep 2002 16:39:01 CDT." <15733.11253.743055.864572@12-248-11-90.client.attbi.com> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> Message-ID: <200209041151.g84BpAg05683@pcp02138704pcs.reston01.va.comcast.net> > While adding a blurb to Misc/NEWS about the change to the thread > ticker and check interval, it occurred to me that perhaps Misc/NEWS > would benefit from conversion to ReST format. You could pump an > HTML version out to the website periodically. Nice idea. How much additional mark-up would this add to quote the occasional reST meta-character? Can you convert a section for test and show me? > Second (also considered during the above edit), it would be nice to > get rid of the ticker altogether in systems with proper signal > support. On those platforms couldn't an alarm replace polling for > the ticker? I know signals are tricky devils, but it still seems it > would be a win if you could use it. You'd have to install a SIGALRM > handler which would trip periodically. It would also have to keep > track of any alarm handler the programmer installed. -1,000,000. --Guido van Rossum (home page: http://www.python.org/~guido/) From praveen.patil@silver-software.com Wed Sep 4 13:31:00 2002 From: praveen.patil@silver-software.com (Praveen Patil) Date: Wed, 4 Sep 2002 13:31:00 +0100 Subject: [Python-Dev] Please help in calling python fucntion from 'c' Message-ID: This is a multi-part message in MIME format. ------=_NextPart_000_0011_01C25417.4EC8F910 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Hi, I have written 'C' dll(MY_DLL.DLL) . I am importing 'C' dll in python file(example.py). I want to call python function from 'c' function. For your reference I have attached 'c' and python files to this mail. In my pc: python code is under the directory D:\test\example.py dll is under the directory C:\Program Files\Python\DLLs\MY_DLL.pyd Here are the steps I am following. step(1): I am calling 'C' function(RECEIVE_FROM_IL_S) from python. This 'C' function is existing imported dll(MY_DLL). step(2): I want to call python function(TestFunction) from 'C' function(RECEIVE_FROM_IL_S). Python code is(example.py) :- ---------------------------- import MY_DLL G_Logfile = None def TestFunction(): G_Logfile = open('Pytestfile.txt', 'w') G_Logfile.write("%s \n"%'I am writing python created text file') G_Logfile.close G_Logfile = None #end def TestFunction if __name__ == "__main__": MY_DLL.RECEIVE_FROM_IL_S(10,50) 'C' code is (MY_DLL.c) :- --------------------- #include #include #include PyObject* _wrap_RECEIVE_FROM_IL_S(PyObject *self, PyObject *args) { FILE* fp; PyObject* _resultobj; int i,j; if( !(PyArg_ParseTuple(args, "ii",&i,&j))) { return NULL; } fp= fopen("RECEIVE_IL_S.txt", "w"); fprintf(fp, "i=%d j=%d" , i,j); fclose(fp); /* Here I want to call python function(TestFunction). Please suggest me some solution*/ _resultobj = Py_None; return _resultobj; } static PyMethodDef MY_DLL_methods[] = { { "RECEIVE_FROM_IL_S", _wrap_RECEIVE_FROM_IL_S, METH_VARARGS }, { NULL , NULL} }; __declspec(dllexport) void __cdecl initMY_DLL(void) { Py_InitModule("MY_DLL",MY_DLL_methods); } Please anybody help me solving the problem. Cheers, Praveen. ------=_NextPart_000_0011_01C25417.4EC8F910 Content-Type: text/plain; name="exampl.py" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="exampl.py" import MY_DLL G_Logfile = None def TestFunction(): G_Logfile = open('Pytestfile.txt', 'w') G_Logfile.write("%s \n"%'I am writing python created text file') G_Logfile.close G_Logfile = None #end def TestFunction if __name__ == "__main__": MY_DLL.RECEIVE_FROM_IL_S(10,50) ------=_NextPart_000_0011_01C25417.4EC8F910 Content-Type: application/octet-stream; name="MY_DLL.c" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="MY_DLL.c" #include #include #include PyObject* _wrap_RECEIVE_FROM_IL_S(PyObject *self, PyObject *args) { FILE* fp; =20 PyObject* _resultobj; int i,j; =20 if( !(PyArg_ParseTuple(args, "ii",&i,&j))) { return NULL; } fp=3D fopen("RECEIVE_IL_S.txt", "w"); fprintf(fp, "i=3D%d j=3D%d" , i,j); fclose(fp); /* Here I want to call python function(TestFunction). Please suggest = me some solution*/ _resultobj =3D Py_None; return _resultobj; } static PyMethodDef MY_DLL_methods[] =3D { { "RECEIVE_FROM_IL_S", _wrap_RECEIVE_FROM_IL_S, METH_VARARGS }, { NULL , NULL} }; __declspec(dllexport) void __cdecl initMY_DLL(void) { Py_InitModule("MY_DLL",MY_DLL_methods); } ------=_NextPart_000_0011_01C25417.4EC8F910 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline [ The information contained in this e-mail is confidential and is intended for the named recipient only. If you are not the named recipient, please notify us by telephone on +44 (0)1249 442 430 immediately, destroy the message and delete it from your computer. Silver Software has taken every reasonable precaution to ensure that any attachment to this e-mail has been checked for viruses. However, we cannot accept liability for any damage sustained as a result of any such software viruses and advise you to carry out your own virus check before opening any attachment. Furthermore, we do not accept responsibility for any change made to this message after it was sent by the sender.] ------=_NextPart_000_0011_01C25417.4EC8F910-- From oren-py-d@hishome.net Wed Sep 4 13:46:46 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Wed, 4 Sep 2002 08:46:46 -0400 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020904124646.GA79746@hishome.net> On Wed, Sep 04, 2002 at 07:44:32AM -0400, Guido van Rossum wrote: > > > Second (also considered during the above edit), it would be nice to get rid > > > of the ticker altogether in systems with proper signal support. On those > > > platforms couldn't an alarm replace polling for the ticker? > > > > Not before all all Python I/O calls are converted to be EINTR-safe. > > > > After running into some problems with I/O interrupted by signals I tried to > > fix it myself but it requires a lot of work in some of the hairiest places > > in the Python codebase. > > Signals: just say no. It is impossible to write correct code in the > presence of signals. Wrapping all I/O calls with PyOS_ wrappers would be a good start. After that the wrappers can be modified to retry the call on EINTR. This should solve all the problems I have encountered with interference to Python code by signals. Any other problems I should be aware of? Oren From guido@python.org Wed Sep 4 14:25:01 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 09:25:01 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Wed, 04 Sep 2002 08:46:46 EDT." <20020904124646.GA79746@hishome.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> Message-ID: <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> > > Signals: just say no. It is impossible to write correct code in the > > presence of signals. > > Wrapping all I/O calls with PyOS_ wrappers would be a good start. And what should those wrappers do? > After that the wrappers can be modified to retry the call on EINTR. But that's not always what you want to happen! E.g. if an app is blocked on a read and uses an alarm to bail out of the read. > This should solve all the problems I have encountered with > interference to Python code by signals. Any other problems I should > be aware of? There's no way to sufficiently test a program that uses signals. The signal handler cannot touch *any* data, which makes it pretty useless. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip@pobox.com Wed Sep 4 15:45:51 2002 From: skip@pobox.com (Skip Montanaro) Date: Wed, 4 Sep 2002 09:45:51 -0500 Subject: [Python-Dev] Two random and nearly unrelated ideas In-Reply-To: <20020904094947.GA56953@hishome.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> Message-ID: <15734.7327.163001.51042@12-248-11-90.client.attbi.com> >> On those platforms couldn't an alarm replace polling for the ticker? Oren> Not before all all Python I/O calls are converted to be Oren> EINTR-safe. Ah, yes. Thanks for pointing out that little stumbling block... Skip From oren-py-d@hishome.net Wed Sep 4 17:01:43 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Wed, 4 Sep 2002 12:01:43 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020904160143.GA1483@hishome.net> On Wed, Sep 04, 2002 at 09:25:01AM -0400, Guido van Rossum wrote: > > After that the wrappers can be modified to retry the call on EINTR. > > But that's not always what you want to happen! E.g. if an app is > blocked on a read and uses an alarm to bail out of the read. If I use a module that spawns an external process and uses SIGCHLD to be informed of its termination why should my innocent code that just reads lines from a file suddenly break? In C I can at least restart the operation after an EINTR but file.readline cannot even be properly restarted because the buffering and file position is all messed up. The example you gave of bailing out of a read with a signal can be done using other techniques such as non-blocking I/O (which is, IMHO, a much cleaner way to do it). Getting an notification of a child process terminating or other asynchronous events can only be done using signals and is currently dangerous because it will break code using I/O. > > interference to Python code by signals. Any other problems I should > > be aware of? > > There's no way to sufficiently test a program that uses signals. The > signal handler cannot touch *any* data, which makes it pretty useless. In order to be useful a signal handler needs to be able to set one bit. The next time the ticker expires this bit will be checked. If an I/O operation was interrupted the Python signal handler can be executed immediately from the wrapper. When it returns the wrapper will resume the interrupted operation. Oren I/O, I/O, it's off to work we go... The seven dwarfs From oren-py-d@hishome.net Wed Sep 4 19:51:31 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Wed, 4 Sep 2002 21:51:31 +0300 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: ; from bac@OCF.Berkeley.EDU on Wed, Sep 04, 2002 at 11:04:27AM -0700 References: <20020904094947.GA56953@hishome.net> Message-ID: <20020904215131.A12898@hishome.net> On Wed, Sep 04, 2002 at 11:04:27AM -0700, Brett Cannon wrote: > [Oren Tirosh] > > > > > > Not before all all Python I/O calls are converted to be EINTR-safe. > > what is EINTER-safe? When an I/O operation is interrupted by an unmasked signal it returns with errno==EINTR. The state of the file is not affected and repeating the operation should recover and continue with no loss of data. Here is an EINTR-safe version of read: ssize_t safe_read(int fd, void *buf, size_t count) { ssize_t result; do { result = read(fd, buf, count); } while (result == -1 && errno == EINTR); return result; } When exposing the C I/O calls to Python you can either: 1. Use EINTR-safe I/O and hide this from the user. 2. Pass on EINTR to the user. Python currently does #2 with a big caveat - the internal buffering of functions like file.read or file.readline is messed up and cannot be cleanly restarted. This makes signals unusable for delivery of asynchronous events in the background without affecting the state of the main program. Oren From guido@python.org Wed Sep 4 20:10:15 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 15:10:15 -0400 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Wed, 04 Sep 2002 21:51:31 +0300." <20020904215131.A12898@hishome.net> References: <20020904094947.GA56953@hishome.net> <20020904215131.A12898@hishome.net> Message-ID: <200209041910.g84JAGR08004@pcp02138704pcs.reston01.va.comcast.net> From guido@python.org Wed Sep 4 20:16:25 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 15:16:25 -0400 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Wed, 04 Sep 2002 21:51:31 +0300." <20020904215131.A12898@hishome.net> References: <20020904094947.GA56953@hishome.net> <20020904215131.A12898@hishome.net> Message-ID: <200209041916.g84JGPd08031@pcp02138704pcs.reston01.va.comcast.net> > > what is EINTER-safe? > > When an I/O operation is interrupted by an unmasked signal it returns > with errno==EINTR. The state of the file is not affected and repeating > the operation should recover and continue with no loss of data. What if the operation is a select() call? Is restarting the right thing? How to take into account the consumed portion of the timeout, if given? > Here is an EINTR-safe version of read: > > ssize_t safe_read(int fd, void *buf, size_t count) { > ssize_t result; > do { > result = read(fd, buf, count); > } while (result == -1 && errno == EINTR); > return result; > } > > When exposing the C I/O calls to Python you can either: > > 1. Use EINTR-safe I/O and hide this from the user. > 2. Pass on EINTR to the user. > > Python currently does #2 with a big caveat - the internal buffering > of functions like file.read or file.readline is messed up and cannot be > cleanly restarted. This makes signals unusable for delivery of asynchronous > events in the background without affecting the state of the main program. Can you point to a place in the code where this is happening? Or is this a stdio problem? I believe that calls like fgets() and getchar() don't lose data, but maybe I misunderstand your observation. As I said before, I'm very skeptical that making the I/O ops EINTR-safe would be enough to allow the use of signals as siggested by Skip, but that might still be useful for other purposes, *if* we can decide when to honor EINTR and when not. --Guido van Rossum (home page: http://www.python.org/~guido/) From nas@python.ca Wed Sep 4 20:22:47 2002 From: nas@python.ca (Neil Schemenauer) Date: Wed, 4 Sep 2002 12:22:47 -0700 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209041916.g84JGPd08031@pcp02138704pcs.reston01.va.comcast.net> References: <20020904094947.GA56953@hishome.net> <20020904215131.A12898@hishome.net> <200209041916.g84JGPd08031@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020904192247.GA16797@glacier.arctrix.com> Guido van Rossum wrote: > What if the operation is a select() call? Is restarting the right > thing? How to take into account the consumed portion of the timeout, > if given? I think you would not restart select(). It's only a hint anyhow. Neil From oren-py-d@hishome.net Wed Sep 4 21:07:09 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Wed, 4 Sep 2002 23:07:09 +0300 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209041916.g84JGPd08031@pcp02138704pcs.reston01.va.comcast.net>; from guido@python.org on Wed, Sep 04, 2002 at 03:16:25PM -0400 References: <20020904094947.GA56953@hishome.net> <20020904215131.A12898@hishome.net> <200209041916.g84JGPd08031@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020904230709.A24623@hishome.net> On Wed, Sep 04, 2002 at 03:16:25PM -0400, Guido van Rossum wrote: > > When an I/O operation is interrupted by an unmasked signal it returns > > with errno==EINTR. The state of the file is not affected and repeating > > the operation should recover and continue with no loss of data. > > What if the operation is a select() call? Is restarting the right > thing? How to take into account the consumed portion of the timeout, > if given? Some versions of select update the timeout structure to the remainder if they are interrupted by a signal. It's probably not a good idea to rely on this so gettimeofday could be used to calculate the remainder. > Or is this a stdio problem? I believe that calls like fgets() and > getchar() don't lose data, but maybe I misunderstand your observation. This is not the point - even if Python I/O calls were fully restartable would you actually expect people to check for EINTR and restart for *every* I/O operation in the program just in case some module happens to use signals? Instead of for line in file: do_something_with(line) we would need to write while 1: try: line = file.next() except IOError, exc: if exc.errno == errno.EINTR: continue else: raise except StopIteration: break do_something_with(line) > As I said before, I'm very skeptical that making the I/O ops > EINTR-safe would be enough to allow the use of signals as suggested by > Skip If it's good enough for other purposes it should be good enough for Skip's proposal, too. > Skip, but that might still be useful for other purposes, *if* we can > decide when to honor EINTR and when not. Only low-level functions like os.read and os.write that map directly to stdio functions should ever return EINTR. To make Python signal-safe all other calls that can return EINTR should have a retry loop. On EINTR they should check if there are things to do and if so grab the GIL, make pending calls, release the GIL and retry the operation (unless an exception has been raised by the signal handler, of course). This way I could finally write a Python daemon that reloads its configuration files on getting the customary SIGHUP :-) Oren From guido@python.org Wed Sep 4 21:05:22 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 16:05:22 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Wed, 04 Sep 2002 12:01:43 EDT." <20020904160143.GA1483@hishome.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> <20020904160143.GA1483@hishome.net> Message-ID: <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> > If I use a module that spawns an external process and uses SIGCHLD to be > informed of its termination why should my innocent code that just reads > lines from a file suddenly break? In C I can at least restart the > operation after an EINTR but file.readline cannot even be properly > restarted because the buffering and file position is all messed up. I have never understood why a child dying should send a signal. You can poll for the child with waitpid() instead. But if you have a suggestion for how to fix this particular issue, I'd be happy to look it over, since this *is* something some people do. > The example you gave of bailing out of a read with a signal can be done > using other techniques such as non-blocking I/O (which is, IMHO, a much > cleaner way to do it). Yes. > Getting an notification of a child process terminating or other > asynchronous events can only be done using signals and is currently > dangerous because it will break code using I/O. See above. I see half your point; people wanting this tend to use signals and it causes breakage. > > > interference to Python code by signals. Any other problems I should > > > be aware of? > > > > There's no way to sufficiently test a program that uses signals. The > > signal handler cannot touch *any* data, which makes it pretty useless. > > In order to be useful a signal handler needs to be able to set one bit. > The next time the ticker expires this bit will be checked. OK. > If an I/O operation was interrupted the Python signal handler can be > executed immediately from the wrapper. When it returns the wrapper > will resume the interrupted operation. Is calling the Python signal handler from the wrapper always safe? What if the Python signal handler e.g. closes the file or reads from it? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed Sep 4 21:24:04 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 16:24:04 -0400 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Wed, 04 Sep 2002 23:07:09 +0300." <20020904230709.A24623@hishome.net> References: <20020904094947.GA56953@hishome.net> <20020904215131.A12898@hishome.net> <200209041916.g84JGPd08031@pcp02138704pcs.reston01.va.comcast.net> <20020904230709.A24623@hishome.net> Message-ID: <200209042024.g84KO4G08242@pcp02138704pcs.reston01.va.comcast.net> > > What if the operation is a select() call? Is restarting the right > > thing? How to take into account the consumed portion of the timeout, > > if given? > > Some versions of select update the timeout structure to the remainder if > they are interrupted by a signal. It's probably not a good idea to rely > on this so gettimeofday could be used to calculate the remainder. I like Neil's suggestion: simply return. The timeout is a hint. > > Or is this a stdio problem? I believe that calls like fgets() and > > getchar() don't lose data, but maybe I misunderstand your observation. > > This is not the point - even if Python I/O calls were fully restartable > would you actually expect people to check for EINTR and restart for > *every* I/O operation in the program just in case some module happens to > use signals? > > Instead of > > for line in file: > do_something_with(line) > > we would need to write > > while 1: > try: > line = file.next() > except IOError, exc: > if exc.errno == errno.EINTR: > continue > else: > raise > except StopIteration: > break > do_something_with(line) OK, but you're changing your tune here. I agree that this is bad, but I still don't believe (or understand) your previous remark about readline losing track of buffering. But let's forget about this, I trust that you really meant what you showed here. > > As I said before, I'm very skeptical that making the I/O ops > > EINTR-safe would be enough to allow the use of signals as > > suggested by Skip > > If it's good enough for other purposes it should be good enough for > Skip's proposal, too. Well, it has to be *perfect* for Skip's proposal, since it means we'd be generating signals probably at a rate of 100 per second. > > Skip, but that might still be useful for other purposes, *if* we can > > decide when to honor EINTR and when not. > > Only low-level functions like os.read and os.write that map directly > to stdio functions should ever return EINTR. Um, os.read/write are the ones that *don't* map to stdio. Maybe you meant "that map directly to file descriptors"? But I doubt this would be acceptable -- if we were generating 100 signals per second, os.read/write become much harder to use if they could raise EINTR (currently they only raise EINTR if the app uses signal handlers, which isn't that common). > To make Python signal-safe all other calls that can return EINTR > should have a retry loop. On EINTR they should check if there are > things to do and if so grab the GIL, make pending calls, release the > GIL and retry the operation (unless an exception has been raised by > the signal handler, of course). > > This way I could finally write a Python daemon that reloads its > configuration files on getting the customary SIGHUP :-) If you really want that, maybe you could see if you can produce a working design and patch? Even if it's not perfect enough to use signals to replace the ticker, people who like to use signals would probably be happy. --Guido van Rossum (home page: http://www.python.org/~guido/) From Jack.Jansen@oratrix.com Wed Sep 4 21:45:30 2002 From: Jack.Jansen@oratrix.com (Jack Jansen) Date: Wed, 4 Sep 2002 13:45:30 -0700 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <20020904215131.A12898@hishome.net> Message-ID: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> On woensdag, sep 4, 2002, at 11:51 US/Pacific, Oren Tirosh wrote: > When an I/O operation is interrupted by an unmasked signal it returns > with errno==EINTR. The state of the file is not affected and repeating > the operation should recover and continue with no loss of data. > I'm not sure about modern unixen (it's been a long time since I was interested in such lowlevel details) but historically this has been one complete mess. Aside from some unix variations that basically didn't do restart at all there have always been problems with signal restart semantics. For sockets and various devices (raw ttys, I think) you could definitely lose data. Hmm, and when I think of it I don't think it's even possible to restart safely. What if I do a read() on a socket, and I request more bytes than the available physical memory (but less than VM, of course)? The kernel simply doesn't have anywhere to store the bytes other than my buffer, and if it has to return EINTR then >POOF< these bytes are gone forever. From guido@python.org Wed Sep 4 21:48:11 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 16:48:11 -0400 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Wed, 04 Sep 2002 13:45:30 PDT." <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> Message-ID: <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> [Jack] > Hmm, and when I think of it I don't think it's even possible to restart > safely. What if I do a read() on a socket, and I request more bytes > than the available physical memory (but less than VM, of course)? The > kernel simply doesn't have anywhere to store the bytes other than my > buffer, and if it has to return EINTR then >POOF< these bytes are gone > forever. I think that if any bytes have already been copied into your buffer, you don't get an EINTR, you get a short read. --Guido van Rossum (home page: http://www.python.org/~guido/) From walter@livinglogic.de Wed Sep 4 22:21:40 2002 From: walter@livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Wed, 04 Sep 2002 23:21:40 +0200 Subject: [Python-Dev] mimetypes patch #554192 References: <3D5BEBB8.7080904@livinglogic.de> <15707.61612.844119.819432@anthem.wooz.org> <3D5CE38D.9080905@livinglogic.de> <3D5F9C2D.8010209@livinglogic.de> Message-ID: <3D767964.4090405@livinglogic.de> Martin v. Loewis wrote: > Walter D�rwald writes: > > >>>>Even better would be, if we could assign priorities to the mappings, >>>>so that for e.g. image/jpeg the preferred extension is .jpeg. >>>>Then guess_type() and guess_extension() would return the preferred >>>>mimetype/extension. >>> >>>Do you have a specific application for that in mind? It sounds like >>>overkill. >> >>I'm using a web mirror script which uses the extensions from >>guess_extension to save all downloaded resources, and I hate it >>when the HTML files are named .htm and JPEG images are named .jpe. > > Then this is your preference - others might prefer jpg, just because > their file system can deal better with that. If you can agree that > this is your preference, you should put the preference mechanism into > the application. Agreed, other applications might have other priorities. > Maybe your preference can be expressed algorithmically? It might be > that you always want the longest known extension (it is unlikely that > you prefer "jpeg" over "jpg" just because that contains a vowel :-). I guess it's "longest one" or "the one most unencumbered by filesystem limitations". OK, so lets drop the priority idea. What do we do with the patch as it is now? Bye, Walter D�rwald From pinard@iro.umontreal.ca Wed Sep 4 22:21:44 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Wed, 04 Sep 2002 17:21:44 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> (Guido van Rossum's message of "Wed, 04 Sep 2002 16:48:11 -0400") References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido van Rossum] > [Jack] >> Hmm, and when I think of it I don't think it's even possible to restart >> safely. What if I do a read() on a socket, and I request more bytes >> than the available physical memory (but less than VM, of course)? The >> kernel simply doesn't have anywhere to store the bytes other than my >> buffer, and if it has to return EINTR then >POOF< these bytes are gone >> forever. > > I think that if any bytes have already been copied into your buffer, > you don't get an EINTR, you get a short read. I'm not fully familiar with all the details of this problem, it surely has been in the air for quite a long time now (I might have first heard of it while Taylor UUCP was being developed). It might be dependent on the underlying system. If I'm not mistaken, this is Ian Taylor who introduced the following Autoconf macro: - Macro: AC_SYS_RESTARTABLE_SYSCALLS If the system automatically restarts a system call that is interrupted by a signal, define `HAVE_RESTARTABLE_SYSCALLS'. In GNU file utilities (now merged within the new GNU coreutils), Jim Meyering uses restart wrappers for many I/O functions, so the idea of wrappers has been maturing for a while, and is used in basic, heavily used programs. However, I did not look at such wrappers recently. Python might probably wrap calls when these are restartable, or transmit the error upwards for systems where calls are not restartable. -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From python@rcn.com Wed Sep 4 22:40:34 2002 From: python@rcn.com (Raymond Hettinger) Date: Wed, 4 Sep 2002 17:40:34 -0400 Subject: [Python-Dev] Proposed Mixins for Wide Interfaces References: <001101c2510d$9fce0920$5f66accf@othello> <200209031750.g83HoVq05812@odiug.zope.com> Message-ID: <001801c2545b$b43aba60$e8ea7ad1@othello> [RH] > > How about adding some mixins to simplify the > > implementation of some of the fatter interfaces? [GvR] > Can you suggest implementations for these, to be absolutely clear what > you mean? -- snip -- > What if the "natural" thing to implement is __le__ instead of __lt__? > That's the case for sets. Or __gt__ (less likely)? Yes. Here is some code --------------------------- class CompareMixin: """ Given an __eq__ method in a subclass, adds a __ne__ method Given __eq__ and __lt__, adds !=, <=, >, >=. If supplied, takes advantage of __lte__ for speed. """ def __eq__(self, other): raise NotImplementedError def __ne__(self, other): return not (self == other) def __lt__(self, other): raise NotImplementedError def __lte__(self, other): return self < other or self == other def __gt__(self, other): return not (self <= other) def __gte__(self, other): return not (self < other) ## Example from sets import mixins class BaseSet(object, mixins.CompareMixin): """Common base class for mutable and immutable sets.""" __slots__ = ['_data'] # . . . def issubset(self, other): """Report whether another set contains this set.""" self._binary_sanity_check(other) if len(self) > len(other): # Fast check for obvious cases return False otherdata = other._data for elt in self: if elt not in otherdata: return False return True def __eq__(self, other): self._binary_sanity_check(other) return self._data == other._data def __lt__(self, other): self._binary_sanity_check(other) return len(self) < len(other) and self.issubset(other) __le__ = issubset # optional, but recommended for speed. # Example where gt is the most natural implementation class Anyhoo(CompareMixin): __eq__ = someBigEqualityTest __gt__ = someBigComplexOrderingFunction def __lt__(self, other): return not(self>other or self==other) [RH] > > class MappingMixin: > > """ > > Given __setitem__, __getitem__, and keys, > > implements values, items, update, get, setdefault, len, > > iterkeys, iteritems, itervalues, has_key, and __contains__. > > > > If __delitem__ is also supplied, implements clear, pop, > > and popitem. > > > > Takes advantage of __iter__ if supplied (recommended). [GvR] > Does that mean that if you have __iter__, you don't use keys()? In > that case it should implement keys() out of __iter__. Maybe this > should be required. Not really. keys() is always required. If __iter__ is supplied, then things like iterkeys(), iteritems(), and itervalues() get computed from __iter__ rather than keys(). My thought on using keys() as part of the minimum specification is that database style interfaces always supply some type of list method. For instance, shelve can be instantly widened with the mixin, no other coding is required. OTOH, I'm not glued to the idea of using keys() as part of the minimum spec. [RH] > > Takes advantage of __contains__ or has_key if supplied > > (recommended). > > """ [GvR] > Let's standardize on __contains__, not has_key(). I guess you could > provide __contains__ as follows: Makes sense. [RH] > > The idea is to make it easier to implement these interfaces. > > Also, if the interfaces get expanded, the clients automatically > > updated. [GvR] > A similar thing for sequences would be useful too, right? Hmm, listing and concatenation beget repetition; len() and __getitem__() beget slicing. iteration and __cmp__ beget min(), max() For mutable sequences, supplying __setitem__ begets appending, extending, and slice assignment. Supplying __delitem__ begets pop(), remove() and slice deletion. For overachivers, the above are all that are needed for sort(), reverse(), index(), insert(), and count() Would you like me to create a mixin module and put it in the sandbox? Raymond Hettinger From pinard@iro.umontreal.ca Wed Sep 4 23:25:24 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Wed, 04 Sep 2002 18:25:24 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <005601c253d5$d0a63c50$ced241d5@hagrid> ("Fredrik Lundh"'s message of "Wed, 4 Sep 2002 07:41:42 +0200") References: <200209031653.g83GrjQ01929@odiug.zope.com> <200209032018.g83KI3q08343@odiug.zope.com> <17d001c2538d$f82650f0$1c86db41@boostconsulting.com> <008201c253aa$780144d0$6300000a@holdenweb.com> <005601c253d5$d0a63c50$ced241d5@hagrid> Message-ID: [Fredrik Lundh] > [...] and this mailing list is about python. Why did you reply to the mailing list, then? :-) > Fran�ois Pinard wrote: >> [Steve Holden] >> > It looks especially bad in my standard mailreader variable-pitch font. >> [...] People often insert ASCII tables or simple explicative drawings, >> these capabilities are useful enough for not being dismissed. You should >> use fixed width fonts [...] > > loser. > > if python really was all about "everything computers did when I learned to > use them will always be the best way to do it", it would probably never have > been invented. Python did not build its success by trying to convince people that every else is wrong. It rather offered an environment in which participants happily considered they were gaining a lot. If someone breaks its screen appearance through selection of inappropriate fonts, he might gain some pleasure indeed while loosing the ability to read many existing messages. That's really his choice and preferences, he has to live with the drawbacks, without trying to convince senders that they are all wrong. Considering others as losers does not efficiently trigger progress. -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From hu.peress@mail.mcgill.ca Sat Sep 7 23:30:59 2002 From: hu.peress@mail.mcgill.ca (Hunter Peress) Date: 07 Sep 2002 17:30:59 -0500 Subject: [Python-Dev] Call for clarity Message-ID: <1031437860.636.29.camel@HillCountryPeress> I've been using python for a good few months. And im really bothered by some aspects of the documentation. I think that there should be a clear effort to provide API style information, rather than the mixed state that things currently are. There are tools for C++/Java...that are part of the official distributions that provide API style docs. Here's what gets me: when u look up something in pydoc, you have no idea what it returns/expects in terms of types. Now, since python is not an explicitely typed language, I ask rhetorically, how can u have good docs that tell u the return/input types without making the language explicitely typed? Make the documenation system explictely typed. The clarification needs to happen somewhere along the lines, and I really think that the world would rather not have it happening at runtime. This could clear up a lot of confusion and further python's effectiveness. -Hunter. From bkc@murkworks.com Wed Sep 4 23:39:01 2002 From: bkc@murkworks.com (Brad Clements) Date: Wed, 04 Sep 2002 18:39:01 -0400 Subject: [Python-Dev] Getting started with GBayes testing Message-ID: <3D7653AD.14352.14F391B6@localhost> Hi, I'm interested in contributing to GBayes .. I'm thinking of trying word stemming and adding other types of token indicators. How can I contribute? Btw, I have been saving up my spam for a year or so.. I have about 31,238 spam messages saved up now. These are categorized as spam based on my reading of the subject, or examining the body when in doubt. There are probably 10% dups in the corpus. Some of them have viruses, likely klez. I'd like to replicate Tim's test rig so I can compare my results with existing ones. My spam isn't in mbox format, but I can convert it.. I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From martin@v.loewis.de Wed Sep 4 23:56:30 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 05 Sep 2002 00:56:30 +0200 Subject: [Python-Dev] Call for clarity In-Reply-To: <1031437860.636.29.camel@HillCountryPeress> References: <1031437860.636.29.camel@HillCountryPeress> Message-ID: Hunter Peress writes: > This could clear up a lot of confusion and further python's > effectiveness. It's not clear, to me, from reading your message, what kind of change you are requesting (that you are requesting a change, rather than offering help, or asking for advice, appears to be clear). Could you kindly provide a small patch that gives an idea of what you would like to see changed, and how? TIA, Martin From drifty@bigfoot.com Thu Sep 5 00:25:29 2002 From: drifty@bigfoot.com (Brett Cannon) Date: Wed, 4 Sep 2002 16:25:29 -0700 (PDT) Subject: [Python-Dev] Proposed Mixins for Wide Interfaces In-Reply-To: <001801c2545b$b43aba60$e8ea7ad1@othello> Message-ID: [Raymond Hettinger] > [RH] > > > How about adding some mixins to simplify the > > > implementation of some of the fatter interfaces? > This is a spur-of-the-moment thought, so it might not be a reasonable comment, but do we care that all of these methods will show up in when using dir() or any other introspective check? While I think the idea is great, it might give this sense that they are really, truly implemented for the class instead of reliant on the other implementations; the side effects of changing one of the required methods might have unexpected consequences for the user. But since I think this is a great idea, I don't want to see it disappear because of this; I guess I better solve my issue. =) Perhaps we can just make sure that this gets documented in both the API and in the doc strings saying that it is from the mixin and what methods it is dependent upon. That should be enough to squash my worry. And yes, if this gets into the core and Raymond does not want to do it, I will help with the doc patches. -Brett C. From hu.peress@mail.mcgill.ca Sun Sep 8 00:47:43 2002 From: hu.peress@mail.mcgill.ca (Hunter Peress) Date: 07 Sep 2002 18:47:43 -0500 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: References: <1031437860.636.29.camel@HillCountryPeress> Message-ID: <1031442464.644.68.camel@HillCountryPeress> Ok heres some more detail. I have no idea how pydoc works right now. I assume you call some program on a python file, and it simply looks for all """ """. It seems to do SOME lexical/scoping analysis of where to look for """ """, and consequently, how to display that information in the final,doc form; but I'm asking for more. As I said, python methods/functions are not explcitely typed. So what I propose is this: When the pydoc generator comes accross a function/method, there should remain a normal """ """ area for any comments. I'm asking now, that when the generator sees its in a method/function, it does a NEW check for a set of docs that document the type of each input argument, and the output. EG (theoretical, and off the top of my head): in a file you have a function: def something(a,b,c="lalal"): """This will find its way into the pydocs because its a comment""" ##Here is the new stuff Im proposing ##note, a clearer sytnax can surely be devised. """file""" #documents the type of the first arg """string""" # "" second """list""" # "" third """string""" #documents the return type. Then the pydoc generator will do a check on the # arguments to the func/meth, verify that the correct amount of these new comments (which only supply the type) are provided. I do think that it would help to actually enforce this. I think its fine that doc's NOT be generated if they don't supply this information. This provides for better docs and shouldnt get that many complaints. Then: If the docs are generated into webpages, links to the known types that are checked are provided. And if the docs are going into shell format then i dont know if links are necessary. There are lots of cases and issues that I havent discussed for this proposed implemenation. So I would like to continue this thread for the purposes of detailing this idea further. > > This could clear up a lot of confusion and further python's > > effectiveness. As we know, python is not an explicitely typed language, but enforcing some level of typing at the documentation level will see a lot of people falling into line (depending on how rigidly its enforced, and i do suggest a pretty rigid level). I have no patch ATM because I tend to design software before writing it, and im looking for support from the developers first. PS whats TIA mean? On Wed, 2002-09-04 at 17:56, Martin v. Loewis wrote: > Hunter Peress writes: > > > This could clear up a lot of confusion and further python's > > effectiveness. > > It's not clear, to me, from reading your message, what kind of change > you are requesting (that you are requesting a change, rather than > offering help, or asking for advice, appears to be clear). > > Could you kindly provide a small patch that gives an idea of what you > would like to see changed, and how? > > TIA, > Martin > > From python@rcn.com Thu Sep 5 01:19:04 2002 From: python@rcn.com (Raymond Hettinger) Date: Wed, 4 Sep 2002 20:19:04 -0400 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> Message-ID: <003d01c25471$d83fe960$2fd8accf@othello> From: "Hunter Peress" > def something(a,b,c="lalal"): > """This will find its way into the pydocs because its a comment""" > ##Here is the new stuff Im proposing > ##note, a clearer sytnax can surely be devised. > """file""" #documents the type of the first arg > """string""" # "" second > """list""" # "" third > """string""" #documents the return type. > > Then the pydoc generator will do a check on the # arguments to the > func/meth, verify that the correct amount of these new comments (which > only supply the type) are provided. I do think that it would help to > actually enforce this. I think its fine that doc's NOT be generated if > they don't supply this information. This provides for better docs and > shouldnt get that many complaints. Thanks for the clarification. I see what you're trying to do; however, I think that any gains are more than offset by the new level of complexity and lengthier code. The current docs make a pretty good effort at describing what is needed for each argument. At the same time, they allow flexibility for dynamic arguments that share a similar interface (such as substituting a StringIO object for a File object. In your example, the docs strings could be made clear using existing tools: def something(file, promptstring, optionlist): """Returns a string extracted from the file for any line matching the promptstring. The optionlist can include any of the following: IGNORECASE, VERBOSE. MULTILINE, or ADDLINENUMBER.""" I can't see that a tool like you described would add any more clarity than the above docstring. > PS whats TIA mean? "Thanks In Advance" Do you have any examples of current python docstrings that are not clear enough? Raymond Hettinger From greg@cosc.canterbury.ac.nz Thu Sep 5 01:30:40 2002 From: greg@cosc.canterbury.ac.nz (Greg Ewing) Date: Thu, 05 Sep 2002 12:30:40 +1200 (NZST) Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Message-ID: <200209050030.g850UeI2026648@kuku.cosc.canterbury.ac.nz> pinard@iro.umontreal.ca: > - Macro: AC_SYS_RESTARTABLE_SYSCALLS > If the system automatically restarts a system call that is > interrupted by a signal, define `HAVE_RESTARTABLE_SYSCALLS'. > > Python might probably wrap calls when > these are restartable, or transmit the error upwards for systems where calls > are not restartable. I think that macro means that you *don't* have to use a wrapper to restart syscalls, because it happens automatically. So if it's not defined it means you have to restart them manually, not that they can't be restarted at all. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+ From guido@python.org Thu Sep 5 01:24:29 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 20:24:29 -0400 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: Your message of "Wed, 04 Sep 2002 18:39:01 EDT." <3D7653AD.14352.14F391B6@localhost> References: <3D7653AD.14352.14F391B6@localhost> Message-ID: <200209050024.g850OTd08824@pcp02138704pcs.reston01.va.comcast.net> > I'm interested in contributing to GBayes .. > > I'm thinking of trying word stemming and adding other types of token > indicators. How can I contribute? Pretty soon, a SF propject will be created (Barry has already gotten the request in). We'll gladly add you to the list of developers. > Btw, I have been saving up my spam for a year or so.. I have about > 31,238 spam messages saved up now. These are categorized as spam > based on my reading of the subject, or examining the body when in > doubt. There are probably 10% dups in the corpus. Some of them have > viruses, likely klez. Cool. > I'd like to replicate Tim's test rig so I can compare my results > with existing ones. My spam isn't in mbox format, but I can convert > it.. If you can't wait for the SF project, you can find all the code in the Python CVS tree: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/spambayes/ > I'm particularly intersted in how to allow html only messages > (reduce false positives). I'm getting a lot of personal mail in > that format, unfortunately. You train it with an equal number of spam and non-spam ("ham") that you received. Just make sure the ham training messages contain enough representatives of the html-only mail. --Guido van Rossum (home page: http://www.python.org/~guido/) From greg@cosc.canterbury.ac.nz Thu Sep 5 01:36:18 2002 From: greg@cosc.canterbury.ac.nz (Greg Ewing) Date: Thu, 05 Sep 2002 12:36:18 +1200 (NZST) Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> Message-ID: <200209050036.g850aIkU026656@kuku.cosc.canterbury.ac.nz> Jack Jansen : > Aside from some unix variations that basically didn't do restart at all > there have always been problems with signal restart semantics. For > sockets and various devices (raw ttys, I think) you could definitely > lose data. Sockets? Are you sure? I find it unlikely that such a severe problem could persist in many Unix variants for so long. I've never heard of any mention of such a thing. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+ From greg@cosc.canterbury.ac.nz Thu Sep 5 01:38:09 2002 From: greg@cosc.canterbury.ac.nz (Greg Ewing) Date: Thu, 05 Sep 2002 12:38:09 +1200 (NZST) Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <200209050038.g850c9mY026662@kuku.cosc.canterbury.ac.nz> Guido van Rossum : > I have never understood why a child dying should send a signal. You > can poll for the child with waitpid() instead. Because child termination might not be the only thing you want to wait for. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+ From guido@python.org Thu Sep 5 01:32:21 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 04 Sep 2002 20:32:21 -0400 Subject: [Python-Dev] Proposed Mixins for Wide Interfaces In-Reply-To: Your message of "Wed, 04 Sep 2002 16:25:29 PDT." References: Message-ID: <200209050032.g850WLG08875@pcp02138704pcs.reston01.va.comcast.net> > This is a spur-of-the-moment thought, so it might not be a > reasonable comment, but do we care that all of these methods will > show up in when using dir() or any other introspective check? While > I think the idea is great, it might give this sense that they are > really, truly implemented for the class instead of reliant on the > other implementations; the side effects of changing one of the > required methods might have unexpected consequences for the user. dir() *intends* to show methods regardless of whether they are implemented in the class or in a base class. So this doesn't sound like a valid objection. Pydoc shows inherited methods separately. --Guido van Rossum (home page: http://www.python.org/~guido/) From whisper@oz.net Thu Sep 5 01:48:07 2002 From: whisper@oz.net (David LeBlanc) Date: Wed, 4 Sep 2002 17:48:07 -0700 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: <200209050024.g850OTd08824@pcp02138704pcs.reston01.va.comcast.net> Message-ID: I would like to be in on that project too please. David LeBlanc Seattle, WA USA > -----Original Message----- > From: python-dev-admin@python.org [mailto:python-dev-admin@python.org]On > Behalf Of Guido van Rossum > Sent: Wednesday, September 04, 2002 17:24 > To: bkc@murkworks.com > Cc: python-dev@python.org > Subject: Re: [Python-Dev] Getting started with GBayes testing > > > > I'm interested in contributing to GBayes .. > > > > I'm thinking of trying word stemming and adding other types of token > > indicators. How can I contribute? > > Pretty soon, a SF propject will be created (Barry has already gotten > the request in). We'll gladly add you to the list of developers. > > > Btw, I have been saving up my spam for a year or so.. I have about > > 31,238 spam messages saved up now. These are categorized as spam > > based on my reading of the subject, or examining the body when in > > doubt. There are probably 10% dups in the corpus. Some of them have > > viruses, likely klez. > > Cool. > > > I'd like to replicate Tim's test rig so I can compare my results > > with existing ones. My spam isn't in mbox format, but I can convert > > it.. > > If you can't wait for the SF project, you can find all the code in the > Python CVS tree: > > > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondi > st/sandbox/spambayes/ > > > I'm particularly intersted in how to allow html only messages > > (reduce false positives). I'm getting a lot of personal mail in > > that format, unfortunately. > > You train it with an equal number of spam and non-spam ("ham") that > you received. Just make sure the ham training messages contain enough > representatives of the html-only mail. > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev From barry@python.org Thu Sep 5 01:48:48 2002 From: barry@python.org (Barry A. Warsaw) Date: Wed, 4 Sep 2002 20:48:48 -0400 Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes classifier.py,1.8,1.9 References: <004301c25472$78f62f40$2fd8accf@othello> Message-ID: <15734.43504.641800.957590@anthem.wooz.org> >>>>> "RH" == Raymond Hettinger writes: >> A now-rare pure win, changing spamprob() to work harder to find >> more evidence when competing 0.01 and 0.99 clues appear RH> I hope these victories make it back to the world outside of RH> Python (assuming there is one). The world needs good spam RH> filters. Indeed, I too hope they will. I just got approved for a SF project called "spambayes" and plan to move the code there. I'll try to coordinate that with Tim, and then make a more detailed announcement tomorrow. -Barry From tim.one@comcast.net Thu Sep 5 01:52:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 04 Sep 2002 20:52:14 -0400 Subject: [Python-Dev] The first trustworthy GBayes results In-Reply-To: Message-ID: [Tim] > ... > The first 16 most extreme indicators are split 9 highly in favor of ham > (.01) and 7 highly in favor of spam (.99). If I hadn't folded > case away to let stinking conference announcements through , I > expect it would have latched on to the SCREAMING at the start instead of > looking deeper. Looking at the To: line probably would nail this one too, > as "Undisclosed Recipients" has two 0.99 spam indicators right there. > > Whatever, you *don't* want to look at msgs with a mix of just > 0.99 and 0.01 thingies: it's not all that unusual to get such an > extreme mix, in spam or ham. I should have added that it usually gets the right result when this happens. It's the exceptions to that rule that are mondo embarrassing, because it's making a mistake then while sitting on a mountain of strong evidence (albeit pointing as extremely as possible in both directions at once ). "A problem" is that when a MIN_SPAMPROB and MAX_SPAMPROB clue both appear, the math is such that they cancel out exactly. It's *almost* as if neither existed, but not quite: they also keep two lower-probability words *out* of the computation (only a grand total of the MAX_DISCRIMINATORS most extreme clues are retained). So I changed spamprob() to keep accepting more clues when MIN/MAX cancellations are inevitable, and to use the best of those in lieu of the cancelling extremes. This turned out to be a pure win: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied 0.000 0.000 tied 0.025 0.025 tied 0.025 0.025 tied 0.050 0.050 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.075 0.075 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.075 0.025 won 0.025 0.025 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.050 tied won 1 times tied 19 times lost 0 times total unique fp went from 9 to 7 false negative percentages 0.909 0.764 won 0.800 0.691 won 1.091 0.981 won 1.381 1.309 won 1.491 1.418 won 1.055 0.873 won 0.945 0.800 won 1.236 1.163 won 1.564 1.491 won 1.200 1.200 tied 1.454 1.381 won 1.599 1.454 won 1.236 1.164 won 0.800 0.655 won 0.836 0.655 won 1.236 1.163 won 1.236 1.200 won 1.055 0.982 won 1.127 0.982 won 1.381 1.236 won won 19 times tied 1 times lost 0 times total unique fn went from 284 to 260 From sholden@holdenweb.com Thu Sep 5 01:55:59 2002 From: sholden@holdenweb.com (Steve Holden) Date: Wed, 4 Sep 2002 20:55:59 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 References: <200209031653.g83GrjQ01929@odiug.zope.com><200209032018.g83KI3q08343@odiug.zope.com><17d001c2538d$f82650f0$1c86db41@boostconsulting.com><008201c253aa$780144d0$6300000a@holdenweb.com><005601c253d5$d0a63c50$ced241d5@hagrid> Message-ID: <006901c25477$01b9cb30$6300000a@holdenweb.com> [Fran=E7ois Pinard] > [Fredrik Lundh] > > > [...] and this mailing list is about python. > > Why did you reply to the mailing list, then? :-) > The effbot is a law unto itself :-) > > Fran=E7ois Pinard wrote: > >> [Steve Holden] > > >> > It looks especially bad in my standard mailreader variable-pitch font. > [...] > > Python did not build its success by trying to convince people that ever= y else > is wrong. It rather offered an environment in which participants happi= ly > considered they were gaining a lot. > erm, ... > If someone breaks its screen appearance through selection of inappropri= ate > fonts, he might gain some pleasure indeed while loosing the ability to read > many existing messages. That's really his choice and preferences, he h= as to > live with the drawbacks, without trying to convince senders that they a= re all > wrong. Considering others as losers does not efficiently trigger progress. > I don't really consider """It looks especially bad in my standard mailrea= der variable-pitch font""" to be sufficiently evangelical to deserve this rebuke, but then I didn't really consider your rebuke deserved either, so= I guess we should just terminate this thread now. regards ----------------------------------------------------------------------- Steve Holden http://www.holdenweb.com/ Python Web Programming pydish.holdenweb.com/pwp/ Previous .sig file retired to www.homeforoldsigs.com ----------------------------------------------------------------------- From oren-py-d@hishome.net Thu Sep 5 06:27:37 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Thu, 5 Sep 2002 08:27:37 +0300 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net>; from guido@python.org on Wed, Sep 04, 2002 at 04:48:11PM -0400 References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020905082737.A31267@hishome.net> On Wed, Sep 04, 2002 at 04:48:11PM -0400, Guido van Rossum wrote: > [Jack] > > Hmm, and when I think of it I don't think it's even possible to restart > > safely. What if I do a read() on a socket, and I request more bytes > > than the available physical memory (but less than VM, of course)? The > > kernel simply doesn't have anywhere to store the bytes other than my > > buffer, and if it has to return EINTR then >POOF< these bytes are gone > > forever. > > I think that if any bytes have already been copied into your buffer, > you don't get an EINTR, you get a short read. >From read(2) man page: EINTR The call was interrupted by a signal before any data was read. Same applies to write, recv, fcntl with locks, semop, etc. They're all designed to be restartable. The keyword in all cases is "before". Oren From goodger@users.sourceforge.net Thu Sep 5 03:44:23 2002 From: goodger@users.sourceforge.net (David Goodger) Date: Wed, 04 Sep 2002 22:44:23 -0400 Subject: [Python-Dev] Misc/NEWS (was: Two random and nearly unrelated ideas) In-Reply-To: <200209041151.g84BpAg05683@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Skip] >> While adding a blurb to Misc/NEWS about the change to the thread >> ticker and check interval, it occurred to me that perhaps Misc/NEWS >> would benefit from conversion to ReST format. You could pump an >> HTML version out to the website periodically. I have the Docutils site auto-regenerated via a small cron script. Any time any of the source text files change, within an hour the site reflects the change. It makes site maintenance easy. (BTW, Skip, thanks for the bug report. I'll be looking into it ASAP.) [Guido] > Nice idea. How much additional mark-up would this add to quote the > occasional reST meta-character? Very little, depending on the desired effect. The extreme case would be if you want to mark up everything possible. The result may look too busy in the source text form though, especially because there are so many Python identifiers, expressions, code snippets, and file names that *could* be marked up. It's a trade-off. The nice thing is that Misc/NEWS is already almost valid reStrucuturedText (which shouldn't be surprising, since reStrucuturedText is based on common usage). In fact, most (if not all) of the standalone text files are almost there: README, PLAN.txt, etc. It wouldn't be much work to bring them up to spec. Here are the areas of Misc/NEWS that would require editing: * Sections: The two-line titles aren't supported. Either they should be combined into one line, or the "Release date" line should become part of the section body. Either:: What's New in Python 2.2 final? Release date: 21-Dec-2001 ========================================================== or:: What's New in Python 2.2 final? =============================== Release date: 21-Dec-2001 * Subsections (like "Core and builtins", "Library", "Extension modules", etc.): These could be made into true subsections by underlining them with dashes (and changing to title case):: Core and Builtins ----------------- I notice that there are many headers for empty subsections (such as "Tools/Demos" and "Build" in "What's New in Python 2.2 final?"). Should they be removed? * Inline literals (filenames, identifiers, expressions and code snippets): Surround with double-backquotes to get monospaced, uninterpreted text (like HTML TT tags). There are so many of these that it may be best to be selective. * Literal blocks: Example code should be indented and prefaced with double-colons ("::" at the end of the preceding paragraph). Doctest blocks (interactive sessions, begin with ">>> " and end with a blank line) don't need this, although it wouldn't hurt. > Can you convert a section for test and show me? I'll be happy to help. Hmm. Looking at the 2.2.1 Misc/NEWS file, I see sections for 2.2.1 final, 2.2.1c2, etc., but they're missing from the CVS Misc/NEWS file. Is this normal because of separate development branches or is something amiss? Following is a converted section from the current Misc/NEWS. Minimally marked up: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What's New in Python 2.3 alpha 1? ================================= XXX Release date: DD-MMM-2002 XXX Type/class unification and new-style classes -------------------------------------------- - Assignment to __class__ is disallowed if either the old and the new class is a statically allocated type object (such as defined by an extension module). This prevents anomalies like ``2 .__class__ = bool``. - New-style object creation and deallocation have been sped up significantly; they are now faster than classic instance creation and deallocation. - The __slots__ variable can now mention "private" names, and the right thing will happen (e.g. ``__slots__ = ["__foo"]``). - The built-ins slice() and buffer() are now callable types. The types classobj (formerly class), code, function, instance, and instancemethod (formerly instance-method), which have no built-in names but are accessible through the types module, are now also callable. The type dict-proxy is renamed to dictproxy. - Cycles going through the __class__ link of a new-style instance are now detected by the garbage collector. - Classes using __slots__ are now properly garbage collected. [SF bug 519621] - Tightened the __slots__ rules: a slot name must be a valid Python identifier. - The constructor for the module type now requires a name argument and takes an optional docstring argument. Previously, this constructor ignored its arguments. As a consequence, deriving a class from a module (not from the module type) is now illegal; previously this created an unnamed module, just like invoking the module type did. [SF bug 563060] - A new type object, 'basestring', is added. This is a common base type for 'str' and 'unicode', and can be used instead of ``types.StringTypes``, e.g. to test whether something is "a string": ``isinstance(x, basestring)`` is True for Unicode and 8-bit strings. This is an abstract base class and cannot be instantiated directly. - Changed new-style class instantiation so that when C's __new__ method returns something that's not a C instance, its __init__ is not called. [SF bug #537450] - Fixed super() to work correctly with class methods. [SF bug #535444] - If you try to pickle an instance of a class that has __slots__ but doesn't define or override __getstate__, a TypeError is now raised. This is done by adding a bozo __getstate__ to the class that always raises TypeError. (Before, this would appear to be pickled, but the state of the slots would be lost.) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Maximally marked up: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What's New in Python 2.3 alpha 1? ================================= XXX Release date: DD-MMM-2002 XXX Type/class unification and new-style classes -------------------------------------------- - Assignment to ``__class__`` is disallowed if either the old and the new class is a statically allocated type object (such as defined by an extension module). This prevents anomalies like ``2 .__class__ = bool``. - New-style object creation and deallocation have been sped up significantly; they are now faster than classic instance creation and deallocation. - The ``__slots__`` variable can now mention "private" names, and the right thing will happen (e.g. ``__slots__ = ["__foo"]``). - The built-ins ``slice()`` and ``buffer()`` are now callable types. The types classobj (formerly class), code, function, instance, and instancemethod (formerly instance-method), which have no built-in names but are accessible through the ``types`` module, are now also callable. The type dict-proxy is renamed to dictproxy. - Cycles going through the ``__class__`` link of a new-style instance are now detected by the garbage collector. - Classes using ``__slots__`` are now properly garbage collected. [SF bug 519621] - Tightened the ``__slots__`` rules: a slot name must be a valid Python identifier. - The constructor for the module type now requires a name argument and takes an optional docstring argument. Previously, this constructor ignored its arguments. As a consequence, deriving a class from a module (not from the module type) is now illegal; previously this created an unnamed module, just like invoking the module type did. [SF bug 563060] - A new type object, ``basestring``, is added. This is a common base type for ``str`` and ``unicode``, and can be used instead of ``types.StringTypes``, e.g. to test whether something is "a string": ``isinstance(x, basestring)`` is ``True`` for Unicode and 8-bit strings. This is an abstract base class and cannot be instantiated directly. - Changed new-style class instantiation so that when C's ``__new__`` method returns something that's not a C instance, its ``__init__`` is not called. [SF bug #537450] - Fixed ``super()`` to work correctly with class methods. [SF bug #535444] - If you try to pickle an instance of a class that has ``__slots__`` but doesn't define or override ``__getstate__``, a ``TypeError`` is now raised. This is done by adding a bozo ``__getstate__`` to the class that always raises ``TypeError``. (Before, this would appear to be pickled, but the state of the slots would be lost.) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< -- David Goodger Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ From mwh@python.net Thu Sep 5 10:34:34 2002 From: mwh@python.net (Michael Hudson) Date: 05 Sep 2002 10:34:34 +0100 Subject: [Python-Dev] Please help in calling python fucntion from 'c' In-Reply-To: "Praveen Patil"'s message of "Wed, 4 Sep 2002 13:31:00 +0100" References: Message-ID: <2m8z2gok45.fsf@starship.python.net> "Praveen Patil" writes: > This is a multi-part message in MIME format. > ------=_NextPart_000_0011_01C25417.4EC8F910 > Content-Type: text/plain; > charset="iso-8859-1" > Content-Transfer-Encoding: 7bit > > Hi, > > I have written 'C' dll(MY_DLL.DLL) . I am importing 'C' dll in python > file(example.py). > I want to call python function from 'c' function. > For your reference I have attached 'c' and python files to this mail. > In my pc: > python code is under the directory D:\test\example.py > dll is under the directory C:\Program Files\Python\DLLs\MY_DLL.pyd > > Here are the steps I am following. > > step(1): I am calling 'C' function(RECEIVE_FROM_IL_S) from python. > This 'C' function is existing imported dll(MY_DLL). > step(2): I want to call python function(TestFunction) from 'C' > function(RECEIVE_FROM_IL_S). > > > Python code is(example.py) :- > ---------------------------- > import MY_DLL > > G_Logfile = None > > def TestFunction(): > G_Logfile = open('Pytestfile.txt', 'w') > G_Logfile.write("%s \n"%'I am writing python created text file') > G_Logfile.close > G_Logfile = None > #end def TestFunction > > if __name__ == "__main__": > > MY_DLL.RECEIVE_FROM_IL_S(10,50) > > > 'C' code is (MY_DLL.c) :- > --------------------- > #include > #include > #include > > PyObject* _wrap_RECEIVE_FROM_IL_S(PyObject *self, PyObject *args) > { > FILE* fp; > PyObject* _resultobj; > int i,j; > > if( !(PyArg_ParseTuple(args, "ii",&i,&j))) > { > return NULL; > } > fp= fopen("RECEIVE_IL_S.txt", "w"); > fprintf(fp, "i=%d j=%d" , i,j); > fclose(fp); > > /* Here I want to call python function(TestFunction). Please suggest me > some solution*/ > > _resultobj = Py_None; > return _resultobj; > } > > > static PyMethodDef MY_DLL_methods[] = { > { "RECEIVE_FROM_IL_S", _wrap_RECEIVE_FROM_IL_S, METH_VARARGS }, > { NULL , NULL} > }; > > __declspec(dllexport) void __cdecl initMY_DLL(void) > { > Py_InitModule("MY_DLL",MY_DLL_methods); > } > > > Please anybody help me solving the problem. > > > Cheers, > > Praveen. > > ------=_NextPart_000_0011_01C25417.4EC8F910 > Content-Type: text/plain; > name="exampl.py" > Content-Transfer-Encoding: 7bit > Content-Disposition: attachment; > filename="exampl.py" > > import MY_DLL > > G_Logfile = None > > def TestFunction(): > G_Logfile = open('Pytestfile.txt', 'w') > G_Logfile.write("%s \n"%'I am writing python created text file') > G_Logfile.close > G_Logfile = None > #end def TestFunction > > if __name__ == "__main__": > > MY_DLL.RECEIVE_FROM_IL_S(10,50) > > ------=_NextPart_000_0011_01C25417.4EC8F910 > Content-Type: application/octet-stream; > name="MY_DLL.c" > Content-Transfer-Encoding: quoted-printable > Content-Disposition: attachment; > filename="MY_DLL.c" > > #include > #include > #include > > PyObject* _wrap_RECEIVE_FROM_IL_S(PyObject *self, PyObject *args) > { > FILE* fp; =20 > PyObject* _resultobj; > int i,j; > =20 > if( !(PyArg_ParseTuple(args, "ii",&i,&j))) > { > return NULL; > } > fp=3D fopen("RECEIVE_IL_S.txt", "w"); > fprintf(fp, "i=3D%d j=3D%d" , i,j); > fclose(fp); > > /* Here I want to call python function(TestFunction). Please suggest = > me some solution*/ > > _resultobj =3D Py_None; > return _resultobj; > } > > > static PyMethodDef MY_DLL_methods[] =3D { > { "RECEIVE_FROM_IL_S", _wrap_RECEIVE_FROM_IL_S, METH_VARARGS }, > { NULL , NULL} > }; > > __declspec(dllexport) void __cdecl initMY_DLL(void) > { > Py_InitModule("MY_DLL",MY_DLL_methods); > } > > ------=_NextPart_000_0011_01C25417.4EC8F910 > Content-Type: text/plain; charset="us-ascii" > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > > [ The information contained in this e-mail is confidential and is intended for the named recipient only. If you are not the named recipient, please notify us by telephone on +44 (0)1249 442 430 immediately, destroy the message and delete it from your computer. Silver Software has taken every reasonable precaution to ensure that any attachment to this e-mail has been checked for viruses. However, we cannot accept liability for any damage sustained as a result of any such software viruses and advise you to carry out your own virus check before opening any attachment. Furthermore, we do not accept responsibility for any change made to this message after it was sent by the sender.] > > ------=_NextPart_000_0011_01C25417.4EC8F910-- > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev -- From mwh@python.net Thu Sep 5 10:33:04 2002 From: mwh@python.net (Michael Hudson) Date: 05 Sep 2002 10:33:04 +0100 Subject: [Python-Dev] Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Oren Tirosh's message of "Wed, 4 Sep 2002 08:46:46 -0400" References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> Message-ID: <2mbs7cok6n.fsf@starship.python.net> Oren Tirosh writes: > Any other problems I should be aware of? Wildly unpredicatble x-platform behaviour in the presence of threads. M. -- Indeed, when I design my killer language, the identifiers "foo" and "bar" will be reserved words, never used, and not even mentioned in the reference manual. Any program using one will simply dump core without comment. Multitudes will rejoice. -- Tim Peters, 29 Apr 1998 From mgilfix@eecs.tufts.edu Thu Sep 5 05:45:03 2002 From: mgilfix@eecs.tufts.edu (Michael Gilfix) Date: Thu, 5 Sep 2002 00:45:03 -0400 Subject: [apug] Re: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: <1031451760.644.97.camel@HillCountryPeress>; from hu.peress@mail.mcgill.ca on Sat, Sep 07, 2002 at 09:22:40PM -0500 References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> Message-ID: <20020905004503.A9680@eecs.tufts.edu> While I understand what you're trying to do here (and think it would be quite nice), I'm not sure how you're going to accomplish it. How will parsing python using a syntax-tree help? It's not going to tell you what the function does in all cases or the various types it could handle. Perhaps you could make educated guesses by looking at the types of operations on the objects (a 'has_key' is a sure indicator of a hash), but that would be sketchy at best. For a ready example, imagine having a module that contains useful helper functions. How are you going to identify the type requirements of those functions if you don't have context? How can you be sure that you've convered all contexts (including conversions). Such is the nature of dynamic languages. It's very hard to do what you'd like to do here. -- Mike On Sat, Sep 07 @ 21:22, Hunter Peress wrote: > I think its easier to enforce this from the level i describe, than have > guido saying "ok guys please be more explicit in your documentation". I > mean, both of those documents above are somewhat explicit, but they are > not COMPLETE. > > Could you provide me with some linkage on parsing python (from a > compilation/ syntax-tree analysis POV). SO that i can get to work on > writing a patch for the pydoc generation program. -- Michael Gilfix mgilfix@eecs.tufts.edu For my gpg public key: http://www.eecs.tufts.edu/~mgilfix/contact.html" From hu.peress@mail.mcgill.ca Sun Sep 8 08:11:32 2002 From: hu.peress@mail.mcgill.ca (Hunter Peress) Date: 08 Sep 2002 02:11:32 -0500 Subject: [apug] Re: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: <20020905004503.A9680@eecs.tufts.edu> References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> <20020905004503.A9680@eecs.tufts.edu> Message-ID: <1031469093.644.196.camel@HillCountryPeress> Actually all of the thinking i did WAS taking into account the "dynamic" nature of python. But its not like the actual code is being rewritten fast enough to make this unfeasible or unneccesary. Im glad to get all of this feedback as its helping me formulate, and further specify my plans (or eventually healthily debunk them (as the past 3 responders have helped do)). Instead of just thinking: "arguments are not explicitely anything, therefore it makes no sense to even attempt to document them explicitely". I think this: simply add the capability for multiple definitions per each argument. eg going back to my original sample here is an updated version: def something(a,b,c="lalal"): """This will find its way into the pydocs because its a comment""" ##Here is the new stuff Im proposing ##note, a clearer sytnax can surely be devised. """file,socket""" #documents the type(s) of the first arg """string,list""" # "" second """list,hash""" # "" third """string,hash""" #documents the return type(s). Thats quite a simple solution, and still provides worlds better exactness and clarity than the current system allows. Onto more of your concerns: On Wed, 2002-09-04 at 23:45, Michael Gilfix wrote: > While I understand what you're trying to do here (and think it would > be quite nice), I'm not sure how you're going to accomplish it. How > will parsing python using a syntax-tree help? It's not going to tell > you what the function does in all cases or the various types it could > handle.Perhaps you could make educated guesses by looking at the > types of operations on the objects (a 'has_key' is a sure indicator of > a hash), but that would be sketchy at best. Actually I wasnt suggesting this AT ALL wrt intelligent guesses, and for now this proposal leans away from it. Rather there are only 2 simple things that I wanted to obtain from the parse-tree: the number of arguments, and if possible to see if there Assume for now that my whole proposal will simply be another option (instead of the default) to the pydoc-generator program. If invoked, it will fail (if the super strict option is specified) if you don't supply definitions for number of args for a given method. This brings up your "dynamic" language issue again. When u have lots of args being used as different things, my program then introduces another level of complexity to deciphering the docs in a meaningful way. Eg: a sample output of this program based on my example: ------------output----------------- method: something(a (file,socket),b (string,list),c="lalal" (list,hash)) return type :string,hash This will find its way into the pydocs because its a comment ----------------------------------- Now in html format it would be even nicer as there will be links to the types listed. And now looking at it, I think its much clearer than nothing at al. Of course there is going to be that type of code where u have no need of documenting every method because their names are self explantory, and such explicit documentation isnt necessary, thats not what this is really intended for. If the specific argument arises that "since python is a dynamic language your approach doesnt make sense" say, then I have to respond: an attempt at specifiying things is FAR better than nothing, and moreover, this is only my first attempt. Allowing it to become a part of the generator as an option will open it up to user input, and hence improvement, AND! *** it might just turn out that a "dynamic" approach will be necessary to document a "dynamic" language. *** So im still looking for more design tips, and a place where I could find out how to get into the meat of the python parser, but i think the "http://python.org/doc/2.2/lib/module-parser.html" is probably what I'll be using. > > For a ready example, imagine having a module that contains useful > helper functions. How are you going to identify the type requirements > of those functions if you don't have context? How can you be sure that > you've convered all contexts (including conversions). > > Such is the nature of dynamic languages. It's very hard to do > what you'd like to do here. > > -- Mike > > On Sat, Sep 07 @ 21:22, Hunter Peress wrote: > > I think its easier to enforce this from the level i describe, than have > > guido saying "ok guys please be more explicit in your documentation". I > > mean, both of those documents above are somewhat explicit, but they are > > not COMPLETE. > > > > Could you provide me with some linkage on parsing python (from a > > compilation/ syntax-tree analysis POV). SO that i can get to work on > > writing a patch for the pydoc generation program. > > -- > Michael Gilfix > mgilfix@eecs.tufts.edu > > For my gpg public key: > http://www.eecs.tufts.edu/~mgilfix/contact.html" > From mal@egenix.com Thu Sep 5 10:14:06 2002 From: mal@egenix.com (M.-A. Lemburg) Date: Thu, 05 Sep 2002 11:14:06 +0200 Subject: [Python-Dev] utf8 issue References: <200208232105.g7NL5RE16863@pcp02138704pcs.reston01.va.comcast.net> <2mznv9c1k4.fsf@starship.python.net> <200208261405.g7QE5Of05199@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3D77205E.8080103@lemburg.com> Guido van Rossum wrote: >>Guido van Rossum writes: >> >> >>>This might beling on SF, except it's already been solved in Python >>>2.3, and I need guidance about what to do for Python 2.2.2. >>> >>>In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that >>>cannot be decode back. In 2.3, this is fixed. Should this be fixed >>>in 2.2.2 as well? >> >>I think this was discussed really quite a long time ago, like six >>months or so. >> >> >>>I'm asking because it caused problems with reading .pyc files: if >>>there's a Unicode literal containing a lone surrogate, reading the >>>.pyc file causes an exception: >>> >>>UnicodeError: UTF-8 decoding error: unexpected code byte >>> >>>It looks like revision 2.128 fixed this for 2.3, but that patch >>>doesn't cleanly apply to the 2.2 maintenance branch. Can someone >>>help? >> >>I think the reason this didn't get fixed in 2.2.1 is that it >>necessitates bumping MAGIC. >> >>I can probably dig up more references if you want. > > > Please do. Bumping MAGIC is a no-no between dot releases. But I > don't understand why that is necessary? It would be necessary since marshal uses UTF-8 for storing Unicode literals. Even though it's highly unlikely that the problem cases are used in Python Unicode literals, there's a tiny chance. Without the MAGIC change this could result in PYC files failing to load. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From guido@python.org Thu Sep 5 14:51:49 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 09:51:49 -0400 Subject: [Python-Dev] utf8 issue In-Reply-To: Your message of "Thu, 05 Sep 2002 11:14:06 +0200." <3D77205E.8080103@lemburg.com> References: <200208232105.g7NL5RE16863@pcp02138704pcs.reston01.va.comcast.net> <2mznv9c1k4.fsf@starship.python.net> <200208261405.g7QE5Of05199@pcp02138704pcs.reston01.va.comcast.net> <3D77205E.8080103@lemburg.com> Message-ID: <200209051351.g85Dpnk12649@odiug.zope.com> > > Please do. Bumping MAGIC is a no-no between dot releases. But I > > don't understand why that is necessary? > > It would be necessary since marshal uses UTF-8 for storing > Unicode literals. Do you mean that in 2.2 it doesn't? > Even though it's highly unlikely that the problem cases are used in > Python Unicode literals, there's a tiny chance. Without the MAGIC > change this could result in PYC files failing to load. Ha. You may have missed the start of this thread, but the whole problem was that a PYC file *did* fail to load! (The .py file had a lone surrogate in it.) So I'm not sure this argument holds much water. Can someone please explain what change would be necessary to what part of the code to prevent a lone surrogate in a string literal from creating a PYC file from blowing up? --Guido van Rossum (home page: http://www.python.org/~guido/) From oren-py-d@hishome.net Thu Sep 5 05:54:14 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Thu, 5 Sep 2002 00:54:14 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> <20020904160143.GA1483@hishome.net> <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020905045414.GA26104@hishome.net> On Wed, Sep 04, 2002 at 04:05:22PM -0400, Guido van Rossum wrote: > > If I use a module that spawns an external process and uses SIGCHLD to be > > informed of its termination why should my innocent code that just reads > > lines from a file suddenly break? In C I can at least restart the > > operation after an EINTR but file.readline cannot even be properly > > restarted because the buffering and file position is all messed up. > > I have never understood why a child dying should send a signal. You > can poll for the child with waitpid() instead. You're assuming too much about the structure of the program using child processes. The code that starts the child process may not be in control of the Python program counter by the time it ends. It's useful to be able to leave a signal handler to clean up the zombie process by waitpid(). > But if you have a suggestion for how to fix this particular issue, I'd > be happy to look it over, since this *is* something some people do. Of course people do it - it's documented and it works. Signal handling may have had some historical problems on some Unixes but I've never had any problem with it under Linux. My previous messages more or less outline my suggestion. I'll write a better summary. > > Getting an notification of a child process terminating or other > > asynchronous events can only be done using signals and is currently > > dangerous because it will break code using I/O. > > See above. I see half your point; people wanting this tend to use > signals and it causes breakage. Polling is not what I'd call "getting notification of asynchronous events". If it causes breakage it could be because people either use it incorrectly or the signal support on the underlying system is broken. In Linux it isn't broken. If it's broken on other Python platforms I don't see why it shouldn't be well-supported on the platforms that aren't. Has anyone here actually tried to use signal.signal ? > > > > interference to Python code by signals. Any other problems I should > > > > be aware of? > > > > > > There's no way to sufficiently test a program that uses signals. The > > > signal handler cannot touch *any* data, which makes it pretty useless. > > > > In order to be useful a signal handler needs to be able to set one bit. > > The next time the ticker expires this bit will be checked. > > OK. > > > If an I/O operation was interrupted the Python signal handler can be > > executed immediately from the wrapper. When it returns the wrapper > > will resume the interrupted operation. > > Is calling the Python signal handler from the wrapper always safe? > What if the Python signal handler e.g. closes the file or reads from > it? Code in signal handlers is executed at some arbitrary point in the program and the programmer should be aware of this and only do so simple things like setting a flag or appending to a list. Oren From bkc@murkworks.com Thu Sep 5 15:13:50 2002 From: bkc@murkworks.com (Brad Clements) Date: Thu, 05 Sep 2002 10:13:50 -0400 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: <200209050024.g850OTd08824@pcp02138704pcs.reston01.va.comcast.net> References: Your message of "Wed, 04 Sep 2002 18:39:01 EDT." <3D7653AD.14352.14F391B6@localhost> Message-ID: <3D772EC2.30217.184B6C78@localhost> On 4 Sep 2002 at 20:24, Guido van Rossum wrote: > Pretty soon, a SF propject will be created (Barry has already gotten > the request in). We'll gladly add you to the list of developers. I look forward to it. > > I'm particularly intersted in how to allow html only messages > > (reduce false positives). I'm getting a lot of personal mail in > > that format, unfortunately. > > You train it with an equal number of spam and non-spam ("ham") that > you received. Just make sure the ham training messages contain enough > representatives of the html-only mail. This is one way to do it, but I was planning on experimenting with tokenizer methods that strip out HTML tags, leaving only the text. My feeling is that the presentation of "the message" is independent of the message itself, so if I get a message in Text, HTML, RTF only the actual content is important, not the markup method. Though I suppose using lots of red and large fonts might be an indicator of spam, the text of the message should still suffice. Tim's comments in timtest.py hint that stripping tags isn't a catastrophe for f-n's, but he's not planning on doing that for use on technical lists. I would like to pursue general client-side filtering of spam, so I do need to contend with that. btw, Tim's comment: > # So if a message is multipart/alternative with both text/plain and text/html > # branches, we ignore the latter, else newbies would never get a message > # through. If a message is just HTML, it has virtually no chance of getting > # through Tells me (spammer hat on) that I can send message with a non-spammish text only part, and a spam html part since most "non-techie" email client users automatically display the html version when available, however Tim's implementation will ignore it. Most "average users" never even see the text-only part of multipart messages. In Tim's application, that's okay since he's going to use the text-only part anyway. But for my purposes, I need to consider both portions. So it's simpler for me to strip html and combine that text with the text-only part and then "test" the combined parts. Well these are just musings, I'll be looking for the SF project. -Brad Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From Anthony Baxter Thu Sep 5 15:28:25 2002 From: Anthony Baxter (Anthony Baxter) Date: Fri, 06 Sep 2002 00:28:25 +1000 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: <3D772EC2.30217.184B6C78@localhost> Message-ID: <200209051428.g85ESPR24749@localhost.localdomain> >>> "Brad Clements" wrote > This is one way to do it, but I was planning on experimenting with tokenizer methods > that strip out HTML tags, leaving only the text. The set I'm working with, I found I needed to strip out everything but for src="" and href="" attributes of tags. Too much goodness in them for the system to get it's teeth into. > Tells me (spammer hat on) that I can send message with a non-spammish text > only part, and a spam html part since most "non-techie" email client users > automatically display the html version when available, however Tim's > implementation will ignore it. I've actually got a bunch of spam like that. The text/plain is something like **This is a HTML message** and nothing else. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From guido@python.org Thu Sep 5 15:33:34 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 10:33:34 -0400 Subject: [Python-Dev] Proposed Mixins for Wide Interfaces In-Reply-To: Your message of "Wed, 04 Sep 2002 17:40:34 EDT." <001801c2545b$b43aba60$e8ea7ad1@othello> References: <001101c2510d$9fce0920$5f66accf@othello> <200209031750.g83HoVq05812@odiug.zope.com> <001801c2545b$b43aba60$e8ea7ad1@othello> Message-ID: <200209051433.g85EXY612883@odiug.zope.com> > [RH] > > > How about adding some mixins to simplify the > > > implementation of some of the fatter interfaces? On second thought, I don't think there's enough here to warrant putting this in the standard library. E.g. the example from BaseSet actually strikes me as indirect: because <= is the natural operation to provide for sets, hanging everything off __lt__ looks forced. Maybe this could go into the Demo directory or in some example or HOWTO. We'll revise this issue when we are going to introduce a standard type or interface hierarchy (not for Python 2.3). --Guido van Rossum (home page: http://www.python.org/~guido/) From python@rcn.com Thu Sep 5 15:43:50 2002 From: python@rcn.com (Raymond Hettinger) Date: Thu, 5 Sep 2002 10:43:50 -0400 Subject: [Python-Dev] Proposed Mixins for Wide Interfaces References: <001101c2510d$9fce0920$5f66accf@othello> <200209031750.g83HoVq05812@odiug.zope.com> <001801c2545b$b43aba60$e8ea7ad1@othello> <200209051433.g85EXY612883@odiug.zope.com> Message-ID: <003901c254ea$a752d820$f6eb7ad1@othello> > > [RH] > > > > How about adding some mixins to simplify the > > > > implementation of some of the fatter interfaces? [GvR] > On second thought, I don't think there's enough here to warrant > putting this in the standard library. E.g. the example from BaseSet > actually strikes me as indirect: because <= is the natural operation > to provide for sets, hanging everything off __lt__ looks forced. Agreed. How about the MappingMixin and SequenceMixin? These both provide much more meat and have more natural attach points (getitem, setitem, delitem). From guido@python.org Thu Sep 5 15:53:08 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 10:53:08 -0400 Subject: [Python-Dev] Proposed Mixins for Wide Interfaces In-Reply-To: Your message of "Thu, 05 Sep 2002 10:43:50 EDT." <003901c254ea$a752d820$f6eb7ad1@othello> References: <001101c2510d$9fce0920$5f66accf@othello> <200209031750.g83HoVq05812@odiug.zope.com> <001801c2545b$b43aba60$e8ea7ad1@othello> <200209051433.g85EXY612883@odiug.zope.com> <003901c254ea$a752d820$f6eb7ad1@othello> Message-ID: <200209051453.g85Er8j12983@odiug.zope.com> > How about the MappingMixin and SequenceMixin? These both > provide much more meat and have more natural attach points > (getitem, setitem, delitem). I'd much rather have a howto that explains all the issues. This stuff is vastly underdocumented. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Thu Sep 5 16:01:14 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 11:01:14 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Thu, 05 Sep 2002 00:54:14 EDT." <20020905045414.GA26104@hishome.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> <20020904160143.GA1483@hishome.net> <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> <20020905045414.GA26104@hishome.net> Message-ID: <200209051501.g85F1EY13017@odiug.zope.com> > > I have never understood why a child dying should send a signal. > > You can poll for the child with waitpid() instead. > > You're assuming too much about the structure of the program using > child processes. The code that starts the child process may not be > in control of the Python program counter by the time it ends. It's > useful to be able to leave a signal handler to clean up the zombie > process by waitpid(). I admit that I hate signals so badly that whenever I needed to wait for a child to finish I would always structure the program around this need (even when coding in C). > > But if you have a suggestion for how to fix this particular issue, I'd > > be happy to look it over, since this *is* something some people do. > > Of course people do it - it's documented and it works. Barely. This thread started when you pointed out the problems with using signals. I've always been reluctant about the fact that we had a signal module at all -- it's not portable (no non-Unix system supports it well), doesn't interact well with threads, etc., etc.; however, C programmers have demanded some sort of signal support and I caved in long ago when someone contributed a reasonable approach. I don't regret it like lambda, but I think it should only be used by people who really know about the caveats. > > See above. I see half your point; people wanting this tend to use > > signals and it causes breakage. > > Polling is not what I'd call "getting notification of asynchronous events". > If it causes breakage it could be because people either use it incorrectly > or the signal support on the underlying system is broken. In Linux it isn't > broken. If it's broken on other Python platforms I don't see why it > shouldn't be well-supported on the platforms that aren't. I meant in Python. The I/O problems make signals hard to use. > Has anyone here actually tried to use signal.signal ? Yes. > Code in signal handlers is executed at some arbitrary point in the > program and the programmer should be aware of this and only do so > simple things like setting a flag or appending to a list. Unfortunately the mechanism doesn't enforce this. I wish we could invent a Python signal API that only lets you do one of these simple things. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Thu Sep 5 02:34:21 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 04 Sep 2002 21:34:21 -0400 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: <3D7653AD.14352.14F391B6@localhost> Message-ID: Guido addressed most points, so I'll just cover a few: [Brad Clements] > ... > I'd like to replicate Tim's test rig so I can compare my results > with existing ones. My spam isn't in mbox format, but I can convert it. Mine isn't either . Barry gave me mboxes, but the spam corpus I got off the web had one spam per file, and it only took two days of extreme pain to realize that one msg per file is enormously easier to work with when testing: you want to split these at random into random collections, you may need to replace some at random when testing reveals spam mistakenly called ham (and vice versa), etc -- even pasting examples into email is much easier when it's one msg per file (and the test driver makes it easy to print a msg's file path). My test driver and tokenizer are checked in (timtest.py), and also a little utility or two. The directory structure under my spambayes directory looks like so: Data/ Spam/ Set1/ (contains 2750 spam .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" Ham/ Set1/ (contains 4000 ham .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" reservoir/ (contains "backup ham") If you use the same names and structure, huge mounds of the tedious testing code will work as-is. The more Set directories the merrier, although you'll hit a point of diminishing returns if you exceed 10. The "reservoir" directory contains a few thousand other random hams. When a ham is found that's really spam, I delete it, and then the rebal.py utility moves in a message at random from the reservoir to replace it. If I had it to do over again, I think I'd move such spam into a Spam set (chosen at random), instead of deleting it. > I'm particularly intersted in how to allow html only messages > (reduce false positives). I'm getting a lot of personal mail in that > format, unfortunately. It will learn about that -- not a problem. It's a problem in *my* tests because HTML mail is so strongly hated on tech lists, but newbies use it there anyway, and it would be horrid to block newbies just because they're normal people who enjoy creating visually attractive messages <0.9 wink>. Read the "What about HTML?" section in timtest.py. You may also with to remove the guard from if part.get_content_type() == "text/plain": text = html_re.sub(' ', text) in tokenize(). Once you have a good test setup, you can try it both ways, and the data will tell you which way works best for your normal mix. Details of runs both ways on my c.l.py corpora are given in the "What about HTML?" section mentioned before, and even there stripping HTML decorations out of HTML-only messages had an insignificant effect on the f-p rate. It increased the f-n rate, though, and precisely because HTML messages are so very rare on c.l.py that they're *almost* certainly spam. From python@rcn.com Thu Sep 5 16:43:20 2002 From: python@rcn.com (Raymond Hettinger) Date: Thu, 5 Sep 2002 11:43:20 -0400 Subject: [Python-Dev] GBayes design Message-ID: <002b01c254f2$f6c7c020$71b53bd0@othello> Is it too late to challenge a core design decision? Instead of multiplying probablities, use fuzzy logic methods. Classify the indicators into damning, strong, weak, neautral, ... After counting the number of indicators in each class, make a spam/ham decision that can be easily tweaked. This would make it easy to implement variations of Tim's recent clear win, where additional indicators are gathered until the balance shifts sharply to one side. Some other advantages are: -- easily interpreted score vectors (6 damning, 7 strong, 4 weak, ... ) -- avoids mathematical issues with indicators not being independent -- allows the addition of non-token based indicators. for instance, a preponderance of caps would be a weak indicator. the presence of caps separated by spaces would be a strong indicator. -- the decision logic would be more intuitive -- avoids the issue of having equal amounts of spam and ham in the sample The core concept would stay the same -- it's really just a shift from continuous to discrete. of-course-this-is-entirely-outside-my-fields-of-knowledge-ly yours, Raymond Hettinger From mgilfix@eecs.tufts.edu Thu Sep 5 18:23:05 2002 From: mgilfix@eecs.tufts.edu (Michael Gilfix) Date: Thu, 5 Sep 2002 13:23:05 -0400 Subject: [apug] Re: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: <1031469093.644.196.camel@HillCountryPeress>; from hu.peress@mail.mcgill.ca on Sun, Sep 08, 2002 at 02:11:32AM -0500 References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> <20020905004503.A9680@eecs.tufts.edu> <1031469093.644.196.camel@HillCountryPeress> Message-ID: <20020905132305.A19681@eecs.tufts.edu> Ok. I think I understand better what you're trying to accomplish. I got the impression earlier (and I think others did as well) that you were hoping to have pydoc automatically label types on the function call. A new convention might very well be welcomed. You might want to post a couple of examples and the corresponding documentation for feedback here before you start the hard work on the patch :) More below... On Sun, Sep 08 @ 02:11, Hunter Peress wrote: > Actually all of the thinking i did WAS taking into account the "dynamic" > nature of python. > > But its not like the actual code is being rewritten fast enough to make > this unfeasible or unneccesary. > > Im glad to get all of this feedback as its helping me formulate, and > further specify my plans (or eventually healthily debunk them (as the > past 3 responders have helped do)). > > Instead of just thinking: > > "arguments are not explicitely anything, therefore it makes no sense to > even attempt to document them explicitely". > > I think this: simply add the capability for multiple definitions per > each argument. eg going back to my original sample here is an updated > version: > > def something(a,b,c="lalal"): > """This will find its way into the pydocs because its a comment""" > ##Here is the new stuff Im proposing > ##note, a clearer sytnax can surely be devised. > """file,socket""" #documents the type(s) of the first arg > """string,list""" # "" second > """list,hash""" # "" third > """string,hash""" #documents the return type(s). > > Thats quite a simple solution, and still provides worlds better > exactness and clarity than the current system allows. > > Onto more of your concerns: > On Wed, 2002-09-04 at 23:45, Michael Gilfix wrote: > > While I understand what you're trying to do here (and think it would > > be quite nice), I'm not sure how you're going to accomplish it. How > > will parsing python using a syntax-tree help? It's not going to tell > > you what the function does in all cases or the various types it could > > handle.Perhaps you could make educated guesses by looking at the > > types of operations on the objects (a 'has_key' is a sure indicator of > > a hash), but that would be sketchy at best. > Actually I wasnt suggesting this AT ALL wrt intelligent guesses, and for > now this proposal leans away from it. > > Rather there are only 2 simple things that I wanted to obtain from the > parse-tree: the number of arguments, and if possible to see if there Agreed now that things are clearer. > Assume for now that my whole proposal will simply be another option > (instead of the default) to the pydoc-generator program. If invoked, it > will fail (if the super strict option is specified) if you don't supply > definitions for number of args for a given method. > > This brings up your "dynamic" language issue again. > > When u have lots of args being used as different things, my program then > introduces another level of complexity to deciphering the docs in a > meaningful way. > > Eg: a sample output of this program based on my example: > > ------------output----------------- > method: something(a (file,socket),b (string,list),c="lalal" > (list,hash)) > return type :string,hash > > This will find its way into the pydocs because its a comment > ----------------------------------- > > Now in html format it would be even nicer as there will be links to > the types listed. I agree. I know that I'd welcome an extra added option to enable some extra pydoc functionality. Developing a schema is tricky though and you should probably engage in some more debate first :) > And now looking at it, I think its much clearer than nothing at al. > > Of course there is going to be that type of code where u have no need of > documenting every method because their names are self explantory, and > such explicit documentation isnt necessary, thats not what this is > really intended for. > > If the specific argument arises that "since python is a dynamic language > your approach doesnt make sense" say, then I have to respond: Of course it applies. Because of my misunderstanding, I was under the impression that you wanted to generate the equivalent of function calls, not develop a scheme like javadoc. The dynamic nature of Python means that such specifications become even more important as project sizes increase. > an attempt at specifiying things is FAR better than nothing, and > moreover, this is only my first attempt. Allowing it to become a part of > the generator as an option will open it up to user input, and hence > improvement, AND! > > *** > it might just turn out that a "dynamic" approach will be necessary to > document a "dynamic" language. > *** > > So im still looking for more design tips, and a place where I could find > out how to get into the meat of the python parser, but i think the > "http://python.org/doc/2.2/lib/module-parser.html" is probably what I'll > be using. Shouldn't there be code in the existing pydoc to do much of what you want for you? It seems like it might be nice to re-engineer pydoc to take some handlers that allow you to do further customization after it's done it's thing. That way, we can add extensions into the existing code and all that integration stuff might be a little easier. Good luck n' keep us posted :) -- Mike -- Michael Gilfix mgilfix@eecs.tufts.edu For my gpg public key: http://www.eecs.tufts.edu/~mgilfix/contact.html" From spambayes@python.org Thu Sep 5 18:57:17 2002 From: spambayes@python.org (Tim Peters) Date: Thu, 05 Sep 2002 13:57:17 -0400 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: <3D772EC2.30217.184B6C78@localhost> Message-ID: [Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ] [Brad Clements] > ... > My feeling is that the presentation of "the message" is independent of the > message itself, so if I get a message in Text, HTML, RTF only the actual > content is important, not the markup method. Everything's A Clue. Everything that gets ignored partly blinds the classifier, so the question isn't whether there's a difference, it's how much of a difference it makes. > Though I suppose using lots of red and large fonts might be an > indicator of spam, the text of the message should still suffice. Indeed, Graham reported that the hex color code for bright red was one of the strongest spam indicators in his database. > Tim's comments in timtest.py hint that stripping tags isn't a > catastrophe for f-n's, but he's not planning on doing that for use on > technical lists. When HTML-only email is a 99.99% spam indicator on a tech list, it would be crazy to ignore that clue. But note that the comments *also* say I'd be delighted to remove HTML tags even there if some other way of slashing the f-n rate is proven to work (and most people who have tried it say that mining more header lines does do it -- but then I haven't seen anything from them about how they do when they ignore the header lines. I was happy to ignore header lines in order to get *some* kind of handle on how well could be done on "pure content", and turned out that works remarkably well). >> # So if a message is multipart/alternative with both text/plain >> # and text/html branches, we ignore the latter, else newbies would never >> # get a message through. If a message is just HTML, it has virtually no >> # chance of getting through > Tells me (spammer hat on) that I can send message with a > non-spammish text only part, and a spam html part since most > "non-techie" email client users automatically display the html > version when available, however Tim's implementation will ignore it. Sure. It *certainly* isn't a problem on my test data (as witnessed by the measured error rates). If the nature of the world changes, the code has to adapt along with it. But 90% of the spam I receive (and I get a lot) is still trivial to recognize from a mere glance at the subject line, and I don't buy that spammers are a class of ubergeek with formidable skill. Response rates are a percentage game, and more so than anti-spammers I expect spammers are keen to go for high-percentage wins at the expense of esoterica. > Most "average users" never even see the text-only part of > multipart messages. In Tim's application, that's okay since he's going > to use the text-only part anyway. But for my purposes, I need to consider > both portions. So it's simpler for me to strip html and combine that text > with the text-only part and then "test" the combined parts. Not unreasonable , but testing remains the only way to decide. It's rare you can out-think a fraction of a percent! From oren-py-d@hishome.net Thu Sep 5 10:30:02 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Thu, 5 Sep 2002 05:30:02 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020905093002.GA61136@hishome.net> On Wed, Sep 04, 2002 at 05:21:44PM -0400, Fran�ois Pinard wrote: > I'm not fully familiar with all the details of this problem, it surely has > been in the air for quite a long time now (I might have first heard of it > while Taylor UUCP was being developed). It might be dependent on the > underlying system. If I'm not mistaken, this is Ian Taylor who introduced the > following Autoconf macro: > > > - Macro: AC_SYS_RESTARTABLE_SYSCALLS > If the system automatically restarts a system call that is > interrupted by a signal, define `HAVE_RESTARTABLE_SYSCALLS'. The name of this macro is misleading. It doesn't check whether system calls are restartABLE but whether they are restartED automatically by libc. It forks a subprocess that sends a signal to the parent. The parent waits for the child and checks if the wait() was interrupted. If this macro is defined you will never get EINTR so there's no need to worry about this. If it isn't defined you need to restart system calls yourself. If a platform really has interruptible I/O calls that cannot be continued or restarted without data loss there is no way to use signal handlers on that system. I doubt that such totally broken platforms are common these days. > In GNU file utilities (now merged within the new GNU coreutils), Jim Meyering > uses restart wrappers for many I/O functions, so the idea of wrappers has been > maturing for a while, and is used in basic, heavily used programs. I'll check the sources. Oren From spambayes@python.org Thu Sep 5 18:39:36 2002 From: spambayes@python.org (Tim Peters) Date: Thu, 05 Sep 2002 13:39:36 -0400 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: <200209051428.g85ESPR24749@localhost.localdomain> Message-ID: [Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ] [Anthony Baxter] > ... > I've actually got a bunch of spam like that. The text/plain is something > like > > **This is a HTML message** > > and nothing else. Are you sure that's in a text/plain MIME section? I've seen that many times myself, but it's always been in the prologue (*between* MIME sections -- so it's something a non-MIME aware reader will show you). From spambayes@python.org Thu Sep 5 19:30:03 2002 From: spambayes@python.org (Tim Peters) Date: Thu, 05 Sep 2002 14:30:03 -0400 Subject: [Python-Dev] GBayes design In-Reply-To: <002b01c254f2$f6c7c020$71b53bd0@othello> Message-ID: [Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ] [Raymond Hettinger] > Is it too late to challenge a core design decision? Never too late, but somebody has to do real work to prove that a change is justified. Plausible ideas are cheaper than dirt, alas. > Instead of multiplying probablities, use fuzzy logic methods. > Classify the indicators into damning, strong, weak, neautral, ... Think about how that differs from 0.99, 0.80, 0.20 and 0.50. Does it? > After counting the number of indicators in each class, make > a spam/ham decision that can be easily tweaked. This would > make it easy to implement variations of Tim's recent clear > win, where additional indicators are gathered until the > balance shifts sharply to one side. > > Some other advantages are: > -- easily interpreted score vectors (6 damning, 7 strong, 4 weak, ... ) I've seen people see the current prob("TV") = 0.99 style cold and pick it up at once. With character n-grams I think it's frustrating, but word-like tokenization gives easily recognized clues. > -- avoids mathematical issues with indicators not being independent How do you know this? > -- allows the addition of non-token based indicators. for instance, > a preponderance of caps would be a weak indicator. the presence > of caps separated by spaces would be a strong indicator. As far as the current classifier is concerned, "a token" is any Python object usable as a dict key. There are already several ways in which the current tokenization scheme in timtest.py uses strings to *represent* non-textual indicators. For example, if the headers lack an Organization line, a 'bool:noorg' "token" is generated. For large blobs of text that get skipped, a token is generated that records both the first character in that blob and the number of bytes skipped (chopped to the nearest multiple of 10). And so on -- you can inject anything you like into the scheme, including stuff like "number of caps separated by spaces: more than 10" (BTW, I happen to know that this particular "clue" acts to block relevant conference announcements, not just spam) I got some interesting results by injecting a crude characters/word statistic: yield "cpw:%.1g" % (float(len(text)) / len(text.split())) There are certain values of that statistic that turned out to be killer-strong spam indicators, but there's a potential problem I've mentioned before: if you have an unbounded number of free parameters you can fiddle, you can train a system to fit any given dataset exactly. That's in part why replication of results by others is necessary to make schemes like this superb (I can only make one merely excellent on my own ). > -- the decision logic would be more intuitive > -- avoids the issue of having equal amounts of spam and ham in > the sample It's not clear that this matters; some results of preliminary experiments are written up in the code comments. The way Graham computes P(Spam | Word) is via ratios, *as if* there were an equal number of each; and that's consistent with the other bogus equality assumption in the scorer. I haven't yet changed all these guys at the same time to take P(Spam) and P(Ham) into account. BTW, note that all the results I've reported had a ham/spam training ratio of 4000/2750. I left that non-unity on purpose. > The core concept would stay the same -- it's really just a shift from > continuous to discrete. Let us know how it turns out . From barry@zope.com Thu Sep 5 19:06:59 2002 From: barry@zope.com (Barry A. Warsaw) Date: Thu, 5 Sep 2002 14:06:59 -0400 Subject: [Python-Dev] New `spambayes' project on SourceForge Message-ID: <15735.40259.117828.402419@anthem.wooz.org> There's been a ton of press about applying Bayesian classifiers to spam detection lately, spurred on by Paul Graham's recent paper "A Plan for Spam" http://www.paulgraham.com/spam.html Tim Peters has done an incredible amount of work on our Python implementation of this idea. Some of the reasons why I think Tim's work is so cool is that he's brought along his deep knowledge of speech recognition's related issues, and his obsessive devotion to reducing the amount of spam I ultimately have to delete . In order to encourage more participation from the wider open source community, we've moved the code from a backwater of the Python cvs tree to its own project on SourceForge. The hope is that more people will be able to contribute to ideas, testing, and integration of the basic algorithms with other systems such as mail daemons, mailing list managers, and mail clients. The project is called "spambayes" (for lack of creativity on our part :) and is hosted here: http://sf.net/projects/spambayes If you're interested in becoming a developer on the project, let me know. Otherwise you can of course get anonymous checkouts of the code. There are also two mailing lists related to the spambayes project. The first is a general discussion list: http://mail.python.org/mailman-21/listinfo/spambayes and the other is a list for cvs checkin message notices: http://mail.python.org/mailman-21/listinfo/spambayes-checkins Feel free to join those lists (and help be a guinea pig for Mailman 2.1 :). Enjoy, -Barry PS to Python-devers: the code has been removed from nondist/sandbox/spambayes, so you won't be able to hack on it there. Also, please move discussion about this from python-dev@python.org to spambayes@python.org From nas@python.ca Thu Sep 5 19:52:28 2002 From: nas@python.ca (Neil Schemenauer) Date: Thu, 5 Sep 2002 11:52:28 -0700 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <20020905093002.GA61136@hishome.net> References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> <20020905093002.GA61136@hishome.net> Message-ID: <20020905185228.GA19726@glacier.arctrix.com> Oren Tirosh wrote: > If this macro is defined you will never get EINTR so there's no need to > worry about this. If it isn't defined you need to restart system calls > yourself. I don't think that is correct. Only certain systems calls will be restarted (for BSD 4.2 it's ioctl, read, readv, write, writev, wait, and waitpid). I think the system calls restarted varies depending on the OS. Signals are a gigantic mess. I'm starting to doubt that you realize the extent of the brain damage. While I would be pleased if there was some way Python could hide the mess, I'm not convinced it is possible. Neil From guido@python.org Thu Sep 5 19:19:02 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 14:19:02 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Thu, 05 Sep 2002 05:30:02 EDT." <20020905093002.GA61136@hishome.net> References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> <20020905093002.GA61136@hishome.net> Message-ID: <200209051819.g85IJ2113867@odiug.zope.com> > > - Macro: AC_SYS_RESTARTABLE_SYSCALLS > > If the system automatically restarts a system call that is > > interrupted by a signal, define `HAVE_RESTARTABLE_SYSCALLS'. > > The name of this macro is misleading. It doesn't check whether system calls > are restartABLE but whether they are restartED automatically by libc. It > forks a subprocess that sends a signal to the parent. The parent waits for > the child and checks if the wait() was interrupted. > > If this macro is defined you will never get EINTR so there's no need to > worry about this. If it isn't defined you need to restart system calls > yourself. This was a feature introduced by BSD Unix in a distant past, as a change from v7 Unix (which had only the EINTR behavior). For b/w compatibility, BSD had a system call to disable the restart feature. I'm guessing that over the years the feature has been found less than helpful, so POSIX defaults to off. POSIX sigaction() has a flag SA_RESTART to enable restarting. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Thu Sep 5 20:15:54 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 15:15:54 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Thu, 05 Sep 2002 11:52:28 PDT." <20020905185228.GA19726@glacier.arctrix.com> References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> <20020905093002.GA61136@hishome.net> <20020905185228.GA19726@glacier.arctrix.com> Message-ID: <200209051915.g85JFsR14171@odiug.zope.com> > Signals are a gigantic mess. I'm starting to doubt that you realize the > extent of the brain damage. While I would be pleased if there was some > way Python could hide the mess, I'm not convinced it is possible. Thanks for the support Neil. That's exactly how I think about it. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip@pobox.com Thu Sep 5 15:57:45 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 5 Sep 2002 09:57:45 -0500 Subject: [Python-Dev] Getting started with GBayes testing In-Reply-To: <3D772EC2.30217.184B6C78@localhost> References: <3D7653AD.14352.14F391B6@localhost> <3D772EC2.30217.184B6C78@localhost> Message-ID: <15735.28905.730200.821228@12-248-11-90.client.attbi.com> Brad> My feeling is that the presentation of "the message" is Brad> independent of the message itself, so if I get a message in Text, Brad> HTML, RTF only the actual content is important, not the markup Brad> method. Though I suppose using lots of red and large fonts might Brad> be an indicator of spam, the text of the message should still Brad> suffice. You might be surprised. In Paul Graham's "A New Plan for Spam" he writes: I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.) When I did try statistical analysis, I found immediately that it was much cleverer than I had been. It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam. But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term. As Tim has pointed out several times, intuition and hunches about this stuff often turns out to be incorrect. Skip From jason-exp-1031947065.5eb24b@mastaler.com Thu Sep 5 21:01:37 2002 From: jason-exp-1031947065.5eb24b@mastaler.com (jason-exp-1031947065.5eb24b@mastaler.com) Date: Thu, 05 Sep 2002 14:01:37 -0600 Subject: [Python-Dev] Re: New `spambayes' project on SourceForge References: <15735.40259.117828.402419@anthem.wooz.org> Message-ID: barry@zope.com (Barry A. Warsaw) writes: > There are also two mailing lists related to the spambayes project. > The first is a general discussion list: > > http://mail.python.org/mailman-21/listinfo/spambayes > > and the other is a list for cvs checkin message notices: > > http://mail.python.org/mailman-21/listinfo/spambayes-checkins These lists have now been added to Gmane (http://gmane.org) as well: spambayes@python.org <==> news://news.gmane.org/gmane.mail.spam.spambayes.general spambayes-checkins@python.org <==> news://news.gmane.org/gmane.mail.spam.spambayes.cvs -- (http://tmda.net/) From oren-py-d@hishome.net Thu Sep 5 21:27:16 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Thu, 5 Sep 2002 23:27:16 +0300 Subject: [Python-Dev] Re: Signal-resistant code In-Reply-To: <200209051501.g85F1EY13017@odiug.zope.com>; from guido@python.org on Thu, Sep 05, 2002 at 11:01:14AM -0400 References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> <20020904160143.GA1483@hishome.net> <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> <20020905045414.GA26104@hishome.net> <200209051501.g85F1EY13017@odiug.zope.com> Message-ID: <20020905232716.A8225@hishome.net> On Thu, Sep 05, 2002 at 11:01:14AM -0400, Guido van Rossum wrote: > > > I have never understood why a child dying should send a signal. > > > You can poll for the child with waitpid() instead. > > > > You're assuming too much about the structure of the program using > > child processes. The code that starts the child process may not be > > in control of the Python program counter by the time it ends. It's > > useful to be able to leave a signal handler to clean up the zombie > > process by waitpid(). > > I admit that I hate signals so badly that whenever I needed to wait > for a child to finish I would always structure the program around this > need (even when coding in C). Ummm... if you really hate signals that much perhaps you to step aside from this particular discussion? Naturally, you will get to pronounce on the results that come out of it (if any ;-) Westley: No, no. We have already succeeded. I mean, what are the three terrors of the fire swamp? One, the flame spurt - no problem - there's a popping sound preceding each. We can avoid that. Two, the lightning sand which you were clever enough to discover what that looks like, so in the future we can avoid that too. (from "The Princess Bride" by William Goldman) So what are the three problems of signals? One - what calls are allowed by the platform inside a signal handler. No problem. Nobody suggested actually executing Python code inside a signal handler so we don't need to be worried about user code. The C handler doesn't call anything unusual, just sets flags. This should work on all platforms. Two - Interruptible system calls. If all Python I/O calls are wrapped inside restarting wrappers this should be solved. If the system's libc wraps them it can be disabled by SA_RESTART (posix) or siginterrupt (BSD). On some systems read and recv return a short buffer instead of EINTR. This can be safely ignored because it only happens for pipes and sockets where this is a valid result. AFAIR it's guaranteed not to happen on regular files so we won't be tricked into believing they reached EOF. Are there any systems where system calls are interruptible but not restartable in any way without data loss? Three - Threads playing "who gets the signal". The Python signal module has a hack that appears to work on all relevant platform - ignore the signal if getpid() isn't the main thread. Oren Buttercup: Westley, what about the R.O.U.S.'s? Westley: Rodents Of Unusual Size? I don't think they exist. ... From stephen@ixokai.net Thu Sep 5 21:27:47 2002 From: stephen@ixokai.net (Stephen Hansen) Date: 05 Sep 2002 13:27:47 -0700 Subject: [Python-Dev] SF patch#555779, "import user" and Apache... *humble* Message-ID: <1031257667.16739.5.camel@jeremy> *cough* So. Hi. Python-Gods. Um. So. Anyways. *embarassed* I submitted a really tiny patch to SF awhile back, #555779, which would make "import user" actually useful in a certain specific CGI situation. The BDFL seemed to have no problems and said anyone could commit it.. no one has. :) Now, i'm not impatient at all, its already patched into all the machines i'm working on... however, i'm just sending this little reminder in the hopes that it won't be forgotten until after 2.3 comes out. :) I don't want to re-patch everything again later, i've got quite a few machines currently using it. :) *cough* So. Yes. Well. Thank you for your time. :) *runs away* --Stephen From mal@egenix.com Thu Sep 5 18:19:57 2002 From: mal@egenix.com (M.-A. Lemburg) Date: Thu, 05 Sep 2002 19:19:57 +0200 Subject: [Python-Dev] GBayes design References: <002b01c254f2$f6c7c020$71b53bd0@othello> Message-ID: <3D77923D.8060108@lemburg.com> Raymond Hettinger wrote: > Is it too late to challenge a core design decision? > > Instead of multiplying probablities, use fuzzy logic methods. > Classify the indicators into damning, strong, weak, neautral, ... > > After counting the number of indicators in each class, make > a spam/ham decision that can be easily tweaked. This would > make it easy to implement variations of Tim's recent clear > win, where additional indicators are gathered until the > balance shifts sharply to one side. > > Some other advantages are: > -- easily interpreted score vectors (6 damning, 7 strong, 4 weak, ... ) > -- avoids mathematical issues with indicators not being independent > -- allows the addition of non-token based indicators. for instance, > a preponderance of caps would be a weak indicator. the presence > of caps separated by spaces would be a strong indicator. > -- the decision logic would be more intuitive > -- avoids the issue of having equal amounts of spam and ham in > the sample > > The core concept would stay the same -- it's really just a shift from > continuous to discrete. Hmm, there's nothing discrete about fuzzy logic (ok, this claim is 0.65% true ;-) The problem is more about multi-dimensional optimization where you are interested in distilling several different inputs into one value. A weighted average is the simplest form to use here and there are various multi-dimensional optimization algorithms around to aid in finding the "optimal" weights. Another approach would be using a shallow neural network. The only "problem" with these is that Tim generates a variable number of inputs, AFAICT, so that you'd have to use some preprocessing to make the number of inputs constant. Would make a nice internship project, I guess :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From hu.peress@mail.mcgill.ca Sun Sep 8 03:22:40 2002 From: hu.peress@mail.mcgill.ca (Hunter Peress) Date: 07 Sep 2002 21:22:40 -0500 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: <003d01c25471$d83fe960$2fd8accf@othello> References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> Message-ID: <1031451760.644.97.camel@HillCountryPeress> On Wed, 2002-09-04 at 19:19, Raymond Hettinger wrote: > From: "Hunter Peress" > > > def something(a,b,c="lalal"): > > """This will find its way into the pydocs because its a comment""" > > ##Here is the new stuff Im proposing > > ##note, a clearer sytnax can surely be devised. > > """file""" #documents the type of the first arg > > """string""" # "" second > > """list""" # "" third > > """string""" #documents the return type. > > > > Then the pydoc generator will do a check on the # arguments to the > > func/meth, verify that the correct amount of these new comments (which > > only supply the type) are provided. I do think that it would help to > > actually enforce this. I think its fine that doc's NOT be generated if > > they don't supply this information. This provides for better docs and > > shouldnt get that many complaints. > > Thanks for the clarification. I see what you're trying to do; > however, I think that any gains are more than offset by the new > level of complexity and lengthier code. > > The current docs make a pretty good effort at describing what is > needed for each argument. At the same time, they allow flexibility > for dynamic arguments that share a similar interface (such as > substituting a StringIO object for a File object. > > In your example, the docs strings could be made clear > using existing tools: > > def something(file, promptstring, optionlist): > """Returns a string extracted from the file > for any line matching the promptstring. > The optionlist can include any of the > following: IGNORECASE, VERBOSE. > MULTILINE, or ADDLINENUMBER.""" > > I can't see that a tool like you described would add any > more clarity than the above docstring. > > > PS whats TIA mean? > > "Thanks In Advance" > > Do you have any examples of current python docstrings that are > not clear enough? this was the impetus behind my whole thinking here. I need not search far. example 1) pydoc os.fork Python Library Documentation: built-in function fork in os fork(...) fork() -> pid Fork a child process. Return 0 to child process and PID of child to parent process. example2) pydoc string.index Python Library Documentation: function index in string index(s, *args) index(s, sub [,start [,end]]) -> int Like find but raises ValueError when the substring is not found. >From these two, I have no idea what BOTH the input and return types are. I found those examples in 10 seconds (literally). The state of the python documentation is caca. And your complacency is a cause for concern. I think its easier to enforce this from the level i describe, than have guido saying "ok guys please be more explicit in your documentation". I mean, both of those documents above are somewhat explicit, but they are not COMPLETE. Could you provide me with some linkage on parsing python (from a compilation/ syntax-tree analysis POV). SO that i can get to work on writing a patch for the pydoc generation program. > > > Raymond Hettinger > > From guido@python.org Thu Sep 5 21:46:27 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 16:46:27 -0400 Subject: [Python-Dev] Re: Signal-resistant code In-Reply-To: Your message of "Thu, 05 Sep 2002 23:27:16 +0300." <20020905232716.A8225@hishome.net> References: <15733.11253.743055.864572@12-248-11-90.client.attbi.com> <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> <20020904160143.GA1483@hishome.net> <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> <20020905045414.GA26104@hishome.net> <200209051501.g85F1EY13017@odiug.zope.com> <20020905232716.A8225@hishome.net> Message-ID: <200209052046.g85KkR714802@odiug.zope.com> > > I admit that I hate signals so badly that whenever I needed to wait > > for a child to finish I would always structure the program around this > > need (even when coding in C). > > Ummm... if you really hate signals that much perhaps you to step aside > from this particular discussion? Naturally, you will get to pronounce on > the results that come out of it (if any ;-) Why? I don't think hating signals disqualifies me from understanding their problems. > So what are the three problems of signals? > > One - what calls are allowed by the platform inside a signal handler. > No problem. Nobody suggested actually executing Python code inside a > signal handler so we don't need to be worried about user code. The C > handler doesn't call anything unusual, just sets flags. This should work > on all platforms. > > Two - Interruptible system calls. If all Python I/O calls are wrapped > inside restarting wrappers this should be solved. I asked what the Python code called by the wrapper when a signal arrives is allowed to do (e.g. close the file?). If you replied to that, I missed it. > If the system's libc wraps them it can be disabled by SA_RESTART > (posix) or siginterrupt (BSD). On some systems read and recv return > a short buffer instead of EINTR. This latter sentence shows that you don't understand signals, or you're being very sloppy. You get *either* a short buffer *or* EINTR depending on whether some data was already transferred to user space. > This can be safely ignored because it only happens for pipes and > sockets where this is a valid result. AFAIR it's guaranteed not to > happen on regular files so we won't be tricked into believing they > reached EOF. I don't believe that a short read on a regular file can be used reliably to infer EOF anyway. The file could be growing while we read. > Are there any systems where system calls are interruptible but not > restartable in any way without data loss? Not AFAIK. > Three - Threads playing "who gets the signal". The Python signal module > has a hack that appears to work on all relevant platform - ignore the > signal if getpid() isn't the main thread. Doesn't that make signals unreliable? What if thread 4 has forked a child, and the child exist? Won't the SIGCHLD be sent to thread 4? AFAIK there's no standard for this, or if there is, not all systems comply. --Guido van Rossum (home page: http://www.python.org/~guido/) From oren-py-d@hishome.net Thu Sep 5 21:45:52 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Thu, 5 Sep 2002 16:45:52 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <20020905185228.GA19726@glacier.arctrix.com> References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> <20020905093002.GA61136@hishome.net> <20020905185228.GA19726@glacier.arctrix.com> Message-ID: <20020905204552.GA51795@hishome.net> On Thu, Sep 05, 2002 at 11:52:28AM -0700, Neil Schemenauer wrote: > > Signals are a gigantic mess. I'm starting to doubt that you realize the > extent of the brain damage. While I would be pleased if there was some > way Python could hide the mess, I'm not convinced it is possible. > > Neil Ah... I can almost hear the pain, frustration and despair in your voice. Obviously Guido and you got burned by this. I know other old-time Unix hackers with the same attitude. From my experience signals on Linux work just fine - I don't carry any signal scars. I can show off my Oracle scars, though. They're really gnarly. I can't hear that name mentioned without turning completely irrational about it. Certain embedded software and hardware makers also make me want to scream. Oren From guido@python.org Thu Sep 5 21:51:57 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 05 Sep 2002 16:51:57 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Thu, 05 Sep 2002 16:45:52 EDT." <20020905204552.GA51795@hishome.net> References: <3FE2540C-C047-11D6-89C6-000A27B19B96@oratrix.com> <200209042048.g84KmCK08365@pcp02138704pcs.reston01.va.comcast.net> <20020905093002.GA61136@hishome.net> <20020905185228.GA19726@glacier.arctrix.com> <20020905204552.GA51795@hishome.net> Message-ID: <200209052059.g85Kxop14949@odiug.zope.com> > From my experience signals on Linux work just fine - I don't carry > any signal scars. That just shows you haven't written enough signal code. :-) Seriously, let's please not confuse Linux with portable. The issues here are about the cross-platform viability of your suggested approach. If you've only used signals on Linux, maybe you should withdraw yourself on account of lack of experience with the real issues. --Guido van Rossum (home page: http://www.python.org/~guido/) From paul-python@svensson.org Thu Sep 5 22:08:54 2002 From: paul-python@svensson.org (Paul Svensson) Date: Thu, 5 Sep 2002 17:08:54 -0400 (EDT) Subject: [Python-Dev] Re: Signal-resistant code In-Reply-To: <20020905232716.A8225@hishome.net> Message-ID: On Thu, 5 Sep 2002, Oren Tirosh wrote: >So what are the three problems of signals? >Two - Interruptible system calls. If all Python I/O calls are wrapped >inside restarting wrappers this should be solved. If the system's libc >wraps them it can be disabled by SA_RESTART (posix) or siginterrupt (BSD). >On some systems read and recv return a short buffer instead of EINTR. This >can be safely ignored because it only happens for pipes and sockets where >this is a valid result. AFAIR it's guaranteed not to happen on regular >files so we won't be tricked into believing they reached EOF. Are there >any systems where system calls are interruptible but not restartable >in any way without data loss? I don't see any guarantee against short reads in my documentation (Linux, HP-UX); indeed both state explicitly that only a 0 return from read() indicates EOF. /Paul From neal@metaslash.com Thu Sep 5 22:10:10 2002 From: neal@metaslash.com (Neal Norwitz) Date: Thu, 05 Sep 2002 17:10:10 -0400 Subject: [Python-Dev] SF patch#555779, "import user" and Apache... *humble* References: <1031257667.16739.5.camel@jeremy> Message-ID: <3D77C832.FD342FCD@metaslash.com> Stephen Hansen wrote: > > I submitted a really tiny patch to SF awhile back, #555779, which would > make "import user" actually useful in a certain specific CGI situation. > The BDFL seemed to have no problems and said anyone could commit it.. Done. Neal From python@rcn.com Thu Sep 5 21:49:21 2002 From: python@rcn.com (Raymond Hettinger) Date: Thu, 5 Sep 2002 16:49:21 -0400 Subject: [Python-Dev] SF patch#555779, "import user" and Apache... *humble* References: <1031257667.16739.5.camel@jeremy> Message-ID: <006801c2551d$b6e0e600$3961accf@othello> I'll check it in for you when I get back from class this evening. Raymond Hettinger BTW, no need for humility around here. ----- Original Message ----- From: "Stephen Hansen" To: Sent: Thursday, September 05, 2002 4:27 PM Subject: [Python-Dev] SF patch#555779, "import user" and Apache... *humble* > *cough* > > So. Hi. Python-Gods. Um. So. Anyways. *embarassed* > > I submitted a really tiny patch to SF awhile back, #555779, which would > make "import user" actually useful in a certain specific CGI situation. > The BDFL seemed to have no problems and said anyone could commit it.. no > one has. :) Now, i'm not impatient at all, its already patched into all > the machines i'm working on... however, i'm just sending this little > reminder in the hopes that it won't be forgotten until after 2.3 comes > out. :) I don't want to re-patch everything again later, i've got quite > a few machines currently using it. :) > > *cough* So. Yes. Well. Thank you for your time. :) > > *runs away* > > --Stephen > > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > From fredrik@pythonware.com Thu Sep 5 23:06:22 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 6 Sep 2002 00:06:22 +0200 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) References: <1031437860.636.29.camel@HillCountryPeress><1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> Message-ID: <004701c25528$7c8b4530$ced241d5@hagrid> hunter wrote: > I need not search far. > example 1) pydoc os.fork > Python Library Documentation: built-in function fork in os > fork(...) > fork() -> pid > Fork a child process. > > Return 0 to child process and PID of child to parent process. why do you care about the type of a PID object? in most cases, all you need to know is that a PID isn't 0, which is exactly what the documentation says. and if you know what a PID is, you already know what type it is... > example2) pydoc string.index > Python Library Documentation: function index in string > index(s, *args) > index(s, sub [,start [,end]]) -> int > > Like find but raises ValueError when the substring is not found. > > From these two, I have no idea what BOTH the input and return > types are. the index documentation refers to the documentation for "find", which tells you that: >>> help(string.find) Help on function find in module string: find(s, *args) find(s, sub [,start [,end]]) -> in Return the lowest index in s where substring sub is found, such that sub is contained within s[start,end]. Optional arguments start and end are interpreted as in slice notation. Return -1 on failure. which, given that you know how indexes and slices work in python, is all you need to know. > I found those examples in 10 seconds (literally). The state of the > python documentation is caca. how long have you been using Python? From oren-py-d@hishome.net Thu Sep 5 23:23:30 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Fri, 6 Sep 2002 01:23:30 +0300 Subject: [Python-Dev] Re: Signal-resistant code In-Reply-To: <200209052046.g85KkR714802@odiug.zope.com>; from guido@python.org on Thu, Sep 05, 2002 at 04:46:27PM -0400 References: <20020904094947.GA56953@hishome.net> <200209041144.g84BiXZ05244@pcp02138704pcs.reston01.va.comcast.net> <20020904124646.GA79746@hishome.net> <200209041325.g84DP1o06695@pcp02138704pcs.reston01.va.comcast.net> <20020904160143.GA1483@hishome.net> <200209042005.g84K5Ms08177@pcp02138704pcs.reston01.va.comcast.net> <20020905045414.GA26104@hishome.net> <200209051501.g85F1EY13017@odiug.zope.com> <20020905232716.A8225@hishome.net> <200209052046.g85KkR714802@odiug.zope.com> Message-ID: <20020906012330.A10575@hishome.net> On Thu, Sep 05, 2002 at 04:46:27PM -0400, Guido van Rossum wrote: > > > I admit that I hate signals so badly that whenever I needed to wait > > > for a child to finish I would always structure the program around this > > > need (even when coding in C). > > > > Ummm... if you really hate signals that much perhaps you to step aside > > from this particular discussion? Naturally, you will get to pronounce on > > the results that come out of it (if any ;-) > > Why? I don't think hating signals disqualifies me from understanding > their problems. In the past I have disqualified myself from making technical decisions on issues where I have been burned and knew that my opinion would be a calm rational decision. > I asked what the Python code called by the wrapper when a signal > arrives is allowed to do (e.g. close the file?). If you replied to > that, I missed it. Anything that a Python thread is allowed to do without grabbing a lock, i.e. anything that involves only exclusive resources or atomic Python operations on shared resources like setting a variable (but not read-modify-write). A signal handler is also allowed to raise an exception that will get delivered to the main thread. > > If the system's libc wraps them it can be disabled by SA_RESTART > > (posix) or siginterrupt (BSD). On some systems read and recv return > > a short buffer instead of EINTR. > > This latter sentence shows that you don't understand signals, or > you're being very sloppy. You get *either* a short buffer *or* EINTR > depending on whether some data was already transferred to user space. Did I say that you get *both* a short buffer *and* EINTR? What I meant is that it's really quite simple - if errno==EINTR I retry and if I get a short buffer I continue from whatever I got and ask for the remainder and this should work regardless of the differences in behavior between different systems, sockets and files, etc. > > This can be safely ignored because it only happens for pipes and > > sockets where this is a valid result. AFAIR it's guaranteed not to > > happen on regular files so we won't be tricked into believing they > > reached EOF. > > I don't believe that a short read on a regular file can be used > reliably to infer EOF anyway. The file could be growing while we read. You're right, only a zero result on read should be interpreted as EOF, not a short result. I got confused by fread where a short read does mark an end of file condition. I don't see how the growing file case is relevant, though. > > Three - Threads playing "who gets the signal". The Python signal module > > has a hack that appears to work on all relevant platform - ignore the > > signal if getpid() isn't the main thread. > > Doesn't that make signals unreliable? What if thread 4 has forked a > child, and the child exist? Won't the SIGCHLD be sent to thread 4? > AFAIK there's no standard for this, or if there is, not all systems > comply. I've never actually tried this one. I just went by the comments in signalmodule.c which claim that this works for all cases of how different implementations deliver signals to threads. I guess that was a bit hasty. Oren From list-python@ccraig.org Fri Sep 6 07:20:51 2002 From: list-python@ccraig.org (Christopher A. Craig) Date: 06 Sep 2002 02:20:51 -0400 Subject: [Python-Dev] Documentation inconsistency in re Message-ID: >From the Library Reference (2.2.1): \b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character. Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. Now reality: Python 2.2.1 (#2, Apr 22 2002, 17:53:10) [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> t = re.compile(r'\bbag\b') >>> t.search('test bag') <_sre.SRE_Match object at 0x812aad0> >>> t.search('test+bag') <_sre.SRE_Match object at 0x815d528> >>> t.search('test_bag') >>> [ chr(i) for i in xrange(256) if not t.search('test' + chr(i) + 'bag') ] ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] >>> So the implementation appears to define a word as a sequence of alphanumeric characters or underscores, which means either the documentation, or the library is wrong. Now it happens that this was found while a friend of mine and I were looking to get the exact behavior that is implemented, so I'd prefer it if the documentation were updated to meet the implementation <.8 wink>. -- Christopher A. Craig I develop for Linux for a living, I used to develop for DOS. Going from DOS to Linux is like trading a glider for an F117. - Lawrence Foard From fredrik@pythonware.com Fri Sep 6 07:47:12 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 6 Sep 2002 08:47:12 +0200 Subject: [Python-Dev] Documentation inconsistency in re References: Message-ID: <00e401c25571$3e6bc1f0$ced241d5@hagrid> Christopher A. Craig wrote: > >From the Library Reference (2.2.1): > > \b Matches the empty string, but only at the beginning or end of a > word. A word is defined as a sequence of alphanumeric characters, so > the end of a word is indicated by whitespace or a non-alphanumeric > character. Inside a character range, \b represents the backspace > character, for compatibility with Python's string literals. as you suspected, the documentation is flawed: \b is defined in terms of \w and \W. From mal@lemburg.com Fri Sep 6 08:55:13 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 06 Sep 2002 09:55:13 +0200 Subject: [Python-Dev] utf8 issue References: <200208232105.g7NL5RE16863@pcp02138704pcs.reston01.va.comcast.net> <2mznv9c1k4.fsf@starship.python.net> <200208261405.g7QE5Of05199@pcp02138704pcs.reston01.va.comcast.net> <3D77205E.8080103@lemburg.com> <200209051351.g85Dpnk12649@odiug.zope.com> Message-ID: <3D785F61.1090301@lemburg.com> Guido van Rossum wrote: >>>Please do. Bumping MAGIC is a no-no between dot releases. But I >>>don't understand why that is necessary? >> >>It would be necessary since marshal uses UTF-8 for storing >>Unicode literals. > > > Do you mean that in 2.2 it doesn't? Marshal uses it since 1.6. The point is that the fix to the lone surrogate problem resulted in a change of the UTF codec output. PYCs from unpatched and patched versions wouldn't interop if they use lone surrogates in Unicode literals. We usually bump the PYC magic in such a case, to avoid these issues. Since it's not possible for a patch level release, we have two choices: 1. leave things as they are 2. apply the fix and live with the consequences of having to regenerate PYCs by hand Just to give an example of the problem: Python 2.2: ------------- u'\ud800'.encode('utf-8') == '\xa0\x80' >>> unicode('\xa0\x80', 'utf-8') Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-8 decoding error: unexpected code byte >>> unicode('\xed\xa0\x80', 'utf-8') Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-8 decoding error: illegal encoding Current CVS Python: --------------------- u'\ud800'.encode('utf-8') == '\xed\xa0\x80' >>> unicode('\xed\xa0\x80', 'utf-8') u'\ud800' >>Even though it's highly unlikely that the problem cases are used in >>Python Unicode literals, there's a tiny chance. Without the MAGIC >>change this could result in PYC files failing to load. > > > Ha. You may have missed the start of this thread, but the whole > problem was that a PYC file *did* fail to load! (The .py file had a > lone surrogate in it.) So I'm not sure this argument holds much > water. Interesting. I wouldn't have expected that. > Can someone please explain what change would be necessary to what part > of the code to prevent a lone surrogate in a string literal from > creating a PYC file from blowing up? One possibility would be to: 1. change the UTF-8 encoder in Python 2.2 to produce correct output 2. let the UTF-8 decoder in Python 2.2 accept the correct output *and* the maformed output I am not sure whether 2. would introduce a security problem. Perhaps there is a way to restrict the work-around so that we don't run into UTF-8 encoding attack problems. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From guido@python.org Fri Sep 6 15:06:21 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 10:06:21 -0400 Subject: [Python-Dev] utf8 issue In-Reply-To: Your message of "Fri, 06 Sep 2002 09:55:13 +0200." <3D785F61.1090301@lemburg.com> References: <200208232105.g7NL5RE16863@pcp02138704pcs.reston01.va.comcast.net> <2mznv9c1k4.fsf@starship.python.net> <200208261405.g7QE5Of05199@pcp02138704pcs.reston01.va.comcast.net> <3D77205E.8080103@lemburg.com> <200209051351.g85Dpnk12649@odiug.zope.com> <3D785F61.1090301@lemburg.com> Message-ID: <200209061406.g86E6Lu14230@pcp02138704pcs.reston01.va.comcast.net> [MAL, on UTF-8 for unicode] > Marshal uses it since 1.6. The point is that the fix to the > lone surrogate problem resulted in a change of the UTF codec > output. PYCs from unpatched and patched versions wouldn't > interop if they use lone surrogates in Unicode literals. We > usually bump the PYC magic in such a case, to avoid these > issues. Since it's not possible for a patch level release, > we have two choices: > > 1. leave things as they are > > 2. apply the fix and live with the consequences of having > to regenerate PYCs by hand [but then later] > One possibility would be to: > > 1. change the UTF-8 encoder in Python 2.2 to produce correct > output > > 2. let the UTF-8 decoder in Python 2.2 accept the correct > output *and* the maformed output This sounds like the right solution. I hope you can produce a patch against the release22-maint branch. > I am not sure whether 2. would introduce a security problem. > Perhaps there is a way to restrict the work-around so that > we don't run into UTF-8 encoding attack problems. I don't see what this vulnerability (if it is one) adds to the already laughable security of marshal and .pyc files. If someone you don't trust can write your .pyc files, they can cause your interpreter to crash by inserting bogus bytecode. So I'd say this is a non-issue. --Guido van Rossum (home page: http://www.python.org/~guido/) From loewis@informatik.hu-berlin.de Fri Sep 6 15:12:25 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 06 Sep 2002 16:12:25 +0200 Subject: [Python-Dev] Subsecond time stamps Message-ID: A number of systems provide subsecond time stamp resolution for files. In particular: - NFS v3 has nanosecond time stamps. - Solaris 9 has nanosecond time stamps in stat(2), and microsecond time stamps in utimes(2). In addition, they have microsecond time stamps on ufs. It appears that other Unices have also extended stat(2), as does OS X. - NTFS has 100ns resolution for time stamps. I'd like to expose atleast the stat extensions to Python. Adding new fields to stat_result is easy enough, but there are a number of alternatives: A. Add an additional field to hold the nanoseconds, i.e. st_mtimensec, st_atimensec, st_ctimensec. This is the BSD Posix extension. B. Follow the Unix API (Solaris and others). They define a struct timespec_t { time_t tv_sec; unsigned long tv_nsec; }; and fields st_mtim, st_ctim, st_atim of timespec_t. For compatibility, they #define st_mtime st_mtim.tv_sec So to get at the seconds, you can write either st_mtim.tv_sec, or st_mtime. For the nanoseconds, you write st_mtim.tv_nsec. This requires to add a new type. C. Make st_mtime a floating point number. This won't offer nanosecond resolution, as C doubles are not dense enough. What do you think? Regards, Martin From paul-python@svensson.org Fri Sep 6 15:31:25 2002 From: paul-python@svensson.org (Paul Svensson) Date: Fri, 6 Sep 2002 10:31:25 -0400 (EDT) Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Message-ID: On 6 Sep 2002, Martin v. L�wis wrote: >A number of systems provide subsecond time stamp resolution for >files. In particular: > >- NFS v3 has nanosecond time stamps. > >- Solaris 9 has nanosecond time stamps in stat(2), and microsecond > time stamps in utimes(2). In addition, they have microsecond time > stamps on ufs. It appears that other Unices have also extended > stat(2), as does OS X. > >- NTFS has 100ns resolution for time stamps. (---) >C. Make st_mtime a floating point number. This won't offer nanosecond > resolution, as C doubles are not dense enough. This seems to me the most Pythonic way. Are C doubles dense enough to offer 100 ns resolution ? /Paul From skip@pobox.com Fri Sep 6 15:39:02 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 6 Sep 2002 09:39:02 -0500 Subject: [Python-Dev] Documentation inconsistency in re In-Reply-To: References: Message-ID: <15736.48646.910216.93578@12-248-11-90.client.attbi.com> Christopher> So the implementation appears to define a word as a Christopher> sequence of alphanumeric characters or underscores, which Christopher> means either the documentation, or the library is wrong. Documentation has been fixed. Skip From erik@pythonware.com Fri Sep 6 15:44:16 2002 From: erik@pythonware.com (erik heneryd) Date: Fri, 06 Sep 2002 16:44:16 +0200 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> Message-ID: <3D78BF40.1030609@pythonware.com> Hunter Peress wrote: >example 1) pydoc os.fork >Python Library Documentation: built-in function fork in os >fork(...) > fork() -> pid > Fork a child process. > > Return 0 to child process and PID of child to parent process. > > my only objection is that the case where fork fails isn't documented. with a c background one expects a negative number, when in fact an exception is raised... erik From guido@python.org Fri Sep 6 15:41:54 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 10:41:54 -0400 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: Your message of "Fri, 06 Sep 2002 16:44:16 +0200." <3D78BF40.1030609@pythonware.com> References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> <3D78BF40.1030609@pythonware.com> Message-ID: <200209061441.g86EfsV14529@pcp02138704pcs.reston01.va.comcast.net> > my only objection is that the case where fork fails isn't documented. > with a C background one expects a negative number, when in fact an > exception is raised... Ah jeez. Even with only half a day of Python you should've figured out that Python nearly always raises an exception where the corresponding C code returns an error value. --Guido van Rossum (home page: http://www.python.org/~guido/) From fredrik@pythonware.com Fri Sep 6 16:03:06 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 6 Sep 2002 17:03:06 +0200 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> <3D78BF40.1030609@pythonware.com> <200209061441.g86EfsV14529@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <00f201c255b6$82458c40$0900a8c0@spiff> guido wrote: > > my only objection is that the case where fork fails isn't = documented. > > with a C background one expects a negative number, when in fact an=20 > > exception is raised... >=20 > Ah jeez. Even with only half a day of Python you should've figured > out that Python nearly always raises an exception where the > corresponding C code returns an error value. otoh, it doesn't hurt to spell it out for functions like fork which almost always succeeds... (can you write a portable test that is guaranteed to raise an exception, and does that without locking up the system?) From Jack.Jansen@oratrix.com Fri Sep 6 16:09:16 2002 From: Jack.Jansen@oratrix.com (Jack Jansen) Date: Fri, 6 Sep 2002 17:09:16 +0200 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209051501.g85F1EY13017@odiug.zope.com> Message-ID: <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> On donderdag, september 5, 2002, at 05:01 , Guido van Rossum wrote: >> Code in signal handlers is executed at some arbitrary point in the >> program and the programmer should be aware of this and only do so >> simple things like setting a flag or appending to a list. > > Unfortunately the mechanism doesn't enforce this. I wish we could > invent a Python signal API that only lets you do one of these simple > things. Could we connect signals to semaphores or locks or something like that? That would allow you to do the two things that i think are worth doing in a signal handler: setting a flag and/or making some other part of the code wake up. Only problem is that for completeness you would really want to wire up select-like functionality too, so that you could really have a single waiting mechanism. -- - Jack Jansen http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman - From md9ms@mdstud.chalmers.se Fri Sep 6 16:10:21 2002 From: md9ms@mdstud.chalmers.se (Martin =?ISO-8859-1?Q?Sj=F6gren?=) Date: 06 Sep 2002 17:10:21 +0200 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: <200209061441.g86EfsV14529@pcp02138704pcs.reston01.va.comcast.net> References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> <3D78BF40.1030609@pythonware.com> <200209061441.g86EfsV14529@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <1031325022.587.1.camel@winterfell> --=-Ph2A8jkujuq9XkvZKUg+ Content-Type: text/plain Content-Transfer-Encoding: quoted-printable fre 2002-09-06 klockan 16.41 skrev Guido van Rossum: > > my only objection is that the case where fork fails isn't documented. > > with a C background one expects a negative number, when in fact an=20 > > exception is raised... >=20 > Ah jeez. Even with only half a day of Python you should've figured > out that Python nearly always raises an exception where the > corresponding C code returns an error value. It would, however, be extremely useful if the documentation spelled out *which* exceptions can be raised! Kind of hard to write a decent try/except clause if you don't know what to expect. Regards, Martin --=-Ph2A8jkujuq9XkvZKUg+ Content-Type: application/pgp-signature; name=signature.asc Content-Description: Detta =?ISO-8859-1?Q?=E4r?= en digitalt signerad meddelandedel -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) iD8DBQA9eMVdGpBPiZwE9FYRAhetAJ4wknrWuT3HVjosDJBu7doPUPNQWACgrm34 cKfO5uHaFBC4JImx5b97vig= =kukK -----END PGP SIGNATURE----- --=-Ph2A8jkujuq9XkvZKUg+-- From guido@python.org Fri Sep 6 16:12:14 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 11:12:14 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Fri, 06 Sep 2002 17:09:16 +0200." <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> References: <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> Message-ID: <200209061512.g86FCF314849@pcp02138704pcs.reston01.va.comcast.net> > Could we connect signals to semaphores or locks or something > like that? That would allow you to do the two things that i > think are worth doing in a signal handler: setting a flag and/or > making some other part of the code wake up. But that mixes signals with threads, which is even more poorly standardized than signals in general. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Fri Sep 6 16:13:22 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 11:13:22 -0400 Subject: [Python-Dev] Call for clarity ( clarification ;-) ) In-Reply-To: Your message of "Fri, 06 Sep 2002 17:10:21 +0200." <1031325022.587.1.camel@winterfell> References: <1031437860.636.29.camel@HillCountryPeress> <1031442464.644.68.camel@HillCountryPeress> <003d01c25471$d83fe960$2fd8accf@othello> <1031451760.644.97.camel@HillCountryPeress> <3D78BF40.1030609@pythonware.com> <200209061441.g86EfsV14529@pcp02138704pcs.reston01.va.comcast.net> <1031325022.587.1.camel@winterfell> Message-ID: <200209061513.g86FDXi14877@pcp02138704pcs.reston01.va.comcast.net> > It would, however, be extremely useful if the documentation spelled out > *which* exceptions can be raised! Kind of hard to write a decent > try/except clause if you don't know what to expect. Yes, *this* is a deficiency in the Python docs that ought to be fixed. It's a lot of work though, and it's not always clear what to document (e.g. *everything* can raise MemoryError -- so it's not useful to mention that everywhere). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Fri Sep 6 16:30:40 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 11:30:40 -0400 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Your message of "Fri, 06 Sep 2002 16:12:25 +0200." References: Message-ID: <200209061530.g86FUeq15029@pcp02138704pcs.reston01.va.comcast.net> > C. Make st_mtime a floating point number. This won't offer nanosecond > resolution, as C doubles are not dense enough. This is the most Pythonic approach. --Guido van Rossum (home page: http://www.python.org/~guido/) From neal@metaslash.com Fri Sep 6 16:36:40 2002 From: neal@metaslash.com (Neal Norwitz) Date: Fri, 06 Sep 2002 11:36:40 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) References: <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> <200209061512.g86FCF314849@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3D78CB88.9E642F78@metaslash.com> Guido van Rossum wrote: > > > Could we connect signals to semaphores or locks or something > > like that? That would allow you to do the two things that i > > think are worth doing in a signal handler: setting a flag and/or > > making some other part of the code wake up. > > But that mixes signals with threads, which is even more poorly > standardized than signals in general. Python can open a pipe to itself. When a signal arrives, write a character on the pipe in addition to setting a flag. Then select() on the pipe. I doubt this is worth the effort, though. Neal From martin@v.loewis.de Fri Sep 6 16:40:51 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Sep 2002 17:40:51 +0200 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: References: Message-ID: Paul Svensson writes: > This seems to me the most Pythonic way. > Are C doubles dense enough to offer 100 ns resolution ? It looks like they are: >>> time.time() 1031326478.373606 >>> 1031326478 + 1e-6 1031326478.000001 >>> 1031326478 + 1e-7 1031326478.0000001 >>> 1031326478 + 1e-8 1031326478.0 but only just so: >>> 1031326478 + 2e-7 1031326478.0000002 >>> 1031326478 + 3e-7 1031326478.0000004 >>> 1031326478 + 4e-7 1031326478.0000004 I admit that this looks tempting, but I'm worried about applications that break because they expect time stamps in struct stat to be integers. Regards, Martin From guido@python.org Fri Sep 6 16:42:33 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 11:42:33 -0400 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Your message of "Fri, 06 Sep 2002 17:40:51 +0200." References:

Message-ID: <200209061542.g86FgXt15105@pcp02138704pcs.reston01.va.comcast.net> > > This seems to me the most Pythonic way. > > I admit that this looks tempting, but I'm worried about applications > that break because they expect time stamps in struct stat to be > integers. Hm, so maybe new field names is still the way to go. E.g. st_mtime gives an int, st_mtimef gives a float. The tuple version only gives the int. If the system doesn't support subsecond resolution, the st_mtimef field still exists but is an int (no point allocating a float and converting the int). --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Fri Sep 6 16:50:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 11:50:45 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <3D78CB88.9E642F78@metaslash.com> Message-ID: [Neal Norwitz] > Python can open a pipe to itself. When a signal arrives, write > a character on the pipe in addition to setting a flag. > Then select() on the pipe. Of course you meant to say it should do WaitForSingleObject(), so that this scheme is portable . > I doubt this is worth the effort, though. Few things are. From tim.one@comcast.net Fri Sep 6 17:01:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 12:01:43 -0400 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Message-ID: [Paul Svensson] > Are C doubles dense enough to offer 100 ns resolution ? The question can't be answered unless you also specify how many years you want to cover. It takes about 25 bits to distinguish a year's worth of seconds, and an IEEE double has 53 bits to play with. So if you were only interested in representing one year, you've got about 28 bits left to play with. If you want to cover an N-year span, you've got about 28 - log2(N) bits to play with. It takes a bit over 23 bits to distinguish the number of 100 ns slices in a second, so N has to be small enough that 5 - log2(N) doesn't go negative. So if you count the start of the epoch at 1970, you've just created a year 2003 problem . From oren-py-d@hishome.net Fri Sep 6 17:54:49 2002 From: oren-py-d@hishome.net (Oren Tirosh) Date: Fri, 6 Sep 2002 19:54:49 +0300 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com>; from Jack.Jansen@oratrix.com on Fri, Sep 06, 2002 at 05:09:16PM +0200 References: <200209051501.g85F1EY13017@odiug.zope.com> <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> Message-ID: <20020906195449.A23347@hishome.net> On Fri, Sep 06, 2002 at 05:09:16PM +0200, Jack Jansen wrote: > Could we connect signals to semaphores or locks or something > like that? That would allow you to do the two things that i > think are worth doing in a signal handler: setting a flag and/or > making some other part of the code wake up. Signal handlers and locks don't mix well. A signal handler can't grab a lock. The signal handler can't wait for the lock to be released because it has interrupted the code holding it. The traditional way this has been handled is with a global "interrupt enable" flag. Just like the good old days of 8 bit micros and DOS when any application could clear the interrupt flag :-) If Queue.Queue sets up a signal critical section as well as getting the queue lock a signal could write to a Queue and wake up a thread waiting on the other end. > Only problem is that for completeness you would really want to > wire up select-like functionality too, so that you could really > have a single waiting mechanism. If the program uses select as the central dispatcher you can set up a pipe. The signal handler writes to one end and the other end is listed in the select socket map. It's a simple way to handle an occasional event like a child process dying or a SIGHUP telling you to reload the configuration file. Do you want to use signals for more intensive tasks like asynchronous I/O? Oren From zack@codesourcery.com Fri Sep 6 18:28:03 2002 From: zack@codesourcery.com (Zack Weinberg) Date: Fri, 6 Sep 2002 10:28:03 -0700 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <20020906195449.A23347@hishome.net> References: <200209051501.g85F1EY13017@odiug.zope.com> <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> <20020906195449.A23347@hishome.net> Message-ID: <20020906172803.GP6886@codesourcery.com> On Fri, Sep 06, 2002 at 07:54:49PM +0300, Oren Tirosh wrote: > Signal handlers and locks don't mix well. A signal handler can't grab a > lock. The signal handler can't wait for the lock to be released because > it has interrupted the code holding it. The traditional way this has been > handled is with a global "interrupt enable" flag. Just like the good old > days of 8 bit micros and DOS when any application could clear the > interrupt flag :-) > > If Queue.Queue sets up a signal critical section as well as getting the > queue lock a signal could write to a Queue and wake up a thread waiting > on the other end. Would this be an appropriate place to complain about how KeyboardInterrupt won't wake up a thread stuck waiting on a Queue? zw From guido@python.org Fri Sep 6 18:53:22 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 13:53:22 -0400 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: Your message of "Fri, 06 Sep 2002 10:28:03 PDT." <20020906172803.GP6886@codesourcery.com> References: <200209051501.g85F1EY13017@odiug.zope.com> <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> <20020906195449.A23347@hishome.net> <20020906172803.GP6886@codesourcery.com> Message-ID: <200209061753.g86HrMx15903@pcp02138704pcs.reston01.va.comcast.net> > Would this be an appropriate place to complain about how > KeyboardInterrupt won't wake up a thread stuck waiting on a Queue? No, unless you have a real proposal on how to fix it (not just a vague idea -- we've all had those, and they don't work). Working code or shut up. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From zack@codesourcery.com Fri Sep 6 21:52:31 2002 From: zack@codesourcery.com (Zack Weinberg) Date: Fri, 6 Sep 2002 13:52:31 -0700 Subject: [Python-Dev] Re: Signal-resistant code (was: Two random and nearly unrelated ideas) In-Reply-To: <200209061753.g86HrMx15903@pcp02138704pcs.reston01.va.comcast.net> References: <200209051501.g85F1EY13017@odiug.zope.com> <9CA26C2C-C1AA-11D6-8D51-003065517236@oratrix.com> <20020906195449.A23347@hishome.net> <20020906172803.GP6886@codesourcery.com> <200209061753.g86HrMx15903@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020906205231.GQ6886@codesourcery.com> On Fri, Sep 06, 2002 at 01:53:22PM -0400, Guido van Rossum wrote: > > Would this be an appropriate place to complain about how > > KeyboardInterrupt won't wake up a thread stuck waiting on a Queue? > > No, unless you have a real proposal on how to fix it (not just a vague > idea -- we've all had those, and they don't work). Working code or > shut up. :-) Fair enough. The underlying problem is that KeyboardInterrupt does not abort acquire() called on a thread lock. This is only noticeable when it was the main thread that called acquire -- if it's some other thread, the KeyboardInterrupt will still be delivered to the main thread. Compare the behavior of these two test programs: -- test1.py -- import time, thread lock = thread.allocate_lock() lock.acquire() def child_thread(): print "Acquiring lock" lock.acquire() print "Have lock (can't happen)" lock.release() thread.start_new_thread(child_thread, ()) print "Hit ^C now" time.sleep(3600) -- test2.py -- import time, thread lock = thread.allocate_lock() def child_thread(): print "Acquiring lock" lock.acquire() print "Have lock" time.sleep(3600) lock.release() thread.start_new_thread(child_thread, ()) time.sleep(1) # give child a chance to acquire lock print "Hit ^C now" lock.acquire() I'm going to look only at the pthread-based thread support; presumably similar changes to the ones I will propose, need to be made to the others. There are two cases of PyThread_acquire_lock in thread_pthread.h: using semaphores, and using condition variables. Let's look at the condition variable one first: /* mut must be locked by me -- part of the condition * protocol */ status = pthread_mutex_lock( &thelock->mut ); CHECK_STATUS("pthread_mutex_lock[2]"); while ( thelock->locked ) { status = pthread_cond_wait(&thelock->lock_released, &thelock->mut); CHECK_STATUS("pthread_cond_wait"); } thelock->locked = 1; status = pthread_mutex_unlock( &thelock->mut ); Naively, we'd like to shove a check of PyOS_InterruptOccurred in that loop so we can bail out if it's true. It is part of the spec for pthread_cond_wait that any signal which is handled (as SIGINT is) will not interrupt its execution. So in order to get a chance to check for interrupts we need to change this to a repeated timed wait, like so: while ( thelock->locked && !interrupted ) { timeout.tv_sec = time(0) + 1; status = pthread_cond_timedwait(&thelock->lock_released, &thelock->mut, &timeout); if (status != ETIMEDOUT) CHECK_STATUS("pthread_cond_wait"); interrupted = PyOS_InterruptOccurred(); } thelock->locked = 1; status = pthread_mutex_unlock( &thelock->mut ); Then we do a bit of fiddling in the return path to reset the interrupt flag and make sure the caller sees a failure. In the semaphore case, life is theoretically simpler: there is no mutex, and sem_wait is interrupted by a handled signal, assuming SA_RESTART was not set for that signal (which it isn't, in Python). do { if (waitflag) status = fix_status(sem_wait(thelock)); else status = fix_status(sem_trywait(thelock)); } while (status == EINTR); /* Retry if interrupted by a signal */ becomes do { if (waitflag) status = fix_status(sem_wait(thelock)); else status = fix_status(sem_trywait(thelock)); if (status == EINTR && PyOS_InterruptOccurred()) goto interrupted; } while (status == EINTR); /* Retry if interrupted by a signal */ ... interrupted: PyErr_SetInterrupt(); dprintf(("PyThread_acquire_lock(%p, %d) interrupted by user\n", lock, waitflag)); return 0; However, the Linux semaphore implementation is buggy and will not actually return EINTR from sem_wait, ever. I'll take this up with the libc maintainers; at the Python level, the thing to do is assume it works. Hence, the appended patch. (While I was at it I fixed CHECK_STATUS so that it actually prints the relevant system error, instead of whatever junk happens to be in errno.) zw =================================================================== Index: thread_pthread.h --- thread_pthread.h 17 Mar 2002 17:19:00 -0000 2.40 +++ thread_pthread.h 6 Sep 2002 20:51:31 -0000 @@ -128,7 +128,12 @@ typedef struct { pthread_mutex_t mut; } pthread_lock; -#define CHECK_STATUS(name) if (status != 0) { perror(name); error = 1; } +#define CHECK_STATUS(name) do { \ + if (status != 0) { \ + fprintf(stderr, "%s: %s\n", name, strerror(status)); \ + error = 1; \ + } \ +} while (0) /* * Initialization. @@ -387,6 +392,8 @@ PyThread_acquire_lock(PyThread_type_lock status = fix_status(sem_wait(thelock)); else status = fix_status(sem_trywait(thelock)); + if (status == EINTR && PyOS_InterruptOccurred()) + goto interrupted; } while (status == EINTR); /* Retry if interrupted by a signal */ if (waitflag) { @@ -399,6 +406,12 @@ PyThread_acquire_lock(PyThread_type_lock dprintf(("PyThread_acquire_lock(%p, %d) -> %d\n", lock, waitflag, success)); return success; + + interrupted: + PyErr_SetInterrupt(); + dprintf(("PyThread_acquire_lock(%p, %d) interrupted by user\n", + lock, waitflag)); + return 0; } void @@ -472,8 +485,10 @@ int PyThread_acquire_lock(PyThread_type_lock lock, int waitflag) { int success; + int interrupted = 0; pthread_lock *thelock = (pthread_lock *)lock; int status, error = 0; + struct timespec timeout; dprintf(("PyThread_acquire_lock(%p, %d) called\n", lock, waitflag)); @@ -491,10 +506,15 @@ PyThread_acquire_lock(PyThread_type_lock * protocol */ status = pthread_mutex_lock( &thelock->mut ); CHECK_STATUS("pthread_mutex_lock[2]"); - while ( thelock->locked ) { - status = pthread_cond_wait(&thelock->lock_released, - &thelock->mut); - CHECK_STATUS("pthread_cond_wait"); + timeout.tv_nsec = 0; + while ( thelock->locked && !interrupted ) { + timeout.tv_sec = time(0) + 1; + status = pthread_cond_timedwait(&thelock->lock_released, + &thelock->mut, + &timeout); + if (status != ETIMEDOUT) + CHECK_STATUS("pthread_cond_wait"); + interrupted = PyOS_InterruptOccurred(); } thelock->locked = 1; status = pthread_mutex_unlock( &thelock->mut ); @@ -502,6 +522,10 @@ PyThread_acquire_lock(PyThread_type_lock success = 1; } if (error) success = 0; + if (interrupted) { + PyErr_SetInterrupt(); + success = 0; + } dprintf(("PyThread_acquire_lock(%p, %d) -> %d\n", lock, waitflag, success)); return success; } From martin@v.loewis.de Sat Sep 7 08:35:26 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 07 Sep 2002 09:35:26 +0200 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: <200209061542.g86FgXt15105@pcp02138704pcs.reston01.va.comcast.net> References:

<200209061542.g86FgXt15105@pcp02138704pcs.reston01.va.comcast.net> Message-ID: Guido van Rossum writes: > Hm, so maybe new field names is still the way to go. E.g. st_mtime > gives an int, st_mtimef gives a float. The tuple version only gives > the int. If the system doesn't support subsecond resolution, the > st_mtimef field still exists but is an int (no point allocating a > float and converting the int). OTOH, I just found that the time values are already floats on the Mac. Did the change in return value for time.time() cause any problems at the time it was made? Regards, Martin From aahz@pythoncraft.com Sat Sep 7 22:44:09 2002 From: aahz@pythoncraft.com (Aahz) Date: Sat, 7 Sep 2002 17:44:09 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: References: <15730.52469.604124.730029@localhost.localdomain> <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020907214409.GA1939@panix.com> On Mon, Sep 02, 2002, Fran�ois Pinard wrote: > > To get the same effects with email addresses, I often prefer using > `mailto:' as a prefix over writing `<' and `>' around a quoted address > in a message body, even if not fully systematic about this. In the > message header itself, `<' and '>' are the proper way to go, of > course. Ewww. I hate "mailto:" because it interferes with cut'n'paste. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/ From Jack.Jansen@oratrix.com Sat Sep 7 23:11:36 2002 From: Jack.Jansen@oratrix.com (Jack Jansen) Date: Sun, 8 Sep 2002 00:11:36 +0200 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Message-ID: On zaterdag, september 7, 2002, at 09:35 , Martin v. Loewis wrote: > Guido van Rossum writes: > >> Hm, so maybe new field names is still the way to go. E.g. st_mtime >> gives an int, st_mtimef gives a float. The tuple version only gives >> the int. If the system doesn't support subsecond resolution, the >> st_mtimef field still exists but is an int (no point allocating a >> float and converting the int). > > OTOH, I just found that the time values are already floats on the > Mac. Did the change in return value for time.time() cause any problems > at the time it was made? It's been causing me headaches in the form of failing test suites about once a year:-) But if I break down the time problems I have on the Mac (100% of which are due to people having a completely unix-centric idea of what a timestamp is) I would say 90% are due to the Mac epoch being in 1904 in stead of in 1970, 9% are due to mac timestamps being localtime in stead of GMT and only 1% are due to the timestamps being floats. And the latter are the easiest to fix, too. The localtime/gmt issues are the hardest, especially because of DST. My preference would be that st_mtime and all other such values are defined to be cookies (sort of similar to lseek values). You would then invoke one of the mythical Python datetime routines to convert the cookie into something guaranteed to be of your liking. (and this specific datetime routine would be platform dependent). If you use the cookie as-is you have a good chance of it working, but you're living dangerously (an analogy would be opening a binary file without "rb"). But this isn't very friendly for backwards compatibility... -- - Jack Jansen http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman - From pinard@iro.umontreal.ca Sat Sep 7 23:50:20 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Sat, 07 Sep 2002 18:50:20 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <20020907214409.GA1939@panix.com> (Aahz's message of "Sat, 7 Sep 2002 17:44:09 -0400") References: <15730.52469.604124.730029@localhost.localdomain> <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> <20020907214409.GA1939@panix.com> Message-ID: [Aahz] > Ewww. I hate "mailto:" because it interferes with cut'n'paste. I read that you cannot cut and paste a string preceded by `mailto:'? Is it what you meant? What is this interference you mention? What I like in `mailto:' for text or message bodies, is that my editor and mail user agent highlights it and makes it clickable. I would be tempted to guess that other editors do this too, but the truth is that I do not know. Maybe we should not let the strengths and drawbacks of the various editors we use drive us into religious feelings for or against a specific markup. Yet, such comparisons let us have an overall feeling on the usefulness of a particular approach. As long as we resist editor wars, it may be useful. If reStructuredText is going to gain popularity in the Python developers community, maybe we should bet in that direction, and prefer the conventions it proposes for Python-dev summaries and other simple documents. The bet to be taken, here, is that our editors and tools would eventually better support reST, or be supplemented with a dependable set of programs to do so. On the other hand, it seems that not everybody is comfortable with reST yet, this might be a problem if there is strong resistance. For one, I rather liked what I saw so far, and without knowing how much time or effort it would take before I use reST fluently, I would probably be happy to share the bet! -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From guido@python.org Sun Sep 8 00:24:54 2002 From: guido@python.org (Guido van Rossum) Date: Sat, 07 Sep 2002 19:24:54 -0400 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Your message of "Sun, 08 Sep 2002 00:11:36 +0200." References: Message-ID: <200209072324.g87NOsG15613@pcp02138704pcs.reston01.va.comcast.net> > >> Hm, so maybe new field names is still the way to go. E.g. st_mtime > >> gives an int, st_mtimef gives a float. The tuple version only gives > >> the int. If the system doesn't support subsecond resolution, the > >> st_mtimef field still exists but is an int (no point allocating a > >> float and converting the int). > > > > OTOH, I just found that the time values are already floats on the > > Mac. Did the change in return value for time.time() cause any problems > > at the time it was made? > > It's been causing me headaches in the form of failing test > suites about once a year:-) But if I break down the time > problems I have on the Mac (100% of which are due to people > having a completely unix-centric idea of what a timestamp is) I > would say 90% are due to the Mac epoch being in 1904 in stead of > in 1970, 9% are due to mac timestamps being localtime in stead > of GMT and only 1% are due to the timestamps being floats. And > the latter are the easiest to fix, too. The localtime/gmt issues > are the hardest, especially because of DST. I'm not sure if this can be used as an argument for making st_mtime and friends floats and be done with it. I wish it could be, because in the long run that's a much nicer API than adding new fields. > My preference would be that st_mtime and all other such values > are defined to be cookies (sort of similar to lseek values). You > would then invoke one of the mythical Python datetime routines > to convert the cookie into something guaranteed to be of your > liking. (and this specific datetime routine would be platform > dependent). If you use the cookie as-is you have a good chance > of it working, but you're living dangerously (an analogy would > be opening a binary file without "rb"). But this isn't very > friendly for backwards compatibility... There's at least one place I know of in Python that assumes the epoch being 1970: calendar.timegm() -- note the line "EPOCH = 1970" right in front of it. :-) Would it make sense if the portable Python APIs translated everything to an epoch of 1970 and UTC? That's what the Windows C library does. Very helpful. (Or is this a problem that's going to disappear with MacOS X? I presume it uses UTC and I hope its epoch is 1970?) --Guido van Rossum (home page: http://www.python.org/~guido/) From aahz@pythoncraft.com Sun Sep 8 05:02:24 2002 From: aahz@pythoncraft.com (Aahz) Date: Sun, 8 Sep 2002 00:02:24 -0400 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: References: <15730.52469.604124.730029@localhost.localdomain> <200209021401.g82E1k030628@pcp02138704pcs.reston01.va.comcast.net> <20020907214409.GA1939@panix.com> Message-ID: <20020908040224.GA27302@panix.com> On Sat, Sep 07, 2002, Fran�ois Pinard wrote: > [Aahz] >> >> Ewww. I hate "mailto:" because it interferes with cut'n'paste. > > I read that you cannot cut and paste a string preceded by `mailto:'? Is it > what you meant? What is this interference you mention? xterm does a nifty job usually of figuring out what to highlight when I double-click on a word. It fails with mailto: because normally when I cut'n'paste an address, I *don't* want to include the "mailto:" portion. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/ From skip@manatee.mojam.com Sun Sep 8 13:00:23 2002 From: skip@manatee.mojam.com (Skip Montanaro) Date: Sun, 8 Sep 2002 07:00:23 -0500 Subject: [Python-Dev] Weekly Python Bug/Patch Summary Message-ID: <200209081200.g88C0N5Z008526@manatee.mojam.com> Bug/Patch Summary ----------------- 278 open / 2830 total bugs (-6) 115 open / 1686 total patches (-4) New Bugs -------- setting file buffer size is unreliable (2002-09-02) http://python.org/sf/603724 spurious SyntaxWarning (2002-09-03) http://python.org/sf/604036 time.struct_time undocumented (2002-09-03) http://python.org/sf/604128 long list in Pythonwin -> weird text (2002-09-03) http://python.org/sf/604387 faster [None]*n or []*n (2002-09-04) http://python.org/sf/604716 pre bug (2002-09-04) http://python.org/sf/604803 python-mode.el replaces function on f1 (2002-09-06) http://python.org/sf/605818 python-mode kills arrow in gdb (gud.el) (2002-09-08) http://python.org/sf/606250 elisp: doesn't recognize comment-syntax (2002-09-08) http://python.org/sf/606251 py-electric-colon & delete-selection-mod (2002-09-08) http://python.org/sf/606254 New Patches ----------- ccompiler argument checking too strict (2002-09-02) http://python.org/sf/603831 release GIL around getaddrinfo() (2002-09-03) http://python.org/sf/604210 For Bug [ 490168 ] shutil.copy(path, pat (2002-09-04) http://python.org/sf/604600 nntplib: group descriptions and RFC2980 (2002-09-05) http://python.org/sf/605370 Tweaks to calls to AH/Help (2002-09-07) http://python.org/sf/606067 fast dictionary lookup by name (2002-09-07) http://python.org/sf/606098 Mac OS X keydefs (2002-09-07) http://python.org/sf/606132 install_IDLE target in Mac/OSX/Makefile (2002-09-07) http://python.org/sf/606134 Closed Bugs ----------- Unicode in sys.path not supported (2001-10-30) http://python.org/sf/476326 PDB single steps list comprehensions (2002-02-28) http://python.org/sf/523995 surprise overriding __radd__ in subclass of complex (2002-03-18) http://python.org/sf/531355 import user doesn't work with CGIs (2002-05-14) http://python.org/sf/555779 whatsnew explains noargs incorrectly (2002-06-11) http://python.org/sf/567607 Invalid mmap crashes Python interpreter (2002-07-24) http://python.org/sf/585792 spawn*() doesn't handle errors well (2002-08-20) http://python.org/sf/597795 The KeyError message doesn't use repr on the key value reported (2002-08-21) http://python.org/sf/598451 Method resolution order in Py 2.2 - 2.3 (2002-08-23) http://python.org/sf/599452 bug in new execvpe (2002-08-27) http://python.org/sf/601077 xmlrpclib ignores CDATA (2002-08-28) http://python.org/sf/601534 some int results that should be bool (2002-08-29) http://python.org/sf/601775 smtplib mishandles empty sender (2002-08-29) http://python.org/sf/602029 configure finds c++ w/o --with-cxx (2002-08-29) http://python.org/sf/602102 Closed Patches -------------- unicode encoding error callbacks (2001-06-12) http://python.org/sf/432401 Pure Python strptime() (PEP 42) (2001-10-23) http://python.org/sf/474274 mimetypes: all extensions for a type (2002-05-09) http://python.org/sf/554192 socketmodule.[ch] downgrade (2002-08-09) http://python.org/sf/593069 email: RFC 2231 parameters encoding (2002-08-26) http://python.org/sf/600096 IDLE [Open module]: import submodules (2002-08-26) http://python.org/sf/600152 Robustness tweak to httplib.py (2002-08-26) http://python.org/sf/600488 obmalloc,structmodule: 64bit, big endian (2002-08-28) http://python.org/sf/601369 expose PYTHON_API_VERSION via sys (2002-08-28) http://python.org/sf/601456 replace_header method for Message class (2002-08-29) http://python.org/sf/601959 sys.path in user.py (2002-08-29) http://python.org/sf/602005 single shared ticker (2002-08-29) http://python.org/sf/602191 From Jack.Jansen@oratrix.com Sun Sep 8 22:51:59 2002 From: Jack.Jansen@oratrix.com (Jack Jansen) Date: Sun, 8 Sep 2002 23:51:59 +0200 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: <200209072324.g87NOsG15613@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <33811F04-C375-11D6-9BF8-003065517236@oratrix.com> On zondag, september 8, 2002, at 01:24 , Guido van Rossum wrote: > Would it make sense if the portable Python APIs translated everything > to an epoch of 1970 and UTC? That's what the Windows C library does. > Very helpful. (Or is this a problem that's going to disappear with > MacOS X? I presume it uses UTC and I hope its epoch is 1970?) On MacOSX (if you use unix-based Python, not if you use old MacPython) the problem is gone. At least, if you ignore the timestamps returned by mac-specific filesystem routines, but I think we can do that safely. Changing the APIs to return unix-style timestamps is what the GUSI unix-compatible socket and I/O library used by MacPython did originally, but I had to rip it out. The problem was that GUSI did provide all the unix system calls, but not the other library routines that handled timestamps. So these were provided by the Metrowerks C library, which assumes localtime. So ctime() and gmtime() and all its friends did the wrong thing, and I didn't cherish the idea of finding replacements for them. If your suggestion is that every timestamp goes through a conversion routine before being passed from C to Python and through a reverse conversion when it goes from Python to C: yes, that would definitely make sense. -- - Jack Jansen http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman - From pinard@iro.umontreal.ca Mon Sep 9 14:59:10 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Mon, 09 Sep 2002 09:59:10 -0400 Subject: [Python-Dev] Codecs lookup order Message-ID: Hi, people. Happily playing with codecs (using Python 2.2.1), I found out that one should be careful about _not_ naming a module after the encoding name, when closely following the documentation in the Library Reference manual. Here is what I guess is happening. `codecs.register()' appends the search function from the new codec module at end of existing search functions. `codecs.lookup()' tries the search functions in the same order in which they were declared. Consequently, `encodings.lookup()' is tried first. If the encoding does not exist in the cache, `encodings.lookup()' tries to import a module by the name of the encoding, slightly transformed, and will indeed import the new user codec module, because that module has the name of the encoding, and is on the module search path. But now, `encodings.lookup()' expects a `getregentry' function in that module, does not find it, and raises a CodecRegistryError, not leaving a chance to subsequent codec search functions to be used. On the user side, a mere renaming the user module holding the new codec solves the problem. I'm not sure what should best be done. The documentation might be modified to explain the limitation, so other users do not trip up on it. `encoding.lookup()' might merely return None in case `getregentry' is not defined in the imported module, or else, it could make sure that it imports modules exclusively from within the `encodings' package. The best and simplest might be to lookup the code search functions in reverse order of their registration. `encoding.lookup()' would be called last instead of first. It would be easier for the user to override an encoding bundled with the Python distribution, if there is a need to do so. Because the Python Library Reference does not specify yet in which order codec search functions are tried, the order is not frozen yet and it might be easier to change it. -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From rledwith@cas.org Mon Sep 9 17:34:18 2002 From: rledwith@cas.org (ledwith@cas.org) Date: Mon, 9 Sep 2002 12:34:18 -0400 (EDT) Subject: [Python-Dev] 64-bit process optimization 1 Message-ID: <20020909123418.AAB25999@cas.org> Hello, This is my first post to Python-Dev. As requested by the list manager I am supplying a brief personal introduction before getting to the topic of this message: I am a Senior Research Scientist at CAS, a branch of the American Chemical Society. I have used Python as my programming language of choice for the last four years. I typically work with large collections of text documents performing analyses of text, computer indexing of text, and information retrieval. I use Python as (1) a general purpose programming language, and (2) a high-level programming language to invoke high-performance C and C++ modules (including Numeric). If I examine my programs by data structures, I would find that they contain mostly: 1. Very large dictionaries using tuples and strings as keys. Guido's essay on Implementing Graphs was the inspiration for my using dictionaries to create very large directed acyclic graphs. 2. Specialized C++ objects to represent inverted lists. 3. Numeric objects for representing vectors and tables of floating point values. My primary computing platforms are four dedicated Sun servers, containing 30 processors, 88GB of RAM and 2TB of DASD. Most of the programs I write require between 1 hour and 27 days to complete. (Obviously, I am an atypical Python user!) During the last three months, I have been forced to migrate from 32-bit python processes to 64-bit processes due to the large number of data points I am analyzing within a single program run. It is my experiences while migrating from 32-bit to 64-bit code that prompted this message. It is with some trepidation that as the subject of my first posting I am suggesting that Python 2.3 should use a different layout of all Python objects than is defined in Python 2.2.1. Specifically, I have found that changing lines 63-74 of Include/object.h from: #ifdef Py_TRACE_REFS #define PyObject_HEAD \ struct _object *_ob_next, *_ob_prev; \ int ob_refcnt; \ struct _typeobject *ob_type; #define PyObject_HEAD_INIT(type) 0, 0, 1, type, #else /* !Py_TRACE_REFS */ #define PyObject_HEAD \ int ob_refcnt; \ struct _typeobject *ob_type; #define PyObject_HEAD_INIT(type) 1, type, #endif /* !Py_TRACE_REFS */ to: #ifdef Py_TRACE_REFS #define PyObject_HEAD \ struct _object *_ob_next, *_ob_prev; \ struct _typeobject *ob_type; \ int ob_refcnt; #define PyObject_HEAD_INIT(type) 0, 0, type, 1, #else /* !Py_TRACE_REFS */ #define PyObject_HEAD \ struct _typeobject *ob_type; \ int ob_refcnt; #define PyObject_HEAD_INIT(type) type, 1, #endif /* !Py_TRACE_REFS */ significantly improved the performance of my 64-bit processes. Basically, I have just changed the order of the items in PyObject and PyVarObject to avoid gas due to an "int" being a 4-byte long and aligned types, while "long" and pointers are 8-byte long and aligned types (on 64-bit platforms that conform to the LP64 guideline). For the ILP32 guideline, such as Intel x86 and AMD CPUs, this should have no effect. On the Sun platform on which I live, the changes work for both ILP32 and LP64. For the very large programs I run, the modification saved me 40% execution time. This was probably due to the increased number of Python objects that would fit into the L2 cache, so I don't believe that others would necessarily see as large as a difference with this coding change. Please consider this change for inclusion in the upcoming Python release. - Bob From aahz@pythoncraft.com Mon Sep 9 18:03:02 2002 From: aahz@pythoncraft.com (Aahz) Date: Mon, 9 Sep 2002 13:03:02 -0400 Subject: [Python-Dev] 64-bit process optimization 1 In-Reply-To: <20020909123418.AAB25999@cas.org> References: <20020909123418.AAB25999@cas.org> Message-ID: <20020909170301.GA8457@panix.com> Without commenting on the merits of your proposal, I can tell you that it'll get lost unless you file a bug report on SourceForge. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/ From guido@python.org Mon Sep 9 18:55:21 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 09 Sep 2002 13:55:21 -0400 Subject: [Python-Dev] 64-bit process optimization 1 In-Reply-To: Your message of "Mon, 09 Sep 2002 12:34:18 EDT." <20020909123418.AAB25999@cas.org> References: <20020909123418.AAB25999@cas.org> Message-ID: <200209091755.g89HtLV30441@pcp02138704pcs.reston01.va.comcast.net> > I am suggesting that Python 2.3 should use a different layout of > all Python objects than is defined in Python 2.2.1. > Specifically, I have found that changing lines 63-74 of > Include/object.h from: > > #ifdef Py_TRACE_REFS > #define PyObject_HEAD \ > struct _object *_ob_next, *_ob_prev; \ > int ob_refcnt; \ > struct _typeobject *ob_type; > #define PyObject_HEAD_INIT(type) 0, 0, 1, type, > #else /* !Py_TRACE_REFS */ > #define PyObject_HEAD \ > int ob_refcnt; \ > struct _typeobject *ob_type; > #define PyObject_HEAD_INIT(type) 1, type, > #endif /* !Py_TRACE_REFS */ > > to: > > #ifdef Py_TRACE_REFS > #define PyObject_HEAD \ > struct _object *_ob_next, *_ob_prev; \ > struct _typeobject *ob_type; \ > int ob_refcnt; > #define PyObject_HEAD_INIT(type) 0, 0, type, 1, > #else /* !Py_TRACE_REFS */ > #define PyObject_HEAD \ > struct _typeobject *ob_type; \ > int ob_refcnt; > #define PyObject_HEAD_INIT(type) type, 1, > #endif /* !Py_TRACE_REFS */ > > significantly improved the performance of my 64-bit processes. > > Basically, I have just changed the order of the items in > PyObject and PyVarObject to avoid gas due to an "int" being a > 4-byte long and aligned types, while "long" and pointers are > 8-byte long and aligned types (on 64-bit platforms that conform > to the LP64 guideline). For the ILP32 guideline, such as Intel > x86 and AMD CPUs, this should have no effect. On the Sun > platform on which I live, the changes work for both ILP32 and > LP64. For the very large programs I run, the modification saved > me 40% execution time. This was probably due to the increased > number of Python objects that would fit into the L2 cache, so I > don't believe that others would necessarily see as large as a > difference with this coding change. Interesting! I can see why this makes sense. Strings, lists and tuples all have an int (ob_size) directly following the standard HEAD, and after that something that requires pointer alignment, so that these object types would all save 8 bytes! To wit: string int refcnt, ptr type, int size, long hash, ... ^gap ^gap list int refcnt, ptr type, int size, ptr item* ^gap ^gap tuple int refcnt, ptr type, int size, ptr item[] ^gap ^gap By swapping the first two fields, these gaps would all disappear. The dict object doesn't use ob_size, but starts with an odd number of ints, so the same reasoning shows it would also save 8 bytes. I don't have access to a 64-bit platform to experiment with this. Unfortunately, one problem is binary compatibility. We try to make it possible to link newer Python versions with extension modules (like Numeric, which you use) compiled for older versions. This requires that the binary lay-out of objects remains the same, and swapping ob_refcnt and ob_type would cause immediate crashes in this case. It may be that there are other reasons why binary incompatibilities exist between 2.2 and 2.3 that make this impractical, so perhaps I'm being too conservative here. Another issue is that at least theoretically, on a 64-bit platform, there could be more than 2 billion references to a particular object. E.g. if you have enough memory, the following allocates 3 lists each containing a billion references to None, causing the reference count of None to go negative: A = [] for i in range(3): A.append([None]*1000000000) So perhaps the refcnt should have been a long in the first place. A similar argument may hold for the length of e.g. strings and lists: one could wish to have a list of more than 2 billion elements, or a string containing more than 2 gigabytes (that much RAM is easily found on the larger 64-bit servers, I believe). Opinions? --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Mon Sep 9 19:16:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 09 Sep 2002 14:16:18 -0400 Subject: [Python-Dev] 64-bit process optimization 1 In-Reply-To: <200209091755.g89HtLV30441@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > ... > So perhaps the refcnt should have been a long in the first place. We agreed to that years ago, but never bothered to change it. In fact, you used to tell people it *was* a long until I beat that out of you . Do note that a long is still only 4 bytes on Win64. The type we really want here is what pyport.h calls Py_intptr_t (a Python spelling of the appropriate C99 type; C99 introduced ways to say what you really mean in these cases). > A similar argument may hold for the length of e.g. strings and lists: > one could wish to have a list of more than 2 billion elements, or a > string containing more than 2 gigabytes (that much RAM is easily found > on the larger 64-bit servers, I believe). > > Opinions? Those are more naturally addressed by size_t, since strlen and malloc are constrained to that type. I generally declare string-slinging code as using size_t vars now, and endure the pain of casting back and forth to int to talk with Python's idea of a string size. Whether it's worth the pain to change this stuff depends on whether we think 64-bit boxes are just another passing fad like the Internet . From python-dev@liveevil.com Mon Sep 9 20:42:56 2002 From: python-dev@liveevil.com (john spurling) Date: Mon, 9 Sep 2002 12:42:56 -0700 Subject: [Python-Dev] raw headers in rfc822.Message Message-ID: <20020909194256.GA13424@c7c8.colobox.com> greetings, since the raw headers don't seem to be available in an rfc822.Message, i added a quick two line hack to populate a rawheaders member. attached is a patch to rfc822.py from the python 2.2.1 distribution. if you don't like my two line hack, consider this a request to provide the raw headers in some way in an rfc822.Message. thanks, john spurling -- "nothing brings people together like doom." --sarah vowell From python-dev@liveevil.com Mon Sep 9 20:52:34 2002 From: python-dev@liveevil.com (john spurling) Date: Mon, 9 Sep 2002 12:52:34 -0700 Subject: [Python-Dev] Re: raw headers in rfc822.Message Message-ID: <20020909195234.GA18807@c7c8.colobox.com> --OXfL5xGRrasGEqWY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline maybe it would help if i actually attached the diff... --OXfL5xGRrasGEqWY Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="rfc822.diff" 139d138 < self.rawheaders = '' 158,159d156 < # Add line to the raw input < self.rawheaders += line --OXfL5xGRrasGEqWY-- From aahz@pythoncraft.com Mon Sep 9 20:52:08 2002 From: aahz@pythoncraft.com (Aahz) Date: Mon, 9 Sep 2002 15:52:08 -0400 Subject: [Python-Dev] raw headers in rfc822.Message In-Reply-To: <20020909194256.GA13424@c7c8.colobox.com> References: <20020909194256.GA13424@c7c8.colobox.com> Message-ID: <20020909195208.GA1662@panix.com> On Mon, Sep 09, 2002, john spurling wrote: > > since the raw headers don't seem to be available in an rfc822.Message, > i added a quick two line hack to populate a rawheaders member. > attached is a patch to rfc822.py from the python 2.2.1 > distribution. File a bug report on SourceForge. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/ From zack@codesourcery.com Mon Sep 9 21:09:34 2002 From: zack@codesourcery.com (Zack Weinberg) Date: Mon, 9 Sep 2002 13:09:34 -0700 Subject: [Python-Dev] Re: raw headers in rfc822.Message In-Reply-To: <20020909195234.GA18807@c7c8.colobox.com> References: <20020909195234.GA18807@c7c8.colobox.com> Message-ID: <20020909200934.GA17001@codesourcery.com> On Mon, Sep 09, 2002 at 12:52:34PM -0700, john spurling wrote: > maybe it would help if i actually attached the diff... > > 139d138 > < self.rawheaders = '' > 158,159d156 > < # Add line to the raw input > < self.rawheaders += line You've generated this patch backward, and in a format which makes it useless to us. Please regenerate it with diff -c or diff -u (either is acceptable) and put the newer file _second_ on the command line: diff -u OLD_FILE NEW_FILE. zw From barry@python.org Mon Sep 9 21:11:46 2002 From: barry@python.org (Barry A. Warsaw) Date: Mon, 9 Sep 2002 16:11:46 -0400 Subject: [Python-Dev] raw headers in rfc822.Message References: <20020909194256.GA13424@c7c8.colobox.com> Message-ID: <15741.130.736249.914221@anthem.wooz.org> >>>>> "js" == john spurling writes: js> since the raw headers don't seem to be available in an js> rfc822.Message, i added a quick two line hack to populate a js> rawheaders member. attached is a patch to rfc822.py from the js> python 2.2.1 distribution. js> if you don't like my two line hack, consider this a request to js> provide the raw headers in some way in an rfc822.Message. Why not just use email.Message.Message? You can get the original headers from it, and the email package tries really hard to produce output identical to the input. -Barry From barry@barrys-emacs.org Mon Sep 9 23:21:45 2002 From: barry@barrys-emacs.org (Barry Scott) Date: Mon, 9 Sep 2002 23:21:45 +0100 Subject: [Python-Dev] Re: Python-dev summary for 2002-08-15 - 2002-09-01 In-Reply-To: <20020908040224.GA27302@panix.com> Message-ID: <002001c2584f$480a4b10$070210ac@LAPDANCE> > xterm does a nifty job usually of figuring out what to highlight when I > double-click on a word. It fails with mailto: because normally when I > cut'n'paste an address, I *don't* want to include the "mailto:" portion. You can configure xterm to treat : as punctuation and not a word char. See man xterm. BArry From bsder@mail.allcaps.org Mon Sep 9 23:21:49 2002 From: bsder@mail.allcaps.org (Andrew P. Lentvorski) Date: Mon, 9 Sep 2002 15:21:49 -0700 (PDT) Subject: [Python-Dev] Subsecond time stamps In-Reply-To: <200209061530.g86FUeq15029@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020909150038.W79275-100000@mail.allcaps.org> On Fri, 6 Sep 2002, Guido van Rossum wrote: > > C. Make st_mtime a floating point number. This won't offer nanosecond > > resolution, as C doubles are not dense enough. > > This is the most Pythonic approach. -1 This then locks Python into a specific bit-description notion of a double in order to get the appropriate number of significant digits to describe time sufficiently. Embedded/portable processors may not support the notion of an IEEE double. In addition, timers get increasingly dense as computers get faster. Thus, doubles may work for nanoseconds, but will not be sufficient for picoseconds. If the goal is a field which never has to be changed to support any amount of time, the value should be "infinite precision". At that point, a Python Long used in some tuple representation of fixed-point arithmetic springs to mind. ie. (, ) -a From martin@v.loewis.de Mon Sep 9 23:26:55 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 00:26:55 +0200 Subject: [Python-Dev] Codecs lookup order In-Reply-To: References: Message-ID: pinard@iro.umontreal.ca (Fran=E7ois Pinard) writes: > I'm not sure what should best be done. The documentation might be > modified to explain the limitation, so other users do not trip up on > it. `encoding.lookup()' might merely return None in case > `getregentry' is not defined in the imported module, or else, it > could make sure that it imports modules exclusively from within the > `encodings' package. This is what Python 2.3, and Python 2.2.2 will do. Regards, Martin From martin@v.loewis.de Mon Sep 9 23:33:20 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 00:33:20 +0200 Subject: [Python-Dev] 64-bit process optimization 1 In-Reply-To: <200209091755.g89HtLV30441@pcp02138704pcs.reston01.va.comcast.net> References: <20020909123418.AAB25999@cas.org> <200209091755.g89HtLV30441@pcp02138704pcs.reston01.va.comcast.net> Message-ID: Guido van Rossum writes: > So perhaps the refcnt should have been a long in the first place. A > similar argument may hold for the length of e.g. strings and lists: > one could wish to have a list of more than 2 billion elements, or a > string containing more than 2 gigabytes (that much RAM is easily found > on the larger 64-bit servers, I believe). > > Opinions? I agree with that position, and Tim's, that those fields should widen to 64 bits on a 64-bit system. I disagree that size_t is suitable for ob_size, since some types put negative values into ob_size. The signed version of that, ssize_t, is not universally available, so we'd need to add Py_ssize_t. Regards, Martin From aahz@pythoncraft.com Tue Sep 10 00:07:13 2002 From: aahz@pythoncraft.com (Aahz) Date: Mon, 9 Sep 2002 19:07:13 -0400 Subject: [Python-Dev] Cut'n'paste In-Reply-To: <002001c2584f$480a4b10$070210ac@LAPDANCE> References: <20020908040224.GA27302@panix.com> <002001c2584f$480a4b10$070210ac@LAPDANCE> Message-ID: <20020909230713.GA5338@panix.com> On Mon, Sep 09, 2002, Barry Scott wrote: >Aahz: >> >> xterm does a nifty job usually of figuring out what to highlight when I >> double-click on a word. It fails with mailto: because normally when I >> cut'n'paste an address, I *don't* want to include the "mailto:" portion. > > You can configure xterm to treat : as punctuation and not a word > char. See man xterm. Then it would fail with regular URLs. You can't win. ;-) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/ From guido@python.org Tue Sep 10 00:06:30 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 09 Sep 2002 19:06:30 -0400 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Your message of "Mon, 09 Sep 2002 15:21:49 PDT." <20020909150038.W79275-100000@mail.allcaps.org> References: <20020909150038.W79275-100000@mail.allcaps.org> Message-ID: <200209092306.g89N6V806944@pcp02138704pcs.reston01.va.comcast.net> > > > C. Make st_mtime a floating point number. This won't offer nanosecond > > > resolution, as C doubles are not dense enough. > > > > This is the most Pythonic approach. > > -1 > > This then locks Python into a specific bit-description notion of a double > in order to get the appropriate number of significant digits to describe > time sufficiently. Embedded/portable processors may not support the > notion of an IEEE double. > > In addition, timers get increasingly dense as computers get faster. Thus, > doubles may work for nanoseconds, but will not be sufficient for > picoseconds. > > If the goal is a field which never has to be changed to support any amount > of time, the value should be "infinite precision". At that point, a > Python Long used in some tuple representation of fixed-point arithmetic > springs to mind. ie. (, ) I'm sorry, but I really don't see the point of wanting to record file mtimes all the way up to nanosecond precision. What would it mean? Most clocks are off by a few seconds at least anyway. Python has represented time as Pythin floats (implemented as C doubles) all its life long and it has served us well. --Guido van Rossum (home page: http://www.python.org/~guido/) From martin@v.loewis.de Tue Sep 10 00:34:12 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 01:34:12 +0200 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: <20020909150038.W79275-100000@mail.allcaps.org> References: <20020909150038.W79275-100000@mail.allcaps.org> Message-ID: "Andrew P. Lentvorski" writes: > This then locks Python into a specific bit-description notion of a double > in order to get the appropriate number of significant digits to describe > time sufficiently. Embedded/portable processors may not support the > notion of an IEEE double. That's not true. Support you have two fields, tv_sec and tv_nsec. Then the resulting float expression is tv_sec + 1e-9 * tv_nsec; This expression works on all systems that support floating point numbers - be it IEEE or not. > In addition, timers get increasingly dense as computers get faster. > Thus, doubles may work for nanoseconds, but will not be sufficient > for picoseconds. At the same time, floating point numbers get increasingly more accurate as computer registers widen. In a 64-bit float, you can just barely express 1e-7s (if you base the era at 1970); with a 128-bit float, you can express 1e-20s easily. > If the goal is a field which never has to be changed to support any amount > of time, the value should be "infinite precision". No, just using floating point numbers is sufficient. Notice that time.time() also returns a floating point number. > At that point, a Python Long used in some tuple representation of > fixed-point arithmetic springs to mind. ie. (, fractional point>) Yes, when/if Python gets rational numbers, or decimal fixed-or-floating point numbers, those data types might represent the the value that the system reports more accurately. At that time, there will be a transition plan to introduce those numbers at all places where it is reasonable, with as little impact on applications as possible. Regards, Martin From brian@sweetapp.com Tue Sep 10 00:55:15 2002 From: brian@sweetapp.com (Brian Quinlan) Date: Mon, 09 Sep 2002 16:55:15 -0700 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Message-ID: <01b501c2585c$584b4a80$df7e4e18@brianspiv1700> MvL wrote: > That's not true. Support you have two fields, tv_sec and tv_nsec. Then > the resulting float expression is > > tv_sec + 1e-9 * tv_nsec; > > This expression works on all systems that support floating point > numbers - be it IEEE or not. Don't you have to truncate tv_sec for that to work? i.e. Truncate(tv_sec, 9) + 1e-9 * tv_nsec Cheers, Brian From drifty@bigfoot.com Tue Sep 10 01:25:56 2002 From: drifty@bigfoot.com (Brett Cannon) Date: Mon, 9 Sep 2002 17:25:56 -0700 (PDT) Subject: [Python-Dev] Cut'n'paste In-Reply-To: <20020909230713.GA5338@panix.com> Message-ID: [Aahz] > > You can configure xterm to treat : as punctuation and not a word > > char. See man xterm. > > Then it would fail with regular URLs. You can't win. ;-) I am now officially ignoring any more comments on how to format URLs and email addresses in the summary. Aahz is right, "You can't win" and thus I am not going to bother to try to please everyone. I will just do it the way I feel like it and if someone doesn't like it they can just reformat the code with a regex to make themselves happy. Now I know how Guido must have felt with everyone and their mother throwing in their opinion about booleans. =) -Brett From tim.one@comcast.net Tue Sep 10 02:21:11 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 09 Sep 2002 21:21:11 -0400 Subject: [Python-Dev] Cut'n'paste In-Reply-To: Message-ID: [Brett Cannon] > I am now officially ignoring any more comments on how to format URLs > and email addresses in the summary. What?! I didn't get around to insisting that you use XML for this, with one UTF8-encoded character per element thingie. > Aahz is right, "You can't win" and thus I am not going to bother to try > to please everyone. I will just do it the way I feel like it and if > someone doesn't like it they can just reformat the code with a regex > to make themselves happy. Such a small-minded attitude, Brett. OTOH, it may preserve a bit of your life for something enjoyable. > Now I know how Guido must have felt with everyone and their mother > throwing in their opinion about booleans. =) Not until you're accused of destroying all that's good about Python, going out of your way to make it impossible to teach programming, and most likely breaking every important Python program ever written. It will take several years for you to earn that level of abuse . no-good-deed-goes-unpunished-ly y'rs - tim From python@rcn.com Tue Sep 10 04:07:11 2002 From: python@rcn.com (Raymond Hettinger) Date: Mon, 9 Sep 2002 23:07:11 -0400 Subject: [Python-Dev] Cut'n'paste References: Message-ID: <001501c25877$29c19460$a661accf@othello> From: "Brett Cannon" > I am now officially ignoring any more comments on how to format URLs and > email addresses in the summary. Aahz is right, "You can't win" and thus I > am not going to bother to try to please everyone. I will just do it the > way I feel like it and if someone doesn't like it they can just reformat > the code with a regex to make themselves happy. > > Now I know how Guido must have felt with everyone and their mother > throwing in their opinion about booleans. =) BTW, my mother would have wanted spaces as delimiters. Raymond Hettinger From martin@v.loewis.de Tue Sep 10 07:30:02 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 08:30:02 +0200 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: <01b501c2585c$584b4a80$df7e4e18@brianspiv1700> References: <01b501c2585c$584b4a80$df7e4e18@brianspiv1700> Message-ID: Brian Quinlan writes: > > tv_sec + 1e-9 * tv_nsec; > > [...] > Don't you have to truncate tv_sec for that to work? i.e. > > Truncate(tv_sec, 9) + 1e-9 * tv_nsec What is Truncate, and why would I need it? Regards, Martin From brian@sweetapp.com Tue Sep 10 07:50:09 2002 From: brian@sweetapp.com (Brian Quinlan) Date: Mon, 09 Sep 2002 23:50:09 -0700 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: Message-ID: <01cd01c25896$4e301d70$df7e4e18@brianspiv1700> > What is Truncate, and why would I need it? You wouldn't need it because I misunderstood the problem. Sorry. Cheers, Brian From fredrik@pythonware.com Tue Sep 10 09:30:37 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 10 Sep 2002 10:30:37 +0200 Subject: [Python-Dev] 64-bit process optimization 1 References: <20020909123418.AAB25999@cas.org> <200209091755.g89HtLV30441@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <001501c258a4$6f889ca0$ced241d5@hagrid> guido wrote: > Unfortunately, one problem is binary compatibility. We try to make it > possible to link newer Python versions with extension modules (like > Numeric, which you use) compiled for older versions. This requires > that the binary lay-out of objects remains the same, and swapping > ob_refcnt and ob_type would cause immediate crashes in this case. a compromise could be to make the swap in 2.3, but only on 64-bit platforms. it's obvious that most people are stuck on 32-bit platforms today, and I think it's safe to say that users on 64-bit plat- forms might be a bit more willing to build everything they need on their local platform. another alternative would be to make it a configuration option, with a platform-dependent default. From Anthony Baxter Tue Sep 10 11:06:31 2002 From: Anthony Baxter (Anthony Baxter) Date: Tue, 10 Sep 2002 20:06:31 +1000 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: <200209092306.g89N6V806944@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <200209101006.g8AA6Vb28742@localhost.localdomain> >>> Guido van Rossum wrote > I'm sorry, but I really don't see the point of wanting to record file > mtimes all the way up to nanosecond precision. What would it mean? > Most clocks are off by a few seconds at least anyway. Not only that, but if you're that precise, are you measuring the time when the modification started, the time when it started hitting the disks, when the write on the disk completed, when the O/S signalled to the application that the modification was complete... questions questions.. .:) From guido@python.org Tue Sep 10 14:54:58 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 10 Sep 2002 09:54:58 -0400 Subject: [Python-Dev] 64-bit process optimization 1 In-Reply-To: Your message of "Tue, 10 Sep 2002 10:30:37 +0200." <001501c258a4$6f889ca0$ced241d5@hagrid> References: <20020909123418.AAB25999@cas.org> <200209091755.g89HtLV30441@pcp02138704pcs.reston01.va.comcast.net> <001501c258a4$6f889ca0$ced241d5@hagrid> Message-ID: <200209101354.g8ADswV23058@odiug.zope.com> > > Unfortunately, one problem is binary compatibility. We try to make it > > possible to link newer Python versions with extension modules (like > > Numeric, which you use) compiled for older versions. This requires > > that the binary lay-out of objects remains the same, and swapping > > ob_refcnt and ob_type would cause immediate crashes in this case. > > a compromise could be to make the swap in 2.3, but only > on 64-bit platforms. > > it's obvious that most people are stuck on 32-bit platforms > today, and I think it's safe to say that users on 64-bit plat- > forms might be a bit more willing to build everything they > need on their local platform. > > another alternative would be to make it a configuration option, > with a platform-dependent default. I like all of that. Maybe it should also be a config option whether refcount, sizes etc. should be 32 or 64 bit quantities on 64 bit platforms. --Guido van Rossum (home page: http://www.python.org/~guido/) From mcherm@destiny.com Tue Sep 10 15:18:38 2002 From: mcherm@destiny.com (Michael Chermside) Date: Tue, 10 Sep 2002 10:18:38 -0400 Subject: [Python-Dev] Re: raw headers in rfc822.Message Message-ID: <3D7DFF3E.3030200@destiny.com> Zack Weinberg writes: > You've generated this patch backward, and in a format which makes it > useless to us. Please regenerate it with diff -c or diff -u (either > is acceptable) and put the newer file _second_ on the command line: > diff -u OLD_FILE NEW_FILE. It wasn't all that long ago that I submitted my first patch (of documentation, not code) to SourceForge. It took me > 20 minutes of careful web searching to figure out the desired way of submitting files and the correct way to generate that. And I still wasn't 100% sure I was generating the diff in the correct direction. Couldn't Zack's comment be added to the directions found at https://sourceforge.net/tracker/?func=add&group_id=5470&atid=305470 so that anyone submitting a patch would see how to do it. (But of course that wouldn't have helped THIS person, who didn't use sourceforge... :-( ) -- Michael Chermside From guido@python.org Tue Sep 10 15:26:48 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 10 Sep 2002 10:26:48 -0400 Subject: [Python-Dev] Re: raw headers in rfc822.Message In-Reply-To: Your message of "Tue, 10 Sep 2002 10:18:38 EDT." <3D7DFF3E.3030200@destiny.com> References: <3D7DFF3E.3030200@destiny.com> Message-ID: <200209101426.g8AEQm023271@odiug.zope.com> > Zack Weinberg writes: > > You've generated this patch backward, and in a format which makes it > > useless to us. Please regenerate it with diff -c or diff -u (either > > is acceptable) and put the newer file _second_ on the command line: > > diff -u OLD_FILE NEW_FILE. [Michael Chermside] > It wasn't all that long ago that I submitted my first patch (of > documentation, not code) to SourceForge. It took me > 20 minutes of > careful web searching to figure out the desired way of submitting files > and the correct way to generate that. And I still wasn't 100% sure I was > generating the diff in the correct direction. > > Couldn't Zack's comment be added to the directions found at > https://sourceforge.net/tracker/?func=add&group_id=5470&atid=305470 > so that anyone submitting a patch would see how to do it. I guess we're assuming that even people who aren't familiar with SourceForge are familiar with diff. Is that not a reasonable assumption any more? There's also the developer FAQ, which has carefull instructions for patch generation at http://www.python.org/dev/devfaq.html#patches and in addition points to http://www.python.org/patches/ which has everything you need (except the hint about forward diffs; I'll add that). --Guido van Rossum (home page: http://www.python.org/~guido/) From Jack.Jansen@cwi.nl Tue Sep 10 15:39:37 2002 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Tue, 10 Sep 2002 16:39:37 +0200 Subject: [Python-Dev] Re: raw headers in rfc822.Message In-Reply-To: <200209101426.g8AEQm023271@odiug.zope.com> Message-ID: <2167CDEB-C4CB-11D6-911E-0030655234CE@cwi.nl> On Tuesday, September 10, 2002, at 04:26 , Guido van Rossum wrote: >> Couldn't Zack's comment be added to the directions found at >> https://sourceforge.net/tracker/?func=add&group_id=5470&atid=305470 >> so that anyone submitting a patch would see how to do it. > > I guess we're assuming that even people who aren't familiar with > SourceForge are familiar with diff. Is that not a reasonable > assumption any more? Not cross-platform. I've had patches for MacPython in rather outlandish diff-like formats, so a note that tells people to use the unix diff program wouldn't hurt. -- - Jack Jansen http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman - From guido@python.org Tue Sep 10 15:41:40 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 10 Sep 2002 10:41:40 -0400 Subject: [Python-Dev] Re: raw headers in rfc822.Message In-Reply-To: Your message of "Tue, 10 Sep 2002 16:39:37 +0200." <2167CDEB-C4CB-11D6-911E-0030655234CE@cwi.nl> References: <2167CDEB-C4CB-11D6-911E-0030655234CE@cwi.nl> Message-ID: <200209101441.g8AEfeW23387@odiug.zope.com> > > I guess we're assuming that even people who aren't familiar with > > SourceForge are familiar with diff. Is that not a reasonable > > assumption any more? > > Not cross-platform. I've had patches for MacPython in rather > outlandish diff-like formats, so a note that tells people to use the > unix diff program wouldn't hurt. But what good does a reference to "the unix diff program" do a Mac developer? --Guido van Rossum (home page: http://www.python.org/~guido/) From aahz@pythoncraft.com Tue Sep 10 15:48:19 2002 From: aahz@pythoncraft.com (Aahz) Date: Tue, 10 Sep 2002 10:48:19 -0400 Subject: [Python-Dev] Writing patches In-Reply-To: <200209101426.g8AEQm023271@odiug.zope.com> References: <3D7DFF3E.3030200@destiny.com> <200209101426.g8AEQm023271@odiug.zope.com> Message-ID: <20020910144818.GA13037@panix.com> On Tue, Sep 10, 2002, Guido van Rossum wrote: > > There's also the developer FAQ, which has carefull instructions for > patch generation at > > http://www.python.org/dev/devfaq.html#patches > > and in addition points to http://www.python.org/patches/ which has > everything you need (except the hint about forward diffs; I'll add > that). Perhaps the "patches" link at http://www.python.org/ should point at either DevFAQ#patches or the patches page. (That was my original intention in not linking directly to SF -- you're the one who added the direct links.) The question IMO is whether those links are for the benefit of core developers or newbies. I'm +1 on the latter. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/ From skip@pobox.com Tue Sep 10 15:52:53 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 10 Sep 2002 09:52:53 -0500 Subject: [Python-Dev] Re: raw headers in rfc822.Message In-Reply-To: <3D7DFF3E.3030200@destiny.com> References: <3D7DFF3E.3030200@destiny.com> Message-ID: <15742.1861.180590.431080@12-248-11-90.client.attbi.com> Michael> Couldn't Zack's comment be added to the directions found at Michael> https://sourceforge.net/tracker/?func=add&group_id=5470&atid=305470 Michael> so that anyone submitting a patch would see how to do it. On that page there's a link entitled "See our hints on how to create a patch." This links to http://www.python.org/patches/ which has, I think, the required details. -- Skip Montanaro skip@pobox.com consulting: http://manatee.mojam.com/~skip/resume.html From guido@python.org Tue Sep 10 15:59:13 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 10 Sep 2002 10:59:13 -0400 Subject: [Python-Dev] Re: raw headers in rfc822.Message In-Reply-To: Your message of "Tue, 10 Sep 2002 09:52:53 CDT." <15742.1861.180590.431080@12-248-11-90.client.attbi.com> References: <3D7DFF3E.3030200@destiny.com> <15742.1861.180590.431080@12-248-11-90.client.attbi.com> Message-ID: <200209101459.g8AExD323473@odiug.zope.com> > Michael> Couldn't Zack's comment be added to the directions found at > Michael> https://sourceforge.net/tracker/?func=add&group_id=5470&atid=305470 > Michael> so that anyone submitting a patch would see how to do it. > > On that page there's a link entitled "See our hints on how to create a > patch." This links to > > http://www.python.org/patches/ > > which has, I think, the required details. I added that link a few minutes ago. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From mcherm@destiny.com Tue Sep 10 16:02:23 2002 From: mcherm@destiny.com (Michael Chermside) Date: Tue, 10 Sep 2002 11:02:23 -0400 Subject: [Python-Dev] Re: raw headers in rfc822.Message Message-ID: <3D7E097F.7000003@destiny.com> >> On that page there's a link entitled "See our hints on how to create a >> patch." This links to >> >> http://www.python.org/patches/ >> >> which has, I think, the required details. > > I added that link a few minutes ago. :-) > > --Guido van Rossum (home page: http://www.python.org/~guido/) I think that's a great fix. Thanks! -- Michael Chermside From thomas@xs4all.net Tue Sep 10 16:12:53 2002 From: thomas@xs4all.net (Thomas Wouters) Date: Tue, 10 Sep 2002 17:12:53 +0200 Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Mac/Include getapplbycreator.h,1.3,1.4 macdefs.h,1.11,1.12 macglue.h,1.61,1.62 pythonresources.h,1.27,1.28 In-Reply-To: References: Message-ID: <20020910151252.GA830@xs4all.nl> On Tue, Sep 10, 2002 at 05:32:49AM -0700, jackjansen@users.sourceforge.net wrote: > Modified Files: > getapplbycreator.h macdefs.h macglue.h pythonresources.h > Log Message: > Added include guards and C++ extern "C" {} constructs. Partial fix for #607253. > Bugfix candidate. [..] > *** getapplbycreator.h 19 May 2001 12:32:39 -0000 1.3 > --- getapplbycreator.h 10 Sep 2002 12:32:47 -0000 1.4 [..] > ******************************************************************/ > + #ifndef Py_GETAPPLBYCREATOR_H > + #define Py_GETALLPBYCREATOR_H This looks suspiciously like a bug. If you really do intend to #define something different than you just checked against, you should add a comment stating that this really isn't a typo of a very common idiom :) I'm-not-dead--I-feel-fine-ly y'rs, -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! From pinard@iro.umontreal.ca Tue Sep 10 16:26:56 2002 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: Tue, 10 Sep 2002 11:26:56 -0400 Subject: [Python-Dev] Re: Codecs lookup order In-Reply-To: (martin@v.loewis.de's message of "10 Sep 2002 00:26:55 +0200") References:

Message-ID: [Martin v. Loewis] > pinard@iro.umontreal.ca (Fran.ois Pinard) writes: >> I'm not sure what should best be done. 1) The documentation might be >> modified to explain the limitation, so other users do not trip up on it. >> 2) `encoding.lookup()' might merely return None in case `getregentry' is >> not defined in the imported module, or else, 3) it could make sure that it >> imports modules exclusively from within the `encodings' package. > This is what Python 2.3, and Python 2.2.2 will do. Hi, Martin. I added "1)", "2)" and "3)" in the original text for clarity. Will Python 2.2.2 and 2.3 do "3)", or all of "1)", "2)" and "3)"? If the codec search order is not changed, how one proceeds if s/he wants to override a bundled codec, with a provided other with the same encoding name? -- Fran�ois Pinard http://www.iro.umontreal.ca/~pinard From xscottg@yahoo.com Tue Sep 10 17:15:04 2002 From: xscottg@yahoo.com (Scott Gilbert) Date: Tue, 10 Sep 2002 09:15:04 -0700 (PDT) Subject: [Python-Dev] 64-bit process optimization 1 In-Reply-To: <200209101354.g8ADswV23058@odiug.zope.com> Message-ID: <20020910161504.10967.qmail@web40110.mail.yahoo.com> --- Guido wrote: > > > > a compromise could be to make the swap in 2.3, but only > > on 64-bit platforms. > > > > it's obvious that most people are stuck on 32-bit platforms > > today, and I think it's safe to say that users on 64-bit plat- > > forms might be a bit more willing to build everything they > > need on their local platform. > > > > another alternative would be to make it a configuration option, > > with a platform-dependent default. > > I like all of that. Maybe it should also be a config option whether > refcount, sizes etc. should be 32 or 64 bit quantities on 64 bit > platforms. > +1 from this 64 bit user. __________________________________________________ Yahoo! - We Remember 9-11: A tribute to the more than 3,000 lives lost http://dir.remember.yahoo.com/tribute From barry@python.org Tue Sep 10 18:00:32 2002 From: barry@python.org (Barry A. Warsaw) Date: Tue, 10 Sep 2002 13:00:32 -0400 Subject: [Python-Dev] The first trustworthy GBayes results References: <15726.13053.111171.335483@12-248-11-90.client.attbi.com> <200208291631.g7TGVgd28718@localhost.localdomain> Message-ID: <15742.9520.698662.836695@anthem.wooz.org> >>>>> "AB" == Anthony Baxter writes: >> Skip Montanaro wrote >> One thing worth noting before everybody starts using it to >> massage their mailboxes is that the email package contains a >> bug which causes it to occasionally delete whitespace when >> reformatting headers. BTW, I fixed Greg's problem but not Skip's. I'm still looking at this one... AB> There's one other known problem - seriously misformatted MIME AB> (as seen in spam, and email from Microsoft Entourage) causes AB> the email package to barf out. I plan, at some point, to try AB> and make a "if it fails, just leave the body as one chunk of AB> text" mode, but it's a long long way down my list of AB> priorities. I just checked this into cvs. -Barry From martin@v.loewis.de Tue Sep 10 19:25:16 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 20:25:16 +0200 Subject: [Python-Dev] Subsecond time stamps In-Reply-To: <200209101006.g8AA6Vb28742@localhost.localdomain> References: <200209101006.g8AA6Vb28742@localhost.localdomain> Message-ID: Anthony Baxter writes: > Not only that, but if you're that precise, are you measuring the time > when the modification started, the time when it started hitting the > disks, when the write on the disk completed, when the O/S signalled > to the application that the modification was complete... questions > questions.. .:) For Python, these questions are easy to answer: We just report to the application what the system reports to us. It the the file system implementor's job to define the notion of modification time. Regards, Martin From martin@v.loewis.de Tue Sep 10 19:26:06 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 20:26:06 +0200 Subject: [Python-Dev] Re: Codecs lookup order In-Reply-To: References:

Sorry for = interrupting you - click refuse for no more = mail...
=A1=A1
- = Welcome to NabiTel's software products and portal = services -
Software Products
	Web Robot: also called web spider or web = crawler, collects useful web page informations by navigating = world wide web sites. Download free trial version now ! =
	eMail ID Collector: Collects = email ids publicly opened on various web pages, with good = intention. Download free trial version now ! =
= Portal Services =
	Web = Portal: Do you have your own home page and want to broadcast = it all over the world ? Register your home page to NabiTel = Portal Now !! (nabi=3Da butterfly) Register your home page now, = it's free !
	Automobiles: Do you want to = sell or buy automobiles ? Cars, trucks, limos, airplanes, = ships,.... All That Cars are here ! = Register your vehicles now, it's free = !
=	Computers: Do you want to sell = or buy computers ? PCs, printers, scanners, servers, mainframes, = ... All That Computers are here ! Register your computers now, = it's free !
	Food & Restaurants: = Are you seeking for a nice place to eat ? Or do you run a = restaurant ? Foods of the world, restaurants of the world, .... = All That Foods are here ! Register your restaurant now, it's free = !
Have a nice = day. Thank you.