From guido at python.org Fri Sep 1 00:04:50 2006 From: guido at python.org (Guido van Rossum) Date: Thu, 31 Aug 2006 15:04:50 -0700 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <1cb725390608311313h4eac0f98x85a0690d3082b533@mail.gmail.com> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <1cb725390608311313h4eac0f98x85a0690d3082b533@mail.gmail.com> Message-ID: (Adding back py3k list assuming you just forgot it) On 8/31/06, Paul Prescod wrote: > On 8/31/06, Guido van Rossum wrote: > > > > (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes > > > per character, and doesn't support the supplemental characters above > > > 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.) > > > > I think we should also support UTF-16, since Java and .NET (and > > Win32?) appear to be using effectively; making surrogate handling an > > application issue doesn't seem *too* big of a burden for many apps. > > I think that the reason that UTF-16 seems "not too big of a burden" is > because people just ignore the UTF-16-ness of the data and hope that people > don't use those characters. In effect they trade correctness and > internationalization for simplicity and performance. It seems like it may > become a bigger issue as time goes by. Well there's a large class of apps that don't do anything for which surrogates matter, since they just copy strings around and only split them at specific characters. E.g. parsing XML would often fall in this category. > Plus, it sounds like you're proposing that the encodings of the underlying > data would leak through to the application. As I understood Fredrick's > model, the intention was to treat the encoding as an implementation detail. > If it works well, this could be an important differentiator for Python > (versus Java) as Unicode already is (versus Ruby). *Only* for UTF-16, which I consider a necessary evil since we can't rewrite the Java and .NET standards. > So my basic feeling is that if we're going to hide UTF-8 from the programmer > then we might as well go the extra mile and hide UTF-16 as well. I don't think the issues are the same. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From brett at python.org Fri Sep 1 00:18:17 2006 From: brett at python.org (Brett Cannon) Date: Thu, 31 Aug 2006 15:18:17 -0700 Subject: [Python-3000] Exception Expressions In-Reply-To: <76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> <76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com> Message-ID: On 8/31/06, Calvin Spealman wrote: > > On 8/31/06, Brett Cannon wrote: > > So this feels like the Perl idiom of using die: ``open(file) or die`` > (or > > something like that; I have never been a Perl guy so I could be off). > > > > > ... > > > > The problem I have with this whole proposal is that catching exceptions > > should be very obvious in the source code. This proposal does not help > with > > that ideal. So I am -1 on the whole idea. > > > > -Brett > > "Ouch" on the associated my idea with perl! =) The truth hurts. Although I agree that it is good to be obvious about exceptions, there > are some cases when they are simply less than exceptional. For > example, you can do d.get(key, default) if you know something is a > dictionary, but for general mappings you can't rely on that, and may > often use exceptions as a kind of logic control. No, that doesn't sync > with the purity of exceptions, but sometimes practicality and > real-world usage trumps theory. Practically most definitely beats purity, but I don't see the practicality of this over what we already have. Only allowing a single expression, it shouldn't be able to get ugly. Famous last words. Remember a big argument against the 'if' expressions was about them getting too unwieldly in terms of length and obscuring the fact that it is a conditional. I have used 'if' expressions and they have been hard to keep very readable unless you are willing to use parentheses and make them unreadable. I would be afraid of this happening here, but to an even more important construct that should always be easy to spot in source code. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/f41c8b7d/attachment.htm From walter at livinglogic.de Fri Sep 1 00:24:35 2006 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Fri, 01 Sep 2006 00:24:35 +0200 Subject: [Python-3000] Comment on iostack library In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> Message-ID: <44F761A3.5060009@livinglogic.de> tomer filiba wrote: > [...] > besides, encoding suffers from many issues. suppose you have a > damaged UTF8 file, which you read char-by-char. when we reach the > damaged part, you'll never be able to "skip" it, as we'll just keep > read()ing bytes, hoping to make a character out of it , until we > reach EOF, i.e.: > > def read_char(self): > buf = "" > while not self._stream.eof: > buf += self._stream.read(1) > try: > return buf.decode("utf8") > except ValueError: > pass > > which leads me to the following thought: maybe we should have > an "enhanced" encoding library for py3k, which would report > *incomplete* data differently from *invalid* data. today it's just a > ValueError: suppose decode() would raise IncompleteDataError > when the given data is not sufficient to be decoded successfully, > and ValueError when the data is just corrupted. > > that could aid iostack greatly. We *do* have that functionality in Python 2.5: incremental decoders can retain incomplete byte sequences on the call to the decode() method until the next call. Only when final=True is passed in the decode() call will it treat incomplete and invalid data in the same way: by raising an exception. Incomplete input: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\xe1") u'' >>> d.decode("\x88") u'' >>> d.decode("\xb4") u'\u1234' Invalid input: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\x80") Traceback (most recent call last): File "", line 1, in File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte Incomplete input with final=True: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\xe1", final=True) Traceback (most recent call last): File "", line 1, in File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data Servus, Walter From greg.ewing at canterbury.ac.nz Fri Sep 1 04:39:37 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 01 Sep 2006 14:39:37 +1200 Subject: [Python-3000] Exception Expressions In-Reply-To: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> Message-ID: <44F79D69.6090909@canterbury.ac.nz> Calvin Spealman wrote: > Other example use cases: > > # Fallback on an alternative path > > # Handle divide-by-zero or get by with index() instead of find(): s.index("foo") except -1 if IndexError # :-) > open(filename) except open(filename2) if IOError One problem is that it doesn't seem to chain all that well. Suppose you had three files to try opening: open(name1) except (open(name2) except open(name3) if IOError) if IOError Maybe it would be better if the exception type and alternative expression were swapped over. Then you could write open(name1) except IOError then open(name2) except IOError then open(name3) Still rather unwieldy though. -0.7j, I think (the j to acknowledge that this is an imaginary proposal.:-) -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From david.nospam.hopwood at blueyonder.co.uk Fri Sep 1 04:53:21 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Fri, 01 Sep 2006 03:53:21 +0100 Subject: [Python-3000] Comment on iostack library In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> Message-ID: <44F7A0A1.30300@blueyonder.co.uk> tomer filiba wrote: > [Talin] > >>Well, as far as readline goes: In order to split the text into lines, >>you have to decode the text first anyway, which is a layer 3 operation. >>You can't just read bytes until you get a \n, because the file you are >>reading might be encoded in UCS2 or something. > > well, the LineBufferedLayer can be "configured" to split on any > "marker", i.e.: LineBufferedLayer(stream, marker = "\x00\x0a") > and of course layer 3, which creates layer 2, can set this marker > to any byte sequence. note it's a *byte* sequence, not chars, > since this passes down to layer 1 transparently. That isn't what is required; for big-endian UCS-2 or UTF-16, "\x00\x0a" should only be recognized as LF if it is at an even byte position. -- David Hopwood From talin at acm.org Fri Sep 1 05:13:27 2006 From: talin at acm.org (Talin) Date: Thu, 31 Aug 2006 20:13:27 -0700 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: <44F7A557.2010002@acm.org> Guido van Rossum wrote: > On 8/31/06, Talin wrote: >> One way to handle this efficiently would be to only support the >> encodings which have a constant character size: ASCII, Latin-1, UCS-2 >> and UTF-32. In other words, if the content of your text is plain ASCII, >> use an 8-bit-per-character string; If the content is limited to the >> Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using >> Unicode supplementary characters, use UTF-32. >> >> (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes >> per character, and doesn't support the supplemental characters above >> 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.) > > I think we should also support UTF-16, since Java and .NET (and > Win32?) appear to be using effectively; making surrogate handling an > application issue doesn't seem *too* big of a burden for many apps. I see that I misspoke - what I meant was, that we would "suppport" all of the available encodings in the sense that we could translate string objects to and from those encodings. But the internal representations of the string objects themselves would only use those encodings which represented a character in a fixed number of bytes. Moreover, this internal representation should be opaque to users of the string - if you want to write out a string as UTF-8 to a file, go for it, it shouldn't matter what the internal type of the string is. (Although Jython and IronPython should probably use whatever string representation is defined by the underlying VM.) >> By avoiding UTF-8, UTF-16 and other variable-character-length formats, >> you can always insure that character index operations are done in >> constant time. Index operations would simply require scaling the index >> by the character size, rather than having to scan through the string and >> count characters. >> >> The drawback of this method is that you may be forced to transform the >> entire string into a wider encoding if you add a single character that >> won't fit into the current encoding. > > A way to handle UTF-8 strings and other variable-length encodings > would be to maintain a small cache of index positions with the string > object. Actually, I realized that this drawback isn't really much of an issue at all. For virtually all string operations in Python, it is possible to predict ahead of time what string width will be required - thus you can allocated the proper width object up front, and not have to "widen" the string in mid-operation. So for example, any string operation which produces a subset of the string (such as partition, split, index, slice, etc.) will produce a string of the same width as the original string. Any string operation that involves combining two strings will produce a string that is the same type as the wider of the two strings. Thus, if I say something like: "Hello World" + chr( 0x8000 ) This will produce a 16-bits wide string, because 'chr( 0x8000 )' can't be represented in ASCII, and thus produces a 16-bit-wide string. Since the first string is plain ASCII (8 bits) and the second is 16 bits, the result of the concatenation is a 16-bit string. Similarly, transformations on strings such as upper / lower yield a string that is the same width as the original. The only case I can think of where you might need to "promote" an entire string is where you are concatenating to a string buffer, in other words you are dealing with a mutable string type. And this case is easily handled by simply making the mutable string buffer type always use UTF-32, and then narrowing the result when str() is called to the narrowest possible representation that can hold the result. So essentially what I am proposing is this: -- That the Python 3000 "str" type can consist of 8-bit, 16-bit, or 32-bit characters, where all characters within a string are the same number of bytes. -- That all 3 types of strings appear identical to Python programmers, such that they need not know what type of string they are using. -- Any operation that returns a string result has the responsibility to insure that the resulting string is wide enough to contain all of the characters produced by the operation. -- That string index operations will always be constant time, with no auxiliary data structures required. -- That all 3 string types can be converted into all of the available encodings, including variable-character-width formats, however the result is a "bytes" object, not a string. An additional, but separate part of the proposal is that for str objects, the contents of the string are always defined in terms of Unicode code points. So if you want to convert to ISO-Latin-1, you can, but the result is a bytes object, not a string. The advantage of this is that it means that you always know what the value of 'ord()' is for a given character. It also means that two strings can always be compared for equality without having to decode them first. >> (Another option is to simply make all strings UTF-32 -- which is not >> that unreasonable, considering that text strings normally make up only a >> small fraction of a program's memory footprint. I am sure that there are >> applications that don't conform to this generalization, however. ) > > Here you are effectively voting against polymorphic strings. I believe > Fredrik has good reasons to doubt this assertion. Yes, that is correct. I'm just throwing it out there as a possibility, as it is by far the simplest solution. Its a question of trading memory use for simplicity of implementation. Having a single, flat, internal representation for all strings would be much less complex than having different string types. -- Talin From ironfroggy at gmail.com Fri Sep 1 05:21:08 2006 From: ironfroggy at gmail.com (Calvin Spealman) Date: Thu, 31 Aug 2006 23:21:08 -0400 Subject: [Python-3000] Exception Expressions In-Reply-To: <44F79D69.6090909@canterbury.ac.nz> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> <44F79D69.6090909@canterbury.ac.nz> Message-ID: <76fd5acf0608312021w1e0cf0f3md00ee5232f3ef9f4@mail.gmail.com> On 8/31/06, Greg Ewing wrote: > One problem is that it doesn't seem to chain > all that well. Suppose you had three files to > try opening: > > open(name1) except (open(name2) except open(name3) if IOError) if IOError > > Maybe it would be better if the exception type > and alternative expression were swapped over. > Then you could write > > open(name1) except IOError then open(name2) except IOError then open(name3) > > Still rather unwieldy though. -0.7j, I think > (the j to acknowledge that this is an imaginary > proposal.:-) > > -- > Greg Ewing, Computer Science Dept, +--------------------------------------+ > University of Canterbury, | Carpe post meridiem! | > Christchurch, New Zealand | (I'm not a morning person.) | > greg.ewing at canterbury.ac.nz +--------------------------------------+ I considered the expr1 except exc_type then expr2 syntax, but it adds a keyword without much need to do so. But, I suppose that isn't a problem now that conditional expressions are in and then is already a keyword. I hereby upgrade this from imaginary proposal to real proposal status! From paul at prescod.net Fri Sep 1 05:32:32 2006 From: paul at prescod.net (Paul Prescod) Date: Thu, 31 Aug 2006 20:32:32 -0700 Subject: [Python-3000] UTF-16 Message-ID: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> On 8/31/06, Guido van Rossum wrote: > > (Adding back py3k list assuming you just forgot it) Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of "Reply To All." > Plus, it sounds like you're proposing that the encodings of the underlying > > data would leak through to the application. As I understood Fredrick's > > model, the intention was to treat the encoding as an implementation > detail. > > If it works well, this could be an important differentiator for Python > > (versus Java) as Unicode already is (versus Ruby). > > *Only* for UTF-16, which I consider a necessary evil since we can't > rewrite the Java and .NET standards. I see what you're getting at. I'd say that decoding UTF-16 data in CPython and PyPy should (by default) create true Unicode characters. Jython and IronPython could create surrogates and characters when necessary. When you run the program in CPython you'll get better behaviour than in Jython/IronPython. Maybe there could be a way to make CPython run like Jython and IronPython if you wanted 100% absolute compatibility between the environments. I think that we agree that it would be unfortunate if CPython copied Java and .NET to its own detriment. It's also not inconceivable that Java and .NET might evolve a 4-byte mode in the long term. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/688d3cc1/attachment.html From guido at python.org Fri Sep 1 05:46:55 2006 From: guido at python.org (Guido van Rossum) Date: Thu, 31 Aug 2006 20:46:55 -0700 Subject: [Python-3000] UTF-16 In-Reply-To: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: On 8/31/06, Paul Prescod wrote: > On 8/31/06, Guido van Rossum wrote: > > (Adding back py3k list assuming you just forgot it) > > Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of "Reply > To All." > > > > Plus, it sounds like you're proposing that the encodings of the > underlying > > > data would leak through to the application. As I understood Fredrick's > > > model, the intention was to treat the encoding as an implementation > detail. > > > If it works well, this could be an important differentiator for Python > > > (versus Java) as Unicode already is (versus Ruby). > > > > *Only* for UTF-16, which I consider a necessary evil since we can't > > rewrite the Java and .NET standards. > > I see what you're getting at. > > I'd say that decoding UTF-16 data in CPython and PyPy should (by default) > create true Unicode characters. Jython and IronPython could create > surrogates and characters when necessary. When you run the program in > CPython you'll get better behaviour than in Jython/IronPython. Maybe there > could be a way to make CPython run like Jython and IronPython if you wanted > 100% absolute compatibility between the environments. I think that we agree > that it would be unfortunate if CPython copied Java and .NET to its own > detriment. It's also not inconceivable that Java and .NET might evolve a > 4-byte mode in the long term. I think it would be best to do this as a CPython configuration option just like it's done today. You can choose 4-byte or 2-byte Unicode (essentially UCS-4 or UTF-16) in order to be compatible with other packages on the platform. Yes, 4-byte gives better Unicode support. But 2-bytes may be more compatible with other stuff on the platform. Too bad .NET and Java don't have this option. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Fri Sep 1 06:13:29 2006 From: guido at python.org (Guido van Rossum) Date: Thu, 31 Aug 2006 21:13:29 -0700 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F7A557.2010002@acm.org> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: On 8/31/06, Talin wrote: > > Here you are effectively voting against polymorphic strings. I believe > > Fredrik has good reasons to doubt this assertion. > > Yes, that is correct. I'm just throwing it out there as a possibility, > as it is by far the simplest solution. Its a question of trading memory > use for simplicity of implementation. Having a single, flat, internal > representation for all strings would be much less complex than having > different string types. I think you don't realize the significance of the immediate enthusiastic +1 votes from several OSX developers. These people are quite familiar with ObjectiveC. ObjectiveC has true polymorphic strings, and the internal representation *can* be UTF-8. These developers love that. For most practical purposes the internal representation is abstracted away from the application; *however* it is possible to go below this level, especially for I/O (I believe). The net effect, if I understand correctly, is that you can save yourself a lot of copying if you are mostly just moving whole strings around and doing relatively little slicing and dicing -- it avoids converting from UTF-8 (which is by far the most common external representation) to UCS-2 or UCS-4 and back again. I don't think these advantages are maintained by your "narrowest constant-width encoding that fits all the characters" proposal. I'm not saying that we should definitely adopt this -- it may well be that the ObjectiveC string API is significantly different from Python's (e.g. it could have less emphasis on character indices and character counts) so that the benefits would be lost in translation -- but I'm not sure that the added complexity of your proposal is warranted if it still requires encoding and decoding on most I/O operations. BTW, in some sense Python 2.x *has* polymorphic strings -- str and unicde have the same API (99% anyway) but different implementations, and there's even a common abstract base class (basestring). But this clearly isn't what the ObjectiveC folks want to see! -- --Guido van Rossum (home page: http://www.python.org/~guido/) From paul at prescod.net Fri Sep 1 06:24:19 2006 From: paul at prescod.net (Paul Prescod) Date: Thu, 31 Aug 2006 21:24:19 -0700 Subject: [Python-3000] UTF-16 In-Reply-To: References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: <1cb725390608312124u24d20ec2q27dbe5a69c2440d3@mail.gmail.com> On 8/31/06, Guido van Rossum wrote: > > On 8/31/06, Paul Prescod wrote: > > On 8/31/06, Guido van Rossum wrote: > > > (Adding back py3k list assuming you just forgot it) > > > > Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of > "Reply > > To All." > > > > > > Plus, it sounds like you're proposing that the encodings of the > > underlying > > > > data would leak through to the application. As I understood > Fredrick's > > > > model, the intention was to treat the encoding as an implementation > > detail. > > > > If it works well, this could be an important differentiator for > Python > > > > (versus Java) as Unicode already is (versus Ruby). > > > > > > *Only* for UTF-16, which I consider a necessary evil since we can't > > > rewrite the Java and .NET standards. > > > > I see what you're getting at. > > > > I'd say that decoding UTF-16 data in CPython and PyPy should (by > default) > > create true Unicode characters. Jython and IronPython could create > > surrogates and characters when necessary. When you run the program in > > CPython you'll get better behaviour than in Jython/IronPython. Maybe > there > > could be a way to make CPython run like Jython and IronPython if you > wanted > > 100% absolute compatibility between the environments. I think that we > agree > > that it would be unfortunate if CPython copied Java and .NET to its own > > detriment. It's also not inconceivable that Java and .NET might evolve a > > 4-byte mode in the long term. > > I think it would be best to do this as a CPython configuration option > just like it's done today. You can choose 4-byte or 2-byte Unicode > (essentially UCS-4 or UTF-16) in order to be compatible with other > packages on the platform. Yes, 4-byte gives better Unicode support. > But 2-bytes may be more compatible with other stuff on the platform. > Too bad .NET and Java don't have this option. :-) The current model is a hack (and I wrote the PEP!). If you decide to go to all of the effort and expense of polymorphic strings, I cannot understand why a user should be forced to choose between 16 and 32 bit strings AT BUILD TIME. PEP 261 says that reason for the build-time solution is: "[The alternate solutions] ... would require a much more complex implementation than the accepted solution. ... Guido is not willing to undertake the implementation right now. ...This PEP represents least-effort solution." Fair enough. A world of finite resouces. But I would be very annoyed if my ISP had installed a Python version that could magically handle 8-bit and 16-bit strings efficiently but I had to ask them to install a special version to handle 32 bit strings at all. Obviously build-time configuration is the least flexible of all available options. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/3dd236f2/attachment.htm From fredrik at pythonware.com Fri Sep 1 07:57:06 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 07:57:06 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F7A557.2010002@acm.org> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: Talin wrote: > So essentially what I am proposing is this: "look at me! look at me!" From fredrik at pythonware.com Fri Sep 1 08:05:18 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:05:18 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: Guido van Rossum wrote: > BTW, in some sense Python 2.x *has* polymorphic strings -- str and > unicde have the same API (99% anyway) but different implementations, > and there's even a common abstract base class (basestring). But this > clearly isn't what the ObjectiveC folks want to see! on the Python level, absolutely. the "use 8-bit strings for ASCII, Unicode strings for everything else" approach works perfectly well. I'm still a bit worried about C API complexities, but as I mentioned, in today's Python, only 8-bit strings are really simple. and there are standard ways to deal with backing stores; if that's good enough for apple hackers, it should be good enough for pythoneers. most of this can be prototyped and benchmarked under 2.X, and parts of it can be directly useful also for 2.X developers; I think I'll start tinkering. > These people are quite familiar with ObjectiveC. ObjectiveC has true > polymorphic strings, and the internal representation *can* be UTF-8. > These developers love that. you are aware that Objective C does provide B-tree strings under the hood too, I hope ;-) From fredrik at pythonware.com Fri Sep 1 08:22:54 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:22:54 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> Message-ID: tjreedy wrote: > These two similar features would be enough, to me, to make Py3 more than > just 2.x with cruft removed. well, it's really only C API issues that keeps us from implementing this in 2.x... (too much code uses PyString_Check and/or PyUnicode_Check and then happily digs into the associated buffers). From fredrik at pythonware.com Fri Sep 1 08:46:23 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:46:23 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: Guido van Rossum wrote: > A way to handle UTF-8 strings and other variable-length encodings > would be to maintain a small cache of index positions with the string > object. I think just delaying decoding would take us most of the way. the big advantage of storage polymorphism is that you can avoid decoding and encoding (and having to pay for the cycles and bytes needed for that) if you don't do have to. the XML case you mentioned is a typical example; just compare the behaviour of a library that does some extra work to keep things small under the hood with more straightforward implementations: http://effbot.org/zone/celementtree.htm#benchmarks (cElementTree uses the "8-bit ascii mixes well with unicode" approach) there are plenty of optimizations you can do when accessing the beginning and end of a string (startswith, endswith, comparisions, slicing, etc), but I think we can deal with that when we get there. I think the NFS sprint showed that you get better results by working with real use cases, rather than spending that theorizing. it also showed that the bottlenecks aren't always where you think they are. From fredrik at pythonware.com Fri Sep 1 08:49:38 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:49:38 +0200 Subject: [Python-3000] UTF-16 In-Reply-To: References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: Guido van Rossum wrote: > I think it would be best to do this as a CPython configuration option > just like it's done today. You can choose 4-byte or 2-byte Unicode > (essentially UCS-4 or UTF-16) in order to be compatible with other > packages on the platform. Yes, 4-byte gives better Unicode support. > But 2-bytes may be more compatible with other stuff on the platform. > Too bad .NET and Java don't have this option. :-) the UCS2/UCS4 linking problems is a minor pain in the ass, though. maybe this is best done via a run-time setting? From fredrik at pythonware.com Fri Sep 1 09:56:52 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 09:56:52 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com><20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: Talin wrote: > (Another option is to simply make all strings UTF-32 -- which is not > that unreasonable, considering that text strings normally make up only a > small fraction of a program's memory footprint. I am sure that there are > applications that don't conform to this generalization, however. ) performance is more than just memory use, though. for some string operations, memory bandwidth is the bottleneck, not memory use. it simply takes more time to process four times as much data. (running the stringbench.py script in the sandbox on a recent 2.5 should give you some idea of this) From fredrik at pythonware.com Fri Sep 1 10:01:45 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 10:01:45 +0200 Subject: [Python-3000] locale-aware strings ? Message-ID: today's Python supports "locale aware" 8-bit strings; e.g. >>> import locale >>> "едц".isalpha() False >>> locale.setlocale(locale.LC_ALL, "sv_SE") 'sv_SE' >>> "едц".isalpha() True to what extent should this be supported by Python 3000 ? From tomerfiliba at gmail.com Fri Sep 1 10:05:10 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Fri, 1 Sep 2006 10:05:10 +0200 Subject: [Python-3000] Comment on iostack library In-Reply-To: <44F761A3.5060009@livinglogic.de> References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> <44F761A3.5060009@livinglogic.de> Message-ID: <1d85506f0609010105n69e8cdcbw989f861e05ca7a24@mail.gmail.com> very well, i'll use it. thanks. On 9/1/06, Walter D?rwald wrote: > tomer filiba wrote: > > > [...] > > besides, encoding suffers from many issues. suppose you have a > > damaged UTF8 file, which you read char-by-char. when we reach the > > damaged part, you'll never be able to "skip" it, as we'll just keep > > read()ing bytes, hoping to make a character out of it , until we > > reach EOF, i.e.: > > > > def read_char(self): > > buf = "" > > while not self._stream.eof: > > buf += self._stream.read(1) > > try: > > return buf.decode("utf8") > > except ValueError: > > pass > > > > which leads me to the following thought: maybe we should have > > an "enhanced" encoding library for py3k, which would report > > *incomplete* data differently from *invalid* data. today it's just a > > ValueError: suppose decode() would raise IncompleteDataError > > when the given data is not sufficient to be decoded successfully, > > and ValueError when the data is just corrupted. > > > > that could aid iostack greatly. > > We *do* have that functionality in Python 2.5: incremental decoders can > retain incomplete byte sequences on the call to the decode() method > until the next call. Only when final=True is passed in the decode() call > will it treat incomplete and invalid data in the same way: by raising an > exception. > > Incomplete input: > >>> import codecs > >>> d = codecs.lookup("utf-8").incrementaldecoder() > >>> d.decode("\xe1") > u'' > >>> d.decode("\x88") > u'' > >>> d.decode("\xb4") > u'\u1234' > > Invalid input: > >>> import codecs > >>> d = codecs.lookup("utf-8").incrementaldecoder() > >>> d.decode("\x80") > Traceback (most recent call last): > File "", line 1, in > File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, > in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > unexpected code byte > > Incomplete input with final=True: > >>> import codecs > >>> d = codecs.lookup("utf-8").incrementaldecoder() > >>> d.decode("\xe1", final=True) > Traceback (most recent call last): > File "", line 1, in > File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, > in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: > unexpected end of data > > Servus, > Walter > > From fredrik at pythonware.com Fri Sep 1 13:14:13 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 13:14:13 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: > spending that theorizing. make that "spending that time theorizing about what you could, in theory, do." From qrczak at knm.org.pl Fri Sep 1 13:34:42 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 01 Sep 2006 13:34:42 +0200 Subject: [Python-3000] Comment on iostack library In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> (tomer filiba's message of "Thu, 31 Aug 2006 23:43:44 +0200") References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> Message-ID: <87u03r6crx.fsf@qrnik.zagroda> "tomer filiba" writes: >> Encoding conversion and newline conversion should be performed a >> block at a time, below buffering, so not only I/O syscalls, but >> also invocations of the recoding machinery are amortized by >> buffering. > > you have a good point, which i also stumbled upon when implementing > the TextInterface. but how would you suggest to solve it? I've designed and implemented this for my language, but I'm not sure that you will like it because it's quite different from the Python tradition. The interface of block reading appends data to the end of the supplied buffer, up to the specified size (or infinity), and also it tells whether it reached end of data. The interface of block writing removes data from the beginning of the supplied buffer, up to the supplied size (or the whole buffer), and is told how to flush, which includes information whether this is the end of data. Both functions are allowed to read/write less than requested. The recoding engine moves data from the beginning of an input buffer to the end of an output buffer. The block recoding function has similar size parameters as above, and a flushing parameter. It returns True on output overflow, i.e. when it stopped because it needs more room in the output rather than because it needs more input. It leaves unconverted data at the end of the input buffer if data looks incomplete, unless it is told that this is the last block - in this case it fails. Both decoding input streams and encoding output streams have a persistent buffer in the format corresponding to their low end, i.e. a byte buffer when this is the boundary between bytes and characters. This design allows to plug everything together, including the cases where recoding changes sizes significantly (compression/decompression). It also allows reading/writing process to be interrupted without breaking the consistency of the state of buffers, as long as each primitive reading/writing operation is atomic, i.e. anything it removes from the input buffer is converted and put in the output buffer. Data not yet processed by the remaining layers remains in their respective buffers. For example reading a block from a decoding stream: 1. If there was no overflow previously, read more data from the underlying stream to the internal buffer, up to the supplied maximum size. 2. Decode data from the internal buffer to the supplied output buffer, up to the supplied maximum size. Tell the recoding engine that this is the last piece if there was no overflow previously and reading from the underlying stream reached the end. 3. Return True (i.e. end of input) if there was no overflow now and reading from the underlying stream reached the end. Writing a block to an encoding stream is simpler: 1. Encode data from the supplied input buffer to the internal buffer. 2. Write data from the internal buffer to the output stream. Buffered streams are typically put on the top of the stack. They support reading a line at a time, unlimited lookahead and unlimited unreading, and writing which guarantees that it won't leave anything in the buffer it is writing from. Newlines are converted by a separate layer. The buffered stream assumes "\n" endings. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From fredrik at pythonware.com Fri Sep 1 13:41:00 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 13:41:00 +0200 Subject: [Python-3000] string C API Message-ID: just noticed that PEP 3100 says that PyString_AsEncodedString and PyString_AsDecodedString is to be removed, but it doesn't mention any other PyString (or PyUnicode) functions. how large changes can we make here, really ? (I'm not going to sketch on a concrete proposal here; I'm more interested in general guidelines. the details are best fleshed out in code) From barry at python.org Fri Sep 1 14:14:46 2006 From: barry at python.org (Barry Warsaw) Date: Fri, 1 Sep 2006 08:14:46 -0400 Subject: [Python-3000] UTF-16 In-Reply-To: References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: <188FAEC3-875D-4AA8-8C66-A1DF6F8A96C6@python.org> On Sep 1, 2006, at 2:49 AM, Fredrik Lundh wrote: > Guido van Rossum wrote: > >> I think it would be best to do this as a CPython configuration option >> just like it's done today. You can choose 4-byte or 2-byte Unicode >> (essentially UCS-4 or UTF-16) in order to be compatible with other >> packages on the platform. Yes, 4-byte gives better Unicode support. >> But 2-bytes may be more compatible with other stuff on the platform. >> Too bad .NET and Java don't have this option. :-) > > the UCS2/UCS4 linking problems is a minor pain in the ass, though. > maybe this is best done via a run-time setting? Yes, the linking problem does crop up from time to time. Recent example: Gentoo Linux is heavily dependent on Python and I recently emerged in several packages. I don't remember the exact details, but there was a conflict between UCS2 and UCS4 where two different upstream packages required two different linkages, and the wrapping Python modules were thus incompatible. I basically had to decide which one I cared about most and delete the other to resolve the conflict. The problem was confusing the hell out of several Gentooers until we tracked down all the resources and figured out the (suboptimal) fix. -Barry From fredrik at pythonware.com Fri Sep 1 14:23:10 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 14:23:10 +0200 Subject: [Python-3000] UTF-16 References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> <188FAEC3-875D-4AA8-8C66-A1DF6F8A96C6@python.org> Message-ID: Barry Warsaw wrote: > I recently emerged in several packages. good thing dictionary.com includes wikipedia articles, or I'd never figured out if that was a typo or a rather odd spiritual phenomenon. From paul at prescod.net Fri Sep 1 16:11:35 2006 From: paul at prescod.net (Paul Prescod) Date: Fri, 1 Sep 2006 07:11:35 -0700 Subject: [Python-3000] Character Set Indepencence Message-ID: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> I thought that others might find this reference interesting. It is Matz (the inventor of Ruby) talking about why he thinks that Unicode is good for what it does but not sufficient in general, along with some hints of what he plans for multinationalization in Ruby. The translation is rough and is lifted from this email: http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html I think that the gist of it is that Unicode will be "just one character set" supported by Ruby. This idea has been kicked around for Python before but you quickly run into questions about how you compare character strings from multiple character sets, to say nothing of the complexity of an character encoding and character set agnostic regular expression engine. I guess Matz is the right guy to experiment with that stuff. Maybe it could be copied in Python 4K. What are your complaints towards Unicode? * it's thoroughly used, isn't it. * resentment towards Han unification? * inferiority complex of Japanese people? -- What are your complaints towards Unicode? * no, no I do not have any complaints about Unicode * in the domains where Unicode is adequate -- Then, why CSI? In most applications, UCS is enough thanks to Unicode. However, there are also applications for which this is not the case. -- Fields for which Unicode is not enough Big character sets * Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode) * TRON code * GB18030 -- Fields for which Unicode is not fitted Legacy encodings * conversion to UCS is useless * big conversion tables * round-trip problem -- If a language chooses the UCS system * you cannot write non-UCS applications * you can't handle text that can't be expressed with Unicode -- If a language chooses the CSI system * CSI is a superset of UCS * Unicode just has to be handled in CSI -- ... is what we can say but * CSI is difficult * can it really be implemented? -- That's where comes out Japan's traditional arts Adaptation for the Japanese language of applications * Modification of English language applications to be able to process Japanese -- Adaptation for the Japanese language of applications * What engineers of long ago experienced for sure - Emacs (NEmacs) - Perl (JPerl) - Bash -- Accumulation of know-how In Japan, the know-how of adaptation for the Japanese language (multi-byte text processing) has been accumulated. -- Accumulation of know-how in the first place, just for local use, text using 3 encodings circulate (4 if including UTF-8) -- Based on this know-how * multibyte text encodings * switching between encodings at the string level * processing them at practical speed is finished -- Available encodings euc_tw euc_jp iso8859_* utf-8 utf-32le ascii euc_kr koi8 utf-16le utf-32be big5 gb2312 sjis utf-16be ...and many others If it's a stateless encodings, in principle it can be available. -- It means For applications using only one encoding, code conversion is not needed -- Moreover Applications wanting to handle multiple encodings can choose an internal encoding (generally Unicode) that includes all others -- If you want to * you can also handle multiple encodings without conversion, letting characters as they are * but this is difficult so I do not recommend it -- However, only the basic part is done, it's far from being ready for practical use * code conversion * guessing encoding * etc. -- For the time being, today I want to tell everyone: * UCS is practical * but not all-purpose * CSI is not impossible -- The reason I'm saying that They may add CSI in Perl6 as they had added * Methods called by "." * Continuations from Ruby. Basically, they hate losing. -- Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060901/46576432/attachment.html From jimjjewett at gmail.com Fri Sep 1 16:24:42 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 1 Sep 2006 10:24:42 -0400 Subject: [Python-3000] Exception Expressions In-Reply-To: References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> <76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com> Message-ID: On 8/31/06, Brett Cannon wrote: > On 8/31/06, Calvin Spealman wrote: > > On 8/31/06, Brett Cannon wrote: > > > So this feels like the Perl idiom of using die: ``open(file) or die`` > > "Ouch" on the associated my idea with perl! > =) The truth hurts. Isn't this almost the opposite of "or die"? Unless I'm having a very bad day, the die idiom is more like a SystemExit, but this proposal is a way to recover from expected Exceptions. > func(ags) || die(msg) means >>> if not func(args): ... raise SystemExit(msg) This proposal, with the "a non-dict mapping might not have get" use case: >>> ((mymap[k] except KeyError then default) for key in source) means >>> def __temp(): ... for element in source: ... try: ... v=mymap[k] ... except KeyError: ... v=default ... yield v >>> __temp() -jJ From guido at python.org Fri Sep 1 16:59:47 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 07:59:47 -0700 Subject: [Python-3000] Character Set Indepencence In-Reply-To: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> References: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> Message-ID: I think in a sense Python *will* continue to support multiple character sets -- as byte streams. IMO that's the only reasonable approach. Unlike apparently Matz I've never heard complaints that Python 2 doesn't have enough support for character sets larger than Unicode, and that is effectively what it supports: encoded strings and Unicode string. --Guido On 9/1/06, Paul Prescod wrote: > I thought that others might find this reference interesting. It is Matz (the > inventor of Ruby) talking about why he thinks that Unicode is good for what > it does but not sufficient in general, along with some hints of what he > plans for multinationalization in Ruby. The translation is rough and is > lifted from this email: > > http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html > > I think that the gist of it is that Unicode will be "just one character set" > supported by Ruby. This idea has been kicked around for Python before but > you quickly run into questions about how you compare character strings from > multiple character sets, to say nothing of the complexity of an character > encoding and character set agnostic regular expression engine. > > I guess Matz is the right guy to experiment with that stuff. Maybe it could > be copied in Python 4K. > What are your complaints towards Unicode? > * it's thoroughly used, isn't it. > * resentment towards Han unification? > > * inferiority complex of Japanese people? > -- > What are your complaints towards Unicode? > * no, no I do not have any complaints about Unicode > * in the domains where Unicode is adequate > -- > Then, why CSI? > > > In most applications, UCS is enough thanks to Unicode. > However, there are also applications for which this is not the case. > -- > Fields for which Unicode is not enough > Big character sets > * Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode) > > * TRON code > * GB18030 > -- > Fields for which Unicode is not fitted > Legacy encodings > * conversion to UCS is useless > * big conversion tables > * round-trip problem > -- > If a language chooses the UCS system > > * you cannot write non-UCS applications > * you can't handle text that can't be expressed with Unicode > -- > If a language chooses the CSI system > * CSI is a superset of UCS > * Unicode just has to be handled in CSI > > -- > ... is what we can say but > * CSI is difficult > * can it really be implemented? > -- > That's where comes out Japan's traditional arts > > Adaptation for the Japanese language of applications > * Modification of English language applications to be able to process > Japanese > > -- > Adaptation for the Japanese language of applications > > * What engineers of long ago experienced for sure > - Emacs (NEmacs) > - Perl (JPerl) > - Bash > -- > Accumulation of know-how > > In Japan, the know-how of adaptation for the Japanese language > > (multi-byte text processing) > has been accumulated. > -- > Accumulation of know-how > > in the first place, just for local use, > text using 3 encodings circulate > (4 if including UTF-8) > -- > Based on this know-how > > * multibyte text encodings > * switching between encodings at the string level > * processing them at practical speed > is finished > -- > Available encodings > > euc_tw euc_jp iso8859_* utf-8 utf-32le > > ascii euc_kr koi8 utf-16le utf-32be > big5 gb2312 sjis utf-16be > > ...and many others > If it's a stateless encodings, in principle it can be available. > -- > It means > For applications using only one encoding, code conversion is not needed > > -- > Moreover > Applications wanting to handle multiple encodings can choose an > internal encoding (generally Unicode) that includes all others > -- > If you want to > * you can also handle multiple encodings without conversion, letting > > characters as they are > * but this is difficult so I do not recommend it > -- > However, > only the basic part is done, > it's far from being ready for practical use > * code conversion > * guessing encoding > > * etc. > -- > For the time being, today > I want to tell everyone: > * UCS is practical > * but not all-purpose > * CSI is not impossible > -- > The reason I'm saying that > They may add CSI in Perl6 as they had added > > * Methods called by "." > * Continuations > from Ruby. > Basically, they hate losing. > -- > Thank you > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mcherm at mcherm.com Fri Sep 1 17:03:59 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Fri, 01 Sep 2006 08:03:59 -0700 Subject: [Python-3000] Exception Expressions Message-ID: <20060901080359.zsxl30h7bpwswc40@login.werra.lunarpages.com> Calvin Spealman writes: > I thought I felt in the mood for some abuse today, so I'm proposing > something sure to give me plenty of crap, but maybe someone will enjoy > the idea, anyway. [...] > expr1 except expr2 if exc_type This is wonderful! In combination with conditional expressions, list comprehensions, and lambda, I think this would make it possible to write full-powerd Python programs on a single line. Actually, putting it on a single line in your text editor would just make things unreadable, but if you wrap parentheses around it, then the entire program can be a single expression, something like this: for entry in entryList: if entry.status() == 'open': try: entry.display() except DisplayError: entry.setStatus('error') entry.hide() else: entry.hide(); would become this: ( ( ( entry.display() except ( entry.setStatus('error'), entry.hide() ) if DisplayError ) if entry.status() == 'open' else entry.hide() ) for entry in entryList ) (Or you *could* choose to compress it as follows:) (((entry.display()except(entry.setStatus('error' ),entry.hide())if DisplayError)if entry.status() =='open' else entry.hide())for entry in entryList) Now, I wouldn't try to claim that this single-expression version is *more* readable than the original, but it has a significant advantage: it makes the language no longer dependent on significant whitespace for demarking lines and blocks! There are places where significant whitespace is a problem, most notably when trying to embed Python code within other documents. Just imagine using this new form to embed Python within HTML to create a new and more powerful form of dynamic page generation:

<*entry.title()*>

    <* "
  • Valid
  • " if entry.isvalid() else "" *> <* "
  • Active
  • " if entry.active else "
  • Inactive
  • " *>

<* entry.showContent() except "No Data Available" if Exception *>

Isn't it amazing? . . . Okay... *everything* above comes with a HUGE wink. It's a joke. Calvin's idea is clever, and readable once you get used to conditional expressions, but I'm still a solid -1 on the proposal. But thanks for giving me something fun to think about. -- Michael Chermside From nnorwitz at gmail.com Fri Sep 1 18:58:49 2006 From: nnorwitz at gmail.com (Neal Norwitz) Date: Fri, 1 Sep 2006 09:58:49 -0700 Subject: [Python-3000] string C API In-Reply-To: References: Message-ID: On 9/1/06, Fredrik Lundh wrote: > just noticed that PEP 3100 says that PyString_AsEncodedString and > PyString_AsDecodedString is to be removed, but it doesn't mention > any other PyString (or PyUnicode) functions. > > how large changes can we make here, really ? I don't know if it was the case here or not, but I added a bunch of APIs to the PEP that were labeled as deprecated or only for backwards compatibility. The sources were the doc, header files, and source files. (There's no single place to look.) n From guido at python.org Fri Sep 1 19:17:39 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 10:17:39 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: Message-ID: I say not at all. On 9/1/06, Fredrik Lundh wrote: > today's Python supports "locale aware" 8-bit strings; e.g. > > >>> import locale > >>> "???".isalpha() > False > >>> locale.setlocale(locale.LC_ALL, "sv_SE") > 'sv_SE' > >>> "???".isalpha() > True > > to what extent should this be supported by Python 3000 ? > > > > > > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Fri Sep 1 20:34:09 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 01 Sep 2006 20:34:09 +0200 Subject: [Python-3000] Ripping out exec Message-ID: Hi, in process of ripping out the exec statement, I stumbled over the following function in symtable.c (line 468ff): ------------------------------------------------------------------------------------ /* Check for illegal statements in unoptimized namespaces */ static int check_unoptimized(const PySTEntryObject* ste) { char buf[300]; const char* trailer; if (ste->ste_type != FunctionBlock || !ste->ste_unoptimized || !(ste->ste_free || ste->ste_child_free)) return 1; trailer = (ste->ste_child_free ? "contains a nested function with free variables" : "is a nested function"); switch (ste->ste_unoptimized) { case OPT_TOPLEVEL: /* exec / import * at top-level is fine */ case OPT_EXEC: /* qualified exec is fine */ return 1; case OPT_IMPORT_STAR: PyOS_snprintf(buf, sizeof(buf), "import * is not allowed in function '%.100s' " "because it is %s", PyString_AS_STRING(ste->ste_name), trailer); break; case OPT_BARE_EXEC: PyOS_snprintf(buf, sizeof(buf), "unqualified exec is not allowed in function " "'%.100s' it %s", PyString_AS_STRING(ste->ste_name), trailer); break; default: PyOS_snprintf(buf, sizeof(buf), "function '%.100s' uses import * and bare exec, " "which are illegal because it %s", PyString_AS_STRING(ste->ste_name), trailer); break; } PyErr_SetString(PyExc_SyntaxError, buf); PyErr_SyntaxLocation(ste->ste_table->st_filename, ste->ste_opt_lineno); return 0; } -------------------------------------------------------------------------------------- Of course, this check can't be made at compile time if exec() is a function. (You can even outsmart it currently by giving explicit None arguments to the exec statement) So my question is: is this check required, and can it be done at execution time instead? Comparing the exec code to execfile(), only this can be the cause for the extra precaution: (from Python/ceval.c, function exec_statement) if (plain) PyFrame_LocalsToFast(f, 0); Georg From guido at python.org Fri Sep 1 20:37:55 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 11:37:55 -0700 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: I would just rip it out. On 9/1/06, Georg Brandl wrote: > Hi, > > in process of ripping out the exec statement, I stumbled over the > following function in symtable.c (line 468ff): > > ------------------------------------------------------------------------------------ > /* Check for illegal statements in unoptimized namespaces */ > static int > check_unoptimized(const PySTEntryObject* ste) { > char buf[300]; > const char* trailer; > > if (ste->ste_type != FunctionBlock || !ste->ste_unoptimized > || !(ste->ste_free || ste->ste_child_free)) > return 1; > > trailer = (ste->ste_child_free ? > "contains a nested function with free variables" : > "is a nested function"); > > switch (ste->ste_unoptimized) { > case OPT_TOPLEVEL: /* exec / import * at top-level is fine */ > case OPT_EXEC: /* qualified exec is fine */ > return 1; > case OPT_IMPORT_STAR: > PyOS_snprintf(buf, sizeof(buf), > "import * is not allowed in function '%.100s' " > "because it is %s", > PyString_AS_STRING(ste->ste_name), trailer); > break; > case OPT_BARE_EXEC: > PyOS_snprintf(buf, sizeof(buf), > "unqualified exec is not allowed in function " > "'%.100s' it %s", > PyString_AS_STRING(ste->ste_name), trailer); > break; > default: > PyOS_snprintf(buf, sizeof(buf), > "function '%.100s' uses import * and bare exec, " > "which are illegal because it %s", > PyString_AS_STRING(ste->ste_name), trailer); > break; > } > > PyErr_SetString(PyExc_SyntaxError, buf); > PyErr_SyntaxLocation(ste->ste_table->st_filename, > ste->ste_opt_lineno); > return 0; > } > -------------------------------------------------------------------------------------- > > Of course, this check can't be made at compile time if exec() is a function. > (You can even outsmart it currently by giving explicit None arguments to the > exec statement) > > So my question is: is this check required, and can it be done at execution time > instead? > > Comparing the exec code to execfile(), only this can be the cause for the > extra precaution: > (from Python/ceval.c, function exec_statement) > > if (plain) > PyFrame_LocalsToFast(f, 0); > > Georg > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jcarlson at uci.edu Fri Sep 1 21:20:21 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 01 Sep 2006 12:20:21 -0700 Subject: [Python-3000] "string" views Message-ID: <20060901120313.1B5F.JCARLSON@uci.edu> Attached you will find a zip file containing the implementation of a 'stringview' object written against Python 2.3 and Pyrex 0.9.3 . I didn't implement center, decode, encode, ljust, rjust, splitlines, title, translate, zfill, __[r]mod__, slicing with indices != 1, my optimization for view.join(...) doesn't seem to work, and view.split('') is also not implemented. I'm stopping for right now because I'm a bit burnt out on this particular project. If it seems hacked together, it is because it is hacked together. Whenever possible it returns views. It also will generally take anything that supports the buffer protocol as an argument where a string or view would have also made sense. Please remember that this is just a proof-of-concept implementation; I would imagine that an actual view object would likely need to be written in pure C, and though I have tested each method by hand, there may be bugs. I have also included the output file "stringview.c" for those without a working Pyrex installation, which should compile against Python 2.3 headers, and perhaps even 2.4 headers. - Josiah -------------- next part -------------- A non-text attachment was scrubbed... Name: stringview.zip Type: application/x-zip-compressed Size: 27685 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20060901/46f8ee58/attachment-0001.bin From g.brandl at gmx.net Fri Sep 1 23:28:15 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 01 Sep 2006 23:28:15 +0200 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: Guido van Rossum wrote: > I would just rip it out. It turns out that it's not so easy. The exec statement currently can modify the locals, which means that def f(): exec "a=1" print a succeeds. To make that possible, the compiler flags scopes containing exec statements as unoptimized and does not assume unbound names to be global. With exec being a function, currently the above function won't work because "a" is assumed to be global. I can see only two resolutions: * change exec() semantics so that it cannot modify the locals * do not make exec a function Georg From guido at python.org Fri Sep 1 23:57:18 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 14:57:18 -0700 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: On 9/1/06, Georg Brandl wrote: > Guido van Rossum wrote: > > I would just rip it out. > > It turns out that it's not so easy. The exec statement currently can > modify the locals, which means that > > def f(): > exec "a=1" > print a > > succeeds. To make that possible, the compiler flags scopes containing > exec statements as unoptimized and does not assume unbound names to > be global. > > With exec being a function, currently the above function won't work > because "a" is assumed to be global. > > I can see only two resolutions: > > * change exec() semantics so that it cannot modify the locals > * do not make exec a function Make it so it can't modify the locals. execfile() has the same limitation. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Sat Sep 2 00:37:20 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 02 Sep 2006 00:37:20 +0200 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: Guido van Rossum wrote: > On 9/1/06, Georg Brandl wrote: >> Guido van Rossum wrote: >> > I would just rip it out. >> >> It turns out that it's not so easy. The exec statement currently can >> modify the locals, which means that >> >> def f(): >> exec "a=1" >> print a >> >> succeeds. To make that possible, the compiler flags scopes containing >> exec statements as unoptimized and does not assume unbound names to >> be global. >> >> With exec being a function, currently the above function won't work >> because "a" is assumed to be global. >> >> I can see only two resolutions: >> >> * change exec() semantics so that it cannot modify the locals >> * do not make exec a function > > Make it so it can't modify the locals. execfile() has the same limitation. > Good. Patch is at python.org/sf/1550800. There's another one at python.org/sf/1550786 implementing the Ellipsis literal. cheers, Georg From greg.ewing at canterbury.ac.nz Sat Sep 2 02:10:55 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 02 Sep 2006 12:10:55 +1200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F7A557.2010002@acm.org> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: <44F8CC0F.2020004@canterbury.ac.nz> Talin wrote: > So for example, any string operation which produces a subset of the > string (such as partition, split, index, slice, etc.) will produce a > string of the same width as the original string. It might be possible to represent it in a narrower format, however. Perhaps there should be an explicit operation for re-packing a string into the narrowest possible format? Or should one simply encode it as UTF-8 or something and then decode it again to get the same effect? -- Greg From greg.ewing at canterbury.ac.nz Sat Sep 2 02:37:02 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 02 Sep 2006 12:37:02 +1200 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: <44F8D22E.70202@canterbury.ac.nz> Guido van Rossum wrote: > I would just rip it out. I don't understand this business about ripping out exec. I thought that exec had to be a statement so the compiler can tell whether to use fast locals. Do you have a different way of handling that in mind for Py3k? -- Greg From guido at python.org Sat Sep 2 04:26:46 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 19:26:46 -0700 Subject: [Python-3000] Ripping out exec In-Reply-To: <44F8D22E.70202@canterbury.ac.nz> References: <44F8D22E.70202@canterbury.ac.nz> Message-ID: On 9/1/06, Greg Ewing wrote: > Guido van Rossum wrote: > > I would just rip it out. > > I don't understand this business about ripping out > exec. I thought that exec had to be a statement so > the compiler can tell whether to use fast locals. > Do you have a different way of handling that in mind > for Py3k? Yes. If we implement the module-level analysis it should be easy enough to track whether 'exec' refers to the built-in function. (We're already planning to add some kind of prohibition against outside modules poking new globals into a module that shadow built-ins.) But I also see no bones in requiring the use of a dict arg if you want to observe the side effects of the exec'ed code. So instead of def f(s): exec s print a # presumably s must contain an assignment to a you'd have to write def f(s): ns = {} exec(s, ns) print ns['a'] This makes it a lot clearer what happens IMO. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From ncoghlan at gmail.com Sat Sep 2 05:42:45 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 02 Sep 2006 13:42:45 +1000 Subject: [Python-3000] Exception Expressions In-Reply-To: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> Message-ID: <44F8FDB5.6000808@gmail.com> An interesting idea, although I suspect a leading try keyword would make things clearer. (try expr1 except expr2 if exc_type) print (try letters[7] except "N/A" if IndexError) f = (try open(filename) except open(filename2) if IOError) print (try eval(expr) except "Can not divide by zero!" if ZeroDivisionError) val = (try db.get(key) except cache.get(key) if TimeoutError) This wouldn't help the chaining problem that Greg pointed out, though: try open(name1) except (try open(name2) except open(name3) if IOError) if IOError Using a different keyword or a comma so expr2 comes last as Greg suggested would fix that: try open(name1) except IOError, (try open(name2) except IOError, open(name3)) I'd be somewhere between -1 and -0 at this point in time. Depending on the results a review of the standard library describing actual use cases that could be made easier to read might be enough to get me to a +0. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Sat Sep 2 05:47:57 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 02 Sep 2006 13:47:57 +1000 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: Message-ID: <44F8FEED.9000600@gmail.com> Fredrik Lundh wrote: > today's Python supports "locale aware" 8-bit strings; e.g. > > >>> import locale > >>> "???".isalpha() > False > >>> locale.setlocale(locale.LC_ALL, "sv_SE") > 'sv_SE' > >>> "???".isalpha() > True > > to what extent should this be supported by Python 3000 ? Since all strings will be Unicode by then: >>> u"???".isalpha() True Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From qrczak at knm.org.pl Sat Sep 2 09:57:11 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 02 Sep 2006 09:57:11 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F8CC0F.2020004@canterbury.ac.nz> (Greg Ewing's message of "Sat, 02 Sep 2006 12:10:55 +1200") References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> <44F8CC0F.2020004@canterbury.ac.nz> Message-ID: <87mz9izoo8.fsf@qrnik.zagroda> Greg Ewing writes: > It might be possible to represent it in a narrower format, > however. Perhaps there should be an explicit operation for > re-packing a string into the narrowest possible format? I suppose it's better to always normalize a polymorphic string representation. And always normalize bignums to fixnums (long->int). It increases chances of using the more compact representation. It doesn't add any asymptotic cost, it's done when the whole object is to be allocated anyway (these are immutable objects). It simplifies equality comparison. The narrow formats should be statistically more common than wide formats anyway. Programmers should not be expected to care about explicitly calling a normalization function. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From tomerfiliba at gmail.com Sat Sep 2 17:53:59 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Sat, 2 Sep 2006 17:53:59 +0200 Subject: [Python-3000] encoding hell Message-ID: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> i'm quite finished with the base of iostack (streams and layers), and have moved to implementing the adpaters layer (especially the dreaded TextAdapter). as was discussed earlier, streams and layers work with bytes, while adpaters may work with arbitrary objects (be it struct-style records, serialized objects, characters and whatnot). the question that arises is -- how far should we stretch this abstraction? for example, the TextAdapter reads and writes characters to the stream, after they go encoding or decoding, so from the programmer's point of view, he's working with *characters*, not *bytes*. that means the programmer need not be aware of how the characters are "physically" stored in the underlying stream. that's all very nice, but what do we do when it comes to seek()ing? do you want to seek by character position or by byte position? logically you are working with characters, but it would be impossible to implement without first decoding the entire stream in-memory... which is unacceptable of course. and if seek()ing is byte-oriented, then you must somehow seek only to the beginning of a multibyte character sequence... how would you do that? my solution would be completely leaving seek() and tell() out of the 3rd layer -- it's a byte-level operation. anyone thinks differently? if so, what's your solution? - - - - you can find the latest sources here (note: i haven't tested it yet, many things are likely to be broken, it's still being redesigned): http://sebulbasvn.googlecode.com/svn/trunk/iostack/ http://sebulbasvn.googlecode.com/svn/trunk/sock2/ -tomer From g.brandl at gmx.net Sat Sep 2 18:36:37 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 02 Sep 2006 18:36:37 +0200 Subject: [Python-3000] The future of exceptions Message-ID: While looking at the changes necessary to implement the exception related syntax changes (except ... as ..., raise without type), I came across some more substantial things that I think must be discussed. * How should exceptions be represented in C code? Should there still be a (type, value, traceback) triple? * Could the traceback be made an attribute of the exception? * What about exception chaining? Something like this comes to mind:: try: whatever except ValueError as err: raise CustomException("Something went wrong", prev=err) With tracebacks becoming part of the exception, that could be:: raise CustomException(*args, prev=err, tb=traceback) (`prev` and `tb` would be keyword-only arguments) With that, all exception info would be contained in one object, so sys.exc_info() could be renamed to sys.last_exc(). cheers, Georg From qrczak at knm.org.pl Sat Sep 2 20:04:08 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 02 Sep 2006 20:04:08 +0200 Subject: [Python-3000] The future of exceptions In-Reply-To: (Georg Brandl's message of "Sat, 02 Sep 2006 18:36:37 +0200") References: Message-ID: <87pseew3fr.fsf@qrnik.zagroda> Georg Brandl writes: > * Could the traceback be made an attribute of the exception? > > * What about exception chaining? > > Something like this comes to mind:: > > try: > whatever > except ValueError as err: > raise CustomException("Something went wrong", prev=err) In my language the traceback is materialized from the stack only if needed (typically when an exception escapes from the toplevel), and it includes the history of other exceptions thrown from exception handlers, intermingled with source locations. The stack is not physically unwound until an exception handler completes successfully, so the data is available until then. For example the above (without storing prev) would include: - locations of active functions leading to whatever - the location of whatever when the value error is raised - exception: the ValueError instance - the location of raise CustomException - exception: the CustomException instance Printing the stack trace recognizes when the same exception object is reraised again, and prints this as a propagation instead of repeating the exception description. Of course this design is suitable only if the previous exception is used merely for printing the stack trace, not for unpacking and examining by the program. I don't know how Python stack traces are implemented, so I have no idea whether this would be practical for Python, assuming it would be desirable at all. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From talin at acm.org Sat Sep 2 22:23:32 2006 From: talin at acm.org (Talin) Date: Sat, 02 Sep 2006 13:23:32 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> Message-ID: <44F9E844.2020603@acm.org> tomer filiba wrote: > i'm quite finished with the base of iostack (streams and layers), and > have moved to implementing the adpaters layer (especially the dreaded > TextAdapter). > > as was discussed earlier, streams and layers work with bytes, while > adpaters may work with arbitrary objects (be it struct-style records, > serialized objects, characters and whatnot). > > the question that arises is -- how far should we stretch this abstraction? > for example, the TextAdapter reads and writes characters to the > stream, after they go encoding or decoding, so from the programmer's > point of view, he's working with *characters*, not *bytes*. > that means the programmer need not be aware of how the characters > are "physically" stored in the underlying stream. > > that's all very nice, but what do we do when it comes to seek()ing? > do you want to seek by character position or by byte position? > logically you are working with characters, but it would be impossible > to implement without first decoding the entire stream in-memory... > which is unacceptable of course. > > and if seek()ing is byte-oriented, then you must somehow seek > only to the beginning of a multibyte character sequence... how > would you do that? > > my solution would be completely leaving seek() and tell() out of the > 3rd layer -- it's a byte-level operation. > > anyone thinks differently? if so, what's your solution? Well, for comparison with other APIs: The .Net equivalent, System.IO.TextReader, does not have a "seek" method at all. The Java version, Java.io.BufferedReader, has a "skip()" method which only allows seeking forward. Sounds to me like copying the Java model would work. -- Talin From brett at python.org Sat Sep 2 22:44:00 2006 From: brett at python.org (Brett Cannon) Date: Sat, 2 Sep 2006 13:44:00 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: References: Message-ID: On 9/2/06, Georg Brandl wrote: > > While looking at the changes necessary to implement the exception > related syntax changes (except ... as ..., raise without type), > I came across some more substantial things that I think must be discussed. You have read Ping's PEP 344, right? * How should exceptions be represented in C code? Should there still > be a (type, value, traceback) triple? > > * Could the traceback be made an attribute of the exception? The problem with this is that it keeps the frame alive. This is why this and exception chaining were considered a design issue in Ping's PEP since that is a lot of stuff to keep alive. * What about exception chaining? > > Something like this comes to mind:: > > try: > whatever > except ValueError as err: > raise CustomException("Something went wrong", prev=err) > > With tracebacks becoming part of the exception, that could be:: > > raise CustomException(*args, prev=err, tb=traceback) > > (`prev` and `tb` would be keyword-only arguments) > > With that, all exception info would be contained in one object, > so sys.exc_info() could be renamed to sys.last_exc(). Right, which is why the original suggestion came up in the first place. It would be nice to compartmentalize exceptions entirely, but the worry of keeping a ont of memory alive for it needs to be addressed, especially if exceptions are to be kept lightweight and usable for things other than flagging errors. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060902/9caf2f96/attachment.html From tomerfiliba at gmail.com Sun Sep 3 00:29:25 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Sun, 3 Sep 2006 00:29:25 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <44F9E844.2020603@acm.org> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44F9E844.2020603@acm.org> Message-ID: <1d85506f0609021529o3a83dccbod0a7a643d39da696@mail.gmail.com> [Talin] > The Java version, Java.io.BufferedReader, has a "skip()" method which > only allows seeking forward. > Sounds to me like copying the Java model would work. then there's no need for it at all... just read() and discard the return value. we don't need a special API for that. on the other hand, the .NET version has a BaseStream attribute holding the underlying stream over which the StreamReader operates... this means you *can* change the position if the underlying stream supports seeking. i read through the msdn but found no explicit definition for what happens in the case of seeking in text-encoded streams, but they noted somewhere they use a "best fit" decoder, which, to the best of my understanding, may skip some bytes until it's in synch with the stream. that's a *horrible* design, imho, but that's microsoft. i say let's leave it below layer 3, at the byte level. if users find seeking very important, we can come up with a layer-2 ReSyncLayer, which will attempt to come in synch with a specified encoding. for example: f = TextAdapter( ReSyncLayer( BufferedLayer( FileStream("blah", "r") ), encoding = "utf8" ), encoding = "utf8" ) # read 3 UTF8 *characters* f.read(3) # this will seek by AT LEAST 7 *bytes*, until resynched f.substream.seekby(7) # we can resume reading of UTF8 *characters* f.read(3) heck, i even like this idea :) thanks for the pointers. -tomer On 9/2/06, Talin wrote: > tomer filiba wrote: > > i'm quite finished with the base of iostack (streams and layers), and > > have moved to implementing the adpaters layer (especially the dreaded > > TextAdapter). > > > > as was discussed earlier, streams and layers work with bytes, while > > adpaters may work with arbitrary objects (be it struct-style records, > > serialized objects, characters and whatnot). > > > > the question that arises is -- how far should we stretch this abstraction? > > for example, the TextAdapter reads and writes characters to the > > stream, after they go encoding or decoding, so from the programmer's > > point of view, he's working with *characters*, not *bytes*. > > that means the programmer need not be aware of how the characters > > are "physically" stored in the underlying stream. > > > > that's all very nice, but what do we do when it comes to seek()ing? > > do you want to seek by character position or by byte position? > > logically you are working with characters, but it would be impossible > > to implement without first decoding the entire stream in-memory... > > which is unacceptable of course. > > > > and if seek()ing is byte-oriented, then you must somehow seek > > only to the beginning of a multibyte character sequence... how > > would you do that? > > > > my solution would be completely leaving seek() and tell() out of the > > 3rd layer -- it's a byte-level operation. > > > > anyone thinks differently? if so, what's your solution? > > Well, for comparison with other APIs: > > The .Net equivalent, System.IO.TextReader, does not have a "seek" method > at all. > > The Java version, Java.io.BufferedReader, has a "skip()" method which > only allows seeking forward. > > Sounds to me like copying the Java model would work. > > -- Talin > From greg.ewing at canterbury.ac.nz Sun Sep 3 01:06:01 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 03 Sep 2006 11:06:01 +1200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> Message-ID: <44FA0E59.9010302@canterbury.ac.nz> tomer filiba wrote: > my solution would be completely leaving seek() and tell() out of the > 3rd layer -- it's a byte-level operation. That's what I'd recommend, too. Seeking doesn't make sense when the underlying units aren't fixed-length. The best you could do would be to return some kind of opaque object from tell() that could be passed back to seek(). But I'm far from convinced that would be worth the trouble. -- Greg From ironfroggy at gmail.com Sun Sep 3 02:24:22 2006 From: ironfroggy at gmail.com (Calvin Spealman) Date: Sat, 2 Sep 2006 20:24:22 -0400 Subject: [Python-3000] The future of exceptions In-Reply-To: References: Message-ID: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> On 9/2/06, Brett Cannon wrote: > Right, which is why the original suggestion came up in the first place. It > would be nice to compartmentalize exceptions entirely, but the worry of > keeping a ont of memory alive for it needs to be addressed, especially if > exceptions are to be kept lightweight and usable for things other than > flagging errors. > > -Brett So, at issue is attaching tracebacks to exceptions keeps too much alive and thus makes exceptions too heavy? If the traceback was passed to the exception constructor and then held as an attribute of the exception, any exception meant for "light" work (ie., not normal error flagging) could simply decided not to include the traceback, and so it would be destroyed, removing the weight from the exception. Similarly, tracebacks could have some lean() method to drop references to the frames. From brett at python.org Sun Sep 3 03:34:47 2006 From: brett at python.org (Brett Cannon) Date: Sat, 2 Sep 2006 18:34:47 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> Message-ID: On 9/2/06, Calvin Spealman wrote: > > On 9/2/06, Brett Cannon wrote: > > Right, which is why the original suggestion came up in the first > place. It > > would be nice to compartmentalize exceptions entirely, but the worry of > > keeping a ont of memory alive for it needs to be addressed, especially > if > > exceptions are to be kept lightweight and usable for things other than > > flagging errors. > > > > -Brett > > So, at issue is attaching tracebacks to exceptions keeps too much > alive and thus makes exceptions too heavy? Basically. Memory usage goes up if you do this as it stands now. If the traceback was passed > to the exception constructor and then held as an attribute of the > exception, any exception meant for "light" work (ie., not normal error > flagging) could simply decided not to include the traceback, and so it > would be destroyed, removing the weight from the exception. Similarly, > tracebacks could have some lean() method to drop references to the > frames. > Problem with that is you then lose any API guarantees of the traceback being there, which would mean you would still need to keep around sys.exc_info(). -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060902/060a9cf8/attachment.html From fredrik at pythonware.com Sun Sep 3 11:19:06 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Sun, 03 Sep 2006 11:19:06 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <44FA0E59.9010302@canterbury.ac.nz> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FA0E59.9010302@canterbury.ac.nz> Message-ID: Greg Ewing wrote: > The best you could do would be to return some kind > of opaque object from tell() that could be passed > back to seek(). that's how seek/tell works on text files in today's Python, of course. if you're writing portable code, you can only seek to the beginning or end of the file, or to a position returned to you by tell. From 2006 at jmunch.dk Sun Sep 3 19:11:27 2006 From: 2006 at jmunch.dk (Anders J. Munch) Date: Sun, 03 Sep 2006 19:11:27 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> Message-ID: <44FB0CBF.7070102@jmunch.dk> tomer filiba wrote: > my solution would be completely leaving seek() and tell() out of the > 3rd layer -- it's a byte-level operation. > > anyone thinks differently? if so, what's your solution? seek and tell are a poor mans sequence. I would have nothing by those names. I would have input streams, output streams and sequences, and I wouldn't mix the three. FileReader would be an InputStream, FileWriter would be an OutputStream. FileBytes would support the sequence protocol, mimicking bytes objects. It would support random-access read and write using __getitem__ and __setitem__, allowing slice assignment for slices of equal size. And there would be append() to extend the file, and partial __delitem__ support for truncating. Looking at your iostack2 Stream class, no sooner do you introduce the key methods read and write, than you supplement them with capability queries readable and writable that check whether these methods may even be called. IMO this is a clear indication that these methods really want to be refactored into separate classes. I think you'll find that separating input, output and random access into three separate ADTs will much simplify BufferingLayer (even though you'll need three of them). At least if you intend to take interactions between reads and writes into account. regards, Anders From tomerfiliba at gmail.com Sun Sep 3 20:17:39 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Sun, 3 Sep 2006 20:17:39 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <44FB0CBF.7070102@jmunch.dk> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> Message-ID: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> > FileReader would be an InputStream, > FileWriter would be an OutputStream yes, this has been discussed, but that's too java-ish by nature. besides, how would this model handle a simple operation, such as file("foo", "w+") ? opening TWO file descriptors for that purpose, one for reading and another for writing, is a complete waste of resources: handles are not cheap. not to mention that opening the same file multiple times may run you into platform-specific pits, like read-after-write bugs, etc. so the obvious solution is having an underlying "file-like object", which is basically like today's file (supports read() AND write()), over which InputStream and OutputStream just expose a different view of: f = file(...) fr = FileReader(f) fw = FileWriter(f) fr.read() fw.write() now, this means you start with a "capable" object like file, with all of the desired operations, and you intentionally CRIPPLE it down into separate reading and writing front-ends. so what's sense does that make? if you want an InputStream, just be sure you only call read() or readall(); if you want an OutputStream limit yourself to caling write(). input-only/output-only streams are just silly and artificial overhead -- we don't need them. the java/.NET world relies on interfaces so much that it might make sense in that context. but that's not the python way. > no sooner do you introduce the > key methods read and write, than you supplement them with capability > queries readable and writable that check whether these methods may > even be called. IMO this is a clear indication that these methods > really want to be refactored into separate classes. the reason is some streams, like pipes or partially shutdown()ed- sockets may be unidirectional; some (i.e., sockets) may not support seeking -- but the 2nd layer may augment that. for example, the BufferingLayer may add seeking (it already supports unreading). that's why streams are queriable -- iostack has a layered structure that allows each layer to add more functionality to the underlying layer. in other words, all stream are NOT born equal, but they can be made equal later :) that way, when your function accepts a stream as an argument, it would just check s.readable or s.seekable, without regard to the *type* of s itself, or the underlying storage -- it may be a file, it may be a buffered socket, but as long as you can read from it/seek in it, your code would work just fine. kinda like duck-typing. > FileBytes would support the > sequence protocol, mimicking bytes objects. It would support > random-access read and write using __getitem__ and __setitem__, > allowing slice assignment for slices of equal size. this may be a good direction. i'll try to see how it fits in. -tomer On 9/3/06, Anders J. Munch <2006 at jmunch.dk> wrote: > tomer filiba wrote: > > my solution would be completely leaving seek() and tell() out of the > > 3rd layer -- it's a byte-level operation. > > > > anyone thinks differently? if so, what's your solution? > > seek and tell are a poor mans sequence. I would have nothing by those > names. > > I would have input streams, output streams and sequences, and I > wouldn't mix the three. FileReader would be an InputStream, > FileWriter would be an OutputStream. FileBytes would support the > sequence protocol, mimicking bytes objects. It would support > random-access read and write using __getitem__ and __setitem__, > allowing slice assignment for slices of equal size. And there would > be append() to extend the file, and partial __delitem__ support for > truncating. > > Looking at your iostack2 Stream class, no sooner do you introduce the > key methods read and write, than you supplement them with capability > queries readable and writable that check whether these methods may > even be called. IMO this is a clear indication that these methods > really want to be refactored into separate classes. > > I think you'll find that separating input, output and random access > into three separate ADTs will much simplify BufferingLayer (even > though you'll need three of them). At least if you intend to take > interactions between reads and writes into account. > > regards, > Anders > > From qrczak at knm.org.pl Sun Sep 3 22:23:23 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sun, 03 Sep 2006 22:23:23 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> (tomer filiba's message of "Sun, 3 Sep 2006 20:17:39 +0200") References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> Message-ID: <87lkp0bsxw.fsf@qrnik.zagroda> "tomer filiba" writes: >> FileReader would be an InputStream, >> FileWriter would be an OutputStream > > yes, this has been discussed, but that's too java-ish by nature. > besides, how would this model handle a simple operation, such as > file("foo", "w+") ? What is a rationale of this operation for a text file? -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From aahz at pythoncraft.com Sun Sep 3 22:45:28 2006 From: aahz at pythoncraft.com (Aahz) Date: Sun, 3 Sep 2006 13:45:28 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <87lkp0bsxw.fsf@qrnik.zagroda> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> Message-ID: <20060903204528.GA3950@panix.com> On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote: > "tomer filiba" writes: >> >> file("foo", "w+") ? > > What is a rationale of this operation for a text file? You want to be able to read the file and write data to it. That argues in favor of seek(0) and seek(-1) being the only supported behaviors, though. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ I support the RKAB From 2006 at jmunch.dk Mon Sep 4 00:29:43 2006 From: 2006 at jmunch.dk (Anders J. Munch) Date: Mon, 04 Sep 2006 00:29:43 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> Message-ID: <44FB5757.6070209@jmunch.dk> tomer filiba wrote: >> FileReader would be an InputStream, >> FileWriter would be an OutputStream > > yes, this has been discussed, but that's too java-ish by nature. > besides, how would this model handle a simple operation, such as > file("foo", "w+") ? You mean, with the intent of both reading and writing to the file in the same go? That's what I meant FileBytes for. Do you have a requirement for drop-in compatibility with the current I/O? In all my programming days I don't believe I written to and read from the same file handle even once. Use cases exist, like if you're implementing a DBMS, or adding to a zip file in-place, but they're the exception, and by separating that functionality out in a dedicated class like FileBytes, you avoid having the complexities of mixed input and output affect your typical use cases. > the reason is some streams, like pipes or partially shutdown()ed- > sockets may be unidirectional; some (i.e., sockets) may not support > seeking -- but the 2nd layer may augment that. for example, the > BufferingLayer may add seeking (it already supports unreading). Watch out! There's an essentiel difference between files and bidirectional communications channels that you need to take into account. For a TCP connection, input and output can be seen as isolated from one another, with each their own stream position, and each their own contents. For read/write files, it's a whole different ballgame, because stream position and data are shared. That means you cannot use the same buffering code for both cases. For files, whenever you write something, you need to take into account that that may overlap your read buffer or change read position. You should take another look at layer.BufferingLayer with that in mind. regards, Anders From talin at acm.org Mon Sep 4 01:04:34 2006 From: talin at acm.org (Talin) Date: Sun, 03 Sep 2006 16:04:34 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <44FB5757.6070209@jmunch.dk> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <44FB5757.6070209@jmunch.dk> Message-ID: <44FB5F82.3070809@acm.org> Anders J. Munch wrote: > Watch out! There's an essentiel difference between files and > bidirectional communications channels that you need to take into > account. For a TCP connection, input and output can be seen as > isolated from one another, with each their own stream position, and > each their own contents. For read/write files, it's a whole different > ballgame, because stream position and data are shared. > > That means you cannot use the same buffering code for both cases. For > files, whenever you write something, you need to take into account > that that may overlap your read buffer or change read position. You > should take another look at layer.BufferingLayer with that in mind. > > regards, Anders This is a better explanation of some of the comments I was raising earlier: The choice of buffering strategy depends on a number of factors related to how the stream is going to be used, as well as the internal implementation of the stream. A buffering strategy that works well for a socket won't work very well for a DBMS. When I stated earlier that 'the OS can do a better job of buffering than we can', what I meant to say was somewhat broader than that - which is that each layer is, in many cases, a better judge of what *kind* of buffering it needs than the person assembling the layers. This doesn't mean that each layer has to implement its own buffering algorithm. The common buffering algorithms can be factored out into their own objects -- but what I'd suggest is that the choice of buffer algorithm not *normally* be exposed to the person constructing the io stack. Thus, when creating a standard "line reader", instead of having the user call: fh = TextReader( Buffer( File( ... ) ) ) Instead, let the TextReader choose the kind of buffer it wants and supply that part automatically. There are several reasons why I think this would work better: 1) You can't simply stick just any buffer object in the middle there and expect it to work. Different buffer strategies have different interfaces, and trying to meld them all into one uber-interface would make for a very complex interface. 2) The TextReader knows perfectly well what kind of buffer it needs. Depending on how TextReader is implemented, it might want a serial, read-only buffer that allows a limited degree of look-ahead buffering so that it can find the line breaks. Or it might want a pair of buffers - one decoded, one encoded. There's no way that the user can know what kind of buffer to use without knowing the implementation details of TextReader. 3) TextReader can be optimized even more if it is allowed to 'peek' inside the internals of the buffer - something that would not be allowed if it had to conform to calling the buffer through a standard interface. More generally, the choice of buffer depends on the usage pattern for reading / writing to the file - and that usage pattern is embodied in the definition of "TextReader". By creating a "TextReader" object, the user is stating their intention to read the file a certain way, in a certain order, with certain performance characteristics. The choice of buffering derives directly from those usage patterns. So the two go hand in hand. Now, I'm not saying that you can't stick additional layers in-between TextReader and FileStream if you want to. An example might be the "resync" layer that you mentioned, or a journaling layer that insures that all writes are recoverable. I'm merely saying that for the specific issue of buffering, I think that the choice of buffer type is complicated, and requires knowledge that might not be accessible to the person assembling the stack. -- Talin From greg.ewing at canterbury.ac.nz Mon Sep 4 01:04:25 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 04 Sep 2006 11:04:25 +1200 Subject: [Python-3000] The future of exceptions In-Reply-To: References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> Message-ID: <44FB5F79.6060507@canterbury.ac.nz> Brett Cannon wrote: > Basically. Memory usage goes up if you do this as it stands now. I'm not sure I follow that. The traceback gets created anyway, so how is it going to use more memory if it's attached to a throwaway exception instead of kept in a sys variable? If you keep the exception around, that would keep the traceback too, but how often are exceptions kept for long periods after being caught? -- Greg From greg.ewing at canterbury.ac.nz Mon Sep 4 01:11:34 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 04 Sep 2006 11:11:34 +1200 Subject: [Python-3000] encoding hell In-Reply-To: References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FA0E59.9010302@canterbury.ac.nz> Message-ID: <44FB6126.8030706@canterbury.ac.nz> Fredrik Lundh wrote: > that's how seek/tell works on text files in today's Python, of course. > if you're writing portable code, you can only seek to the beginning or > end of the file, or to a position returned to you by tell. True, but with arbitrary stacks of stream-transforming objects the value might need to be even more opaque, since it might need to encapsulate internal states of decoders, etc. Could be very messy. -- Greg From brett at python.org Mon Sep 4 01:19:55 2006 From: brett at python.org (Brett Cannon) Date: Sun, 3 Sep 2006 16:19:55 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: <44FB5F79.6060507@canterbury.ac.nz> References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> <44FB5F79.6060507@canterbury.ac.nz> Message-ID: On 9/3/06, Greg Ewing wrote: > > Brett Cannon wrote: > > > Basically. Memory usage goes up if you do this as it stands now. > > I'm not sure I follow that. The traceback gets created anyway, > so how is it going to use more memory if it's attached to a > throwaway exception instead of kept in a sys variable? It won't. If you keep the exception around, that would keep the > traceback too, but how often are exceptions kept for long > periods after being caught? Not very, but I didn't make this argument to begin with, other people did. It was a sticking point when the idea was first put forth. I personally supported adding the attributes, but people kept pushing against it. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060903/9325b8c2/attachment.htm From jimjjewett at gmail.com Mon Sep 4 01:22:18 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Sep 2006 19:22:18 -0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44F8FEED.9000600@gmail.com> References: <44F8FEED.9000600@gmail.com> Message-ID: On 9/1/06, Nick Coghlan wrote: > Fredrik Lundh wrote: > > today's Python supports "locale aware" 8-bit strings ... > > to what extent should this be supported by Python 3000 ? > Since all strings will be Unicode by then: > >>> u"???".isalpha() > True Two followup questions, then ... (1) To what extent should python support files (including stdin, stdout) in local (non-unicode) encodings? (not at all, per-file, settable global default?) (2) To what extent will strings have an opaque (or at least on-demand) backing store, so that decoding/encoding could be delayed? (For example, Swedish text could be stored in single-byte characters, and only converted to standard unicode on the rare occasions when it met strings in an incompatible encoding.) -jJ From jimjjewett at gmail.com Mon Sep 4 02:57:35 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Sep 2006 20:57:35 -0400 Subject: [Python-3000] The future of exceptions In-Reply-To: References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> <44FB5F79.6060507@canterbury.ac.nz> Message-ID: On 9/3/06, Brett Cannon wrote: > On 9/3/06, Greg Ewing wrote: > > The traceback gets created anyway, so how > > is it going to use more memory if it's attached to a > > throwaway exception instead of kept in a sys variable? > > ... how often are exceptions kept for long > > periods after being caught? > It was a sticking point when the idea was first put forth. I think people were really objecting to cyclic garbage in general. Both the garbage collector and weak references have improved since the original discussion. Even today, if a StopIteration() participates in a reference cycle, then it won't be reclaimed until the next gc run. I'm not quite sure which direction should be a weakref, but I think it would be reasonable for the cycle to get broken when an catching except block exits without reraising. -jJ From paul at prescod.net Mon Sep 4 03:55:20 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 3 Sep 2006 18:55:20 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: <1cb725390609031855r7258a2e9q2ce2877b45075744@mail.gmail.com> On 9/3/06, Jim Jewett wrote: > > On 9/1/06, Nick Coghlan wrote: > > Fredrik Lundh wrote: > > > today's Python supports "locale aware" 8-bit strings ... > > > to what extent should this be supported by Python 3000 ? > > > Since all strings will be Unicode by then: > > > >>> u"???".isalpha() > > True > > Two followup questions, then ... > > (1) To what extent should python support files (including stdin, > stdout) in local (non-unicode) encodings? (not at all, per-file, > settable global default?) I presume that Python's support of these will not change from today's. I don't think that locale changes file decoding today, nor should it. After all, files are emailed from place to place all the time. (2) To what extent will strings have an opaque (or at least > on-demand) backing store, so that decoding/encoding could be delayed? > (For example, Swedish text could be stored in single-byte characters, > and only converted to standard unicode on the rare occasions when it > met strings in an incompatible encoding.) I don't see this as particularly related to the locale issue either. It is being discussed in other threads under the name "Polymorphic strings." Fredrik Lundh said: "I think just delaying decoding would take us most of the way. the big advantage of storage polymorphism is that you can avoid decoding and encoding (and having to pay for the cycles and bytes needed for that) if you don't do have to." I believe he is working on a prototype. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060903/12c63525/attachment.htm From guido at python.org Mon Sep 4 04:11:02 2006 From: guido at python.org (Guido van Rossum) Date: Sun, 3 Sep 2006 19:11:02 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: On 9/3/06, Jim Jewett wrote: > On 9/1/06, Nick Coghlan wrote: > > Fredrik Lundh wrote: > > > today's Python supports "locale aware" 8-bit strings ... > > > to what extent should this be supported by Python 3000 ? > > > Since all strings will be Unicode by then: > > > >>> u"???".isalpha() > > True > > Two followup questions, then ... > > (1) To what extent should python support files (including stdin, > stdout) in local (non-unicode) encodings? (not at all, per-file, > settable global default?) I've always said (can someone find a quote perhaps?) that there ought to be a sensible default encoding for files (including but not limited to stdin/out/err), perhaps influenced by personalized settings, environment variables, the OS, etc. > (2) To what extent will strings have an opaque (or at least > on-demand) backing store, so that decoding/encoding could be delayed? > (For example, Swedish text could be stored in single-byte characters, > and only converted to standard unicode on the rare occasions when it > met strings in an incompatible encoding.) That seems to be a bit of a leading question. Talin is currently championing strings with different fixed-width storage, and others have proposed even more flexible "polymorphic strings". You might want to learn about the NSString type on Apple's ObjectiveC. BTW the term "backing store" is typically used for *disk-based* storage of large amounts of data -- but (despite that your first question is about files) I don't believe this what you're referring to. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Mon Sep 4 05:14:18 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Sep 2006 23:14:18 -0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: On 9/3/06, Guido van Rossum wrote: > On 9/3/06, Jim Jewett wrote: > > (2) To what extent will strings have an opaque > > (or at least on-demand) backing store, so that > > decoding/encoding could be delayed? > That seems to be a bit of a leading question. Yes; I (mis-?)read the original question as asking whether non-English users would still be able to use (faster) 8-bit representations. > BTW the term "backing store" is typically used for > *disk-based* storage of large amounts of data -- > but (despite that your first question is about files) > I don't believe this what you're referring to. You are correct; I had forgotten that meaning, and was taking my usage from the CFString (~= NSString) documentation suggested earlier. There it refers to the underlying (private) real storage, rather than to a disk. Today, python unicode characters are limited to a specific fixed width at compile time, because C extensions can operate directly on the data buffer. If C extensions were required to go through the unicode methods -- or at least to explicitly request a buffer -- then the underlying storage could (often) be far more efficient. This privatization would, however, be a major change to the API. Smaller and faster localized strings are one of the compensatory benefits. -jJ From jack at psynchronous.com Mon Sep 4 09:21:29 2006 From: jack at psynchronous.com (Jack Diederich) Date: Mon, 4 Sep 2006 03:21:29 -0400 Subject: [Python-3000] The future of exceptions In-Reply-To: References: Message-ID: <20060904072129.GC5707@performancedrivers.com> On Sat, Sep 02, 2006 at 06:36:37PM +0200, Georg Brandl wrote: > While looking at the changes necessary to implement the exception > related syntax changes (except ... as ..., raise without type), > I came across some more substantial things that I think must be discussed. > > * How should exceptions be represented in C code? Should there still > be a (type, value, traceback) triple? > > * Could the traceback be made an attribute of the exception? > > * What about exception chaining? > The last time this came up everyone's eyes glazed over and the conversation stopped. That doesn't mean it isn't worth talking about it just means that exceptions are hard and potentially make GC miserable. > Something like this comes to mind:: > > try: > whatever > except ValueError as err: > raise CustomException("Something went wrong", prev=err) > > With tracebacks becoming part of the exception, that could be:: > > raise CustomException(*args, prev=err, tb=traceback) > > (`prev` and `tb` would be keyword-only arguments) > > With that, all exception info would be contained in one object, > so sys.exc_info() could be renamed to sys.last_exc(). > The current system is awkward if you want to do fancy things with exceptions and tracebacks. I've never had to do fancy things with exceptions and tracebacks so I'm OK with it. "raise" as a bare word covers all the cases where I need to catch, inspect, and potentially reraise the original. In the above example you are just annotating and reraising an error so a KISS suggestion might go try: whatever except ValueError as err: err.also_squawk += 'Kilroy was here' raise Where 'also_squawk' was renamed to something more intuitive and much more international. -Jack From phd at mail2.phd.pp.ru Mon Sep 4 12:24:13 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Mon, 4 Sep 2006 14:24:13 +0400 Subject: [Python-3000] encoding hell In-Reply-To: <20060903204528.GA3950@panix.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> Message-ID: <20060904102413.GC21049@phd.pp.ru> On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote: > On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote: > > "tomer filiba" writes: > >> > >> file("foo", "w+") ? > > > > What is a rationale of this operation for a text file? > > You want to be able to read the file and write data to it. That argues > in favor of seek(0) and seek(-1) being the only supported behaviors, > though. Sometimes programs need tell() + seek(). Two examples (very similar, really). Example 1. I have a program, an email robot that receives email(s) and marks email addresses in a "database" that is actually a text file: --- email database file --- phd at phd.pp.ru phd at oper.med.ru --- / --- The program opens the file in "r+" mode, reads it line by line and stores the positions of the first character in an every line using tell(). When it needs to mark an email it seek()'s to the stored position and write '+' mark so the file looks like --- email database file --- +phd at phd.pp.ru phd at oper.med.ru --- / --- Example 2. INN (the NNTP daemon) stores (at least stored when I was using it) information about newsgroup in a text file database. It uses another approach - it stores info using lines of equal length: --- newsgroups --- comp.lang.python 000001234567 comp.lang.python.announce 000000abcdef --- / --- Probably INN doesn't use tell() - it just calculates the position using line length. But a python program needs tell() and seek() for such a file. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From aahz at pythoncraft.com Mon Sep 4 15:39:52 2006 From: aahz at pythoncraft.com (Aahz) Date: Mon, 4 Sep 2006 06:39:52 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <20060904102413.GC21049@phd.pp.ru> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> <20060904102413.GC21049@phd.pp.ru> Message-ID: <20060904133951.GA10810@panix.com> On Mon, Sep 04, 2006, Oleg Broytmann wrote: > On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote: >> >> You want to be able to read the file and write data to it. That argues >> in favor of seek(0) and seek(-1) being the only supported behaviors, >> though. > > Sometimes programs need tell() + seek(). Two examples (very similar, > really). > > Example 1. I have a program, an email robot that receives email(s) and > marks email addresses in a "database" that is actually a text file: [snip examples of file with email addresses and INN control files] My understanding is that those are in fact binary files that are being treated as line-oriented "text" files. I would agree that there needs to be a way to do line-oriented processing on binary files, but anyone who attempts to process these as text files is foolish at best. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ I support the RKAB From david.nospam.hopwood at blueyonder.co.uk Mon Sep 4 17:50:51 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Mon, 04 Sep 2006 16:50:51 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: <44FC4B5B.9010508@blueyonder.co.uk> Guido van Rossum wrote: > On 9/3/06, Jim Jewett wrote: > >>Two followup questions, then ... >> >>(1) To what extent should python support files (including stdin, >>stdout) in local (non-unicode) encodings? (not at all, per-file, >>settable global default?) Per-file, I hope. > I've always said (can someone find a quote perhaps?) that there ought > to be a sensible default encoding for files (including but not limited > to stdin/out/err), perhaps influenced by personalized settings, > environment variables, the OS, etc. While it should be possible to find out what the OS believes to be the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows; LC_CHARSET environment variable on Unix), that does not mean that it is this charset that Python programs should normally use. When defining a new text-based file type, it is simpler to define it to be always UTF-8. >>(2) To what extent will strings have an opaque (or at least >>on-demand) backing store, so that decoding/encoding could be delayed? >>(For example, Swedish text could be stored in single-byte characters, >>and only converted to standard unicode on the rare occasions when it >>met strings in an incompatible encoding.) > > That seems to be a bit of a leading question. Talin is currently > championing strings with different fixed-width storage, and others > have proposed even more flexible "polymorphic strings". You might want > to learn about the NSString type on Apple's ObjectiveC. Operating on encoded constant strings, and decoding each character on the fly, works fine when the charset is stateless and each character has a 1-1 correspondance with a Unicode character (i.e. code point). In that case the program can operate on the string essentially as if it were Unicode. It still works fine for variable-width charsets (including UTF-8 and UTF-16); that just means that the program has to avoid assuming that a position in the string is the same thing as a character count. For charsets like ISCII and ISO 2022, which are stateful and/or have a different encoding model to Unicode, I don't believe this approach would work very well. But it is fine to support this for some charsets and not others. -- David Hopwood From guido at python.org Mon Sep 4 23:32:12 2006 From: guido at python.org (Guido van Rossum) Date: Mon, 4 Sep 2006 14:32:12 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FC4B5B.9010508@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> Message-ID: On 9/4/06, David Hopwood wrote: > Guido van Rossum wrote: > > I've always said (can someone find a quote perhaps?) that there ought > > to be a sensible default encoding for files (including but not limited > > to stdin/out/err), perhaps influenced by personalized settings, > > environment variables, the OS, etc. > > While it should be possible to find out what the OS believes to be > the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows; > LC_CHARSET environment variable on Unix), that does not mean that it > is this charset that Python programs should normally use. When defining > a new text-based file type, it is simpler to define it to be always UTF-8. In this particular case I don't care what's simpler to implement, but what's most likely to do what the user expects. If on a particular box most files are encoded in encoding X, and the user did whatever is necessary to tell the tools that that's their preferred encoding, I want Python to honor that encoding when opening text files, unless the program makes other arrangements explicitly (such as specifying an explicit encoding as a parameter to open()). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rasky at develer.com Tue Sep 5 15:17:49 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 5 Sep 2006 15:17:49 +0200 Subject: [Python-3000] have zip() raise exception for sequences of different lengths References: <44F608B6.5010209@ewtllc.com> <44F62745.60006@ewtllc.com> Message-ID: <017401c6d0ed$af903d10$b803030a@trilan> Raymond Hettinger wrote: > It's a PITA because it precludes all of the use cases whether the > inputs ARE intentionally of different length (like when one argument > supplys an infinite iterator): > > for lineno, ts, line in zip(count(1), timestamp(), sys.stdin): > print 'Line %d, Time %s: %s)' % (lineno, ts, line) which is a much more complicated way of writing: for lineno, line in enumerate(sys.stdin): ts = time.time() ... [assuming your "timestamp()" is what I think it is, never heard of it before]. I double-checked my own uses of zip() and they seem to follow the trend of those in Python stdlib: most of the cases are really programming errors if the two sequences do not match in length. I reckon the usage of infinite iterators is generally much less common. -- Giovanni Bajo From paul at prescod.net Tue Sep 5 18:08:47 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 09:08:47 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> Message-ID: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> On 9/4/06, Guido van Rossum wrote: > > In this particular case I don't care what's simpler to implement, but > what's most likely to do what the user expects. If on a particular box > most files are encoded in encoding X, and the user did whatever is > necessary to tell the tools that that's their preferred encoding, I > want Python to honor that encoding when opening text files, unless the > program makes other arrangements explicitly (such as specifying an > explicit encoding as a parameter to open()). It does not strike me as accurate that on a modern computer system, a Swedish person's computer is full of ISO/Swedish encoded files and a Chinese person's computer is full of a speciifc Chinese encoding etc. Maybe that was true before the notion of variant encodings became so popular. But now Europeans are just as likely to use UTF-8 as a national encoding and Asians each have MANY different encodings to select from (some defined by Unicode, some national). I doubt you'll frequently guess correctly except in specialized apps where a user has very explicit control over their file encodings and doesn't depend on applications to choose. The direction over the lifetype of Python 3000 will be AWAY from national, local, locale-predictable encodings and TOWARDS global, standard encodings. Once we get to a place where Unicode encodings are dominant, a local-encodings feature will be useless. In the transition period, it will be actually harmful. Also, only a portion of the text data on a computer is in "documents" where the end-user has control over the encoding. There are also many, many configuration files, emails, saved web pages, chat logs etc. where the encoding was selected by someone else with a potentially different nationality. I would guess that "most" text files on "most" computers in any particular locale are in ASCII/utf-8. Japanese people also have hosts files and .htaccess files and INI files and log files and ... Python can't know whether it is dealing with one of these files or an end-user document. Of the subset of documents that actually have their encoding controlled by the local user's preferences, an increasing portion with be XML and XML documents describe their encoding explicitly. It would be wrong to use the locale to override that. Beyond all of that: It just seems wrong to me that I could send someone a bunch of files and a Python program and their results processing them would be different from mine, despite the fact that we run the same version of Python on the same operating system. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/18f98882/attachment.html From jimjjewett at gmail.com Tue Sep 5 18:35:35 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Sep 2006 12:35:35 -0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: On 9/5/06, Paul Prescod wrote: > On 9/4/06, Guido van Rossum wrote: > > In this particular case I don't care what's simpler to implement, but > > what's most likely to do what the user expects. Good. > But now Europeans are just as likely to use UTF-8 as a national encoding fine; then that will be the locale. > and Asians each have MANY different encodings to select from (some defined by > Unicode, some national). and the one they typically use will be the locale. If notepad (or vi/emacs/less/cat) agree on what a text file is, and python doesn't, it is python that will lose. >The direction over > the lifetype of Python 3000 will be AWAY from national, local, > locale-predictable encodings and TOWARDS global, standard encodings. Ruby is not wedding itself to unicode precisely because they have seen the opposite in Japan. It sounded like the "unicode doesn't quite work" problem will be permanent, because there are fundamental differences over which glyphs should be unified when. It isn't just a matter of using a larger set; there are glyphs which should be unified in some contexts but not others. > Also, only a portion of the text data on a computer is in "documents" where > the end-user has control over the encoding. There are also many, many > configuration files, emails, saved web pages, chat logs etc. where the > encoding was selected by someone else with a potentially different > nationality. Typically, these either list the encoding explicitly, or stick to something close to ASCII, which is included in most national encodings. > Beyond all of that: It just seems wrong to me that I could send someone a > bunch of files and a Python program and their results processing them would > be different from mine, despite the fact that we run the same version of > Python on the same operating system. So include the charset header. -jJ From guido at python.org Tue Sep 5 18:52:59 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 09:52:59 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: On 9/5/06, Paul Prescod wrote: > Beyond all of that: It just seems wrong to me that I could send someone a > bunch of files and a Python program and their results processing them would > be different from mine, despite the fact that we run the same version of > Python on the same operating system. And it seems just as wrong if Python doesn't do what the user expects. If I were a beginning Python user, I'd hate it if I had prepared a simple data file in vi or notepad and my Python program wouldn't read it right because Python's idea of encoding differs from my editor's. Sorry Paul, I appreciate your standards-driven perspective, but in this area I'd rather build in more flexibility than strictly needed, than too little. If it turns out that on a particular platform all files are in UTF-8, making Python *on that platform* always choose UTF-8 is simple enough. OTOH, if on a particular platform UTF-8 is *not* the norm, Python should not insist on using it anyway. We can remove this feature once everybody uses UTF-8. I don't believe we're there yet, and "it just seems wrong" doesn't count as proof. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Tue Sep 5 19:03:32 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 05 Sep 2006 19:03:32 +0200 Subject: [Python-3000] have zip() raise exception for sequences of different lengths In-Reply-To: <017401c6d0ed$af903d10$b803030a@trilan> References: <44F608B6.5010209@ewtllc.com> <44F62745.60006@ewtllc.com> <017401c6d0ed$af903d10$b803030a@trilan> Message-ID: Giovanni Bajo wrote: > Raymond Hettinger wrote: > >> It's a PITA because it precludes all of the use cases whether the >> inputs ARE intentionally of different length (like when one argument >> supplys an infinite iterator): >> >> for lineno, ts, line in zip(count(1), timestamp(), sys.stdin): >> print 'Line %d, Time %s: %s)' % (lineno, ts, line) > > which is a much more complicated way of writing: > > for lineno, line in enumerate(sys.stdin): > ts = time.time() > ... enumerate() starts at 0, count(1) at 1, so you'd have to do a lineno += 1 in the body too. Whether for lineno, ts, line in zip(count(1), timestamp(), sys.stdin): is more complicated than for lineno, line in enumerate(sys.stdin): ts = time.time() lineno += 1 is a stylistic question. (However, enumerate() could grow a second argument specifying the starting index). Georg From brian at sweetapp.com Tue Sep 5 20:12:15 2006 From: brian at sweetapp.com (Brian Quinlan) Date: Tue, 05 Sep 2006 20:12:15 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: <44FDBDFF.7090505@sweetapp.com> Guido van Rossum wrote: > And it seems just as wrong if Python doesn't do what the user expects. > If I were a beginning Python user, I'd hate it if I had prepared a > simple data file in vi or notepad and my Python program wouldn't read > it right because Python's idea of encoding differs from my editor's. As a user, I don't have any expectations regarding non-ASCII text files. I'm using a US-English version of Windows XP (very common) and I haven't changed the default encoding (very common). Python claims that my system encoding is CP436 (from sys.stdin/stdout.encoding). I can assure you that most of the documents that I work with are not in CP436 - they are a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that this is true of many Windows XP (US-English) users. So, for me and users like me, Python is going to silently misinterpret my data. How about using ASCII as the default encoding and raising an exception if non-ASCII text is encountered? Cheers, Brian From guido at python.org Tue Sep 5 21:13:46 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 12:13:46 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FDBDFF.7090505@sweetapp.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> Message-ID: On 9/5/06, Brian Quinlan wrote: > Guido van Rossum wrote: > > And it seems just as wrong if Python doesn't do what the user expects. > > If I were a beginning Python user, I'd hate it if I had prepared a > > simple data file in vi or notepad and my Python program wouldn't read > > it right because Python's idea of encoding differs from my editor's. > > As a user, I don't have any expectations regarding non-ASCII text files. What tools do you use to edit or view those files? How do those tools know the encoding to use? (Auto-detection from sniffing the data is a perfectly valid answer BTW -- I see no reason why that couldn't be one option, as long as there's a way to disable it.) > I'm using a US-English version of Windows XP (very common) and I haven't > changed the default encoding (very common). Python claims that my system > encoding is CP436 (from sys.stdin/stdout.encoding). I can assure you > that most of the documents that I work with are not in CP436 - they are > a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that > this is true of many Windows XP (US-English) users. So, for me and users > like me, Python is going to silently misinterpret my data. Not to any greater extent than Notepad or whatever other tool you are using. > How about using ASCII as the default encoding and raising an exception > if non-ASCII text is encountered? That would not be doing what the user wants. We have extensive experience with defaulting to ASCII in Python 2.x and it's mostly bad. There should definitely be a way to force ASCII as the default encoding (if only as a debugging aid), both in the program code and in the environment; but it shouldn't be the only default. There should also be a way to force UTF-8 as the default, or ISO-8859-1. But if CP436 is the default encoding set by the OS I don't see why Python shouldn't use that as the default *in the absence of any other preferences*. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From paul at prescod.net Tue Sep 5 22:17:47 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 13:17:47 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> On 9/5/06, Guido van Rossum wrote: > > On 9/5/06, Paul Prescod wrote: > > Beyond all of that: It just seems wrong to me that I could send someone > a > > bunch of files and a Python program and their results processing them > would > > be different from mine, despite the fact that we run the same version of > > Python on the same operating system. > > And it seems just as wrong if Python doesn't do what the user expects. > If I were a beginning Python user, I'd hate it if I had prepared a > simple data file in vi or notepad and my Python program wouldn't read > it right because Python's idea of encoding differs from my editor's. My point is that most textual content in the world is NOT produced in vi or notepad or other applications that read the system encoding. Most content is produced in Word (future Word files will be zipped Unicode, not opaque binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb, etc. I haven't created locale-relevant content in a generic text editor in a very, very long time. Applications like vi and emacs that "help" you to create content that other people can't consume are not really helping at all. After all, we (now!) live in a networked era and people don't just create documents and then print them out on their local printers. Most of the time when I use text editors I am editing HTML, XML or Python and using the default of CP437 is wrong for all of those. Even Python will puke if you take a naive approach to text encodings in creating a Python program. sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file c:\temp\testencoding.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Are you going to change the Python interpreter so that it will "just work" with content created in vi and notepad? Otherwise you're saying that Python will take a modern collaboration-roeitend approach to text processing but encourage Python programmers to take a naive obsolete approach. It also isn't just a question of flexibility. I think that Brian Quinlan made the good point that most English Windows users do not know what encoding their computer is using. If this represents 25% of the world's Python users, and these users run into UTF-8 data more often than CP437 then Python will guess wrong more often than it will guess right for 25% of its users. This is really dangerous because CP437 will happily read and munge UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default for that 25%. But it's worse than even that. GUI applications on Windows use a different encoding than command line ones. So on the same box, Python-in-Tk and Python-on-command line will answer that the system encoding is "cp437" versus "cp1252". I just tested it. http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx Were it not for these issue I would say that it "isn't a big deal" because modern Linux distributions are moving to UTF-8 default anyhow, and the Mac seems to use ASCII. So we're moving to international standards regardless. But default encoding on Windows is totally broken. The Mac is not totally consistent either. The console decodes UTF-8 for display. Textedit and vim munge the display in different ways (same GUI versus command-line issue again, I guess) A question: what happens when Python is reading data from a socket or other file-like object? Will that data also be decoded as if it came from the user's locale? I don't think that this discussion really has anything to do with being compatible with "most of the files on a computer". It is about being compatible with a certain set of Unix text processing applications. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/064cd1a7/attachment-0001.html From paul at prescod.net Tue Sep 5 22:21:25 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 13:21:25 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> Message-ID: <1cb725390609051321i518d7b4cm607cbae361a55d7d@mail.gmail.com> On 9/5/06, Guido van Rossum wrote: > > > So, for me and users > > like me, Python is going to silently misinterpret my data. > > Not to any greater extent than Notepad or whatever other tool you are > using. Yes. Unicode was invented in large part because people got sick of crappy tools that silently misintepreted their data. "I see a Euro character here, a happy face there, a stack trace in a third place and my friend says he sees an accented character." Not only do we not want to emulate that (PEP 263 explicitly chooses not to), we don't want to encourage other programmers to do so either. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/18dd86ff/attachment.htm From guido at python.org Tue Sep 5 22:48:27 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 13:48:27 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> Message-ID: I have no desire to continue this discussion in every detail. I believe we've both made our point, eloquently enough. The designers of the I/O library will have to come up with the specific rules for deciding on the default encoding. The only thing I'm saying is that hardcoding the default encoding in the language standard (like we did for str<-->unicode in 2.0) would be a mistake. I'm trusting that building the more basic facilities (such as being able to pass an explicit encoding to open()) first will enable us to experiment with different ways of determining a default encoding. That makes more sense to me than trying to settle this argument by raising our voices. (And yes, I am building in the possibility that I'm wrong. But he-said-she-said won't convince me; only actual usage experience.) --Guido On 9/5/06, Paul Prescod wrote: > On 9/5/06, Guido van Rossum wrote: > > > On 9/5/06, Paul Prescod wrote: > > > Beyond all of that: It just seems wrong to me that I could send someone > a > > > bunch of files and a Python program and their results processing them > would > > > be different from mine, despite the fact that we run the same version of > > > Python on the same operating system. > > > > And it seems just as wrong if Python doesn't do what the user expects. > > If I were a beginning Python user, I'd hate it if I had prepared a > > simple data file in vi or notepad and my Python program wouldn't read > > it right because Python's idea of encoding differs from my editor's. > > > My point is that most textual content in the world is NOT produced in vi or > notepad or other applications that read the system encoding. Most content is > produced in Word (future Word files will be zipped Unicode, not opaque > binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb, > etc. > > I haven't created locale-relevant content in a generic text editor in a > very, very long time. > > Applications like vi and emacs that "help" you to create content that other > people can't consume are not really helping at all. After all, we (now!) > live in a networked era and people don't just create documents and then > print them out on their local printers. Most of the time when I use text > editors I am editing HTML, XML or Python and using the default of CP437 is > wrong for all of those. > > Even Python will puke if you take a naive approach to text encodings in > creating a Python program. > > sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file > c:\temp\testencoding.py on line 1, but no encoding declared; see > http://www.python.org/peps/pep-0263.html for details > > Are you going to change the Python interpreter so that it will "just work" > with content created in vi and notepad? Otherwise you're saying that Python > will take a modern collaboration-roeitend approach to text processing but > encourage Python programmers to take a naive obsolete approach. > > It also isn't just a question of flexibility. I think that Brian Quinlan > made the good point that most English Windows users do not know what > encoding their computer is using. If this represents 25% of the world's > Python users, and these users run into UTF-8 data more often than CP437 then > Python will guess wrong more often than it will guess right for 25% of its > users. This is really dangerous because CP437 will happily read and munge > UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default > for that 25%. > > But it's worse than even that. GUI applications on Windows use a different > encoding than command line ones. So on the same box, Python-in-Tk and > Python-on-command line will answer that the system encoding is "cp437" > versus "cp1252". I just tested it. > > http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx > > Were it not for these issue I would say that it "isn't a big deal" because > modern Linux distributions are moving to UTF-8 default anyhow, and the Mac > seems to use ASCII. So we're moving to international standards regardless. > But default encoding on Windows is totally broken. > > The Mac is not totally consistent either. The console decodes UTF-8 for > display. Textedit and vim munge the display in different ways (same GUI > versus command-line issue again, I guess) > > A question: what happens when Python is reading data from a socket or other > file-like object? Will that data also be decoded as if it came from the > user's locale? > > I don't think that this discussion really has anything to do with being > compatible with "most of the files on a computer". It is about being > compatible with a certain set of Unix text processing applications. > > Paul Prescod > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From oliphant.travis at ieee.org Wed Sep 6 00:17:49 2006 From: oliphant.travis at ieee.org (Travis Oliphant) Date: Tue, 05 Sep 2006 16:17:49 -0600 Subject: [Python-3000] long/int unification In-Reply-To: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> Message-ID: martin at v.loewis.de wrote: > Here is a quick status of the int_unification branch, > summarizing what I did at the Google sprint in NYC. > > - the int type has been dropped; the builtins int and long > now both refer to long type > - all PyInt_* API is forwarded to the PyLong_* API. Little > changes to the C code are necessary; the most common offender > is PyInt_AS_LONG((PyIntObject*)v) since I completely removed > PyIntObject. > - Much of the test suite passes, although it still has a number > of bugs. > - There are timing tests for allocation and for addition. > On allocation, the current implementation is about a factor > of 2 slower; the integer addition is about 1.5 times slower; > the initial slowdowns was by a factor of 3. The pystones > dropped about 10% (pybench fails to run on p3yk). What impact is this long/int unification going to have on C-based sub-types of the old int-type? Will you be able to sub-class the integer-type in C without carrying around all the extra backage of the Python long? NumPy has a scalar-type that inherits from the current int-type which allows it to participate in many Python optimizations. Will the ability to do this disappear? I'm just wondering about the C-side view of the int/long unification. I can see benefit to the notion of integer unification, but wonder if strictly throwing out the small integer type on the C-level is actually going too far. In NumPy, we have 10 different integer data-types corresponding to what can be contained in an array. This direction was chosen after years of frustration of trying to fit a square peg (the item from the NumPy array) into a square hole (the limited Python scalar types). -Travis From guido at python.org Wed Sep 6 01:05:22 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 16:05:22 -0700 Subject: [Python-3000] long/int unification In-Reply-To: References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> Message-ID: On 9/5/06, Travis Oliphant wrote: > What impact is this long/int unification going to have on C-based > sub-types of the old int-type? Will you be able to sub-class the > integer-type in C without carrying around all the extra backage of the > Python long? This seems unlikely given that the PyInt *type* will go away (though the PyInt *API* methods may well continue to exist). You can subclass the PyLong type just as easily. What baggage are you thinking of? > NumPy has a scalar-type that inherits from the current int-type which > allows it to participate in many Python optimizations. Will the ability > to do this disappear? What kind of optimizations are you thinking of? If you're thinking of the current special-casing for e.g. list[int] in ceval.c, that code will likely disappear (although something equivalent will eventually be added). See my message about premature optimization in the Py3k from about 10 days ago. > I'm just wondering about the C-side view of the int/long unification. I > can see benefit to the notion of integer unification, but wonder if > strictly throwing out the small integer type on the C-level is actually > going too far. In NumPy, we have 10 different integer data-types > corresponding to what can be contained in an array. This direction was > chosen after years of frustration of trying to fit a square peg (the > item from the NumPy array) into a square hole (the limited Python scalar > types). But now that we have __index__, of course, there's less reason to subclass PyInt in the first place -- you can write your own 32-bit integer *without* inheriting from PyInt or PyLong, and it should be usable perfectly whenever an integer is expected. Id rather make sure *this* property is provided without compromise than attempting to keep random older optimizations alive for nostalgia's sake. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Sep 6 01:33:19 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 16:33:19 -0700 Subject: [Python-3000] long/int unification In-Reply-To: <44FE0752.9020903@ee.byu.edu> References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> <44FE0752.9020903@ee.byu.edu> Message-ID: On 9/5/06, Travis Oliphant wrote: > Guido van Rossum wrote: > > > On 9/5/06, Travis Oliphant wrote: > > > >> What impact is this long/int unification going to have on C-based > >> sub-types of the old int-type? Will you be able to sub-class the > >> integer-type in C without carrying around all the extra backage of the > >> Python long? > > > > > > This seems unlikely given that the PyInt *type* will go away (though > > the PyInt *API* methods may well continue to exist). You can subclass > > the PyLong type just as easily. What baggage are you thinking of? > > Just the extra stuff in the C-structure needed to handle the > arbitrary-length integer. That's just an int length and 15-for-16-bits encoding of the actual value. > > If you're thinking of the current special-casing for e.g. list[int] in > > ceval.c, that code will likely disappear (although something > > equivalent will eventually be added). > > Yes, that's what I'm thinking of. It would be nice if the "something > equivalent" could be extended to other objects. I suppose the > discussion can be held off until then. > > > > > But now that we have __index__, of course, there's less reason to > > subclass PyInt in the first place -- you can write your own 32-bit > > integer *without* inheriting from PyInt or PyLong, and it should be > > usable perfectly whenever an integer is expected. Id rather make sure > > *this* property is provided without compromise than attempting to keep > > random older optimizations alive for nostalgia's sake. > > > Of course, I agree entirely, so I doubt it will matter at all (except in > optimizations). There is probably going to be an increasing need to > tell whether or not something can handle one of these interfaces. I > know this was already discussed on this list, but was a decision reached > about how to tell if something exposes a specific interface? (I think > the relevant discussion took place under the name "callable"). > > I see a lot of > > isinstance(obj, int) > > in scientific Python code where testing for __index__ would be more > appropriate. I wouldn't rip this out just yet. 'int' may become an abstract type yet -- the int/long unification branch isn't the final word (if only because it doesn't pass all the unit tests yet). > Thanks for easing my mind. You're welcome. And how's that PEP coming? :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From oliphant at ee.byu.edu Wed Sep 6 01:25:06 2006 From: oliphant at ee.byu.edu (Travis Oliphant) Date: Tue, 05 Sep 2006 17:25:06 -0600 Subject: [Python-3000] long/int unification In-Reply-To: References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> Message-ID: <44FE0752.9020903@ee.byu.edu> Guido van Rossum wrote: > On 9/5/06, Travis Oliphant wrote: > >> What impact is this long/int unification going to have on C-based >> sub-types of the old int-type? Will you be able to sub-class the >> integer-type in C without carrying around all the extra backage of the >> Python long? > > > This seems unlikely given that the PyInt *type* will go away (though > the PyInt *API* methods may well continue to exist). You can subclass > the PyLong type just as easily. What baggage are you thinking of? Just the extra stuff in the C-structure needed to handle the arbitrary-length integer. > > If you're thinking of the current special-casing for e.g. list[int] in > ceval.c, that code will likely disappear (although something > equivalent will eventually be added). Yes, that's what I'm thinking of. It would be nice if the "something equivalent" could be extended to other objects. I suppose the discussion can be held off until then. > > But now that we have __index__, of course, there's less reason to > subclass PyInt in the first place -- you can write your own 32-bit > integer *without* inheriting from PyInt or PyLong, and it should be > usable perfectly whenever an integer is expected. Id rather make sure > *this* property is provided without compromise than attempting to keep > random older optimizations alive for nostalgia's sake. Of course, I agree entirely, so I doubt it will matter at all (except in optimizations). There is probably going to be an increasing need to tell whether or not something can handle one of these interfaces. I know this was already discussed on this list, but was a decision reached about how to tell if something exposes a specific interface? (I think the relevant discussion took place under the name "callable"). I see a lot of isinstance(obj, int) in scientific Python code where testing for __index__ would be more appropriate. Thanks for easing my mind. -Travis From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 02:32:28 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 01:32:28 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: <44FE171C.1090101@blueyonder.co.uk> Guido van Rossum wrote: > On 9/5/06, Paul Prescod wrote: > >> Beyond all of that: It just seems wrong to me that I could send someone a >> bunch of files and a Python program and their results processing them >> would be different from mine, despite the fact that we run the same version of >> Python on the same operating system. > > And it seems just as wrong if Python doesn't do what the user expects. > If I were a beginning Python user, I'd hate it if I had prepared a > simple data file in vi or notepad and my Python program wouldn't read > it right because Python's idea of encoding differs from my editor's. I don't know about vi, but notepad will open and save files that are not in the system ("ANSI") encoding just fine. On opening it checks for a BOM and auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the Encoding drop-down box. This is exactly the behaviour that most users would expect of a well-behaved Unicode-aware app. It should be as easy as possible to match this behaviour in a Python program. > Sorry Paul, I appreciate your standards-driven perspective, but in > this area I'd rather build in more flexibility than strictly needed, > than too little. If it turns out that on a particular platform all > files are in UTF-8, making Python *on that platform* always choose > UTF-8 is simple enough. The problem is not the systems where all files are UTF-8, or all files are another known charset. The problem is the platforms where half of the files are UTF-8 and half are in some other charset, determined either by type or by presence of a UTF-8 BOM. This is a *very* common situation, especially for European users. Such a user cannot set the locale to UTF-8, because that will break all of their non-Unicode-aware applications. The Unicode-aware applications typically have much better support for reading and writing files in charsets that are not the system default. So in practice the locale has to be set to the "old" charset during a migration to UTF-8. (Setting different locales for different applications is far too much hassle. On Windows, although I believe it is technically possible to do the equivalent of selecting a UTF-8 locale, most users don't know how to do it, even if they want to use UTF-8 exclusively.) -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 02:36:10 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 01:36:10 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE171C.1090101@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: <44FE17FA.6030103@blueyonder.co.uk> David Hopwood wrote: > I don't know about vi, but notepad will open and save files that are not in > the system ("ANSI") encoding just fine. On opening it checks for a BOM and > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the > Encoding drop-down box. ... and it also helpfully prompts you to select a Unicode encoding, if you forget and the file contains characters that are not representable in the ANSI encoding. > This is exactly the behaviour that most users would expect of a well-behaved > Unicode-aware app. It should be as easy as possible to match this behaviour > in a Python program. -- David Hopwood From guido at python.org Wed Sep 6 02:44:37 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 17:44:37 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE171C.1090101@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: On 9/5/06, David Hopwood wrote: > Guido van Rossum wrote: > > On 9/5/06, Paul Prescod wrote: > > > >> Beyond all of that: It just seems wrong to me that I could send someone a > >> bunch of files and a Python program and their results processing them > >> would be different from mine, despite the fact that we run the same version of > >> Python on the same operating system. > > > > And it seems just as wrong if Python doesn't do what the user expects. > > If I were a beginning Python user, I'd hate it if I had prepared a > > simple data file in vi or notepad and my Python program wouldn't read > > it right because Python's idea of encoding differs from my editor's. > > I don't know about vi, but notepad will open and save files that are not in > the system ("ANSI") encoding just fine. On opening it checks for a BOM and > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the > Encoding drop-down box. > > This is exactly the behaviour that most users would expect of a well-behaved > Unicode-aware app. It should be as easy as possible to match this behaviour > in a Python program. And this is exactly why I want the determination of the default encoding (i.e. the encoding to be used when opening a file when no explicit encoding is specified by the Python code that does the opening) to be open-ended, rather than picking some standard default like UTF-8 and saying (like Paul seems to want to say) "this is it". > > Sorry Paul, I appreciate your standards-driven perspective, but in > > this area I'd rather build in more flexibility than strictly needed, > > than too little. If it turns out that on a particular platform all > > files are in UTF-8, making Python *on that platform* always choose > > UTF-8 is simple enough. > > The problem is not the systems where all files are UTF-8, or all files are > another known charset. The problem is the platforms where half of the files > are UTF-8 and half are in some other charset, determined either by type or by > presence of a UTF-8 BOM. This is a *very* common situation, especially for > European users. Right. (And Paul appears to be ignorant of this.) > Such a user cannot set the locale to UTF-8, because that will break all of > their non-Unicode-aware applications. The Unicode-aware applications typically > have much better support for reading and writing files in charsets that are > not the system default. So in practice the locale has to be set to the "old" > charset during a migration to UTF-8. > > (Setting different locales for different applications is far too much hassle. > On Windows, although I believe it is technically possible to do the equivalent > of selecting a UTF-8 locale, most users don't know how to do it, even if they > want to use UTF-8 exclusively.) Right. Of course, "locale" and "encoding" are somewhat orthogonal issues; the encoding may be UTF-8 but that doesn't determine other aspects of the locale (such as language-specific collation order, or culture-specific formatting of numbers, dates and money). Now, some platforms may equate the two somehow, and on those platforms we would have to inspect the locale to tell the encoding; but other platforms may specify the encoding separate from the locale... -- --Guido van Rossum (home page: http://www.python.org/~guido/) From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 02:46:29 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 01:46:29 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> Message-ID: <44FE1A65.7020900@blueyonder.co.uk> Guido van Rossum wrote: > On 9/5/06, Brian Quinlan wrote: > [...] > > That would not be doing what the user wants. We have extensive > experience with defaulting to ASCII in Python 2.x and it's mostly bad. > There should definitely be a way to force ASCII as the default > encoding (if only as a debugging aid), both in the program code and in > the environment; but it shouldn't be the only default. There should > also be a way to force UTF-8 as the default, or ISO-8859-1. But if > CP436 is the default encoding set by the OS I don't see why Python > shouldn't use that as the default *in the absence of any other > preferences*. Cp436 is almost certainly *not* the encoding set by the OS; Python has got it wrong. If Brian is using an English-language variant of Windows XP and has not changed the defaults, the system ("ANSI") encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1 if C1 control characters are not used). -- David Hopwood From guido at python.org Wed Sep 6 03:09:21 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 18:09:21 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <20060904102413.GC21049@phd.pp.ru> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> <20060904102413.GC21049@phd.pp.ru> Message-ID: On 9/4/06, Oleg Broytmann wrote: > On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote: > > On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote: > > > "tomer filiba" writes: > > >> > > >> file("foo", "w+") ? > > > > > > What is a rationale of this operation for a text file? > > > > You want to be able to read the file and write data to it. That argues > > in favor of seek(0) and seek(-1) being the only supported behaviors, > > though. Umm, where he wrote seek(-1) he probably meant seek(0, 2) which is how one seeks to EOF. > Sometimes programs need tell() + seek(). Two examples (very similar, > really). > > Example 1. I have a program, an email robot that receives email(s) and > marks email addresses in a "database" that is actually a text file: > > --- email database file --- > phd at phd.pp.ru > phd at oper.med.ru > --- / --- > > The program opens the file in "r+" mode, reads it line by line and > stores the positions of the first character in an every line using tell(). > When it needs to mark an email it seek()'s to the stored position and write > '+' mark so the file looks like > > --- email database file --- > +phd at phd.pp.ru > phd at oper.med.ru > --- / --- I don't understand how it can insert a character into the file without rewriting everything after that point. But it does remind me of a use case for tell+seek on a read-only text file. An email-reading program may have a text-based multi-message mailbox format (e.g. UNIX mailbox format) and build an in-memory index of seek positions using a quick initial scan (or scanning as it goes). Once it has computed the position of a message it can quickly seek to its start and display that message. Granted, typical mailbox formats tend to use ASCII only. But one could easily imagine a similar use case for encoded text files containing multiple application-specific sections. As long as the state of the decoder is "neutral" at the start of a line, it should be possible to do this. I like the idea that tell() returns a "cookie" which is really a byte offset. If one wants to be able to seek to positions with a non-neutral decoder state, the cookie would have to be more abstract. It shouldn't matter; text apps should not do arithmetic on seek/tell positions. > Example 2. INN (the NNTP daemon) stores (at least stored when I was > using it) information about newsgroup in a text file database. It uses > another approach - it stores info using lines of equal length: > > --- newsgroups --- > comp.lang.python 000001234567 > comp.lang.python.announce 000000abcdef > --- / --- > > Probably INN doesn't use tell() - it just calculates the position using > line length. But a python program needs tell() and seek() for such a file. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 03:28:31 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 02:28:31 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: <44FE243F.80203@blueyonder.co.uk> Guido van Rossum wrote: > On 9/5/06, David Hopwood wrote: >> Guido van Rossum wrote: >> > On 9/5/06, Paul Prescod wrote: >> > >> >> Beyond all of that: It just seems wrong to me that I could send >> >> someone a bunch of files and a Python program and their results >> >> processing them would be different from mine, despite the fact that >> >> we run the same version of Python on the same operating system. >> > >> > And it seems just as wrong if Python doesn't do what the user expects. >> > If I were a beginning Python user, I'd hate it if I had prepared a >> > simple data file in vi or notepad and my Python program wouldn't read >> > it right because Python's idea of encoding differs from my editor's. >> >> I don't know about vi, but notepad will open and save files that are >> not in the system ("ANSI") encoding just fine. On opening it checks for >> a BOM and auto-detects UTF-8 and UTF-16; on saving it will write a BOM >> if you choose "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or >> UTF-8 in the Encoding drop-down box. >> >> This is exactly the behaviour that most users would expect of a >> well-behaved Unicode-aware app. It should be as easy as possible to >> match this behaviour in a Python program. > > And this is exactly why I want the determination of the default > encoding (i.e. the encoding to be used when opening a file when no > explicit encoding is specified by the Python code that does the > opening) to be open-ended, rather than picking some standard default > like UTF-8 and saying (like Paul seems to want to say) "this is it". The point I was making is that the system encoding *should not* be treated as (or called) a "default" encoding. I can't speak for Paul, but that seemed to also be what he was saying. The whole idea of a default encoding is flawed. Ideally there would be no default; programmers should be forced to think about the issue on a case-by-case basis. In some cases they might choose to open a file with the system encoding, but that should be an explicit decision. >> (Setting different locales for different applications is far too much >> hassle. On Windows, although I believe it is technically possible to >> do the equivalent of selecting a UTF-8 locale, most users don't know >> how to do it, even if they want to use UTF-8 exclusively.) > > Right. Of course, "locale" and "encoding" are somewhat orthogonal > issues; the encoding may be UTF-8 but that doesn't determine other > aspects of the locale (such as language-specific collation order, or > culture-specific formatting of numbers, dates and money). The encoding is usually an attribute of the locale. This is certainly the case on POSIX and Windows platforms. -- David Hopwood From paul at prescod.net Wed Sep 6 03:53:53 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 18:53:53 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: <1cb725390609051853p59574772q16ca26d17b52c76f@mail.gmail.com> On 9/5/06, Guido van Rossum wrote: > > On 9/5/06, David Hopwood wrote: > > Guido van Rossum wrote: > > > On 9/5/06, Paul Prescod wrote: > > > > > >> Beyond all of that: It just seems wrong to me that I could send > someone a > > >> bunch of files and a Python program and their results processing them > > >> would be different from mine, despite the fact that we run the same > version of > > >> Python on the same operating system. > > > > > > And it seems just as wrong if Python doesn't do what the user expects. > > > If I were a beginning Python user, I'd hate it if I had prepared a > > > simple data file in vi or notepad and my Python program wouldn't read > > > it right because Python's idea of encoding differs from my editor's. > > > > I don't know about vi, but notepad will open and save files that are not > in > > the system ("ANSI") encoding just fine. On opening it checks for a BOM > and > > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you > choose > > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the > > Encoding drop-down box. > > > > This is exactly the behaviour that most users would expect of a > well-behaved > > Unicode-aware app. It should be as easy as possible to match this > behaviour > > in a Python program. > > And this is exactly why I want the determination of the default > encoding (i.e. the encoding to be used when opening a file when no > explicit encoding is specified by the Python code that does the > opening) to be open-ended, rather than picking some standard default > like UTF-8 and saying (like Paul seems to want to say) "this is it". I never suggested that UTF-8 should be the default. In fact, I think it was very wise of Python 2.x to make ASCII the default and I'm astounded to hear that you regret that decision. "In the face of ambiguity, refuse the temptation to guess." Python 2.x provided an option to allow users to change the default system-wide and ever since then we've (almost unanimously) counselled users against changing it. > > Sorry Paul, I appreciate your standards-driven perspective, but in > > > this area I'd rather build in more flexibility than strictly needed, > > > than too little. If it turns out that on a particular platform all > > > files are in UTF-8, making Python *on that platform* always choose > > > UTF-8 is simple enough. > > > > The problem is not the systems where all files are UTF-8, or all files > are > > another known charset. The problem is the platforms where half of the > files > > are UTF-8 and half are in some other charset, determined either by type > or by > > presence of a UTF-8 BOM. This is a *very* common situation, especially > for > > European users. > > Right. (And Paul appears to be ignorant of this.) I don't see how the fact that an individual system can have half of the files in one encoding and half in another could argue IN FAVOUR of a system-global default. I would have thought it strengthens my argument AGAINST trying to apply a random encoding to files. You said: "If on a particular box most files are encoded in encoding X, and the user did whatever is necessary to tell the tools that that's their preferred encoding, I want Python to honor that encoding when opening text files, unless the program makes other arrangements explicitly (such as specifying an explicit encoding as a parameter to open())." But there is no such thing that "most users do" to tell tool what's their preferred encoding. Most users use some random (to them) operating system default which on Windows is usually wrong and is different (for no particular reason) on the Macintosh than on Linux. Long-time Windows users in this thread cannot even agree what is the default for US English Windows because there is no single default. There are two. Can we at least agree that if LC_CHARSET is demonstrably wrong most of the time on Windows that we should not use it (at least on Windows)? Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/dba9e8d0/attachment.html From paul at prescod.net Wed Sep 6 04:00:06 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 19:00:06 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE1A65.7020900@blueyonder.co.uk> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> <44FE1A65.7020900@blueyonder.co.uk> Message-ID: <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com> On 9/5/06, David Hopwood wrote: > > Guido van Rossum wrote: > > On 9/5/06, Brian Quinlan wrote: > > [...] > > > > That would not be doing what the user wants. We have extensive > > experience with defaulting to ASCII in Python 2.x and it's mostly bad. > > There should definitely be a way to force ASCII as the default > > encoding (if only as a debugging aid), both in the program code and in > > the environment; but it shouldn't be the only default. There should > > also be a way to force UTF-8 as the default, or ISO-8859-1. But if > > CP436 is the default encoding set by the OS I don't see why Python > > shouldn't use that as the default *in the absence of any other > > preferences*. > > Cp436 is almost certainly *not* the encoding set by the OS; Python > has got it wrong. If Brian is using an English-language variant of > Windows XP and has not changed the defaults, the system ("ANSI") > encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1 > if C1 control characters are not used). http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbdaen9/00000376.htm "There are at least two code pages in use on most PCs. Applications using the Windows graphical user interface use the Windows code pages. These code pages are compatible with ISO character sets, and also with ANSI character sets. They are often referred to as *ANSI code pages*. Character-mode applications (those using the console or command prompt window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were used in DOS. These are called *OEM code pages* (Original Equipment Manufacturer) for historical reasons. ... Example Consider the following situation: - A PC is running a Windows operating system with ANSI code page 1252. - The code page for character-mode applications is OEM code page 437. - Text is held in a database created using the collation UTF8. An upper case A grave in the database is stored as hex byes C380. In a Windows application, the same character is represented as hex CO. In a DOS application, it is represented as hex B7." Now notice that when we introduce Unicode (and all Python 3K strings are Unicode), we aren't talking about DISPLAY of characters. We're talking about INTERPRETATION of characters. So if I read a file and then merge it with some XML data then a Windows default encoding-using application will create different output in a Python script run from the command line versus run from the Windows desktop. Same app. Same data. Different default encodings. Different output. Of course we could arbitrarily choose one of these two encodings as the "true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates something about how likely either one is to be correct for a particular user's goals. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/8be863ca/attachment.htm From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 04:52:28 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 03:52:28 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> <44FE1A65.7020900@blueyonder.co.uk> <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com> Message-ID: <44FE37EC.2050504@blueyonder.co.uk> Paul Prescod wrote: > On 9/5/06, David Hopwood wrote: >> Guido van Rossum wrote: >> > On 9/5/06, Brian Quinlan wrote: >> > [...] >> > >> > That would not be doing what the user wants. We have extensive >> > experience with defaulting to ASCII in Python 2.x and it's mostly bad. >> > There should definitely be a way to force ASCII as the default >> > encoding (if only as a debugging aid), both in the program code and in >> > the environment; but it shouldn't be the only default. There should >> > also be a way to force UTF-8 as the default, or ISO-8859-1. But if >> > CP436 is the default encoding set by the OS I don't see why Python >> > shouldn't use that as the default *in the absence of any other >> > preferences*. >> >> Cp436 is almost certainly *not* the encoding set by the OS; Python >> has got it wrong. If Brian is using an English-language variant of >> Windows XP and has not changed the defaults, the system ("ANSI") >> encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1 >> if C1 control characters are not used). > > http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbdaen9/00000376.htm > > "There are at least two code pages in use on most PCs. Applications using > the Windows graphical user interface use the Windows code pages. These code > pages are compatible with ISO character sets, and also with ANSI character > sets. They are often referred to as *ANSI code pages*. > > Character-mode applications (those using the console or command prompt > window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were > used in DOS. These are called *OEM code pages* (Original Equipment > Manufacturer) for historical reasons. True, I oversimplified. In practice, each text file on a Windows system is somewhat more likely to be encoded in the ANSI charset than in the OEM charset (unless the user still commonly uses DOS-era applications). The OEM charset only exists at all as a compatibility hack. > Of course we could arbitrarily choose one of these two encodings as the > "true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates > something about how likely either one is to be correct for a particular > user's goals. Right -- it's impossible to make a clear distinction between "files used by console applications" and "files used by graphical applications", since any text file can be used by both. This just supports my assertion that there should not be a "default" encoding at all. -- David Hopwood From qrczak at knm.org.pl Wed Sep 6 08:10:44 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 06 Sep 2006 08:10:44 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE243F.80203@blueyonder.co.uk> (David Hopwood's message of "Wed, 06 Sep 2006 02:28:31 +0100") References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> Message-ID: <87bqpttti3.fsf@qrnik.zagroda> David Hopwood writes: > The whole idea of a default encoding is flawed. Ideally there would be > no default; programmers should be forced to think about the issue > on a case-by-case basis. In some cases they might choose to open a file > with the system encoding, but that should be an explicit decision. Perhaps this is shows a difference between Unix and Windows culture. On Unix there is definitely a default encoding; this is what most good programs operating on text files assume by default. It would be insane to have to tell each program separately about the encoding. Locale is the OS mechanism used to provide this information in a uniform way. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Wed Sep 6 12:08:21 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 03:08:21 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <87bqpttti3.fsf@qrnik.zagroda> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> Message-ID: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> On 9/5/06, Marcin 'Qrczak' Kowalczyk wrote: > David Hopwood writes: > > > The whole idea of a default encoding is flawed. Ideally there would be > > no default; programmers should be forced to think about the issue > > on a case-by-case basis. In some cases they might choose to open a file > > with the system encoding, but that should be an explicit decision. > > Perhaps this is shows a difference between Unix and Windows culture. > > On Unix there is definitely a default encoding; this is what most good > programs operating on text files assume by default. It would be insane > to have to tell each program separately about the encoding. Locale is > the OS mechanism used to provide this information in a uniform way. Windows users do not "tell each program separately about the encoding." The encoding varies by file type. It makes no more sense to have a global variable that says "all of my files are Shift-JIS" than it does to say "all of my files are PowerPoint files." Because someday somebody is going to email you a Big-5 file (or a zipfile) and that setting will be wrong. Once you know that a file is of type Zip then you know that the "encoding" is zipped binary. Once you know that it is an Office 2007 file, then you know that the encoding is Zipped XML and that the XML will have its own encoding declaration. Once you know that it is HTML, then you look for meta tags. This is how real-world programs work. They shouldn't guess based on system global variables. May I ask an empircal question? In your experience, what percentage of Macintosh users change the default encoding from US-ASCII to something specific to their culture? What percentage of Ubuntu users change it froom UTF-8 to something specific? If the answers are "few", then we are talking about a feature that will break Windows programs and offer little value to Unix and Macintosh users. If "many" users change the global system encoding on their modern Unix distributions then I propose the following. There should be a property called something like "encodings.recommendedEncoding". On Windows it should be ASCII. On Unix-like platforms it can be inferred from the locale. Programmers who know what it means and want to take advantage of it can do so like this: opentext(filename, "r", encoding=encodings.recommendedEncoding) This is almost exactly how C# does it, though it uses the confusing term "defaut encoding" which implies a default behaviour. The lack of an encoding argument should default to ASCII or perhaps UTF-8. (either one is relatively safe about not processing data incorrectly by accident) Paul Prescod From phd at mail2.phd.pp.ru Wed Sep 6 12:37:51 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Wed, 6 Sep 2006 14:37:51 +0400 Subject: [Python-3000] encoding hell In-Reply-To: References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> <20060904102413.GC21049@phd.pp.ru> Message-ID: <20060906103751.GD30635@phd.pp.ru> On Tue, Sep 05, 2006 at 06:09:21PM -0700, Guido van Rossum wrote: > On 9/4/06, Oleg Broytmann wrote: > >--- email database file --- > > phd at phd.pp.ru > > phd at oper.med.ru > >--- / --- > > > > The program opens the file in "r+" mode, reads it line by line and > >stores the positions of the first character in an every line using tell(). > >When it needs to mark an email it seek()'s to the stored position and write > >'+' mark so the file looks like > > > >--- email database file --- > >+phd at phd.pp.ru > > phd at oper.med.ru > >--- / --- > > I don't understand how it can insert a character into the file without > rewriting everything after that point. The essential part of the program is: results = open("results", "r+") name, email = getaddresses([to])[0] while 1: pos = results.tell() line = results.readline() if not line: break if line.strip() == email: results.seek(pos) results.write('+') break results.close() Open the "database" file in "r+" mode, find the email, seek to the beginning of the line, replace the space with '+'. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From phd at mail2.phd.pp.ru Wed Sep 6 12:48:39 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Wed, 6 Sep 2006 14:48:39 +0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> Message-ID: <20060906104839.GE30635@phd.pp.ru> On Wed, Sep 06, 2006 at 03:08:21AM -0700, Paul Prescod wrote: > Windows users do not "tell each program separately about the > encoding." The encoding varies by file type. It makes no more sense to > have a global variable that says "all of my files are Shift-JIS" than > it does to say "all of my files are PowerPoint files." Because someday > somebody is going to email you a Big-5 file (or a zipfile) and that > setting will be wrong. Once you know that a file is of type Zip then > you know that the "encoding" is zipped binary. Once you know that it > is an Office 2007 file, then you know that the encoding is Zipped XML > and that the XML will have its own encoding declaration. Once you know > that it is HTML, then you look for meta tags. > > This is how real-world programs work. They shouldn't guess based on > system global variables. Unfortunately, the real world is a bit worse than that. There are many protocol and file formats that cary textual information and still don't provide a hint on encoding. First, there are text files. Really, there are still text files. A user can dump a README file unto his/her personal FTP server, and the file ususally is in the local encoding. MP3 tags. Real nightmare. Nobody follows the standard - tag editors write tags in the local encoding, and mp3 players interpret them in the local encoding. FTP and other dumb protocols that transfer file names in the encoding local to the server without announcing that encoding in the metadata. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From qrczak at knm.org.pl Wed Sep 6 12:51:55 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 06 Sep 2006 12:51:55 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> (Paul Prescod's message of "Wed, 6 Sep 2006 03:08:21 -0700") References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> Message-ID: <87k64hl12s.fsf@qrnik.zagroda> "Paul Prescod" writes: > Windows users do not "tell each program separately about the > encoding." The encoding varies by file type. There are lots of Unix file types which are based on text files and their encoding is not specified explicitly. > It makes no more sense to have a global variable that says "all of > my files are Shift-JIS" than it does to say "all of my files are > PowerPoint files." Not all: it's just the default for text files. > This is how real-world programs work. They shouldn't guess based on > system global variables. But they do. It's a fact which is impossible to change with a decree. There is no place, other than the locale, which would suggest which encoding is used in /etc files, or in the contents of environment variables, or on the terminal. You might say that it's unfortunate, but it's true. At most you could advocate specifying new file formats with the encoding in mind, like XML does. This doesn't enrich existing file formats with that information. Of course technically these formats are just sequences of bytes, and most programs pass non-ASCII fragments around without looking into them deeper. But as long as one tries to treat them as natural language text, search them case-insensitively, embed text taken from them in HTML files, then the encoding begins to matter, and there is a general shift among programming languages to translate it on I/O to a common format instead of dealing with encoded text on all levels. > May I ask an empircal question? In your experience, what percentage > of Macintosh users change the default encoding from US-ASCII to > something specific to their culture? I have no experience with Macintoshes at all. > What percentage of Ubuntu users change it froom UTF-8 to something > specific? Why would it matter? If most of their programs use UTF-8, and it's specified by the locale, then fine. My system uses mostly ISO-8859-2, and it's also fine, as long as there is a way for the program to get that information. If a program can't read my text files or filenames or environment variables or program invocation arguments, while they are encoded according to the locale, then the program is broken. If a file is not encoded using the encoding specified by the locale, and I don't tell the program explicitly about the encoding, then it's not the program's fault when it can't read that. If a language requires extra steps in order to make the locale encoding work, then it's unhelpful. Most programmers won't bother, and their programs will work most of the time when they test it, assuming they use it with English texts. Such programs suddenly break when used in a non-English speaking country. > If the answers are "few", then we are talking about a feature that > will break Windows programs and offer little value to Unix and > Macintosh users. How does it break more programs than assuming ASCII does? All encodings suitable as a system encoding are ASCII supersets, so if a file can't be read using the locale encoding, it can't be read in ASCII either. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Wed Sep 6 12:55:04 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 03:55:04 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <20060906104839.GE30635@phd.pp.ru> References: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <20060906104839.GE30635@phd.pp.ru> Message-ID: <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> But how would a system-wide default encoding help with any of these situations? These situations are IN FACT caused by system-wide default encodings used by naive programmers. Python should be part of the solution, not part of the problem. On 9/6/06, Oleg Broytmann wrote: > On Wed, Sep 06, 2006 at 03:08:21AM -0700, Paul Prescod wrote: > ... > Unfortunately, the real world is a bit worse than that. There are many > protocol and file formats that cary textual information and still don't > provide a hint on encoding. > First, there are text files. Really, there are still text files. A user > can dump a README file unto his/her personal FTP server, and the file > ususally is in the local encoding. > MP3 tags. Real nightmare. Nobody follows the standard - tag editors > write tags in the local encoding, and mp3 players interpret them in the > local encoding. > FTP and other dumb protocols that transfer file names in the encoding > local to the server without announcing that encoding in the metadata. > > Oleg. > -- > Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru > Programmers don't die, they just GOSUB without RETURN. From phd at mail2.phd.pp.ru Wed Sep 6 13:16:43 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Wed, 6 Sep 2006 15:16:43 +0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> References: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <20060906104839.GE30635@phd.pp.ru> <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> Message-ID: <20060906111643.GA4412@phd.pp.ru> On Wed, Sep 06, 2006 at 03:55:04AM -0700, Paul Prescod wrote: > But how would a system-wide default encoding help with any of these > situations? These situations are IN FACT caused by system-wide default > encodings used by naive programmers. Python should be part of the > solution, not part of the problem. > > On 9/6/06, Oleg Broytmann wrote: > > First, there are text files. Really, there are still text files. A user > > can dump a README file unto his/her personal FTP server, and the file > > ususally is in the local encoding. > > MP3 tags. Real nightmare. Nobody follows the standard - tag editors > > write tags in the local encoding, and mp3 players interpret them in the > > local encoding. > > FTP and other dumb protocols that transfer file names in the encoding > > local to the server without announcing that encoding in the metadata. These situations are caused because of the lack of metadata or clear encoding-friendly standards. Ogg, for example, is encoding friendly - it clearly states that tags (comments) must be in UTF-8, and all Ogg Vorbis files I have saw were really in UTF-8, and all tag editors and players write/use UTF-8. XML is encoding-friendly - every file specifies its encoding. HTTP protocol is mostly encoding friendly with its Content-Type header. HTML is partially encoding friendly, but only partially - if one saves an HTML page to a file it may lack an encoding information. But text files and FTP protocol don't have any metadata, and ID3v2 don't specify an universal encoding or encoding metadata. In these cases programs can either guess encoding based on the file content or use system global encoding. I fail to see how Python can help here. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From brian at sweetapp.com Wed Sep 6 13:33:43 2006 From: brian at sweetapp.com (Brian Quinlan) Date: Wed, 06 Sep 2006 13:33:43 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <87k64hl12s.fsf@qrnik.zagroda> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <87k64hl12s.fsf@qrnik.zagroda> Message-ID: <44FEB217.6040507@sweetapp.com> Marcin 'Qrczak' Kowalczyk wrote: > Why would it matter? If most of their programs use UTF-8, and it's > specified by the locale, then fine. My system uses mostly ISO-8859-2, > and it's also fine, as long as there is a way for the program to get > that information. The problem is that blindly using the system encoding is error prone. For example, I would imagine that when you type: % less /usr/lib/python2.4/getopt.py you see "Peter ?strand" rather than "Peter ?strand". That happens because getopt.py is encoded in ISO-8859-1 and you are using ISO-8859-2 as your default encoding. Maybe you don't care about the display glitch but there are applications where it would be a big deal e.g. you are populating a database based on the content of text files. > If a program can't read my text files or filenames or environment > variables or program invocation arguments, while they are encoded > according to the locale, then the program is broken. How can the program know if the file is encoded according to your locale? Do you think that all of the text files on your system are encoded using ISO-8859-2? Should Python really just guess for you? > If a file is not encoded using the encoding specified by the locale, > and I don't tell the program explicitly about the encoding, then it's > not the program's fault when it can't read that. > > If a language requires extra steps in order to make the locale > encoding work, then it's unhelpful. No, it's favoring caution and trying to avoid letting errors slip through. If the programmer believes that they understand the issues and wants to use the locale encoding setting, it will cost her <20 characters of typing per file open to do so. > Most programmers won't bother, > and their programs will work most of the time when they test it, > assuming they use it with English texts. Such programs suddenly break > when used in a non-English speaking country. And that is a great thing! Their program will break in a nice clean understandable way, instead of proceeding and generating incorrect results. Cheers, Brian From murman at gmail.com Wed Sep 6 15:18:19 2006 From: murman at gmail.com (Michael Urman) Date: Wed, 6 Sep 2006 08:18:19 -0500 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <20060906111643.GA4412@phd.pp.ru> References: <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <20060906104839.GE30635@phd.pp.ru> <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> <20060906111643.GA4412@phd.pp.ru> Message-ID: On 9/6/06, Oleg Broytmann wrote: > These situations are caused because of the lack of metadata or clear > encoding-friendly standards. Ogg, for example, is encoding friendly - it > clearly states that tags (comments) must be in UTF-8, and all Ogg Vorbis > files I have saw were really in UTF-8, and all tag editors and players > write/use UTF-8. And yet I've run across vorbiscomments encoded in latin-1. It screws everyone else up, but there are always going to be applications that do not play along. > XML is encoding-friendly - every file specifies its encoding. And plenty of people use methods to read and write it which cannot cope with non ascii files. > HTTP protocol is mostly encoding friendly with its Content-Type > header. HTML is partially encoding friendly, but only partially - if one > saves an HTML page to a file it may lack an encoding information. Right; HTTP has the means to indicate the encoding, but rarely does it have the means to acquire it. > But text files and FTP protocol don't have any metadata, and ID3v2 don't > specify an universal encoding or encoding metadata. In these cases programs > can either guess encoding based on the file content or use system global > encoding. Actually, ID3v2 offers exactly four encodings: latin1, UTF16, UTF16-BE, and UTF8. However UTF16 isn't endian-determined, and latin1 has been abused and holds the Windows ACP encoded text more often than not, so it's a poor indicator. Another case of applications ignoring the spec and doing what's easy. (I don't recall exactly when the unicode encoding options were added, so they may have had little choice; more likely they were too lazy to use UTF16 or it wouldn't work on their portable device.) > I fail to see how Python can help here. Absolutely agreed. I suspect the best option is some sort of TextFile constructor that defaults to ASCII (or has no default) but accepts an easy way to use the "recommended" or system encoding, or any explicit one. And for more complicated formats, the code will just have to use a bytestream layer, and decode as necessary. This may be a pain for mbox files, but unless there's a way to switch encodings on the fly, a seemingly text file will have to be treated as binary (newlines excepted, I hope). I also hope that, if the "recommended" encoding uses a heuristic on the file's contents, the file has enough data in the encoding to make a good guess. Music metadata rarely is that. :) Michael -- Michael Urman http://www.tortall.net/mu/blog From david.nospam.hopwood at blueyonder.co.uk Tue Sep 5 02:28:54 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Tue, 05 Sep 2006 01:28:54 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> Message-ID: <44FCC4C6.9030500@blueyonder.co.uk> Guido van Rossum wrote: > On 9/4/06, David Hopwood wrote: >> Guido van Rossum wrote: >> >> > I've always said (can someone find a quote perhaps?) that there ought >> > to be a sensible default encoding for files (including but not limited >> > to stdin/out/err), perhaps influenced by personalized settings, >> > environment variables, the OS, etc. >> >> While it should be possible to find out what the OS believes to be >> the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows; >> LC_CHARSET environment variable on Unix), that does not mean that it >> is this charset that Python programs should normally use. When defining >> a new text-based file type, it is simpler to define it to be always >> UTF-8. > > In this particular case I don't care what's simpler to implement, The issue is not simplicity of implementation; it is what will provide the simplest usage model in the long term. If new files are encoded in X just because most of a user's existing files are encoded in X, then how is the user supposed to migrate to a different encoding? Language specifications can have a significant effect in helping migration to Unicode. > but what's most likely to do what the user expects. In practice, the system charset is often set to the charset that should be used as a fallback *for applications that do not support Unicode*. This is especially true on Windows systems. Using UTF-8 by default for new file types is not only simpler, it's more functional. If a BOM is written at the start of the file, and if the user edits files with a text editor that recognizes this, then everything, including writing text in multiple scripts, will Just Work from the user's point of view. > If on a particular box > most files are encoded in encoding X, and the user did whatever is > necessary to tell the tools that that's their preferred encoding, I > want Python to honor that encoding when opening text files, unless the > program makes other arrangements explicitly (such as specifying an > explicit encoding as a parameter to open()). I would prefer that there is no default. But since that is incompatible with the existing API for open(), I accept that I'm not likely to win that argument. -- David Hopwood From jimjjewett at gmail.com Wed Sep 6 18:50:24 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Sep 2006 12:50:24 -0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FCC4C6.9030500@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <44FCC4C6.9030500@blueyonder.co.uk> Message-ID: On 9/4/06, David Hopwood wrote: > The issue is not simplicity of implementation; it is what will provide > the simplest usage model in the long term. If new files are encoded in X > just because most of a user's existing files are encoded in X, then how is > the user supposed to migrate to a different encoding? ... > In practice, the system charset is often set to the charset that should > be used as a fallback *for applications that do not support Unicode*. Are you assuming that most uses of open will be for new files, *and* that these files will not also be read by such unicode-ignorant applications? Since we're only talking about text files that do not have an explicit encoding, I can barely imagine *either* of these conditions being true. -jJ From paul at prescod.net Wed Sep 6 19:15:44 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 10:15:44 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <87k64hl12s.fsf@qrnik.zagroda> References: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <87k64hl12s.fsf@qrnik.zagroda> Message-ID: <1cb725390609061015h1c953b7l765e42cacdff2a71@mail.gmail.com> On 9/6/06, Marcin 'Qrczak' Kowalczyk wrote: > "Paul Prescod" writes: > > > Windows users do not "tell each program separately about the > > encoding." The encoding varies by file type. > > There are lots of Unix file types which are based on text files > and their encoding is not specified explicitly. Of course. But you asserted that the Windows world was insane and I made the point that it is not. They've just consciously and explicitly moved away from the situation where the encoding is inferred from the environment instead of from the file's context. I'm not starting a Windows versus Unix debate. I'm talking about the direction that the world is working. Python need not move forward in that direction but it should not move backwards.Today, Python does not use the locale in inferring a file's type. Python also explicitly chose not to use the locale in inferring string encodings when Unicode was added. I'm not saying that Python programmers should be disallowed from using the system locale. I'm saying that Python itself should "resist the urge to guess" encodings. Python programmers who want to guess could have an easy, one-line way, as C# programmers do. > But they do. It's a fact which is impossible to change with a > decree. I'm not trying to change tools. I'm asking that Python not emulate their broken behaviour. If a Python programmer wants to do so, then they should add one line of code. > > What percentage of Ubuntu users change it froom UTF-8 to something > > specific? > > Why would it matter? I said explicitly why it matters in my first program. If most Unix uses just accept system defaults then the feature is of no value to them. If the feature actively hurts Windows programmers. So you have decreasing value on one side and a steady amount of pain on the other. > If a program can't read my text files or filenames or environment > variables or program invocation arguments, while they are encoded > according to the locale, then the program is broken. Either you are saying that Python is broken today, or you are saying that Python should allow people to write programs that are "not broken" according to your definition. In the former case, I disagree. In the latter case, I agree. The only thing we could disagree on is whether Python's default behaviour should be to guess the encodings based upon locale, despite Python's long history of avoiding guessing in general and guessing encodings in particular. >... > If a language requires extra steps in order to make the locale > encoding work, then it's unhelpful. Most programmers won't bother, > and their programs will work most of the time when they test it, > assuming they use it with English texts. Such programs suddenly break > when used in a non-English speaking country. Loudly and suddenly breaking is better than silently munging data. There are vast application classes where using the system encoding is the wrong thing. For example, an FTP server. An application working with data from a remote socket. An application working with a file from a remote server. An application working with incoming email. Python cannot know whether you are building a client/server application or a script for working with local files. It can't even really know whether a file that it opens is truly local. So it shouldn't guess. > > If the answers are "few", then we are talking about a feature that > > will break Windows programs and offer little value to Unix and > > Macintosh users. > > How does it break more programs than assuming ASCII does? All > encodings suitable as a system encoding are ASCII supersets, so if > a file can't be read using the locale encoding, it can't be read > in ASCII either. If a program expecting ASCII sees an unknown character then it can throw an exception and say: "You haven't thought through the internationalization aspects properly. Read the Python docs for more information." Silently munging data is worse. "In the face of ambiguity, refuse the temptation to guess." Paul Prescod From paul at prescod.net Wed Sep 6 19:21:33 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 10:21:33 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <20060906111643.GA4412@phd.pp.ru> References: <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <20060906104839.GE30635@phd.pp.ru> <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> <20060906111643.GA4412@phd.pp.ru> Message-ID: <1cb725390609061021y11e37727kb6c94668392a36f7@mail.gmail.com> On 9/6/06, Oleg Broytmann wrote: > On Wed, Sep 06, 2006 at 03:55:04AM -0700, Paul Prescod wrote: > These situations are caused because of the lack of metadata or clear > encoding-friendly standards. Ogg, for example, is encoding friendly - it > clearly states that tags (comments) must be in UTF-8, and all Ogg Vorbis > files I have saw were really in UTF-8, and all tag editors and players > write/use UTF-8. Michael Urman disagrees with you. He says that he sometimes sees Latin-1 encoded files. Let's trace back how that could have happened. 1. The end-user must have had Latin-1 as their system encoding. 2. The programmer of the ID tagging app had not thought through encoding issues. 3. The programming language either implicitly encoded the data according to the locale or treated it as binary data. (unless the programmer did this on purpose, which would imply that he was VERY confused and not just lazy) > I fail to see how Python can help here. Python can refuse to be the programming language in Step 3 that guesses the appropriate encoding without consulting the programmer or end-user. Paul Prescod From paul at prescod.net Wed Sep 6 19:23:37 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 10:23:37 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <20060906104839.GE30635@phd.pp.ru> <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> <20060906111643.GA4412@phd.pp.ru> Message-ID: <1cb725390609061023g6562f11ah7247ef356149a681@mail.gmail.com> On 9/6/06, Michael Urman wrote: > ... I suspect the best option is some sort of TextFile > constructor that defaults to ASCII (or has no default) but accepts an > easy way to use the "recommended" or system encoding, or any explicit > one. That's exactly what I'm asking for. Paul Prescod From paul at prescod.net Wed Sep 6 19:28:12 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 10:28:12 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FCC4C6.9030500@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <44FCC4C6.9030500@blueyonder.co.uk> Message-ID: <1cb725390609061028g285565dasd03dd58e80602dd9@mail.gmail.com> On 9/4/06, David Hopwood wrote: >... > I would prefer that there is no default. But since that is incompatible > with the existing API for open(), I accept that I'm not likely to win > that argument. First, can you outline how the proposal of no default is incompatible with the existing API for open? print open("Documents/foo.py").encoding Second: the whole IO library is being overhauled. How can backwards compatibility be an issue? Paul Prescod From murman at gmail.com Wed Sep 6 20:28:09 2006 From: murman at gmail.com (Michael Urman) Date: Wed, 6 Sep 2006 13:28:09 -0500 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609061023g6562f11ah7247ef356149a681@mail.gmail.com> References: <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <20060906104839.GE30635@phd.pp.ru> <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> <20060906111643.GA4412@phd.pp.ru> <1cb725390609061023g6562f11ah7247ef356149a681@mail.gmail.com> Message-ID: On 9/6/06, Paul Prescod wrote: > On 9/6/06, Michael Urman wrote: > > ... I suspect the best option is some sort of TextFile > > constructor that defaults to ASCII (or has no default) but accepts an > > easy way to use the "recommended" or system encoding, or any explicit > > one. > > That's exactly what I'm asking for. I suspect the difference in attitudes between us and those who don't want explicit encodings is that we've dealt with the mess of extracting information from various sources that use arbitrary encodings, either indicated incorrectly or not at all, and we want Python to help break that cycle. Those who want the ease of a TextFile constructor which magically supplies the "recommended" (local?) encoding might only deal with data in their local encoding, and aren't aware that code like theirs provides the problem case for those who deal with more. Not because a text file in the local encoding is a problem, but because if they're not thinking of encoding there, they won't where it matters. I have to learn more about the Japanese distaste for the Unicode system, but I don't see how that could influence me into accepting, e.g., ms932 as a silently-requested encoding. Do you have any clue if or where that fits in? Michael -- Michael Urman http://www.tortall.net/mu/blog From david.nospam.hopwood at blueyonder.co.uk Thu Sep 7 02:46:11 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Thu, 07 Sep 2006 01:46:11 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <44FCC4C6.9030500@blueyonder.co.uk> Message-ID: <44FF6BD3.6060409@blueyonder.co.uk> Jim Jewett wrote: > On 9/4/06, David Hopwood wrote: > >> The issue is not simplicity of implementation; it is what will provide >> the simplest usage model in the long term. If new files are encoded in X >> just because most of a user's existing files are encoded in X, then >> how is the user supposed to migrate to a different encoding? ... > >> In practice, the system charset is often set to the charset that should >> be used as a fallback *for applications that do not support Unicode*. > > Are you assuming that most uses of open will be for new files, No, I'm refusing to make the assumption that all uses will be for old files. My position is that there should be no default encoding (not ASCII either, although I may differ with Paul Prescod on that point). Note that Py3K is the only opportunity to remove the idea of a default encoding -- Python 2.5 by default opens text files as US-ASCII, so this would be an incompatible API change. If a programmer explicitly chooses to open files with the system encoding (by adding an "encoding=sys.get_file_content_encoding()" argument to a file open call), that's absolutely fine. In that case they must have considered encoding issues for at least a few seconds. That is the best we can do. APIs that open files should also be designed to allow auto-detection of the encoding based on content. This requires that the detected encoding be returned from the file open call, so that if the file needs to be rewritten, that can be done in the same encoding that was detected (which is the behaviour least likely to break existing applications that may read the same file). > *and* that these files will not also be read by such unicode-ignorant > applications? I'm not making that assumption either. -- David Hopwood From tomerfiliba at gmail.com Thu Sep 7 19:30:45 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Thu, 7 Sep 2006 19:30:45 +0200 Subject: [Python-3000] iostack, second revision Message-ID: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> [Guido] > As long as the state of the decoder is "neutral" at the start of a > line, it should be possible to do this. I like the idea that tell() > returns a "cookie" which is really a byte offset. If one wants to be > able to seek to positions with a non-neutral decoder state, the cookie > would have to be more abstract. It shouldn't matter; text apps should > not do arithmetic on seek/tell positions. [Andres] > In all my programming days I don't believe I written to and read from > the same file handle even once. Use cases exist, like if you're > implementing a DBMS, or adding to a zip file in-place, but they're the > exception, and by separating that functionality out in a dedicated > class like FileBytes, you avoid having the complexities of mixed input > and output affect your typical use cases. [...] > Watch out! There's an essentiel difference between files and > bidirectional communications channels that you need to take into > account. For a TCP connection, input and output can be seen as > isolated from one another, with each their own stream position, and > each their own contents. For read/write files, it's a whole different > ballgame, because stream position and data are shared. [Talin] > Now, I'm not saying that you can't stick additional layers in-between > TextReader and FileStream if you want to. An example might be the > "resync" layer that you mentioned, or a journaling layer that insures > that all writes are recoverable. I'm merely saying that for the specific > issue of buffering, I think that the choice of buffer type is > complicated, and requires knowledge that might not be accessible to the > person assembling the stack. --- lots of things have been discussed, lots of new ideas came: it's time to rethink the design of iostack; i'll try to see into it. there are several key issues: * splitting streams to separate reading and writing sides. * the underlying OS resource can be separated into some very low level abstraction layer, over which streams would operate. * stateful-seek-cookies sound like the perfect solution issues with seeking: being opaque, there's no sense in having the long debated position property (although i really liked it :)). i.e., there's no sense in doing s.position += some_opaque_cookie on the other hand, since streams are byte-oriented, over which the data abstraction layer (text, etc.) is placed, maybe there's sense in splitting these into two distinct APIs: * tell()/seek() for the byte-level stream position: a stream is just a sequence of bytes in which you can seek. * data-abstraction-layer "pointers": pointers will be stateful stream locations of encoded *objects*. you will not be able to "forge" pointers, you'll first have come across a valid object location, and then could you get a "pointer" pointing to it. of course these pointers should be kept cheap, and for most situations, plain integers would suffice. example: f = TextAdapter(BufferingLayer(FileStream(...)), encoding = "utf-32") f.write("hello world") p = f.get_pointer() f.write("wide web") f.set_pointer(p) or using a property: p = f.pointer f.pointer = p something like that....though i would like to recv comments on that first, before i go into deeper meditation :) -tomer From paul at prescod.net Thu Sep 7 21:21:12 2006 From: paul at prescod.net (Paul Prescod) Date: Thu, 7 Sep 2006 12:21:12 -0700 Subject: [Python-3000] Help on text editors Message-ID: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> Guido has asked me to do some research in aid of a file encoding detection/defaulting PEP. I only have access to a small number of operating systems and language variants so I need help. If you have access to "German Windows XP", "Japanese Windows XP", "Spanish OS X", "Japanese OS X", "German Ubuntu" etc., I would appreciate answers to the following questions. 1. On US English Windows, Notepad defaults to an encoding called "ANSI". "ANSI" is not a real encoding at all (and certainly not one from the American National Standards Institutue -- they should sue!). ANSI is just the default Windows character set for your localization set. What does "ANSI" map to in European and Asian versions of Windows? 2. On my English Mac, the default character set for textedit is "Mac OS Roman". What is it for foreign language macs? What API does an application use to query this default character set? What setting is it derived from? The Unix-level locale (seems not!) or some GUI-level setting (which one)? 3. In general, how do modern versions of Linux and other Unix handle this issue? In particular: what is your default encoding and how did your operating system determine it? Did you install a locale-specific version? Did the installer ask you? Did you edit a configuration file? Did you change a GUI setting? What is the relationship between your localization of Gnome/KDE and your default encoding? Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060907/ae52d46e/attachment.htm From solipsis at pitrou.net Thu Sep 7 22:13:56 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 07 Sep 2006 22:13:56 +0200 Subject: [Python-3000] Help on text editors In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> Message-ID: <1157660036.8533.18.camel@fsol> Hi, Le jeudi 07 septembre 2006 ? 12:21 -0700, Paul Prescod a ?crit : > If you have access to "German Windows XP", "Japanese Windows XP", > "Spanish OS X", "Japanese OS X", "German Ubuntu" etc., I would > appreciate answers to the following questions. French Mandriva (up-to-date development version). > In particular: what is your default encoding and how did your > operating system determine it? My locale is named "fr_FR" and the encoding is iso-8859-15. > Did you install a locale-specific version? Did the installer ask you? No, it's the built-in config. I don't remember the installer asking me anything except the language and keyboard layout. > What is the relationship between your localization of Gnome/KDE and > your default encoding? Ok, I hexdump'ed a few .mo files (the gettext-compatible files which contain translation strings) and the result is a bit funny: Gnome/KDE .mo files use utf-8, while .mo files for various command-line tools (e.g. aspell) use iso-8859-15. Also, it is interesting to know that Gnome tools like gedit (the Gnome text editor) normally default to utf-8, however gedit was patched by Mandriva to use the system encoding by default (which breaks character set auto-detection because the Mandriva patch is awful : http://qa.mandriva.com/show_bug.cgi?id=20277). By the way, you should be aware that filesystems have their own encodings which can different from the default system encoding (depending on how it's declared in /etc/fstab). I don't know of a simple way to retrieve the encoding for a given directory (except trying to find out the filesystem mounting point and parsing /etc/fstab... *sigh*). This can be annoying when handling non-ascii filenames. Regards Antoine. From qrczak at knm.org.pl Thu Sep 7 23:22:31 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 07 Sep 2006 23:22:31 +0200 Subject: [Python-3000] Help on text editors In-Reply-To: <1157660036.8533.18.camel@fsol> (Antoine Pitrou's message of "Thu, 07 Sep 2006 22:13:56 +0200") References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> <1157660036.8533.18.camel@fsol> Message-ID: <8764fzgync.fsf@qrnik.zagroda> Antoine Pitrou writes: > By the way, you should be aware that filesystems have their own > encodings which can different from the default system encoding > (depending on how it's declared in /etc/fstab). I don't know of a > simple way to retrieve the encoding for a given directory (except > trying to find out the filesystem mounting point and parsing > /etc/fstab... *sigh*). This can be annoying when handling non-ascii > filenames. I believe the intent is to set up all filesystems to use the same encoding externally. The encoding setting exists only for some filesystems, especially those which use UTF-16 internally, where it would be impossible to physically store filenames in the default system encoding, or where the filesystem is likely to be created on a different system with a different encoding. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Fri Sep 8 00:41:30 2006 From: paul at prescod.net (Paul Prescod) Date: Thu, 7 Sep 2006 15:41:30 -0700 Subject: [Python-3000] Help on text editors In-Reply-To: <1157660036.8533.18.camel@fsol> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> <1157660036.8533.18.camel@fsol> Message-ID: <1cb725390609071541p308293b0u29d264f619d23d92@mail.gmail.com> Are you plugged into the Mandriva community? Is there any debate about the continued use of iso8859-15? Obviously it has the benefit of backwards compatibility and slightly smaller file sizes. But it also has very severe limitations and interoperability problems as you describe below. On 9/7/06, Antoine Pitrou wrote: > > > Hi, > > Le jeudi 07 septembre 2006 ? 12:21 -0700, Paul Prescod a ?crit : > > If you have access to "German Windows XP", "Japanese Windows XP", > > "Spanish OS X", "Japanese OS X", "German Ubuntu" etc., I would > > appreciate answers to the following questions. > > French Mandriva (up-to-date development version). > > > In particular: what is your default encoding and how did your > > operating system determine it? > > My locale is named "fr_FR" and the encoding is iso-8859-15. > > > Did you install a locale-specific version? Did the installer ask you? > > No, it's the built-in config. I don't remember the installer asking me > anything except the language and keyboard layout. > > > What is the relationship between your localization of Gnome/KDE and > > your default encoding? > > Ok, I hexdump'ed a few .mo files (the gettext-compatible files which > contain translation strings) and the result is a bit funny: > Gnome/KDE .mo files use utf-8, while .mo files for various command-line > tools (e.g. aspell) use iso-8859-15. > > Also, it is interesting to know that Gnome tools like gedit (the Gnome > text editor) normally default to utf-8, however gedit was patched by > Mandriva to use the system encoding by default (which breaks character > set auto-detection because the Mandriva patch is awful : > http://qa.mandriva.com/show_bug.cgi?id=20277). > > > By the way, you should be aware that filesystems have their own > encodings which can different from the default system encoding > (depending on how it's declared in /etc/fstab). I don't know of a simple > way to retrieve the encoding for a given directory (except trying to > find out the filesystem mounting point and parsing /etc/fstab... > *sigh*). This can be annoying when handling non-ascii filenames. > > Regards > > Antoine. > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/paul%40prescod.net > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060907/4004913e/attachment.htm From guido at python.org Fri Sep 8 01:33:43 2006 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Sep 2006 16:33:43 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> Message-ID: On 9/7/06, tomer filiba wrote: > lots of things have been discussed, lots of new ideas came: > it's time to rethink the design of iostack; i'll try to see into it. > > there are several key issues: > * splitting streams to separate reading and writing sides. > * the underlying OS resource can be separated into some very low > level abstraction layer, over which streams would operate. > * stateful-seek-cookies sound like the perfect solution > > issues with seeking: > being opaque, there's no sense in having the long debated > position property (although i really liked it :)). i.e., there's no sense > in doing s.position += some_opaque_cookie > > on the other hand, since streams are byte-oriented, over which the > data abstraction layer (text, etc.) is placed, maybe there's sense in > splitting these into two distinct APIs: > > * tell()/seek() for the byte-level stream position: a stream is just a > sequence of bytes in which you can seek. > * data-abstraction-layer "pointers": pointers will be stateful stream > locations of encoded *objects*. > > you will not be able to "forge" pointers, you'll first have come across > a valid object location, and then could you get a "pointer" pointing to it. > of course these pointers should be kept cheap, and for most situations, > plain integers would suffice. Using plain ints makes them trivially forgeable though. Not sure I mind, just noticing. > example: > > f = TextAdapter(BufferingLayer(FileStream(...)), encoding = "utf-32") > f.write("hello world") > p = f.get_pointer() > f.write("wide web") > f.set_pointer(p) Why not use tell() and seek() instead of get_pointer() and set_pointer()? Seek should also support several special cases: f.seek(0) seeks to the start of the file no matter what type is otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a no-op, f.seek(0, 2) seeks to EOF. > or using a property: > p = f.pointer > f.pointer = p Since the creation of a seek cookie may be relatively expensive (since it may have to ask the decoder a rather personal question :-) it should be a method, not a property. > something like that....though i would like to recv comments on > that first, before i go into deeper meditation :) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From hasan.diwan at gmail.com Fri Sep 8 02:41:30 2006 From: hasan.diwan at gmail.com (Hasan Diwan) Date: Thu, 7 Sep 2006 17:41:30 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> Message-ID: <2cda2fc90609071741t1b7fc4a9gff48836d187367da@mail.gmail.com> I was thinking about the new IOStack and could not come up with an use case requiring both a line-oriented and a record-oriented read/write functionality -- the general case is the record-oriented, lines are just new-line terminated records. Perhaps this has already been dropped, but I seem to recall the original spec having a readrec, writerec? Similarly, readline/writeline aren't needed. For example... class InputStream(Stream): def read(self): # Reads 1 byte return os.stdin.read(1) def readline(self): ret = self.readrec('\n') # or whatever constant represents the EOL return ret class Stream(object): def read(self): raise Exception, 'cannot read' def readrec(self,terminator): ret = '' while ret != terminator: ret = ret + self.read() return ret def write(self): raise Exception, 'cannot write' def writeRec(self, terminator): ''' writeRec returns self as a list split by terminator ''' ret = str(self) return str(ret).split(terminator) -- Cheers, Hasan Diwan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060907/2769816f/attachment.htm From murman at gmail.com Fri Sep 8 03:05:09 2006 From: murman at gmail.com (Michael Urman) Date: Thu, 7 Sep 2006 20:05:09 -0500 Subject: [Python-3000] Help on text editors In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> Message-ID: On 9/7/06, Paul Prescod wrote: > 1. On US English Windows, Notepad defaults to an encoding called "ANSI". > What does "ANSI" map to in European and Asian versions of Windows? On most Western European configurations, the ANSI Code Page is historically 1252 (CP1252 or WINDOWS-1252 according to iconv). It may be something different now for supporting the EURO symbol. Japanese machines tend to use CP932 (or MS932), also known as SHIFT-JIS (or close enough). I don't know exactly which ACPs match other languages off the top of my head. I expect notepad will default to the ACP encoding whenever a file is detected as such, or a new file contains only characters representable via that code page. Otherwise I expect it will default to "Unicode" (UTF-16 / UCS-2). When editing an existing file, it will default to the detected encoding, unless "Unicode" is required to save the changes. It uses BOMs to mark all unicode encodings, but doesn't require them to be present in order to detect "Unicode." http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx > 3. In general, how do modern versions of Linux and other Unix handle this > issue? I use en-US.UTF-8, after many years of C or en-US.ISO-8859-1. Due to the age of my install, this was not the default, but now I use it as pervasively as possible. I set it via GDM these days, but via my shell rc file originally. Michael -- Michael Urman http://www.tortall.net/mu/blog From david.nospam.hopwood at blueyonder.co.uk Fri Sep 8 04:03:55 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Fri, 08 Sep 2006 03:03:55 +0100 Subject: [Python-3000] Help on text editors In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> Message-ID: <4500CF8B.6040003@blueyonder.co.uk> Paul Prescod wrote: > Guido has asked me to do some research in aid of a file encoding > detection/defaulting PEP. > > I only have access to a small number of operating systems and language > variants so I need help. > > If you have access to "German Windows XP", "Japanese Windows XP", Since Win2K there is actually no such thing, from a technical point of view -- just Win2K or WinXP with a German or Japanese "language group" installed, and a corresponding locale selected as the interface locale for a given user account. The links below should make this clearer. > "Spanish OS X", "Japanese OS X", "German Ubuntu" etc., I would appreciate > answers to the following questions. > > 1. On US English Windows, Notepad defaults to an encoding called "ANSI". > "ANSI" is not a real encoding at all (and certainly not one from the > American National Standards Institute -- they should sue!). ANSI is just > the default Windows character set for your localization set. What does > "ANSI" map to in European and Asian versions of Windows? See , , and . Each "language group" maps to a similarly named "ANSI" code page (and also an "OEM" code page) in the obvious way. -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Fri Sep 8 04:12:27 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Fri, 08 Sep 2006 03:12:27 +0100 Subject: [Python-3000] Help on text editors In-Reply-To: <4500CF8B.6040003@blueyonder.co.uk> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> <4500CF8B.6040003@blueyonder.co.uk> Message-ID: <4500D18B.1040404@blueyonder.co.uk> David Hopwood wrote: > Paul Prescod wrote: > >>Guido has asked me to do some research in aid of a file encoding >>detection/defaulting PEP. >> >>I only have access to a small number of operating systems and language >>variants so I need help. >> >>If you have access to "German Windows XP", "Japanese Windows XP", > > Since Win2K there is actually no such thing, from a technical point of view -- > just Win2K or WinXP with a German or Japanese "language group" installed, This is right... > and a corresponding locale selected as the interface locale for a given user account. Correction: the "System Locale" is what determines the ANSI and OEM codepages, and this is *not* dependent on the user account. Changing it requires a reboot, so you can assume that it stays constant for the lifetime of a Python process. > The links below should make this clearer. I obviously should have read them more thoroughly myself! :-( > See , > , and > . -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Fri Sep 8 04:46:40 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Fri, 08 Sep 2006 03:46:40 +0100 Subject: [Python-3000] Help on text editors In-Reply-To: References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> Message-ID: <4500D990.3000707@blueyonder.co.uk> Michael Urman wrote: > On 9/7/06, Paul Prescod wrote: > >>1. On US English Windows, Notepad defaults to an encoding called "ANSI". >>What does "ANSI" map to in European and Asian versions of Windows? > > On most Western European configurations, the ANSI Code Page is > historically 1252 (CP1252 or WINDOWS-1252 according to iconv). It may > be something different now for supporting the EURO symbol. None of the Windows-125x code page numbers changed when '?' was added. These are "open" encodings in the Unicode and ISO terminology; i.e. there is an authority (Microsoft) who can assign any previously unassigned code point at any time. > Japanese machines tend to use CP932 (or MS932), also known as SHIFT-JIS (or > close enough). Not close enough, actually. Cp932 is a superset of US-ASCII, whereas Shift-JIS isn't: 0x5C represents '\' and '?' respectively. If you think about how important '\' is as an escaping metacharacter, this is quite a big deal (there are other differences, but they are less important). Actual practice in Japan is that 0x5C *can* be used as an escaping metacharacter with the semantics of '\' (even if it is sometimes displayed as '?'), and so Cp932 is the encoding that should be used, even on non-Microsoft OSes. > I expect notepad will default to the ACP encoding whenever a file is > detected as such, or a new file contains only characters representable > via that code page. Otherwise I expect it will default to "Unicode" > (UTF-16 / UCS-2). When editing an existing file, it will default to > the detected encoding, unless "Unicode" is required to save the > changes. It uses BOMs to mark all unicode encodings, but doesn't > require them to be present in order to detect "Unicode." > http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx Yes. However, this is not a good idea for precisely the reason described on that page (false detection of Unicode), and so any Unicode detection algorithm in Python should only be based on detecting a BOM, IMHO. -- David Hopwood From jeff at soft.fujitsu.com Fri Sep 8 05:09:26 2006 From: jeff at soft.fujitsu.com (Jeff Wilcox) Date: Fri, 8 Sep 2006 12:09:26 +0900 Subject: [Python-3000] Help on text editors In-Reply-To: Message-ID: > From: "Paul Prescod" > 1. On US English Windows, Notepad defaults to an encoding called "ANSI". > "ANSI" is not a real encoding at all (and certainly not one from the On Japanese Windows 2000, Notepad defaults to ANSI as it does in the English version. It actually writes Shift JIS though. > 2. On my English Mac, the default character set for textedit is "Mac OS > Roman". What is it for foreign language macs? What API does an application > use to query this default character set? What setting is it derived from? > The Unix-level locale (seems not!) or some GUI-level setting (which one)? Mac OS X actually doesn't have different language versions of the operating system. If you change the language setting, the Japanese version *becomes* the English version and vice versa. (Several of the English speakers that I work with have purchased Japanese Macs and switched them over to English, they're indistinguishable from English Macs afterwards. Similarly, several Macs purchased in the US have been successfully switched to Japanese, and become indistinguishable from Macs bought in Japan.) > 3. In general, how do modern versions of Linux and other Unix handle this > issue? In particular: what is your default encoding and how did your On Vine Linux (popular in Japan), the default text encoding is EUC with no configuration changes. From solipsis at pitrou.net Fri Sep 8 09:02:48 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 08 Sep 2006 09:02:48 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> Message-ID: <1157698968.4636.3.camel@fsol> Le jeudi 07 septembre 2006 ? 16:33 -0700, Guido van Rossum a ?crit : > Why not use tell() and seek() instead of get_pointer() and > set_pointer()? Seek should also support several special cases: > f.seek(0) seeks to the start of the file no matter what type is > otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a > no-op, f.seek(0, 2) seeks to EOF. Perhaps it would be good to drop those magic numbers (0, 1, 2) for seek() ? They don't really help readibility except perhaps for people who still do a lot of C ;) Regards Antoine. From solipsis at pitrou.net Fri Sep 8 09:08:41 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 08 Sep 2006 09:08:41 +0200 Subject: [Python-3000] Help on text editors In-Reply-To: <1cb725390609071541p308293b0u29d264f619d23d92@mail.gmail.com> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> <1157660036.8533.18.camel@fsol> <1cb725390609071541p308293b0u29d264f619d23d92@mail.gmail.com> Message-ID: <1157699321.4636.9.camel@fsol> Le jeudi 07 septembre 2006 ? 15:41 -0700, Paul Prescod a ?crit : > Are you plugged into the Mandriva community? Not much. I only participe in bug reports ;) > Is there any debate about the continued use of iso8859-15? I think there has been some for years. Some people in the community push for UTF-8 but I guess the problem is related to Mandriva company management or priority setting policy. Regards Antoine. From hasan.diwan at gmail.com Fri Sep 8 09:26:55 2006 From: hasan.diwan at gmail.com (Hasan Diwan) Date: Fri, 8 Sep 2006 00:26:55 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <1157698968.4636.3.camel@fsol> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <1157698968.4636.3.camel@fsol> Message-ID: <2cda2fc90609080026s7184815fh19345ba764d03c90@mail.gmail.com> On 08/09/06, Antoine Pitrou wrote: > > Perhaps it would be good to drop those magic numbers (0, 1, 2) for > seek() ? They don't really help readibility except perhaps for people > who still do a lot of C ;) > +1 If we can't don't want to eliminate the "magic numbers" entirely, perhaps we could assign symbolic constants to them? fileobj.seek(fileobj.START) for instance? -- Cheers, Hasan Diwan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060908/1477f322/attachment.html From tomerfiliba at gmail.com Fri Sep 8 10:53:33 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Fri, 8 Sep 2006 10:53:33 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> Message-ID: <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com> > Why not use tell() and seek() instead of get_pointer() and > set_pointer()? because, at least the way i see it, seek and tell are byte-oriented, while the upper layers of the stack may be objects-oriented (including, for instance, characters, struct records, or pickled objects), so pointers would be a vector of (byte-position, stateful object-layer info). pointers are different than mere byte-positions, so i thought streams should have a byte-level API, while the upper layers are more likely to work with "pointers". [Guido] > Seek should also support several special cases: > f.seek(0) seeks to the start of the file no matter what type is > otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a > no-op, f.seek(0, 2) seeks to EOF. [Antoine] > Perhaps it would be good to drop those magic numbers (0, 1, 2) for > seek() ? They don't really help readibility except perhaps for people > who still do a lot of C ;) yes, this was discussed some time ago. we concluded that the new position property should behave similar to negative indexes: f.position = 5 -- absolute seek, from the beginning of the stream f.position += 3 -- relative seek (*) f.position = -1 -- absolute seeking, back from the end (**) (*) it requires two syscalls, so we'll also have a seekby() method (**) like "hello"[-1] imho it's much more simple and intuitive than these magic consts, and it feels more like object indexing. -tomer On 9/8/06, Guido van Rossum wrote: > On 9/7/06, tomer filiba wrote: > > lots of things have been discussed, lots of new ideas came: > > it's time to rethink the design of iostack; i'll try to see into it. > > > > there are several key issues: > > * splitting streams to separate reading and writing sides. > > * the underlying OS resource can be separated into some very low > > level abstraction layer, over which streams would operate. > > * stateful-seek-cookies sound like the perfect solution > > > > issues with seeking: > > being opaque, there's no sense in having the long debated > > position property (although i really liked it :)). i.e., there's no sense > > in doing s.position += some_opaque_cookie > > > > on the other hand, since streams are byte-oriented, over which the > > data abstraction layer (text, etc.) is placed, maybe there's sense in > > splitting these into two distinct APIs: > > > > * tell()/seek() for the byte-level stream position: a stream is just a > > sequence of bytes in which you can seek. > > * data-abstraction-layer "pointers": pointers will be stateful stream > > locations of encoded *objects*. > > > > you will not be able to "forge" pointers, you'll first have come across > > a valid object location, and then could you get a "pointer" pointing to it. > > of course these pointers should be kept cheap, and for most situations, > > plain integers would suffice. > > Using plain ints makes them trivially forgeable though. Not sure I > mind, just noticing. > > > example: > > > > f = TextAdapter(BufferingLayer(FileStream(...)), encoding = "utf-32") > > f.write("hello world") > > p = f.get_pointer() > > f.write("wide web") > > f.set_pointer(p) > > Why not use tell() and seek() instead of get_pointer() and > set_pointer()? Seek should also support several special cases: > f.seek(0) seeks to the start of the file no matter what type is > otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a > no-op, f.seek(0, 2) seeks to EOF. > > > or using a property: > > p = f.pointer > > f.pointer = p > > Since the creation of a seek cookie may be relatively expensive (since > it may have to ask the decoder a rather personal question :-) it > should be a method, not a property. > > > something like that....though i would like to recv comments on > > that first, before i go into deeper meditation :) > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > From qrczak at knm.org.pl Fri Sep 8 11:17:36 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 08 Sep 2006 11:17:36 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com> (tomer filiba's message of "Fri, 8 Sep 2006 10:53:33 +0200") References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com> Message-ID: <87zmda1zv3.fsf@qrnik.zagroda> "tomer filiba" writes: > yes, this was discussed some time ago. we concluded that the new > position property should behave similar to negative indexes: > > f.position = 5 -- absolute seek, from the beginning of the stream > f.position += 3 -- relative seek (*) > f.position = -1 -- absolute seeking, back from the end (**) Seeking to the very end requires a special constant, otherwise it's off by 1. I don't understand so strong desire to push that syntax despite its problems with implementing += in one syscall and with specifying the end point. If it doesn't work well, don't do it that way. Of course magic constants are bad. My language Kogut has three separate functions for seeking, it's simpler than interpreting non-negative and negative numbers differently (and it can even seek past the end if the OS supports that). I can't imagine a case where the origin of seeking is not known statically. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From ncoghlan at gmail.com Fri Sep 8 12:31:48 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 08 Sep 2006 20:31:48 +1000 Subject: [Python-3000] iostack, second revision In-Reply-To: <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com> Message-ID: <45014694.2030609@gmail.com> [Guido] >> Why not use tell() and seek() instead of get_pointer() and >> set_pointer()? [tomer] > because, at least the way i see it, seek and tell are byte-oriented, > while the upper layers of the stack may be objects-oriented > (including, for instance, characters, struct records, or pickled objects), > so pointers would be a vector of (byte-position, stateful object-layer info). > > pointers are different than mere byte-positions, so i thought streams > should have a byte-level API, while the upper layers are more likely > to work with "pointers". seek() & tell() aren't necessarily byte-oriented, and a program can get itself in trouble by treating them as if they are. Seeking to an arbitrary byte position on a Windows text file can be a very bad idea :) So -1 on using different names, but +1 on permitting different IO layers to assign a different meaning to exactly what it is that seek() and tell() are indexing. With the IO layer doing a translation, I suggest that the seek/tell cookies should be plain integers, so that doing f.seek(20) on a text file will seek to the 20th character instead of the 20th byte. This approach is backwards compatible with the current rule of 'for text files, arguments to seek() must be previously returned from tell()' and Guido's desire that f.seek(0) always seek to the beginning of the file. > > [Guido] >> Seek should also support several special cases: >> f.seek(0) seeks to the start of the file no matter what type is >> otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a >> no-op, f.seek(0, 2) seeks to EOF. > > [Antoine] >> Perhaps it would be good to drop those magic numbers (0, 1, 2) for >> seek() ? They don't really help readibility except perhaps for people >> who still do a lot of C ;) Since I've been playing with string methods lately, I believe a natural name for the 'seek from the end' version is f.rseek(0). And someone else suggested f.seekby(0) as a reasonable name for relative seeking. f.seek(0) # Go to beginning f.seekby(0) # Stay at current position f.rseek(0) # Go to end Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From exarkun at divmod.com Fri Sep 8 14:29:57 2006 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Fri, 8 Sep 2006 08:29:57 -0400 Subject: [Python-3000] iostack, second revision In-Reply-To: <2cda2fc90609080026s7184815fh19345ba764d03c90@mail.gmail.com> Message-ID: <20060908122957.1717.1038742216.divmod.quotient.42846@ohm> On Fri, 8 Sep 2006 00:26:55 -0700, Hasan Diwan wrote: >On 08/09/06, Antoine Pitrou wrote: >> >>Perhaps it would be good to drop those magic numbers (0, 1, 2) for >>seek() ? They don't really help readibility except perhaps for people >>who still do a lot of C ;) > >+1 >If we can't don't want to eliminate the "magic numbers" entirely, perhaps we >could assign symbolic constants to them? fileobj.seek(fileobj.START) for >instance? Note that Python is _worse_ than C here. C has named constants for these, Python does not expose them. Jean-Paul From ronaldoussoren at mac.com Fri Sep 8 15:37:00 2006 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Fri, 08 Sep 2006 15:37:00 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: <20060908122957.1717.1038742216.divmod.quotient.42846@ohm> References: <20060908122957.1717.1038742216.divmod.quotient.42846@ohm> Message-ID: <10397427.1157722620731.JavaMail.ronaldoussoren@mac.com> On Friday, September 08, 2006, at 02:30PM, Jean-Paul Calderone wrote: >On Fri, 8 Sep 2006 00:26:55 -0700, Hasan Diwan wrote: >>On 08/09/06, Antoine Pitrou wrote: >>> >>>Perhaps it would be good to drop those magic numbers (0, 1, 2) for >>>seek() ? They don't really help readibility except perhaps for people >>>who still do a lot of C ;) >> >>+1 >>If we can't don't want to eliminate the "magic numbers" entirely, perhaps we >>could assign symbolic constants to them? fileobj.seek(fileobj.START) for >>instance? > >Note that Python is _worse_ than C here. C has named constants for these, >Python does not expose them. What about os.SEEK_SET, os.SEEK_CUR, os.SEEK_END? The named constants are there, just not at the most convenient location. Ronald > >Jean-Paul >_______________________________________________ >Python-3000 mailing list >Python-3000 at python.org >http://mail.python.org/mailman/listinfo/python-3000 >Unsubscribe: http://mail.python.org/mailman/options/python-3000/ronaldoussoren%40mac.com > > From exarkun at divmod.com Fri Sep 8 15:40:42 2006 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Fri, 8 Sep 2006 09:40:42 -0400 Subject: [Python-3000] iostack, second revision In-Reply-To: <10397427.1157722620731.JavaMail.ronaldoussoren@mac.com> Message-ID: <20060908134042.1717.1143631052.divmod.quotient.42896@ohm> On Fri, 08 Sep 2006 15:37:00 +0200, Ronald Oussoren wrote: > >On Friday, September 08, 2006, at 02:30PM, Jean-Paul Calderone wrote: >> >>Note that Python is _worse_ than C here. C has named constants for these, >>Python does not expose them. > >What about os.SEEK_SET, os.SEEK_CUR, os.SEEK_END? The named constants are there, just not at the most convenient location. New in Python 2.5, so Python will finally be caught up with C when 2.5 final is released :) Jean-Paul From ronaldoussoren at mac.com Fri Sep 8 16:06:30 2006 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Fri, 08 Sep 2006 16:06:30 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: <20060908134042.1717.1143631052.divmod.quotient.42896@ohm> References: <20060908134042.1717.1143631052.divmod.quotient.42896@ohm> Message-ID: <5355804.1157724390887.JavaMail.ronaldoussoren@mac.com> On Friday, September 08, 2006, at 03:41PM, Jean-Paul Calderone wrote: >On Fri, 08 Sep 2006 15:37:00 +0200, Ronald Oussoren wrote: >> >>On Friday, September 08, 2006, at 02:30PM, Jean-Paul Calderone wrote: >>> >>>Note that Python is _worse_ than C here. C has named constants for these, >>>Python does not expose them. >> >>What about os.SEEK_SET, os.SEEK_CUR, os.SEEK_END? The named constants are there, just not at the most convenient location. > >New in Python 2.5, so Python will finally be caught up with C when 2.5 >final is released :) The same constants are also defined in posixfile, which even according to python 2.3 is deprecated. Sigh... Ronald From guido at python.org Fri Sep 8 18:37:13 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 8 Sep 2006 09:37:13 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <1157698968.4636.3.camel@fsol> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <1157698968.4636.3.camel@fsol> Message-ID: On 9/8/06, Antoine Pitrou wrote: > Le jeudi 07 septembre 2006 ? 16:33 -0700, Guido van Rossum a ?crit : > > Why not use tell() and seek() instead of get_pointer() and > > set_pointer()? Seek should also support several special cases: > > f.seek(0) seeks to the start of the file no matter what type is > > otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a > > no-op, f.seek(0, 2) seeks to EOF. > > Perhaps it would be good to drop those magic numbers (0, 1, 2) for > seek() ? They don't really help readibility except perhaps for people > who still do a lot of C ;) Maybe (since I fall in that category it doesn't bother me :-), but we shouldn't replace them with symbolic constants. Having to import another module to import names like SEEK_CUR and SEEK_END is not Pythonic. Perhaps the seek() method can grow keyword arguments to indicate the different types of seekage, or there should be three separate methods. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mcherm at mcherm.com Fri Sep 8 18:45:50 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Fri, 08 Sep 2006 09:45:50 -0700 Subject: [Python-3000] The future of exceptions Message-ID: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> Marcin Kowalczyk writes: > In my language the traceback is materialized from the stack only > if needed [...] The stack is not > physically unwound until an exception handler completes successfully, > so the data is available until then. Jim Jewett writes: > Even today, if a StopIteration() participates in a reference cycle, > then it won't be reclaimed until the next gc run. I'm not quite sure > which direction should be a weakref, but I think it would be > reasonable for the cycle to get broken when an catching except block > exits without reraising. When thinking about these things, don't forget that in Python an exception handler can perform complicated actions, including invoking new functions and possibly raising new exceptions. Any solution should allow the following code to work "properly": # -- WARNING: demo code, not tested def logError(msg): try: errorChannel.write(msg) except IOError: pass try: callSomeCode() except SomeException as err: msg = str(msg) logError(msg) raise msg By "properly" I mean that that when callSomeCode() raises SomeException the uncaught exception will cause the program should print a stacktrace which should correcly show the stack frame of callSomeCode(). This should happen regardless of whether errorChannel raised an IOError. In the process, though, we (1) added new frames to the stack, and (2) successfully exited an error handler (the one for IOError). It is work to provide this feature but without it Python programmers cannot freely use any code they like within exception handlers, which I think is an important feature. It doesn't necessarily imply that the traceback be materialized immediately upon exception creation (which is undesirable because we want exceptions lightweight enough to use for things like for loop control!)... but it might mean that pieces of the stack frame need hang around as long as the exception itself does. -- Michael Chermside From ncoghlan at gmail.com Fri Sep 8 19:00:33 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 09 Sep 2006 03:00:33 +1000 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <1157698968.4636.3.camel@fsol> Message-ID: <4501A1B1.5050707@gmail.com> Guido van Rossum wrote: > Maybe (since I fall in that category it doesn't bother me :-), but we > shouldn't replace them with symbolic constants. Having to import > another module to import names like SEEK_CUR and SEEK_END is not > Pythonic. Perhaps the seek() method can grow keyword arguments to > indicate the different types of seekage, or there should be three > separate methods. As I mentioned in a different part of the thread, I believe seek(), seekby() and rseek() would work as names for the 3 different method approach. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From talin at acm.org Fri Sep 8 19:12:13 2006 From: talin at acm.org (Talin) Date: Fri, 08 Sep 2006 10:12:13 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <4501A1B1.5050707@gmail.com> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <1157698968.4636.3.camel@fsol> <4501A1B1.5050707@gmail.com> Message-ID: <4501A46D.20007@acm.org> Nick Coghlan wrote: > Guido van Rossum wrote: >> Maybe (since I fall in that category it doesn't bother me :-), but we >> shouldn't replace them with symbolic constants. Having to import >> another module to import names like SEEK_CUR and SEEK_END is not >> Pythonic. Perhaps the seek() method can grow keyword arguments to >> indicate the different types of seekage, or there should be three >> separate methods. > > As I mentioned in a different part of the thread, I believe seek(), seekby() > and rseek() would work as names for the 3 different method approach. > > Cheers, > Nick. > One advantage of that approach is that layers which don't support a particular operation could omit one or more of those functions, or have differently-named functions that represent what the layer is capable of. For example, if a layer is only capable of seeking forward, you could use 'skip' like the Java stream does; If a layer can rewind the stream back to zero, but not to any intermediate position, you could have a 'reset' method. By taking this approach, you can come up with an API for a given layer that fits naturally into the behavior model of that layer, without trying to cram it into a generic model for seeking that attempts to cover all cases. For text streams, come up with a model that makes sense for what kinds of things you want to do with text, and don't try and make it look like the API for the underlying byte stream. -- Talin From aahz at pythoncraft.com Fri Sep 8 19:21:51 2006 From: aahz at pythoncraft.com (Aahz) Date: Fri, 8 Sep 2006 10:21:51 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> Message-ID: <20060908172151.GA9911@panix.com> On Fri, Sep 08, 2006, Michael Chermside wrote: > > def logError(msg): > try: > errorChannel.write(msg) > except IOError: > pass > > try: > callSomeCode() > except SomeException as err: > msg = str(msg) > logError(msg) > raise msg This code is guaranteed to fail in Python 3.0, of course, because string exceptions aren't allowed. But your point is taken, I think. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ "LL YR VWL R BLNG T S" -- www.nancybuttons.com From fdrake at acm.org Fri Sep 8 20:03:17 2006 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 8 Sep 2006 14:03:17 -0400 Subject: [Python-3000] iostack, second revision In-Reply-To: <4501A1B1.5050707@gmail.com> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> Message-ID: <200609081403.18350.fdrake@acm.org> On Friday 08 September 2006 13:00, Nick Coghlan wrote: > As I mentioned in a different part of the thread, I believe seek(), > seekby() and rseek() would work as names for the 3 different method > approach. +1, for the reasons discussed. -Fred -- Fred L. Drake, Jr. From guido at python.org Fri Sep 8 20:06:41 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 8 Sep 2006 11:06:41 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <200609081403.18350.fdrake@acm.org> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> Message-ID: -1 on those particular cryptic names. Which one of seekby() and rseek() is the relative seek? Where's the seek relative to EOF? On 9/8/06, Fred L. Drake, Jr. wrote: > On Friday 08 September 2006 13:00, Nick Coghlan wrote: > > As I mentioned in a different part of the thread, I believe seek(), > > seekby() and rseek() would work as names for the 3 different method > > approach. > > +1, for the reasons discussed. > > > -Fred > > -- > Fred L. Drake, Jr. > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From solipsis at pitrou.net Fri Sep 8 20:41:13 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 08 Sep 2006 20:41:13 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> Message-ID: <1157740873.4979.10.camel@fsol> Le vendredi 08 septembre 2006 ? 11:06 -0700, Guido van Rossum a ?crit : > -1 on those particular cryptic names. Which one of seekby() and > rseek() is the relative seek? Where's the seek relative to EOF? What about seek(), seek_relative() and seek_reverse() ? "rseek" also looks like "relative seek" to me (having be used to move / rmove for graphic primitives a long time ago). Regards Antoine. From jimjjewett at gmail.com Fri Sep 8 21:04:50 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 8 Sep 2006 15:04:50 -0400 Subject: [Python-3000] iostack, second revision In-Reply-To: <1157740873.4979.10.camel@fsol> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> <1157740873.4979.10.camel@fsol> Message-ID: On 9/8/06, Antoine Pitrou wrote: > Le vendredi 08 septembre 2006 ? 11:06 -0700, Guido van Rossum a ?crit : > > -1 on those particular cryptic names. Which one of seekby() and > > rseek() is the relative seek? Where's the seek relative to EOF? > What about seek(), seek_relative() and seek_reverse() ? Why not just borrow the standard symbolic names of cur and end? seek(pos=0) seek_cur(pos=0) seek_end(pos=0) seek_end(-1000) <==> 1000 units (bytes or chars or records or ...) before the end seek_cur(50) <==> 50 units beyond current seek() <==> beginning -jJ From qrczak at knm.org.pl Fri Sep 8 21:21:22 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 08 Sep 2006 21:21:22 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: (Guido van Rossum's message of "Fri, 8 Sep 2006 11:06:41 -0700") References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> Message-ID: <87bqpqqi4t.fsf@qrnik.zagroda> "Guido van Rossum" writes: > -1 on those particular cryptic names. Which one of seekby() and > rseek() is the relative seek? Where's the seek relative to EOF? I propose seek, seek_by, seek_end. I suppose in 99% of cases seek_end is used to seek to the very end, rather than some offset from the end, so it makes sense for the offset to be optional. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From fdrake at acm.org Sat Sep 9 00:06:08 2006 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 8 Sep 2006 18:06:08 -0400 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <200609081403.18350.fdrake@acm.org> Message-ID: <200609081806.09032.fdrake@acm.org> On Friday 08 September 2006 14:06, Guido van Rossum wrote: > -1 on those particular cryptic names. Which one of seekby() and > rseek() is the relative seek? Where's the seek relative to EOF? My reading was seekby() as relative, and rseek() was relative to the end. It could be something like seekposition(), seekforward(), seekfromend(). Long, but unambiguous. -Fred -- Fred L. Drake, Jr. From solipsis at pitrou.net Sat Sep 9 00:24:10 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 09 Sep 2006 00:24:10 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> <1157740873.4979.10.camel@fsol> Message-ID: <1157754250.8948.1.camel@fsol> Le vendredi 08 septembre 2006 ? 15:04 -0400, Jim Jewett a ?crit : > > What about seek(), seek_relative() and seek_reverse() ? > > Why not just borrow the standard symbolic names of cur and end? > > seek(pos=0) > seek_cur(pos=0) > seek_end(pos=0) You are right, it's clear and shorter than my proposal. From jackdied at jackdied.com Sat Sep 9 01:26:04 2006 From: jackdied at jackdied.com (Jack Diederich) Date: Fri, 8 Sep 2006 19:26:04 -0400 Subject: [Python-3000] iostack, second revision In-Reply-To: <1157754250.8948.1.camel@fsol> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> <1157740873.4979.10.camel@fsol> <1157754250.8948.1.camel@fsol> Message-ID: <20060908232604.GC6250@performancedrivers.com> On Sat, Sep 09, 2006 at 12:24:10AM +0200, Antoine Pitrou wrote: > Le vendredi 08 septembre 2006 ? 15:04 -0400, Jim Jewett a ?crit : > > > What about seek(), seek_relative() and seek_reverse() ? > > > > Why not just borrow the standard symbolic names of cur and end? > > > > seek(pos=0) > > seek_cur(pos=0) > > seek_end(pos=0) I like the C-ish style because I'm used to it. These are OK so long as seek(n, 2) raises an informative exception. I was initially going to suggest seek_abs() for the absolute seek but if it remains plain seek() old users won't have to go searching docs and help() would be .. helpful. -Jack From murman at gmail.com Sat Sep 9 06:32:10 2006 From: murman at gmail.com (Michael Urman) Date: Fri, 8 Sep 2006 23:32:10 -0500 Subject: [Python-3000] Help on text editors In-Reply-To: <4500D990.3000707@blueyonder.co.uk> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> <4500D990.3000707@blueyonder.co.uk> Message-ID: On 9/7/06, David Hopwood wrote: > Yes. However, this is not a good idea for precisely the reason described > on that page (false detection of Unicode), and so any Unicode detection > algorithm in Python should only be based on detecting a BOM, IMHO. Right, except BOMs break tons of Unix applications (and even occasional Windows ones) which do not expect them. Which leaves us with Python nearly unable to detect unicode on Unix. This is quite unfortunate for those of us rooting for UTF-8. Perhaps there are better heuristics that are worth considering. Perhaps not. It certainly shouldn't be the default behaviour of a TextFile constructor. Michael -- Michael Urman http://www.tortall.net/mu/blog From murman at gmail.com Sat Sep 9 06:39:25 2006 From: murman at gmail.com (Michael Urman) Date: Fri, 8 Sep 2006 23:39:25 -0500 Subject: [Python-3000] Help on text editors In-Reply-To: References: Message-ID: On 9/7/06, Jeff Wilcox wrote: > > From: "Paul Prescod" > > 1. On US English Windows, Notepad defaults to an encoding called "ANSI". > > "ANSI" is not a real encoding at all (and certainly not one from the > On Japanese Windows 2000, Notepad defaults to ANSI as it does in the English > version. It actually writes Shift JIS though. ANSI is not an encoding; it is a collective name for various multibyte encodings, each corresponding to a particular default language of the machine. Thus ANSI corresponds to cp1252 on English and cp932 on Japanese machines. As for whether cp932 is the same as Shift JIS, David and I seem to disagree. While I lack hard data, the string '\\' round trips through either on my box. -- Michael Urman http://www.tortall.net/mu/blog From ncoghlan at gmail.com Sat Sep 9 07:44:59 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 09 Sep 2006 15:44:59 +1000 Subject: [Python-3000] iostack, second revision In-Reply-To: References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> <1157740873.4979.10.camel@fsol> Message-ID: <450254DB.3020502@gmail.com> Jim Jewett wrote: > On 9/8/06, Antoine Pitrou wrote: >> Le vendredi 08 septembre 2006 ? 11:06 -0700, Guido van Rossum a ?crit : >>> -1 on those particular cryptic names. Which one of seekby() and >>> rseek() is the relative seek? Where's the seek relative to EOF? > >> What about seek(), seek_relative() and seek_reverse() ? > > Why not just borrow the standard symbolic names of cur and end? > > seek(pos=0) > seek_cur(pos=0) > seek_end(pos=0) > > seek_end(-1000) <==> 1000 units (bytes or chars or records or > ...) before the end > seek_cur(50) <==> 50 units beyond current > seek() <==> beginning +1 here. Short, to the point, and easy to remember for anyone already familiar with seek(). Cheers, Nick. P.S. on a slightly different topic, it would be nice if f.seek(-1) raised ValueError instead of IOError. Passing a negative absolute seek value is a program bug, not an environment problem. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From david.nospam.hopwood at blueyonder.co.uk Sat Sep 9 16:39:17 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Sat, 09 Sep 2006 15:39:17 +0100 Subject: [Python-3000] Help on text editors In-Reply-To: References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> <4500D990.3000707@blueyonder.co.uk> Message-ID: <4502D215.1080407@blueyonder.co.uk> Michael Urman wrote: > On 9/7/06, David Hopwood wrote: > >>Yes. However, this is not a good idea for precisely the reason described >>on that page (false detection of Unicode), and so any Unicode detection >>algorithm in Python should only be based on detecting a BOM, IMHO. > > Right, except BOMs break tons of Unix applications (and even > occasional Windows ones) which do not expect them. This problem is overstated. A BOM anywhere in a text causes no problem with display, and *should* be treated as an ignorable character for searching, etc. Note that there are plenty of other characters that should be treated as ignorable, so the applications that are broken for BOMs are broken more generally. -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Sat Sep 9 17:04:44 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Sat, 09 Sep 2006 16:04:44 +0100 Subject: [Python-3000] Help on text editors In-Reply-To: References: Message-ID: <4502D80C.9030908@blueyonder.co.uk> Michael Urman wrote: > On 9/7/06, Jeff Wilcox wrote: > >>>From: "Paul Prescod" >>>1. On US English Windows, Notepad defaults to an encoding called "ANSI". >>>"ANSI" is not a real encoding at all (and certainly not one from the >> >>On Japanese Windows 2000, Notepad defaults to ANSI as it does in the English >>version. It actually writes Shift JIS though. > > ANSI is not an encoding; it is a collective name for various multibyte > encodings, each corresponding to a particular default language of the > machine. Thus ANSI corresponds to cp1252 on English and cp932 on > Japanese machines. > > As for whether cp932 is the same as Shift JIS, David and I seem to > disagree. While I lack hard data, the string '\\' round trips through > either on my box. You may have an implementation that uses Cp932 or similar, but calls it "Shift-JIS". agrees with me, FWIW. Here is a pretty complete mapping table for Shift-JIS + common extensions (as opposed to Cp932): although there is quite a bit of variation in mappings: -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Sat Sep 9 17:10:38 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Sat, 09 Sep 2006 16:10:38 +0100 Subject: [Python-3000] Help on text editors In-Reply-To: References: Message-ID: <4502D96E.7090607@blueyonder.co.uk> Michael Urman wrote: > As for whether cp932 is the same as Shift JIS, David and I seem to > disagree. While I lack hard data, the string '\\' round trips through > either on my box. I missed this part. On any single implementation, '\\' will usually round-trip from Unicode -> Shift-JIS -> Unicode; the issue is whether it is encoded as 0x5C, or something else like 0x815F. It may very well not round-trip if you use different implementations for encoding and decoding. -- David Hopwood From qrczak at knm.org.pl Sat Sep 9 17:43:13 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 09 Sep 2006 17:43:13 +0200 Subject: [Python-3000] Help on text editors In-Reply-To: <4502D215.1080407@blueyonder.co.uk> (David Hopwood's message of "Sat, 09 Sep 2006 15:39:17 +0100") References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> <4500D990.3000707@blueyonder.co.uk> <4502D215.1080407@blueyonder.co.uk> Message-ID: <87bqpp82r2.fsf@qrnik.zagroda> David Hopwood writes: >> Right, except BOMs break tons of Unix applications (and even >> occasional Windows ones) which do not expect them. > > This problem is overstated. A BOM anywhere in a text causes no > problem with display, and *should* be treated as an ignorable > character for searching, etc. It is not ignorable in most file formats, and it is not automatically ignored by reading functions of most programming languages. > Note that there are plenty of other characters that should be > treated as ignorable, so the applications that are broken for BOMs > are broken more generally. I disagree. UTF-8 BOM should not be used on Unix. It's not a reliable method of encoding detection in general (applies only to Unicode), and it breaks the simplicity of text streams. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Sat Sep 9 19:41:51 2006 From: paul at prescod.net (Paul Prescod) Date: Sat, 9 Sep 2006 10:41:51 -0700 Subject: [Python-3000] Offtopic: declaring encoding Message-ID: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com> On 9/9/06, Marcin 'Qrczak' Kowalczyk wrote: > > > Note that there are plenty of other characters that should be > > treated as ignorable, so the applications that are broken for BOMs > > are broken more generally. > > I disagree. UTF-8 BOM should not be used on Unix. It's not a reliable > method of encoding detection in general (applies only to Unicode), > and it breaks the simplicity of text streams. We're offtopic but: treating these decisions as operating-system-specific is a big part of what caused the current mess. e.g with Japanese Windows users and Japanese Unix users using different encodings. The Unicode consortium should address the issue of auto-encoding and make a recommendation for how "raw" text files can have their encoding detected. A combination of BOM, coding declaration and fall-back to UTF-8 would cover the vast majority of the world's languages and incorporate many national encodings. Are you defending the status quo wherein text data cannot even be reliably processed on the desktop on which it was created (yes, even on Unix: look back in this thread). Do you have a positive prescription? Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/163a90dc/attachment.html From paul at prescod.net Sat Sep 9 19:58:22 2006 From: paul at prescod.net (Paul Prescod) Date: Sat, 9 Sep 2006 10:58:22 -0700 Subject: [Python-3000] Help on text editors In-Reply-To: References: Message-ID: <1cb725390609091058j49ffcdc6h61ce7eb80700f011@mail.gmail.com> On 9/7/06, Jeff Wilcox wrote: > > > From: "Paul Prescod" < paul at prescod.net> > > 1. On US English Windows, Notepad defaults to an encoding called "ANSI". > > "ANSI" is not a real encoding at all (and certainly not one from the > On Japanese Windows 2000, Notepad defaults to ANSI as it does in the > English > version. It actually writes Shift JIS though. > > > 2. On my English Mac, the default character set for textedit is "Mac OS > > Roman". What is it for foreign language macs? What API does an > application > > use to query this default character set? What setting is it derived > from? > > The Unix-level locale (seems not!) or some GUI-level setting (which > one)? > > Mac OS X actually doesn't have different language versions of the > operating > system. If you change the language setting, the Japanese version > *becomes* > the English version and vice versa. (Several of the English speakers that > I > work with have purchased Japanese Macs and switched them over to English, > they're indistinguishable from English Macs afterwards. Similarly, > several > Macs purchased in the US have been successfully switched to Japanese, and > become indistinguishable from Macs bought in Japan.) Great: but what is the default Textedit encoding on a Japanized version of the Mac? Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/63de2d1b/attachment.htm From qrczak at knm.org.pl Sat Sep 9 22:00:34 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 09 Sep 2006 22:00:34 +0200 Subject: [Python-3000] Offtopic: declaring encoding In-Reply-To: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com> (Paul Prescod's message of "Sat, 9 Sep 2006 10:41:51 -0700") References: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com> Message-ID: <87wt8c4xp9.fsf@qrnik.zagroda> "Paul Prescod" writes: > text data cannot even be reliably processed on the desktop on which > it was created (yes, even on Unix: look back in this thread). Where? > Do you have a positive prescription? New communication protocols and newly created file formats designed for interchange will either specify the text encoding in metadata (if files are expected to be edited by hand and it's still a near future), or use UTF-8 exclusively. Simple file formats expected to be used only locally will continue to have the encoding implicit. The system encoding of Unix boxes will more commonly be UTF-8 as time passes. I'm not using UTF-8 on my desktop by default because there are still some applications which don't work with UTF-8 terminals. The situation is much better than it used to be 10 years ago: most applications didn't support UTF-8 back then, now most do. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Sun Sep 10 01:26:15 2006 From: paul at prescod.net (Paul Prescod) Date: Sat, 9 Sep 2006 16:26:15 -0700 Subject: [Python-3000] Offtopic: declaring encoding In-Reply-To: <87wt8c4xp9.fsf@qrnik.zagroda> References: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com> <87wt8c4xp9.fsf@qrnik.zagroda> Message-ID: <1cb725390609091626r6104a8a2k7604be7b560e7f2f@mail.gmail.com> On 9/9/06, Marcin 'Qrczak' Kowalczyk wrote: > > "Paul Prescod" writes: > > > text data cannot even be reliably processed on the desktop on which > > it was created (yes, even on Unix: look back in this thread). > > Where? http://mail.python.org/pipermail/python-3000/2006-September/003492.html New communication protocols and newly created file formats designed > for interchange will either specify the text encoding in metadata > (if files are expected to be edited by hand and it's still a near future), > or use UTF-8 exclusively. Simple file formats expected to be used only > locally will continue to have the encoding implicit. > > The system encoding of Unix boxes will more commonly be UTF-8 as time > passes. Okay, thanks for your view of where things are going. I think that it is clear that UTF-8 will replace iso8859-* on Unix over the next few years. It isn't as clear if it (or any other global encoding) will replace EUC. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/8bd90db9/attachment-0001.html From lists at janc.be Sun Sep 10 01:35:32 2006 From: lists at janc.be (Jan Claeys) Date: Sun, 10 Sep 2006 01:35:32 +0200 Subject: [Python-3000] Help on text editors In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com> Message-ID: <1157844932.5109.198.camel@bedsa> Op do, 07-09-2006 te 12:21 -0700, schreef Paul Prescod: > Guido has asked me to do some research in aid of a file encoding > detection/defaulting PEP. > > I only have access to a small number of operating systems and language > variants so I need help. > > If you have access to "German Windows XP", "Japanese Windows XP", > "Spanish OS X", "Japanese OS X", "German Ubuntu" etc., I would > appreciate answers to the following questions. [...] > 3. In general, how do modern versions of Linux and other Unix handle > this issue? In particular: what is your default encoding and how did > your operating system determine it? Did you install a locale-specific > version? Did the installer ask you? Did you edit a configuration file? > Did you change a GUI setting? What is the relationship between your > localization of Gnome/KDE and your default encoding? AFAIK Ubuntu has used UTF-8 as the default encoding for all languages since the 'hoary' release (version 5.04, which was the 2nd Ubuntu release). -- Jan Claeys From paul at prescod.net Sun Sep 10 05:29:05 2006 From: paul at prescod.net (Paul Prescod) Date: Sat, 9 Sep 2006 20:29:05 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding Message-ID: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> PEP: XXX Title: Easy Text File Decoding Version: $Revision$ Last-Modified: $Date$ Author: Paul Prescod Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 09-Sep-2006 Post-History: 09-Sep-2006 Python-Version: 3.0 Abstract ======== Python 3000 will use Unicode as the standard string type. This means that text files read from disk will be "decoded" into Unicode code points just as binary files might be decoded into integers and structures. This change brings a few issues to the fore that were previously ignorable. For example, in Python 2.x, it was possible to open a text file, read the data into a Python string, filter some lines and print the remaining lines to the console without ever considering what "encoding" the text was in. In Python 3000, the programmer will only get access to Python's powerful string manipulation functions after decoding the data to Unicode code points. This means that either the programmer or the Python runtime must select an decoding algorithm (by naming the encoding algorithm that was used to encode the data in the first place). Often the programmer can do so based upon out-of-band knowledge ("this file format is always UCS-2" or "the protocol header says that this data is latin-1"). In other cases, the programmer may be more naive or simply wish to avoid thinking about it and would rather defer the issue to Python. This document presents a proposal for algorithms and APIs that Python can use to simplify the programmer's life. Issues outside the scope of this PEP ===================================== Any programmer who wishes to take direct control of the encoding selection may of course ignore the features described in this PEP and choose a decoding explicitly. The PEP is not intended to constrain them in any way. Bytes received through means other than the file system are not addressed by this PEP. For example, the PEP does not address data directly read from a socket or returned from marshal functions. Rationale ========== The simplest possible use case for Python text processing involves a user maintaining some form of simple database (e.g. an address book) as a text file and processing it with Python. Unfortunately, this use case is not as simple as it should be because of the variety of encodings in the universe. For example, the file might be UTF-8, ISO-8859-1 or ISO-8859-2. Professional programmers making widely distributed programs probably have no alternative but to deal with this variability head-on. But programmers working with data that originates and resides primarily on their own computer might wish to avoid dealing with it. They would like Python to just "try to do the right" thing with respect to the file. They would like to think about encodings if and only if Python failed to guess appropriately. Proposal ======== The function to open a text file will tenatively be called textfile(), though the function name is not an integral part of this PEP. The function takes three arguments, the filename, the mode ("r", "w", "r+", etc.) and the type. The type could be a true encoding or one of a small set of additional symbolic values. The two main symbolic values are: * "site" -- the default value, which invokes a site-specific alogrithm. For example, a Japanese school teacher using Windows might default "site" to Shift-JIS. An organization dealing with a small number of encodings might default "site" to be equivalent to "guess". An organization with a strict internationalization policy might default "site" to "UTF-8". An important open issue is what Python's out-of-box interpretation of "site" should be. This is key because "site" is the default value so Python's out-of-box behaviour is the "default default". * "guess" -- the value to be used by encoding-inexpert programmers and experts who feel confident that Python's guessing algorithm will produce sufficient results for their purposes. The guessing algorithm will necessarily be complicated and may change over time. It will take into account the following factors: - the conventions dominant on the operating system of choice - any localization-relevant settings available - a certain number of bytes at the start of the file (perhaps start and end?). This sample will likely be on the order of thousands of bytes. - filesystem metadata attached to the file (in strong preference to the above). * "locale" -- the encoding suggested by the operating system's locale concept Other symbolic values might allow the programmer to suggest specific encoding detection algorithms like XML [#XML-encoding-detection]_, HTML [#HTML-encoding-detection]_ and the "coding:" comment convention. These would be specified in separate PEPs. The Site Decoding Hook ======================== The "sys" module could have a function called "setdefaultfileencoding". The encoding specified could be a true encoding name or one of the encoding detection scheme names (e.g. "guess" or "XML"). In addition, it should be possible to register new encoding detection schemes using a method like "sys.registerencodingdetector". This function would take two arguments, a string and a callable. The callable would accept a byte stream argument and return a text stream. The contract for these detection scheme implementations must allow them to peek ahead some bytes to use the content as a hint to the encoding. Alternatives and Open Issues ============================== 1. Guido proposes that the function be called merely "open". His proposal is that the binary open should be the alternative and should be invoked explicitly with a "b" mode switch. The PEP author feels first, that changing the behaviour of an existing function is more confusing and disruptive than creating another. Backporting a change to the "open" function would be difficult and therefore it would be unnecessarily difficult to create file-manipulating libraries that work both on Python 2.x and 3.x. Second, the author feels that the "open" is an unnecessarily cryptic name based only in Unix/C history. For a programmer coming from (for example) Javascript, open() would tend to imply "open window". The PEP author believes that factory functions should say what they are creating. 2. There is substantial disagreement on the behaviour of the function when there is no encoding argument passed and no site override (i.e the out-of-box default). Current proposals include ASCII (on the basis that it is a nearly universal subset of popular encodings), UTF-8 (on the basis that it is the dominant global standard encompassing all of Unicode), a locale-derived encoding (on the basis that this is what a naive user will generate in a text editor) or the guessing algorithm (on the basis that it is by definition designed to guess right more often than any more specific encoding name). The PEP author strongly advocates a strict encoding like ASCII, UTF-8 or no default at all (in which case the lack of an encoding would raise an exception). A default like iso-8859-1 (even inferred from the environment) will result in encodings like UTF-8, UCS-2 and even binary files being "interpreted" as gibberish strings. This could result in document or database corruption. An encoding with a "guess" default will encourage the widespread creation of very unreliable code. The current proposal is to have no out-of-box default until some point in the future when a small set of auto-detectable encodings are globally dominant. UTF-8 has gradually been gaining popularity through W3C and other standards so it is possible that five years from now it will be the "no-brainer" default. Until we can guess with substantial confidence, absence of both an encoding declaration and a site override should result in a thrown exception. References ========== .. [#XML-encoding-detection] XML Encoding Detection algorithm: http://www.w3.org/TR/REC-xml/#sec-guessing .. [#HTML-encoding-detection] HTML Encoding Detection algorithm: http://www.w3.org/TR/REC-xml/#sec-guessing Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/38766c07/attachment-0001.htm From greg.ewing at canterbury.ac.nz Sun Sep 10 08:11:05 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 10 Sep 2006 18:11:05 +1200 Subject: [Python-3000] The future of exceptions In-Reply-To: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> Message-ID: <4503AC79.4090601@canterbury.ac.nz> Michael Chermside wrote: > It doesn't necessarily imply that > the traceback be materialized immediately upon exception creation > (which is undesirable because we want exceptions lightweight enough > to use for things like for loop control!)... but it might mean that > pieces of the stack frame need hang around as long as the exception > itself does. With the current implementation, "materialising the traceback" and "keeping parts of the stack frame hanging around" are pretty much the same thing, since the traceback is mostly just a linked list of frames encountered while unwinding the stack looking for a handler. So if there's a possibility you might want a traceback at all at any point, it's hard to see how the process could be made any more lightweight. However, I'm wondering whether it might be worth distinguishing two different kinds of exceptions: "flow control" exceptions which are used something like a non-local goto, and full- blown exceptions. Flow control exceptions typically don't need most of the exception machinery -- they don't carry data of their own, so you don't need to instantiate a class every time, and you're not usually interested in a traceback. So maybe there should be a different form of raise statement for these that doesn't bother making provision for them. A problem is that if a flow control exception *doesn't* get caught by something that's expecting it, you probably do want a traceback in order to debug the problem. Maybe try-statements could maintain a stack of handlers, so the raise-control-flow-exception statement could quickly tell whether there is a handler, and if not, raise an ordinary exception with a traceback. Or maybe there should be a different mechanism altogether for non-local gotos. I'd like to see some kind of "longjmp" object that could be invoked to cause a jump back to a specific place. That would help alleviate the problem that exceptions used for control flow can get caught by the wrong handler. Sometimes you really want something that's targeted to a specific handler, not just the next enclosing one of some type. -- Greg From qrczak at knm.org.pl Sun Sep 10 11:11:31 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sun, 10 Sep 2006 11:11:31 +0200 Subject: [Python-3000] The future of exceptions In-Reply-To: <4503AC79.4090601@canterbury.ac.nz> (Greg Ewing's message of "Sun, 10 Sep 2006 18:11:05 +1200") References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> <4503AC79.4090601@canterbury.ac.nz> Message-ID: <874pvgozlo.fsf@qrnik.zagroda> Greg Ewing writes: > Flow control exceptions typically don't need most of the exception > machinery -- they don't carry data of their own, so you don't need > to instantiate a class every time, It's lazily instantiated today (see PyErr_NormalizeException). > Or maybe there should be a different mechanism altogether > for non-local gotos. I'd like to see some kind of "longjmp" > object that could be invoked to cause a jump back to > a specific place. Any non-local exit should be hookable by active function calls between the raising point and the catching point, especially by things like try...finally. > Sometimes you really want something that's targeted to a specific > handler, not just the next enclosing one of some type. Indeed, but this can still use an exception internally. My language Kogut has a function for that ('?' is lambda, the whole thing is an argument of 'WithExit'): WithExit ?exit { some code which can at some point call the 'exit' function introduced above, even from another function, and the control flow will return to this WitExit call }; I think it can be exposed as something used with 'with' in Python. 'WithExit' constructs a unique exception object and catches precisely this object. Implementing it with an exception makes the semantics of expression evaluation more uniform: an expression either evaluates to a value, or fails with an exception, and there is no other possibility which would have to be accounted for in generic wrappers which call unknown code (e.g. my bridge between two languages, or running a computation by another thread). There are other kinds of non-local exits, like exiting the program or thread cancellation, which can be implemented with exceptions and I think it's better than inventing a separate mechanism for each. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From solipsis at pitrou.net Sun Sep 10 12:31:24 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 12:31:24 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> Message-ID: <1157884285.4246.41.camel@fsol> Le samedi 09 septembre 2006 ? 20:29 -0700, Paul Prescod a ?crit : > The type could be a true encoding or one of a small set of additional > symbolic values. The two main symbolic values are: Actually your proposal has three ;) > For example, a Japanese school teacher using Windows might default > "site" to Shift-JIS. I think a Japanese school teacher using Windows shouldn't have to configure anything specifically in Python, encoding-wise. I've never seen a tool (e.g. text editor) refuse to work before you had explicitly configured an encoding *for the tool*. Those tools either choose system-wide default aka "locale" (if they want to play fair with other apps) or their own (if they think utf-8 is the future). I see two cases where refusing to use a default is even more unhelpful: - on the growing number of systems which have utf-8 as default - when the programmer simply wants to open a pure-ascii text file (e.g. configuration file), and opening it as text allows him to read it line-by-line, or use whatever other facilities text files provide that binary files don't So, here is an alternative proposal : Make it so that textfile() doesn't recognize system-wide defaults (as in your proposal), but also provide autotextfile() which would recognize those defaults (with a by_content=False optional argument to enable content-based guessing). textfile() being clearly marked for use by large well thought-out applications, and autotextfile() for small scripts and the like. Different names make it clear that they are for different uses, and allow to spot them easily when looking at source code (either by a human reader or a quality measurement tool). Regards Antoine. From phd at mail2.phd.pp.ru Sun Sep 10 12:35:00 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Sun, 10 Sep 2006 14:35:00 +0400 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> Message-ID: <20060910103500.GA13412@phd.pp.ru> On Sat, Sep 09, 2006 at 08:29:05PM -0700, Paul Prescod wrote: > "the protocol header says that this data is latin-1"). "Protocol metadata" if you allow me to suggest the word. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From solipsis at pitrou.net Sun Sep 10 13:02:57 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 13:02:57 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> Message-ID: <1157886177.4246.59.camel@fsol> > The Site Decoding Hook > ======================== > > The "sys" module could have a function called > "setdefaultfileencoding". The encoding specified could be a true > encoding name or one of the encoding detection scheme names ( e.g. > "guess" or "XML"). Isn't it more intuitive to gather functions based on what their high-level purpose is ("text" or "textfile") than implementation details of where the information comes from ("sys", "locale") ? That function could be "textfile.set_default_encoding" (with underscores), or even "text.textfile.set_default_encoding" (if all this resides in a "text" module). Regards Antoine. From ncoghlan at gmail.com Sun Sep 10 13:58:00 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 10 Sep 2006 21:58:00 +1000 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1157884285.4246.41.camel@fsol> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> Message-ID: <4503FDC8.2030608@gmail.com> Antoine Pitrou wrote: > So, here is an alternative proposal : > Make it so that textfile() doesn't recognize system-wide defaults (as in > your proposal), but also provide autotextfile() which would recognize > those defaults (with a by_content=False optional argument to enable > content-based guessing). > > textfile() being clearly marked for use by large well thought-out > applications, and autotextfile() for small scripts and the like. > Different names make it clear that they are for different uses, and > allow to spot them easily when looking at source code (either by a human > reader or a quality measurement tool). How does your "autotextfile('myfile.txt')" differ from Paul's "textfile('myfile.txt', encoding='guess')"? Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Sun Sep 10 14:05:35 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 10 Sep 2006 22:05:35 +1000 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> Message-ID: <4503FF8F.6070801@gmail.com> Paul Prescod wrote: > The function to open a text file will tenatively be called textfile(), > though the function name is not an integral part of this PEP. The > function takes three arguments, the filename, the mode ("r", "w", "r+", > etc.) and the type. > > The type could be a true encoding or one of a small set of additional > symbolic values. The 'additional symbolic values' should be implemented as true encodings (i.e., it should be possible to look up 'site', 'guess' and 'locale' in the codecs registry, and replace them there as well). I also agree with Guido that the right spelling for the factory function is to incorporate this into the existing open() builtin. The signature of open() is already going to change to accept an encoding argument in Py3k, and the special encodings proposed in the PEP are just that: special encodings that happen to take environmental information into account when deciding how to decode or encode data. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From solipsis at pitrou.net Sun Sep 10 14:47:15 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 14:47:15 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <4503FDC8.2030608@gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> Message-ID: <1157892435.4246.107.camel@fsol> Le dimanche 10 septembre 2006 ? 21:58 +1000, Nick Coghlan a ?crit : > Antoine Pitrou wrote: > > So, here is an alternative proposal : > > Make it so that textfile() doesn't recognize system-wide defaults (as in > > your proposal), but also provide autotextfile() which would recognize > > those defaults (with a by_content=False optional argument to enable > > content-based guessing). > > > > textfile() being clearly marked for use by large well thought-out > > applications, and autotextfile() for small scripts and the like. > > Different names make it clear that they are for different uses, and > > allow to spot them easily when looking at source code (either by a human > > reader or a quality measurement tool). > > How does your "autotextfile('myfile.txt')" differ from Paul's > "textfile('myfile.txt', encoding='guess')"? Paul's "encoding='guess'" specifies a complicated and dangerous guessing algorithm. However, autotextfile('myfile.txt') would mean : - use Paul's "site" if such a thing is defined - otherwise, use Paul's "locale" (no content-based guessing) On the other hand "autotextfile('myfile.txt', by_content=True)" would enable content-based guessing, thus be equivalent to Paul's "encoding='guess'". To sum up the API: - textfile("filename.txt", mode, encoding=None): fails without an explicit "encoding" argument if no "site" algorithm has been explicitly configured. - autotextfile("filename.txt", mode, by_content=False): selects either the "site"-configured encoding or the locale fallback, unless "by_content" is True in which case it tries to detect based on actual content. In short, my proposal is just a naming proposal to achieve the following goals : - the textfile() function is "clean", and satisfies to the ideal that it is Wrong to not specify an encoding when retrieving text from on-disk bytes - the autotextfile() function makes it easy to write simple scripts with an easy to remember function with an explicit name (instead of a magic value in an optional string argument) - the autotextfile() function makes it easy to spot those abusive uses of the quick-and-dirty way in apps which strive for interoperability and portability (in French we say "ne pas m?langer les torchons et les serviettes" : don't mix towels and rags :-)) All this can be in a module, no need to pollute the top-level namespace : from text import textfile from text import autotextfile > The 'additional symbolic values' should be implemented as true > encodings (i.e., it should be possible to look up 'site', 'guess' and > 'locale' in the codecs registry, and replace them there as well). Treating different things as "true encodings" does not help understandability IMHO. "guess", "site" and "locale" are not encodings in themselves, they are decision algorithms. In particular, "guess" has to look at big chunks of existing text contents before deciding (which may or may not have side-effects such as unexpected buffering). Really, while "iso-8859-1" or "utf-8" is always the same encoding, "guess" will not always result in the same encoding being used: it depends on actual data fed to it. "guess" will not even allow the same set of characters to be used: if "guess" results in "iso-8859-1", then I can't use all the (Unicode) characters that I can use when "guess" results in "utf-8". This variability/unpredictability is a fundamental difference in behaviour compared to a "true encoding", for which you can always be sure what set of (textual) data can be represented. Regards Antoine. From solipsis at pitrou.net Sun Sep 10 15:21:14 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 15:21:14 +0200 Subject: [Python-3000] encoding='guess' ? In-Reply-To: <1157892435.4246.107.camel@fsol> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> Message-ID: <1157894475.4246.130.camel@fsol> Hi, Let me add that 'guess' should probably be forbidden as an encoding parameter (instead, a separate function argument should be used as in my proposal). Here is a schematic example to show why : def append_text(filename, encoding): src = textfile(filename, "r", encoding) my_text = src.read() src.close() dst = textfile("textlist.txt", "r+", encoding) dst.seek_end(0) dst.write(my_text + "\n") dst.close() With Paul's current proposal three cases can arise : - "encoding" is a real encoding name like iso-8859-1 or utf-8. There should be no problems, since we assume this encoding has been configured once and for all in the application. - "encoding" is either "site" or "locale". This should result in the same value run after run, since we assume the site or locale encoding value has been configured once and for all. - "encoding" is "guess". In this case anything can happen. A possible occurence is that for the first file, it will result in utf-8 being detected (or Shift-JIS, or whatever), and for the second file it will be iso-8859-1. This will lead to a crash in the likely case that some characters in the source file can't be represented using the character encoding auto-detected for the destination file. Yet the append_text() function does look correct, doesn't it? We shouldn't hide a contextual encoding-detection algorithm under an encoding name. It leads to semantic uncertainty. Regards Antoine. From ncoghlan at gmail.com Sun Sep 10 15:44:16 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 10 Sep 2006 23:44:16 +1000 Subject: [Python-3000] encoding='guess' ? In-Reply-To: <1157894475.4246.130.camel@fsol> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <1157894475.4246.130.camel@fsol> Message-ID: <450416B0.4050109@gmail.com> Antoine Pitrou wrote: > Hi, > > Let me add that 'guess' should probably be forbidden as an encoding > parameter (instead, a separate function argument should be used as in my > proposal). > > Here is a schematic example to show why : > > def append_text(filename, encoding): > src = textfile(filename, "r", encoding) > my_text = src.read() > src.close() > dst = textfile("textlist.txt", "r+", encoding) > dst.seek_end(0) > dst.write(my_text + "\n") > dst.close() > > With Paul's current proposal three cases can arise : > - "encoding" is a real encoding name like iso-8859-1 or utf-8. There > should be no problems, since we assume this encoding has been configured > once and for all in the application. > - "encoding" is either "site" or "locale". This should result in the > same value run after run, since we assume the site or locale encoding > value has been configured once and for all. > - "encoding" is "guess". In this case anything can happen. A possible > occurence is that for the first file, it will result in utf-8 being > detected (or Shift-JIS, or whatever), and for the second file it will be > iso-8859-1. This will lead to a crash in the likely case that some > characters in the source file can't be represented using the character > encoding auto-detected for the destination file. > > Yet the append_text() function does look correct, doesn't it? > > We shouldn't hide a contextual encoding-detection algorithm under an > encoding name. It leads to semantic uncertainty. Interesting. This goes back more towards the model of "no default encoding, but provide the right tools to make it easy for a program to choose one in the absence of any metadata". So perhaps there should just be an explicit function "guessencoding()" that accepts a filename and returns a codec name. So if you want to guess, you would do something like: f = open(fname, 'r', string.guessencoding(fname)) The PEP's other suggestions would then be spelled something like: f = open(fname, 'r', string.getlocaleencoding()) f = open(fname, 'r', string.getsiteencoding()) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From david.nospam.hopwood at blueyonder.co.uk Sun Sep 10 15:52:44 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Sun, 10 Sep 2006 14:52:44 +0100 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1157892435.4246.107.camel@fsol> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> Message-ID: <450418AC.2010400@blueyonder.co.uk> Antoine Pitrou wrote: > Le dimanche 10 septembre 2006 ? 21:58 +1000, Nick Coghlan a ?crit : >>Antoine Pitrou wrote: >> >>>So, here is an alternative proposal : >>>Make it so that textfile() doesn't recognize system-wide defaults (as in >>>your proposal), but also provide autotextfile() which would recognize >>>those defaults (with a by_content=False optional argument to enable >>>content-based guessing). >>> >>>textfile() being clearly marked for use by large well thought-out >>>applications, and autotextfile() for small scripts and the like. >>>Different names make it clear that they are for different uses, and >>>allow to spot them easily when looking at source code (either by a human >>>reader or a quality measurement tool). >> >>How does your "autotextfile('myfile.txt')" differ from Paul's >>"textfile('myfile.txt', encoding='guess')"? > > Paul's "encoding='guess'" specifies a complicated and dangerous guessing > algorithm. Indeed, to the extent that it specifies anything. However, guessing algorithms can differ greatly in how complicated and dangerous they are. Here is a very simple, reasonably (although not completely) safe, and much more predictable guessing algorithm, based on a generalization of : Let A, B, C, and D be the first 4 bytes of the stream, or None if the corresponding byte is past end-of-stream. Let other be any encoding which is to be used as a default if no specific UTF is detected. if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8 if B == None: return other if A == 0 and B == 0 and D != None: return UTF32BE if C == 0 and D == 0: return UTF32LE if A == 0xFE and B == 0xFF: return UTF16BE if A == 0xFF and B == 0xFE: return UTF16LE if A != 0 and B != 0: return other if A == 0: return UTF16BE return UTF16LE This would normally be used with 'other' as the system encoding, as an alternative to just assuming that the file is in the system encoding. There is very little chance of this algorithm misdetecting a file in a non-Unicode encoding as Unicode. For that to happen, either the first two or three bytes would have to be encoded in exactly the same way as a UTF-16 or UTF-8 BOM, or one of the first three characters would have to be NUL. However, if the file *is* Unicode and it starts with a BOM, then its UTF will always be correctly detected. Furthermore, UTF-16 and UTF-32 will be correctly detected if the file starts with a character from U+0001 to U+00FF (i.e. non-NUL and in the ISO-8859-1 range). Another advantage of this algorithm is that it always reads only 4 bytes. > However, autotextfile('myfile.txt') would mean : > - use Paul's "site" if such a thing is defined > - otherwise, use Paul's "locale" > (no content-based guessing) > > On the other hand "autotextfile('myfile.txt', by_content=True)" would > enable content-based guessing, thus be equivalent to Paul's > "encoding='guess'". As I pointed out earlier, any file open function that guesses the encoding should return which encoding has been guessed. Alternatively, it could be possible to allow the encoding to be set after the file has been opened, in which case a separate function could do the guessing. >>The 'additional symbolic values' should be implemented as true >>encodings (i.e., it should be possible to look up 'site', 'guess' and >>'locale' in the codecs registry, and replace them there as well). > > Treating different things as "true encodings" does not help > understandability IMHO. "guess", "site" and "locale" are not encodings > in themselves, they are decision algorithms. +1. -- David Hopwood From solipsis at pitrou.net Sun Sep 10 16:00:06 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 16:00:06 +0200 Subject: [Python-3000] encoding='guess' ? In-Reply-To: <450416B0.4050109@gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <1157894475.4246.130.camel@fsol> <450416B0.4050109@gmail.com> Message-ID: <1157896806.4246.138.camel@fsol> Le dimanche 10 septembre 2006 ? 23:44 +1000, Nick Coghlan a ?crit : > Interesting. This goes back more towards the model of "no default encoding, > but provide the right tools to make it easy for a program to choose one in the > absence of any metadata". In the "clean" API yes. But it would be nice to also have an easy API for small scripts, hence my "autotextfile" proposal. (and, it would also avoid making life too hard for beginners trying to learn the language) > f = open(fname, 'r', string.guessencoding(fname)) This one is inefficient because it results in opening the file twice: once in string.guessencoding(), and once in open(). This does not happen if there is a special argument instead, like "by_content=True" in my proposal. Cheers Antoine. From solipsis at pitrou.net Sun Sep 10 16:04:47 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 16:04:47 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <450418AC.2010400@blueyonder.co.uk> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> Message-ID: <1157897087.4246.143.camel@fsol> Le dimanche 10 septembre 2006 ? 14:52 +0100, David Hopwood a ?crit : > > On the other hand "autotextfile('myfile.txt', by_content=True)" would > > enable content-based guessing, thus be equivalent to Paul's > > "encoding='guess'". > > As I pointed out earlier, any file open function that guesses the encoding > should return which encoding has been guessed. Since open files are objects, the encoding can just be a read-only property: # replace autotextfile by whatever API is finally chosen ;) f = autotextfile('myfile.txt', by_content=True) enc = f.encoding > Alternatively, it could be possible to allow the encoding to be set > after the file has been opened, in which case a separate function > could do the guessing. Yes, sounds like a nice alternative. Regards Antoine. From solipsis at pitrou.net Sun Sep 10 16:27:12 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 16:27:12 +0200 Subject: [Python-3000] sys.stdin and sys.stdout with textfile Message-ID: <1157898432.4246.161.camel@fsol> Hi, Another aspect of the textfile discussion. sys.stdin and sys.stdout are for now, concretely, byte streams (AFAIK, at least under Unix). Yet it must be possible to read/write text to and from them. So two questions: - Is there a builtin text.stdin / text.stdout counterpart to sys.stdin / sys.stdout (the former being text versions, the latter raw bytes versions) ? Or a way to write: my_input_file = textfile(sys.stdin) ? - How is handled the default encoding ? Does Python mandate setting an encoding before calling print() or raw_input() ? Also, consider a "script.py" beginning with: import sys, text if len(sys.argv) > 1: f = textfile(sys.argv[1], "r") else: f = text.stdin Should encoding policy be chosen differently depending on whether the script is called with: python script.py in.txt or with: python script.py < in.txt ? Regards Antoine. From qrczak at knm.org.pl Sun Sep 10 18:08:14 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sun, 10 Sep 2006 18:08:14 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> (Paul Prescod's message of "Sat, 9 Sep 2006 20:29:05 -0700") References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> Message-ID: <87d5a366xd.fsf@qrnik.zagroda> "Paul Prescod" writes: > The type could be a true encoding or one of a small set of additional > symbolic values. The two main symbolic values are: Here is a counter-proposal. There is a variable sys.default_encoding. It's used by file opening functions when the encoding is not specified explicitly, among others. Its initial value is set in site.py with a site-specific algorithm. Two variants of the proposal: 1. The default site-specific algorithm queries the locale on Unix, uses "mbcs" on Windows (which is a special encoding which causes to use MultiByteToWideChar as the decoding function), and something appropriate on other systems. 2. The default initial value is "locale" (or "system" or "default" or whatever, but the spelling is fixed), which is a special encoding name which means to use the system-specific encoding, as above. I prefer variant 1: it's simpler and it allows programs to examine the choice on Unix. A Python-specific environment variable could be defined to override the system-specific choice. If MultiByteToWideChar on Windows doesn't handle UTF-8 even with a BOM (I don't know whether it does), then the Windows default could be an encoding which assumes UTF-8 when a UTF-8 BOM is present, and uses MultiByteToWideChar otherwise. This applies only to Windows; Unix rarely uses a BOM, OTOH on Unix you can have UTF-8 locales which Windows doesn't have as far as I know. Other than that, guessing the encoding from the contents of the text stream, especially statistical guessing basing on well-formed UTF-8 non-ASCII characters, shouldn't be encouraged, because it's effect is not predictable. There can be a separate function which guesses the encoding for those who really want to do this. If Python ever has dynamically-scoped variables, sys.default_encoding should be dynamically scoped, so it's possible to set for the context of a block of code. sys.default_encoding also applies to filenames, to names and values of environment variables, to program invocation parameters (both sys.argv and os.exec*), to pwd.struct_passwd.pw_gecos, etc. There is a number of Unix interfaces which doesn't specify the encoding of texts they exchange (and of course pw_gecos doesn't contain a BOM if it's UTF-8). Antoine Pitrou writes: > sys.stdin and sys.stdout are for now, concretely, byte streams (AFAIK, > at least under Unix). Yet it must be possible to read/write text to and > from them. Here is what my language Kogut does this: RawStdIn etc. are the underlying raw files (thin wrappers over file descriptors). StdIn etc. are text files with encoding, buffering etc. They are initialized the first time they are used, i.e. the first time the StdIn variable is read. They are constructed with the default encoding from that time. This allows a script to set the default encoding before accessing standard text streams. I don't know wheter Python typically accesses stdin/stdout during initialization, before the first line of the script is executed. If it does, this design can't be used until this is changed. > Also, consider a "script.py" beginning with: > > import sys, text > if len(sys.argv) > 1: > f = textfile(sys.argv[1], "r") > else: > f = text.stdin > > Should encoding policy be chosen differently depending on whether the > script is called with: > python script.py in.txt > or with: > python script.py < in.txt > ? With my design it's the same. It's also the same if the script does sys.default_encoding = 'ISO-8859-1' at the beginning. Note: in my design sys.argv is also initialized lazily (in fact each time it is accessed, until it's assigned to where it starts to behave as a normal variable). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From guido at python.org Sun Sep 10 19:04:56 2006 From: guido at python.org (Guido van Rossum) Date: Sun, 10 Sep 2006 10:04:56 -0700 Subject: [Python-3000] sys.stdin and sys.stdout with textfile In-Reply-To: <1157898432.4246.161.camel@fsol> References: <1157898432.4246.161.camel@fsol> Message-ID: On 9/10/06, Antoine Pitrou wrote: > Another aspect of the textfile discussion. > sys.stdin and sys.stdout are for now, concretely, byte streams (AFAIK, > at least under Unix). No, they are conceptually text streams, because that's what they are on Windows, which the only remaining platform where you can currently experience the difference between text and byte streams. > Yet it must be possible to read/write text to and > from them. I'd turn it around. If you want to read bytes from stdin (sometimes a useful thing for filters), in Py3k you better dig out the underlying byte stream and use that. > So two questions: > - Is there a builtin text.stdin / text.stdout counterpart to > sys.stdin / sys.stdout (the former being text versions, the latter raw > bytes versions) ? You've got it backwards. > Or a way to write: my_input_file = textfile(sys.stdin) ? > - How is handled the default encoding ? > Does Python mandate setting an encoding before calling print() or > raw_input() ? Not in my view of the future. :-) > Also, consider a "script.py" beginning with: > > import sys, text > if len(sys.argv) > 1: > f = textfile(sys.argv[1], "r") > else: > f = text.stdin > > Should encoding policy be chosen differently depending on whether the > script is called with: > python script.py in.txt > or with: > python script.py < in.txt > ? All sorts of things are different when reading stdin vs. opening a filename. e.g. stdin may be a pipe. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Sun Sep 10 19:09:17 2006 From: guido at python.org (Guido van Rossum) Date: Sun, 10 Sep 2006 10:09:17 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <4503FF8F.6070801@gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <4503FF8F.6070801@gmail.com> Message-ID: On 9/10/06, Nick Coghlan wrote: > The 'additional symbolic values' should be implemented as true encodings > (i.e., it should be possible to look up 'site', 'guess' and 'locale' in the > codecs registry, and replace them there as well). That's hard to do since guessing, at least, may require inspection of a large portion of the input data before settling upon a specific choice. The decoding API doesn't have a way to do this AFAIK. And for encoding (output) it's even more iffy -- if possible I'd like the guessing function to have access to what was in the file before it was emptied by the "create" function, or what's at the start before appending to the end, -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Sun Sep 10 19:11:46 2006 From: guido at python.org (Guido van Rossum) Date: Sun, 10 Sep 2006 10:11:46 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <87d5a366xd.fsf@qrnik.zagroda> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <87d5a366xd.fsf@qrnik.zagroda> Message-ID: On 9/10/06, Marcin 'Qrczak' Kowalczyk wrote: > Here is a counter-proposal. > > There is a variable sys.default_encoding. It's used by file opening > functions when the encoding is not specified explicitly, among others. > Its initial value is set in site.py with a site-specific algorithm. This doesn't seem to allow guessing based on the file's contents. That seems intentional from your part, but I believe it makes for way too many disappointing user experiences. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From 2006 at jmunch.dk Sun Sep 10 19:53:09 2006 From: 2006 at jmunch.dk (Anders J. Munch) Date: Sun, 10 Sep 2006 19:53:09 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: <450254DB.3020502@gmail.com> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> <1157740873.4979.10.camel@fsol> <450254DB.3020502@gmail.com> Message-ID: <45045105.5040209@jmunch.dk> Nick Coghlan wrote: > Jim Jewett wrote: >> Why not just borrow the standard symbolic names of cur and end? >> >> seek(pos=0) >> seek_cur(pos=0) >> seek_end(pos=0) I say drop seek_cur and seek_end altogether, and keep only absolute seek. The C library caters for archaic file systems, that are record-based or otherwise not well modelled as an array of bytes. That's where the ftell/fseek/fpos_t system comes from: An fpos_t might be a composite data type containing a record number and a within-record offset; but as long as it's used as an opaque token, you'd never notice. That was a nice design for backward-compatibility back in the early 1970's. Thirty years later do we still need it? POSIX and Win32 have array-of-bytes files. Does CPython even run on any OS where binary files are not seen as arrays of bytes? I'm saying _binary_ files because a gander through the standard library shows that seeking is never done on text files. Even mailbox.py opens Unix mailbox files as binary. The majority of f.seek(.., 2) calls in the library use it for computing the length of file. How's that for an "opaque token": f.tell() is taken to be the length of the file after f.seek(0,2). As for seeking to the end with only an absolute .seek available: Surely, any file that supports seeking to the end will also support reporting the file size. Thus f.seek(f.length) should suffice, and what could be clearer? Also, there's the "a+w" mode for appending, no seeks required. Having just a single method/mode will not only ease file-protocol implementation, but IMO client code will be easier to read as well. - Anders PS: I'm working on that FileBytes object, Tomer, as a wrapper over an object that supports seek to absolute position, with integrated buffering. From jcarlson at uci.edu Sun Sep 10 20:08:41 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 10 Sep 2006 11:08:41 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: <4503AC79.4090601@canterbury.ac.nz> References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> <4503AC79.4090601@canterbury.ac.nz> Message-ID: <20060910110221.F8EA.JCARLSON@uci.edu> Greg Ewing wrote: > Or maybe there should be a different mechanism altogether > for non-local gotos. I'd like to see some kind of "longjmp" > object that could be invoked to cause a jump back to > a specific place. That would help alleviate the problem > that exceptions used for control flow can get caught by > the wrong handler. Sometimes you really want something > that's targeted to a specific handler, not just the next > enclosing one of some type. I imagine you mean something like this... try: for ....: try: dosomething() except Exception: ... except FlowException1: ... And the answer I've always heard is: try: for ....: try: dosomething() except FlowException1: raise except Exception: ... except FlowException1: ... That really only works if you have control over the entire stack of possible exception handlers, but it is also really the only way it makes sense, unless I'm misunderstanding what you are asking for. If I am misunderstanding, please provide some sample code showing what needs to be done now, and what you would like to be possible. - Josiah From paul at prescod.net Sun Sep 10 20:09:39 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 11:09:39 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <20060910103500.GA13412@phd.pp.ru> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <20060910103500.GA13412@phd.pp.ru> Message-ID: <1cb725390609101109q52578a84o5382fd4e97e40ba0@mail.gmail.com> Suggestion accepted. On 9/10/06, Oleg Broytmann wrote: > > On Sat, Sep 09, 2006 at 08:29:05PM -0700, Paul Prescod wrote: > > "the protocol header says that this data is latin-1"). > > "Protocol metadata" if you allow me to suggest the word. > > Oleg. > -- > Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru > Programmers don't die, they just GOSUB without RETURN. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/b9d2bca6/attachment.htm From paul at prescod.net Sun Sep 10 20:14:07 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 11:14:07 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1157886177.4246.59.camel@fsol> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157886177.4246.59.camel@fsol> Message-ID: <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com> I went based on the current setdefaultencoding. But it seems that we will accumulate 3 or 4 related functions so I'm pursuaded that there should be a module. encodingdetection.setdefaultfileencoding encodingdetection.registerencodingdetector encodingdetection.guessfileencoding(filename) encodingdetection.guessfileencoding(bytestream) Suggestion accepted. On 9/10/06, Antoine Pitrou wrote: > > > > The Site Decoding Hook > > ======================== > > > > The "sys" module could have a function called > > "setdefaultfileencoding". The encoding specified could be a true > > encoding name or one of the encoding detection scheme names ( e.g. > > "guess" or "XML"). > > Isn't it more intuitive to gather functions based on what their > high-level purpose is ("text" or "textfile") than implementation details > of where the information comes from ("sys", "locale") ? > > That function could be "textfile.set_default_encoding" (with > underscores), or even "text.textfile.set_default_encoding" (if all this > resides in a "text" module). > > Regards > > Antoine. > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/paul%40prescod.net > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/27244e91/attachment.html From jcarlson at uci.edu Sun Sep 10 20:25:43 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 10 Sep 2006 11:25:43 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <450418AC.2010400@blueyonder.co.uk> References: <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> Message-ID: <20060910111814.F8ED.JCARLSON@uci.edu> David Hopwood wrote: > Here is a very simple, reasonably (although not completely) safe, and much > more predictable guessing algorithm, based on a generalization of > : > > Let A, B, C, and D be the first 4 bytes of the stream, or None if the > corresponding byte is past end-of-stream. > > Let other be any encoding which is to be used as a default if no specific > UTF is detected. > > if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8 > if B == None: return other > if A == 0 and B == 0 and D != None: return UTF32BE > if C == 0 and D == 0: return UTF32LE > if A == 0xFE and B == 0xFF: return UTF16BE > if A == 0xFF and B == 0xFE: return UTF16LE > if A != 0 and B != 0: return other > if A == 0: return UTF16BE > return UTF16LE > > This would normally be used with 'other' as the system encoding, as an alternative > to just assuming that the file is in the system encoding. Using the xml guessing mechanism is fine, as long as you get it right. A first pass with BOM detection and a second pass to "guess" based on content in the case that a BOM isn't detected seems to make sense. Note that the above algorithm returns UTF32BE for a files beginning with 4 null bytes. - Josiah From paul at prescod.net Sun Sep 10 20:25:07 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 11:25:07 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <4503FF8F.6070801@gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <4503FF8F.6070801@gmail.com> Message-ID: <1cb725390609101125q26fba051ya086e5ed005e08c5@mail.gmail.com> On 9/10/06, Nick Coghlan wrote: > > Paul Prescod wrote: > > The function to open a text file will tenatively be called textfile(), > > though the function name is not an integral part of this PEP. The > > function takes three arguments, the filename, the mode ("r", "w", "r+", > > etc.) and the type. > > > > The type could be a true encoding or one of a small set of additional > > symbolic values. > > The 'additional symbolic values' should be implemented as true encodings > (i.e., it should be possible to look up 'site', 'guess' and 'locale' in > the > codecs registry, and replace them there as well). I don't believe that these are "true" encodings because when you query a stream for its encoding you will never find these names nor an alias for them. I also agree with Guido that the right spelling for the factory function is > to > incorporate this into the existing open() builtin. The signature of open() > is > already going to change to accept an encoding argument in Py3k, and the > special encodings proposed in the PEP are just that: special encodings > that > happen to take environmental information into account when deciding how to > decode or encode data. Yes, well I disagree that the open function should get a new argument. I think it should either be deprecated or used to open byte streams. The function name is a hold over from Unix/C which has no resonance with a Java, C#, Javascript, programmer. Plus I would like to ease the writing of code that is both valid Python 2.xand 3.x. I'd advocate the strategy that we should try to have a large enough behavioural overlap that modules can be written to run on both. Subtle changes in semantics make this difficult. To the extent that this is unavoidable (e.g. behaviour of very core syntax) I guess we'll have to live with it. But we can easily add a function called textfile() to both Python 2.x and Python 3.x and ease the transition. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/645b0c28/attachment.htm From paul at prescod.net Sun Sep 10 20:30:14 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 11:30:14 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1157892435.4246.107.camel@fsol> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> Message-ID: <1cb725390609101130w73f7a12bs243d8d6548b7b2d8@mail.gmail.com> I don't mind your name of autotextfile but I think that your by_content argument defeats the goal of having a very simple API for quick and dirty stuff. If content detection is a good idea (usually right) then we should do it. If it isn't, we shouldn't. I don't see a need for an argument to turn it on and off. The programmer is not likely to have a lot more understanding than we do of whether it is effective or not. Also, there are two different levels of content detection (as someone later in the thread pointed out). There is looking at BOMs, and there is a statistical approach of looking for high characters and inferring their encoding. I can't see an argument for ever turning off the BOM detection. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/40d9ff83/attachment.html From paul at prescod.net Sun Sep 10 21:02:44 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 12:02:44 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <450418AC.2010400@blueyonder.co.uk> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> Message-ID: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> On 9/10/06, David Hopwood wrote: > > Here is a very simple, reasonably (although not completely) safe, and much > more predictable guessing algorithm, based on a generalization of > : Your algorithm is more predictable but will confuse BOM-less UTF-8 with the system encoding frequently. I haven't decided in my own mind whether that trade-off is worth making. It will work well for: * Windows users, who will often find a BOM in their UTF-8 * Western Unix/Linux users who will increasingly use UTF-8 as their system encoding It will not work well for: * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as" UTF-8 * Mac users using UTF-8 apps or saving as UTF-8. I still haven't decided how I feel about that trade-off. Maybe the guessing algorithm should read the WHOLE FILE. After all, we've said repeatedly that it isn't for production use so making it a bit inefficient is not a big problem and might even emphasize that point. Modern I/O is astonishingly fast anyhow. On my computer it takes five seconds to decode a quarter gigabyte of UTF-8 text through Python. That would be a totally unacceptable waste for a production program, but for a quick hack it wouldn't be bad. And it would guarantee that you would never get an exception half-way through your parsing because of a bad character. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/0c19d780/attachment.html From solipsis at pitrou.net Sun Sep 10 21:36:48 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 21:36:48 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> Message-ID: <1157917008.4257.8.camel@fsol> Le dimanche 10 septembre 2006 ? 12:02 -0700, Paul Prescod a ?crit : > Your algorithm is more predictable but will confuse BOM-less UTF-8 > with the system encoding frequently. I don't think it is desirable to acknowledge only some kinds of UTF-8. It will confuse the hell out of programmers, and users. I'm not sure full-blown statistical analysis is necessary anyway. There should be an ordered list of detectable encodings, which realistically would be [all unicode variants, system default]. Then if you have a file which is syntactically valid UTF-8, it most likely /is/ UTF-8 and not ISO-8859-1 (for example). > Modern I/O is astonishingly fast anyhow. On my computer it takes five > seconds to decode a quarter gigabyte of UTF-8 text through Python. Maybe we shouldn't be that presomptuous. Modern I/O is fast but memory is not infinite. That quarter gigabyte will have swapped out other data/code in order to make some place in the filesystem cache. Also, Python is often used on more modest hardware. Regards Antoine. From tjd at sfu.ca Sun Sep 10 21:46:46 2006 From: tjd at sfu.ca (Toby Donaldson) Date: Sun, 10 Sep 2006 12:46:46 -0700 Subject: [Python-3000] educational aspects of Python 3000 Message-ID: Hello, There's been an explosion of discussion on the EDU-SIG list recently about the removal of raw_input and input from Python 3000. For teaching purposes, many educators report that they like raw_input (and input). The basic argument is that, for beginners, code like name = raw_input('Morbo demands your name! ') is clearer and easier than using sys.stdin.readline(). Some fear that a big mistake is being made here. Others just fear getting bogged down in EDU-SIG discussions. :-) Any suggestions for how educators interested in the educational/learning aspects of Python 3000 could more fruitfully participate? For instance, would there be interest in the inclusion of a standard educational library, a la the Java ACM library (http://www-cs-faculty.stanford.edu/~eroberts//jtf/index.html)? Toby -- Dr. Toby Donaldson School of Computing Science Simon Fraser University (Surrey) From solipsis at pitrou.net Sun Sep 10 21:57:56 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 10 Sep 2006 21:57:56 +0200 Subject: [Python-3000] content-based detection In-Reply-To: <1cb725390609101130w73f7a12bs243d8d6548b7b2d8@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <1cb725390609101130w73f7a12bs243d8d6548b7b2d8@mail.gmail.com> Message-ID: <1157918276.4257.30.camel@fsol> Le dimanche 10 septembre 2006 ? 11:30 -0700, Paul Prescod a ?crit : > I don't mind your name of autotextfile but I think that your > by_content argument defeats the goal of having a very simple API for > quick and dirty stuff. If content detection is a good idea (usually > right) then we should do it. Using system or locale default is trustable and reproduceable. Content-based detection is wilder, especially if the algorithm isn't fully refined in the first Py3k releases. > I can't see an argument for ever turning off the BOM detection. Perhaps, but having a subset of it still running behind your back while you disabled it is misleading. Also, I think having BOM detection as the only test in content-based detection would be uninteresting. The common use case for encoding detection is to guess between one of Unicode variants (mostly UTF-8 *with or without BOM*) and the non-Unicode encoding which is popular for a given language (e.g. ISO-8859-15). I doubt many people have to discriminate between UTF-16LE, UCS-4 and UTF-8. Are there real cases like that for text files? Regards Antoine. From david.nospam.hopwood at blueyonder.co.uk Sun Sep 10 22:01:10 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Sun, 10 Sep 2006 21:01:10 +0100 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <20060910111814.F8ED.JCARLSON@uci.edu> References: <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <20060910111814.F8ED.JCARLSON@uci.edu> Message-ID: <45046F06.5090502@blueyonder.co.uk> Josiah Carlson wrote: > David Hopwood wrote: > >>Here is a very simple, reasonably (although not completely) safe, and much >>more predictable guessing algorithm, based on a generalization of >>: >> >> Let A, B, C, and D be the first 4 bytes of the stream, or None if the >> corresponding byte is past end-of-stream. >> >> Let other be any encoding which is to be used as a default if no specific >> UTF is detected. >> >> if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8 >> if B == None: return other >> if A == 0 and B == 0 and D != None: return UTF32BE >> if C == 0 and D == 0: return UTF32LE >> if A == 0xFE and B == 0xFF: return UTF16BE >> if A == 0xFF and B == 0xFE: return UTF16LE >> if A != 0 and B != 0: return other >> if A == 0: return UTF16BE >> return UTF16LE >> >>This would normally be used with 'other' as the system encoding, as an alternative >>to just assuming that the file is in the system encoding. > > Using the xml guessing mechanism is fine, as long as you get it right. > A first pass with BOM detection and a second pass to "guess" based on > content in the case that a BOM isn't detected seems to make sense. ... if you think that guessing based on content is a good idea -- I don't. In any case, such guessing necessarily depends on the expected file format, so it should be done by the application itself, or by a library that knows more about the format. If the encoding of a text stream were settable after it had been opened, then it would be easy for anyone to implement whatever guessing algorithm they needed, without having to write an encoding implementation or include any other support for guessing in the I/O library itself. (This also requires the ability to seek back to the beginning of the stream after reading the data needed for the guess.) > Note that the above algorithm returns UTF32BE for a files beginning with > 4 null bytes. Yes. But such a thing probably isn't a text file at all -- in which case there will be subsequent decoding errors when most of the code units are not in the range 0 to 0x10FFFF. -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Sun Sep 10 23:12:34 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Sun, 10 Sep 2006 22:12:34 +0100 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> Message-ID: <45047FC2.40904@blueyonder.co.uk> Paul Prescod wrote: > Maybe the guessing algorithm should read the WHOLE FILE. That wouldn't work for streams (e.g. stdin). The algorithm I gave does work for streams, provided that they have a push-back buffer of at least 4 bytes. -- David Hopwood From jcarlson at uci.edu Sun Sep 10 23:47:13 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 10 Sep 2006 14:47:13 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <45046F06.5090502@blueyonder.co.uk> References: <20060910111814.F8ED.JCARLSON@uci.edu> <45046F06.5090502@blueyonder.co.uk> Message-ID: <20060910143817.F8F9.JCARLSON@uci.edu> David Hopwood wrote: > Josiah Carlson wrote: [snip] > > Using the xml guessing mechanism is fine, as long as you get it right. > > A first pass with BOM detection and a second pass to "guess" based on > > content in the case that a BOM isn't detected seems to make sense. > > ... if you think that guessing based on content is a good idea -- I don't. > In any case, such guessing necessarily depends on the expected file format, > so it should be done by the application itself, or by a library that knows > more about the format. I'm keeping my hat out of the ring for whether guessing is a good idea. However, if one is going to have a guessing mechanic, starting with UTF BOMS is a good start, which is what I was trying to say. > If the encoding of a text stream were settable after it had been opened, > then it would be easy for anyone to implement whatever guessing algorithm > they needed, without having to write an encoding implementation or include > any other support for guessing in the I/O library itself. That is true. But considering that you, presumably an experienced programmer with regards to unicode, have provided an algorithm with an obvious hole that I was able to discover in a few moments, suggests that guessing algorithms are not easy to write. > (This also requires the ability to seek back to the beginning of the stream > after reading the data needed for the guess.) > > > Note that the above algorithm returns UTF32BE for a files beginning with > > 4 null bytes. > > Yes. But such a thing probably isn't a text file at all -- in which case > there will be subsequent decoding errors when most of the code units are > not in the range 0 to 0x10FFFF. A file starting with 4 nulls certainly will likely imply a non-text file of some kind, but presuming that "most" code points would not be in the 0...0x10ffff range is a bit of assumption about the content of a file. I thought you didn't want to guess. - Josiah From paul at prescod.net Mon Sep 11 06:09:47 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 21:09:47 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1157917008.4257.8.camel@fsol> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> <1157917008.4257.8.camel@fsol> Message-ID: <1cb725390609102109j6e4c1e7bk18087b5319928abf@mail.gmail.com> On 9/10/06, Antoine Pitrou wrote: > > ... > > Modern I/O is astonishingly fast anyhow. On my computer it takes five > > seconds to decode a quarter gigabyte of UTF-8 text through Python. > > Maybe we shouldn't be that presomptuous. Modern I/O is fast but memory > is not infinite. That quarter gigabyte will have swapped out other > data/code in order to make some place in the filesystem cache. Not really. It works in 16k chunks. > Also, Python is often used on more modest hardware. People writing programs to deal with vast amounts of data on modest computers are trying to do something advanced and should not use the quick and dirty guessing algorithms. We're not trying to hit 100% of programmers and situations. Not even close. The PEP was very explicit about that fact. Paul Presocd From paul at prescod.net Mon Sep 11 06:11:11 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 21:11:11 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <45047FC2.40904@blueyonder.co.uk> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> <45047FC2.40904@blueyonder.co.uk> Message-ID: <1cb725390609102111u44287761i8b509729aa6f5ce1@mail.gmail.com> The PEP doesn't deal with streams. It is about files. On 9/10/06, David Hopwood wrote: > Paul Prescod wrote: > > Maybe the guessing algorithm should read the WHOLE FILE. > > That wouldn't work for streams (e.g. stdin). The algorithm I gave > does work for streams, provided that they have a push-back buffer of > at least 4 bytes. > > -- > David Hopwood > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/paul%40prescod.net > From paul at prescod.net Mon Sep 11 06:31:00 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 21:31:00 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <45046F06.5090502@blueyonder.co.uk> References: <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <20060910111814.F8ED.JCARLSON@uci.edu> <45046F06.5090502@blueyonder.co.uk> Message-ID: <1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com> On 9/10/06, David Hopwood wrote: > Josiah Carlson wrote: > ... if you think that guessing based on content is a good idea -- I don't. > In any case, such guessing necessarily depends on the expected file format, > so it should be done by the application itself, or by a library that knows > more about the format. I disagree. If a non-trivial file can be decoded as a UTF-* encoding it probably is that encoding. I don't see how it matters whether the file represents Latex or an .htaccess file. XML is a special case because it is specially designed to make encoding detection (not guessing, but detection) easy. > If the encoding of a text stream were settable after it had been opened, > then it would be easy for anyone to implement whatever guessing algorithm > they needed, without having to write an encoding implementation or include > any other support for guessing in the I/O library itself. But this defeats the whole purpose of the PEP which is to accelerate the writing of quick and dirty text processing scripts. Paul Prescod From paul at prescod.net Mon Sep 11 06:42:01 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 10 Sep 2006 21:42:01 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <87d5a366xd.fsf@qrnik.zagroda> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <87d5a366xd.fsf@qrnik.zagroda> Message-ID: <1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com> On 9/10/06, Marcin 'Qrczak' Kowalczyk wrote: >... > Other than that, guessing the encoding from the contents of the text > stream, especially statistical guessing basing on well-formed UTF-8 > non-ASCII characters, shouldn't be encouraged, because it's effect is > not predictable. My thinking has evolved. The "guess" mode should "virtually" try different decodings until one succeeds. In the worst case this might involve decoding the whole file twice (once for detection and once for application processing). In general, your proposal is too far from the goals that were given to me by Guido for me to really evaluate it as an alternative. Guido's goal was that quick and dirty text processing should "just work" for newbies and encoding-disintererested expert programmers. I don't think that your proposal achieves that. Paul Prescod From jeff at soft.fujitsu.com Mon Sep 11 06:54:03 2006 From: jeff at soft.fujitsu.com (Jeff Wilcox) Date: Mon, 11 Sep 2006 13:54:03 +0900 Subject: [Python-3000] Help on text editors In-Reply-To: <1cb725390609091058j49ffcdc6h61ce7eb80700f011@mail.gmail.com> Message-ID: > Great: but what is the default Textedit encoding on a Japanized version of the Mac? > Paul Prescod I'm fairly sure that the settings on the computer I looked at this on are default, but I borrowed the machine so I can't guarantee it. In textpad with OS X set to Japanese there were three choices of encoding: EUC-JP, ISO-2022-JP and Shift_JIS. The dropdown defaulted to Shift_JIS. The (reversible) procedure that I used to change the language back and forth is: System Preferences > International > Language Drag the language you wish to use to the top of the list. Log out, then back in again and it should be in the language you chose. If only one language is listed, then the language pack(s) are most likely not installed. They can be installed from the original OS X install CD/DVD. From walter at livinglogic.de Mon Sep 11 12:00:38 2006 From: walter at livinglogic.de (Walter =?iso-8859-1?Q?D=F6rwald?=) Date: Mon, 11 Sep 2006 12:00:38 +0200 (CEST) Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157886177.4246.59.camel@fsol> <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com> Message-ID: <61353.89.54.51.133.1157968838.squirrel@isar.livinglogic.de> Paul Prescod wrote: > I went based on the current setdefaultencoding. But it seems that we will > accumulate 3 or 4 related functions so I'm pursuaded that there should be a > module. > > encodingdetection.setdefaultfileencoding > encodingdetection.registerencodingdetector > encodingdetection.guessfileencoding(filename) > encodingdetection.guessfileencoding(bytestream) > > Suggestion accepted. There's no need for implementing a separate infrastructure for encoding detection. This can be implemented as a "meta codec". See http://styx.livinglogic.de/~walter/xml_codec/xml_codec.py for a codec that autodetects the XML encoding. Servus, Walter From phd at phd.pp.ru Mon Sep 11 12:30:31 2006 From: phd at phd.pp.ru (Oleg Broytmann) Date: Mon, 11 Sep 2006 14:30:31 +0400 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> Message-ID: <20060911103031.GD29600@phd.pp.ru> On Sun, Sep 10, 2006 at 12:02:44PM -0700, Paul Prescod wrote: > * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as" > UTF-8 Finally I've got the definitive answer for "is Russia Europe or Asia?" It is an Eastern country! At last! ;) > Maybe the guessing algorithm should read the WHOLE FILE. Zen: "In the face of ambiguity, refuse the temptation to guess." Unfortunately this contradicts to not the only idea how much to read but the to whole idea to guess encoding. So may be we are going in the wrong direction. IMHO the right direction is to include a guessing script in Tools directory. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From qrczak at knm.org.pl Mon Sep 11 12:38:49 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Mon, 11 Sep 2006 12:38:49 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com> (Paul Prescod's message of "Sun, 10 Sep 2006 21:42:01 -0700") References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <87d5a366xd.fsf@qrnik.zagroda> <1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com> Message-ID: <8764fu3cxy.fsf@qrnik.zagroda> "Paul Prescod" writes: > Guido's goal was that quick and dirty text processing should "just > work" for newbies and encoding-disintererested expert programmers. What does 'guess' mean for creating files? Consider a program which reads one file and writes data extracted from it (e.g. with lines beginning with '#' removed) to another file. With my proposal it will work if the encoding of the file is the same as the locale encoding (or if they can be harmlessly confused). It will just work most of the time. It will not work in general if the encodings are different. In this case the user of the script can override the encoding assumption by temporarily changing the locale or by changing an environment variable. OTOH when the encoding is guessed from file contents, what happens depending on how it's designed. If the locale is ISO-8859-x: 1. Files are created in the locale encoding. Then some UTF-8 files will be silently recoded to a different encoding, and for other UTF-8 files writing will fail (if they contain characters not expressible in the locale encoding). 2. Files are created in UTF-8. Then files encoded with the locale encoding will be silently recoded to UTF-8, causing trouble for further work with the file (it can't be even typed to the terminal). If the locale is UTF-8, but the reader assumes e.g. ISO-8859-1 when it can't decode as UTF-8, there will be a silent recoding for these files. If the file is in fact encoded in ISO-8859-2, the result will be nonsensical: looking as UTF-8 but with characters substituted according to ISO-8859-2/1 differences. In either case it's not clear what the user of the script can do to preserve the encoding in the output file. I claim that in my design the result is more easily predictable and easier to fix when it goes wrong. I've implemented a hack which allows simple programs to "just work" in case of UTF-8. It's a modified encoder/decoder which escapes malformed UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte sequences to round-trip UTF-8 decoding and encoding. It's not used by default and it's never used when "UTF-8" is specified explicitly, because it's not the true UTF-8, but I have an environment variable which says "if the locale is UTF-8, use the modified UTF-8 as the default encoding". -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From barry at python.org Mon Sep 11 13:35:51 2006 From: barry at python.org (Barry Warsaw) Date: Mon, 11 Sep 2006 07:35:51 -0400 Subject: [Python-3000] iostack, second revision In-Reply-To: <45045105.5040209@jmunch.dk> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> <1157740873.4979.10.camel@fsol> <450254DB.3020502@gmail.com> <45045105.5040209@jmunch.dk> Message-ID: <0D784B1A-DB20-4D27-A11E-4AED4B76152B@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 10, 2006, at 1:53 PM, Anders J. Munch wrote: > I say drop seek_cur and seek_end altogether, and keep only absolute > seek. I was just looking through some of our elf/dwarf parsing code and we use seek_cur quite a bit. Not that it couldn't be rewritten to use absolute seek, but it's also not the most natural interface. I'd opt for keeping those interfaces for binary files since there are use- cases where they are useful. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRQVKGHEjvBPtnXfVAQL0GwP+KG8NflbbSoUxHLIkCyMd+NFj2fR1GAU5 dfu7cIc/oJpx25VxqgcDqM3IdKqp5CyJLG7AjtPXm8SuWGba3YmunHAcvnPPmP6Z qdxAI8KD+Sf/imEuB7te29AUGlFteh+6IGKJKBMjxiXSjjqw2lwhDQphyhVPKuHp 3j+oly6uZ8E= =/1N6 -----END PGP SIGNATURE----- From paul at prescod.net Mon Sep 11 15:58:42 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 11 Sep 2006 06:58:42 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <20060911103031.GD29600@phd.pp.ru> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> <20060911103031.GD29600@phd.pp.ru> Message-ID: <1cb725390609110658t40cbafc8q66f93b1b03b7eabc@mail.gmail.com> On 9/11/06, Oleg Broytmann wrote: > > On Sun, Sep 10, 2006 at 12:02:44PM -0700, Paul Prescod wrote: > > * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving > as" > > UTF-8 > > Finally I've got the definitive answer for "is Russia Europe or Asia?" > It is an Eastern country! At last! ;) For these purposes, Russia is European, isn't it? Russian text can be subsumed by UTF-8 with relatively minor expansion, right? If so, then I would guess that UTF-8 would replace KOI8-R and iso8859-? for Russian eventually. > Maybe the guessing algorithm should read the WHOLE FILE. > > Zen: "In the face of ambiguity, refuse the temptation to guess." > > Unfortunately this contradicts to not the only idea how much to read > but the to whole idea to guess encoding. So may be we are going in the > wrong direction. IMHO the right direction is to include a guessing script > in Tools directory. That was the position I started with. Guido wanted a guessing mode. So I designed what seemed to me to be the least dangerous guessing mode possible: 1. Off by default. 2. Turned on by the keyword "guess". 3. Decodes the full text to check for encoding correctness. Given these safeguards, I think that the feature is not only safe enough but also helpful. Moving it to a script would not meet the central goal that it be easily usable by people who do not know much about encodings or Python. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060911/9b38a31b/attachment.html From paul at prescod.net Mon Sep 11 16:15:07 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 11 Sep 2006 07:15:07 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <8764fu3cxy.fsf@qrnik.zagroda> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <87d5a366xd.fsf@qrnik.zagroda> <1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com> <8764fu3cxy.fsf@qrnik.zagroda> Message-ID: <1cb725390609110715t4caca46bya5f9b2508d216c7@mail.gmail.com> On 9/11/06, Marcin 'Qrczak' Kowalczyk wrote: > > "Paul Prescod" writes: > > > Guido's goal was that quick and dirty text processing should "just > > work" for newbies and encoding-disintererested expert programmers. > > What does 'guess' mean for creating files? I wasn't sure about this one. But on Windows and Mac it seems safe to generate UTF-8-with-BOM. Textedit, VIM and notepad all auto-detect the UTF-8 BOM and do the right thing. 2. Files are created in UTF-8. > > Then files encoded with the locale encoding will be silently > recoded to UTF-8, causing trouble for further work with the file > (it can't be even typed to the terminal). It can on the teriminal on the mac. And on the increasing number of UTF-8 defaulted Linux distributions. Perhaps it should by default use the Unix locale for output, but only on Unix and not on mac/Windows. I've implemented a hack which allows simple programs to "just work" in > case of UTF-8. It's a modified encoder/decoder which escapes malformed > UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte > sequences to round-trip UTF-8 decoding and encoding. It's not used by > default and it's never used when "UTF-8" is specified explicitly, > because it's not the true UTF-8, but I have an environment variable > which says "if the locale is UTF-8, use the modified UTF-8 as the > default encoding". That's an interesting idea. I'm not sure if you are proposing it as being applicable to this PEP or not... Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060911/f5e90075/attachment.htm From phd at phd.pp.ru Mon Sep 11 16:23:04 2006 From: phd at phd.pp.ru (Oleg Broytmann) Date: Mon, 11 Sep 2006 18:23:04 +0400 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609110658t40cbafc8q66f93b1b03b7eabc@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> <20060911103031.GD29600@phd.pp.ru> <1cb725390609110658t40cbafc8q66f93b1b03b7eabc@mail.gmail.com> Message-ID: <20060911142303.GA12119@phd.pp.ru> On Mon, Sep 11, 2006 at 06:58:42AM -0700, Paul Prescod wrote: > For these purposes, Russia is European, isn't it? If the test is "a BOM in UTF-8 text files on Unices" - then no. :) > Russian text can be subsumed by UTF-8 with relatively minor expansion, right? Sorry, what do you mean? That russian encodings can be converted to UTF-8? Yes, they can. But the most popular encoding here is cp1251, not UTF-8. Even on Unices there are people who use cp1251 as their main encoding (locale, fonts, keyboard mapping) because they often switch between a number of platforms. > If so, then I > would guess that UTF-8 would replace KOI8-R and iso8859-? for Russian > eventually. On Unix? Probably yes, but not in the nearest future. There are some popular tools (for me the most notable is Midnight Commander) that still have problems with UTF-8 locales. > Given these safeguards, I think that the feature is not only safe enough but > also helpful. Ok then. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From mcherm at mcherm.com Mon Sep 11 18:44:47 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Mon, 11 Sep 2006 09:44:47 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding Message-ID: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> Paul Prescod writes: [... Pre-PEP proposal ...] Quick thoughts: * I like it. Good work. * I agree with Guido: "open" is the right spelling for this. * I agree with Paul: mandatory specification is the way to go. 10,000 different blog entries, tutorials, and cookbook recipies can recomend using "guess" if you don't know what you're doing. Or they can all recomend using "site". Or they can all recomend using "utf-8". I'm not sure what they'll all recomend, and that's enough reason for me to require the user to say. If we later decide that one is an acceptable default, we could make it optional in release 3.1, 3.2, or 3.3... but if we make it optional from the start then it can never become required. Other thoughts after reading everyone else's replies: * Guessing. Hmm. Broad topic. If the option for guessing were spelled "guess" (rather than, say "autodetect") then I would have been scared off from using it in "production code" but I would still feel free to use in quick-and-dirty scripting. On the other hand, I'm not sure I'm a good "typical programmer". Fortunately, your PEP works fine whether or not "guess" is allowed, so I can support your PEP without having to commit on the idea of having a "guess" option. -- Michael Chermside From mcherm at mcherm.com Mon Sep 11 20:22:15 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Mon, 11 Sep 2006 11:22:15 -0700 Subject: [Python-3000] educational aspects of Python 3000 Message-ID: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> Toby Donaldson writes: > Any suggestions for how educators interested in the > educational/learning aspects of Python 3000 could more fruitfully > participate? You're doing pretty well so far! Seriously... just speak up: Pythonistas (including, in particular, Guido) value the fact that Python is an excellent language for beginners, and we'll go out of our way to keep it so. But you might need to speak up. Elsewhere: > For teaching purposes, many educators report that they like raw_input > (and input). The basic argument is that, for beginners, code like > > name = raw_input('Morbo demands your name! ') > > is clearer and easier than using sys.stdin.readline(). [...] > For instance, would there be interest in the inclusion of a standard > educational library... Personally, I think input() should never have existed and must go no matter what. I think raw_input() is worth discussing -- I wouldn't need it, but it's little more than a convenience function. The idea of a standard edu library though is a GREAT one. That would provide a standard place for things like raw_input() (with a better name) as well as lots of other "helper functions" useful to beginners and/or students -- and all it would cost is a single line of boilerplate at the top of each program ("from beginnerlib import *" or something like that). I suspect that such a library would be enthusiastically welcomed into the Python core distribution *IF* there was clear consensus about what it should contain. So if the EDU-SIG could do the hard work of obtaining the consensus (and mark my words... it IS hard work), I think you'd be 90% of the way there. -- Michael Chermside From brett at python.org Mon Sep 11 20:26:51 2006 From: brett at python.org (Brett Cannon) Date: Mon, 11 Sep 2006 11:26:51 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> Message-ID: On 9/11/06, Michael Chermside wrote: > > Toby Donaldson writes: > > Any suggestions for how educators interested in the > > educational/learning aspects of Python 3000 could more fruitfully > > participate? > > You're doing pretty well so far! Seriously... just speak up: Pythonistas > (including, in particular, Guido) value the fact that Python is an > excellent language for beginners, and we'll go out of our way to keep > it so. But you might need to speak up. > > Elsewhere: > > For teaching purposes, many educators report that they like raw_input > > (and input). The basic argument is that, for beginners, code like > > > > name = raw_input('Morbo demands your name! ') > > > > is clearer and easier than using sys.stdin.readline(). > [...] > > For instance, would there be interest in the inclusion of a standard > > educational library... > > Personally, I think input() should never have existed and must go > no matter what. Agreed. Teach the folks eval() quick if you want something like that. I think raw_input() is worth discussing -- I wouldn't > need it, but it's little more than a convenience function. Yeah, but when you are learning it's cool to take input easily. I loved raw_input() when I started out. The idea of a standard edu library though is a GREAT one. That would > provide a standard place for things like raw_input() (with a better > name) as well as lots of other "helper functions" useful to beginners > and/or students -- and all it would cost is a single line of boilerplate > at the top of each program ("from beginnerlib import *" or something > like that). > > I suspect that such a library would be enthusiastically welcomed into > the Python core distribution *IF* there was clear consensus about > what it should contain. So if the EDU-SIG could do the hard work of > obtaining the consensus (and mark my words... it IS hard work), I > think you'd be 90% of the way there. Yeah. Stuff that normally trips up beginners could be put in here with pointers to how to do it properly when they get more advanced. And making the name seem very newbie will (hopefully) discourage people from using it beyond their learning code. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060911/5656a63b/attachment.html From solipsis at pitrou.net Mon Sep 11 20:42:46 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 11 Sep 2006 20:42:46 +0200 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> Message-ID: <1158000166.4672.33.camel@fsol> Le lundi 11 septembre 2006 ? 11:22 -0700, Michael Chermside a ?crit : > The idea of a standard edu library though is a GREAT one. That would > provide a standard place for things like raw_input() (with a better > name) as well as lots of other "helper functions" useful to beginners > and/or students -- and all it would cost is a single line of boilerplate > at the top of each program ("from beginnerlib import *" or something > like that). There is a risk with beginner-specific library: it's the same problem as with user interfaces which have "simple" and "advanced" modes. Often the "simple" mode becomes an excuse for lazy developers to turn the "advanced" mode into a painful mess (under the flawed pretext that advanced users can suffer the pain anyway). And if the helper functions are genuinely useful, why would they be only for beginners and students? IMHO, it would be better to label the module "scripting" rather than "beginnerlib" (and why append "lib" at the end of module names anyway? :-)). It might even contain stuff such as encoding guessing. >>> from scripting import raw_input, autotextfile Regards Antoine. From p.f.moore at gmail.com Mon Sep 11 23:49:34 2006 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 11 Sep 2006 22:49:34 +0100 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> Message-ID: <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> On 9/11/06, Michael Chermside wrote: > Paul Prescod writes: > [... Pre-PEP proposal ...] > > Quick thoughts: My quick thoughts on this whole subject: * Yes, it should be "open". Anything else feels like gratuitous breakage. * There should be a default encoding, and it should be the system default one. If I don't take special steps, most tools I use save in the system default encoding, so Python should follow this approach as well. * I don't mind corrupted characters for unusual cases. Really, I don't. * The bizarre Windows behavious of using different encodings for console and GUI programs doesn't bother me either. Really. I promise. 99.99% of the time I simply don't care about i18n. All I want is something that runs on the machine(s) I'm using. Using the system locale is fine for that. In the rare cases where I *do* care about international characters, I have no problem doing work and research to get things right. And when I've done that, detecting encodings and specifying the right thing in an open() call is entirely OK. Of course, I'm in the useful position of having an OS default character set which contains ASCII as a subset. I don't know what issues someone with Greek/Russian/Japanese or whatever as an OS default would have (one thought - if your default character set doesn't contain ASCII as a subset, how do you deal with the hosts file? OTOH, I had a real struggle to find an example of an encoding which didn't have ASCII as a subset!) Paul. From jimjjewett at gmail.com Mon Sep 11 23:53:40 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Sep 2006 17:53:40 -0400 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157886177.4246.59.camel@fsol> <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com> Message-ID: On 9/10/06, Paul Prescod wrote: > encodingdetection.setdefaultfileencoding > encodingdetection. registerencodingdetector > encodingdetection.guessfileencoding(filename) > encodingdetection.guessfileencoding(bytestream) This demonstrates two of problems with requiring an explicit decision. (1) You still won't actually get one; you'll just lose the information that it wasn't even considered. (2) You'll add so much boilerplate that you invite other bugs. Suddenly, >>> f=open("runlist.txt") turns into something more like >>> import encodingdetection ... >>> f=open("runlist.txt", encoding=encodingdetection.guessfileencoding("runlast.txt")) I certainly wouldn't read a line like that without a good reason; I wouldn't even notice that the encoding guess was based on a different file. It will be an annoying amount of typing though, during which time I'll be thinking: "It doesn't really matter what encoding is used; if there is anything outside of ASCII, it is because the user put it there, and all I have to do is copy it around unchanged." For situations like that, if there were *ever* a reason to specify a particular encoding, I *still* wouldn't get it right, because it is something that hasn't occurred to me. I guess the explicitness means that the error is now my fault instead of python's, but the error is still there, and someone else is more reluctant to fix it. (Well, this *was* an explicit choice -- maybe I had a reason?) But since the encoding is mandatory, I do still have to deal with it, by making my code longer and uglier. In the end, packages will end up distributing their own non-standard convenience wrappers, so that the equivalent of >>> f=open("runlist.txt") can still be used -- but you'll have to read the whole module and the imports (and the star-imports) to figure out what it means/whether it is shadowed. You can't even scan for "open" because someone may have named their convenience wrapper get_file. If packages A and B disagree about the default encoding, it will be even harder to find and fix than it is today. -jJ From paul at prescod.net Tue Sep 12 00:09:02 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 11 Sep 2006 15:09:02 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157886177.4246.59.camel@fsol> <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com> Message-ID: <1cb725390609111509u39725d5al72294b0d89009eb6@mail.gmail.com> I think that the basis of your concern is a misunderstanding of the proposal (at least as documented in the PEP). On 9/11/06, Jim Jewett wrote: > On 9/10/06, Paul Prescod wrote: > > > encodingdetection.setdefaultfileencoding > > encodingdetection. registerencodingdetector > > encodingdetection.guessfileencoding(filename) > > encodingdetection.guessfileencoding(bytestream) Those last two are helper functions exposing the functionality of the "guess" keyword through a different means. > This demonstrates two of problems with requiring an explicit decision. > > (1) You still won't actually get one; you'll just lose the > information that it wasn't even considered. I frankly don't think that that makes any sense. If there is a default then how can I know whether someone thought about it and decided to use the default or did not think it through and decided to use the default. > (2) You'll add so much boilerplate that you invite other bugs. > > > Suddenly, > > >>> f=open("runlist.txt") > > turns into something more like > > >>> import encodingdetection > ... > >>> f=open("runlist.txt", > encoding=encodingdetection.guessfileencoding("runlast.txt")) No, that was never the proposal. The proposal is: f = open("runlist.txt", "guess") > "It doesn't really matter what encoding is used; if there is anything > outside of ASCII, it is because the user put it there, and all I have > to do is copy it around unchanged." Yes, if you are doing something utterly trivial with the text as opposed to the normal case where you are comparing it with some other input, combining it with some other input, putting it in a database, serving it up over the Web etc. Even Unix "cat" would need to be encoding aware if it were created today and designed to be i18n friendly. > For situations like that, if there were *ever* a reason to specify a > particular encoding, I *still* wouldn't get it right, because it is > something that hasn't occurred to me. I guess the explicitness means > that the error is now my fault instead of python's, but the error is > still there, and someone else is more reluctant to fix it. (Well, > this *was* an explicit choice -- maybe I had a reason?) The documentation for the "guess" keyword will be clear that it is NEVER the correct choice for production-quality software. That's one of the virtues of having an explicit keyword for the quick and dirty mode (as opposed to making it the default as you seem to wish). > But since the encoding is mandatory, I do still have to deal with it, > by making my code longer and uglier. In the end, packages will end up > distributing their own non-standard convenience wrappers, so that the > equivalent of > > >>> f=open("runlist.txt") No, I don't think they'll do that to avoid typing 7 extra characters. Paul Prescod From guido at python.org Tue Sep 12 00:18:32 2006 From: guido at python.org (Guido van Rossum) Date: Mon, 11 Sep 2006 15:18:32 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <1158000166.4672.33.camel@fsol> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <1158000166.4672.33.camel@fsol> Message-ID: On 9/11/06, Antoine Pitrou wrote: > Le lundi 11 septembre 2006 ? 11:22 -0700, Michael Chermside a ?crit : > > The idea of a standard edu library though is a GREAT one. That would > > provide a standard place for things like raw_input() (with a better > > name) as well as lots of other "helper functions" useful to beginners > > and/or students -- and all it would cost is a single line of boilerplate > > at the top of each program ("from beginnerlib import *" or something > > like that). > > There is a risk with beginner-specific library: it's the same problem as > with user interfaces which have "simple" and "advanced" modes. Often > the "simple" mode becomes an excuse for lazy developers to turn the > "advanced" mode into a painful mess (under the flawed pretext that > advanced users can suffer the pain anyway). Please give us more credit than that. > And if the helper functions are genuinely useful, why would they be only > for beginners and students? DrScheme has several levels for beginners and experts and in between, so they think it is really useful to have different levels. I'm torn; I wish a single level would apply to all but I know that many educators provide some kind of "training wheels" library for their students. > IMHO, it would be better to label the module "scripting" rather than > "beginnerlib" (and why append "lib" at the end of module names > anyway? :-)). > It might even contain stuff such as encoding guessing. > > >>> from scripting import raw_input, autotextfile I'm not so keen on 'scripting' as the name either, but I'm sure we can come up with something. Perhaps easyio, simpleio or basicio? (Not to be confused with vbio. :-) I'm also not completely against revising the decision on killing raw_input(). While input() must definitely go, raw_input() might survive under a new name. Too bad calling it input() would be too confusing from a Python 2.x POV, and I don't want to call it readline() because it strips the trailing newline and raises EOF on error. Unless the educators can line with having to use readline().strip() instead of raw_input()...? Perhaps the educators (less Art :-) can get together and write a PEP? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From paul at prescod.net Tue Sep 12 00:30:15 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 11 Sep 2006 15:30:15 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> Message-ID: <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> On 9/11/06, Paul Moore wrote: > On 9/11/06, Michael Chermside wrote: > > Paul Prescod writes: > > [... Pre-PEP proposal ...] > > > > Quick thoughts: > > My quick thoughts on this whole subject: > > * Yes, it should be "open". Anything else feels like gratuitous breakage. > * There should be a default encoding, and it should be the system > default one. If I don't take special steps, most tools I use save in > the system default encoding, so Python should follow this approach as > well. So just to be clear: you want to keep the function name "open" but change its behaviour. For example, the ord() of high characters returned by open will be completely different than today. And the syntax for "open" of binary files will be different (in fact, whether it reads the file or throws an exception will depend on your locale). > The bizarre Windows behavious of using different > encodings for console and GUI programs doesn't > bother me either. Really. I promise." So according to this philosophy, Windows and Mac users will probably never be able to open UTF-8 documents by default even if every Microsoft app generates and consumes UTF-8 by default, because Microsoft and Apple will probably _never change the default locale_ for backwards compatibility reasons. Their philosophy seems to be that the locale is irrelevant in the age of Unicode and therefore there is no reason to upgrade it at a risk of "breaking" applications that were hard-coded to expect a specific locale. Paul Prescod From qrczak at knm.org.pl Tue Sep 12 01:20:28 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 12 Sep 2006 01:20:28 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> (Paul Moore's message of "Mon, 11 Sep 2006 22:49:34 +0100") References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> Message-ID: <87d5a2f0sj.fsf@qrnik.zagroda> "Paul Moore" writes: > Of course, I'm in the useful position of having an OS default > character set which contains ASCII as a subset. I don't know what > issues someone with Greek/Russian/Japanese or whatever as an OS > default would have (one thought - if your default character set > doesn't contain ASCII as a subset, how do you deal with the hosts > file? OTOH, I had a real struggle to find an example of an encoding > which didn't have ASCII as a subset!) AFAIK the only encoding which might be used today which is not based on ASCII is EBCDIC. Perl supports it (and it supports Unicode at the same time, via UTF-EBCDIC). Other than that, there are some Japanese encodings with a confusion between \ and the Yen sign, otherwise being ASCII. They are used today. There used to be national ASCII variants with accented letters instead of [\]^{|}~. I don't think they are still used today. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From david.nospam.hopwood at blueyonder.co.uk Tue Sep 12 01:25:15 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Tue, 12 Sep 2006 00:25:15 +0100 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com> References: <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <20060910111814.F8ED.JCARLSON@uci.edu> <45046F06.5090502@blueyonder.co.uk> <1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com> Message-ID: <4505F05B.8070503@blueyonder.co.uk> Paul Prescod wrote: > On 9/10/06, David Hopwood wrote: > >> ... if you think that guessing based on content is a good idea -- I >> don't. In any case, such guessing necessarily depends on the expected file >> format, so it should be done by the application itself, or by a library that >> knows more about the format. > > I disagree. If a non-trivial file can be decoded as a UTF-* encoding > it probably is that encoding. That is quite false for UTF-16, at least. It is also false for short UTF-8 files. > I don't see how it matters whether the > file represents Latex or an .htaccess file. XML is a special case > because it is specially designed to make encoding detection (not > guessing, but detection) easy. Many other frequently used formats also necessarily start with an ASCII character and do not contain NULs, which is at least sufficient to reliably detect UTF-16 and UTF-32. >> If the encoding of a text stream were settable after it had been opened, >> then it would be easy for anyone to implement whatever guessing algorithm >> they needed, without having to write an encoding implementation or >> include any other support for guessing in the I/O library itself. > > But this defeats the whole purpose of the PEP which is to accelerate > the writing of quick and dirty text processing scripts. That doesn't justify making the behaviour of those scripts "dirtier" than necessary. I think that the focus should be on solving a set of well-defined problems, for which BOM detection can definitely help: Suppose we have a system in which some of the files are in a potentially non-Unicode 'system' encoding, and some are Unicode. The user of the system needs a reliable way of marking the Unicode files so that the encoding of *those* files can be distinguished. In addition, a provider of portable software or documentation needs a way to encode files for distribution that is independent of the system encoding, since (before run-time) they don't know what encoding that will on any given system. Use and detection of Byte Order Marks solves both of these problems. You appear to be arguing for the common use of much more ambitious heuristic guessing, which *cannot* be made reliable. I am not opposed to providing support for such guessing in the Python standard library, but only if its limitations are thoroughly documented, and only if it is not the default. -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Tue Sep 12 01:29:13 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Tue, 12 Sep 2006 00:29:13 +0100 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609102111u44287761i8b509729aa6f5ce1@mail.gmail.com> References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com> <45047FC2.40904@blueyonder.co.uk> <1cb725390609102111u44287761i8b509729aa6f5ce1@mail.gmail.com> Message-ID: <4505F149.8030509@blueyonder.co.uk> Paul Prescod wrote: > The PEP doesn't deal with streams. It is about files. An important part of the Unix design philosophy (partially adopted by Windows) is to make streams and files behave as similarly as possible. It is quite feasible to make *some* detection algorithms work for streams, and this is an advantage over algorithms that don't work for streams. -- David Hopwood From qrczak at knm.org.pl Tue Sep 12 01:44:13 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 12 Sep 2006 01:44:13 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> (Paul Prescod's message of "Mon, 11 Sep 2006 15:30:15 -0700") References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> Message-ID: <87r6yij7ea.fsf@qrnik.zagroda> "Paul Prescod" writes: >> The bizarre Windows behavious of using different >> encodings for console and GUI programs doesn't >> bother me either. Really. I promise." > > So according to this philosophy, Windows and Mac users will probably > never be able to open UTF-8 documents by default even if every > Microsoft app generates and consumes UTF-8 by default, because > Microsoft and Apple will probably _never change the default locale_ > for backwards compatibility reasons. This can be solved for file reading by making a "Windows locale" always consider UTF-8 BOM and switch to UTF-8 in this case. It's still unclear what to do for writing on Windows. I have no idea what Mac does (does it typically use UTF-8 locales? and does it typicaly use a BOM in UTF-8?). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Tue Sep 12 02:41:59 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 11 Sep 2006 17:41:59 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <4505F05B.8070503@blueyonder.co.uk> References: <1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk> <20060910111814.F8ED.JCARLSON@uci.edu> <45046F06.5090502@blueyonder.co.uk> <1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com> <4505F05B.8070503@blueyonder.co.uk> Message-ID: <1cb725390609111741v3b7ef92aufec5aa960711768c@mail.gmail.com> On 9/11/06, David Hopwood wrote: > > I disagree. If a non-trivial file can be decoded as a UTF-* encoding > > it probably is that encoding. > > That is quite false for UTF-16, at least. It is also false for short UTF-8 > files. True UTF-16 (as opposed to UTF-16 BE/UTF 16 LE) files have a BOM. Also, you can recognize incorrect ones through misuse of surrogates. > > I don't see how it matters whether the > > file represents Latex or an .htaccess file. XML is a special case > > because it is specially designed to make encoding detection (not > > guessing, but detection) easy. > > Many other frequently used formats also necessarily start with an ASCII > character and do not contain NULs, which is at least sufficient to reliably > detect UTF-16 and UTF-32. Yes, but these are the two easiest ones. > > But this defeats the whole purpose of the PEP which is to accelerate > > the writing of quick and dirty text processing scripts. > > That doesn't justify making the behaviour of those scripts "dirtier" than > necessary. > > I think that the focus should be on solving a set of well-defined problems, > for which BOM detection can definitely help: > > Suppose we have a system in which some of the files are in a potentially > non-Unicode 'system' encoding, and some are Unicode. The user of the system > needs a reliable way of marking the Unicode files so that the encoding of > *those* files can be distinguished. If the user understands the problem and is willing to go to this level of effort then they are not the target user of the feature. > ... In addition, a provider of portable > software or documentation needs a way to encode files for distribution that > is independent of the system encoding, since (before run-time) they don't > know what encoding that will on any given system. Use and detection of > Byte Order Marks solves both of these problems. Sure, that's great. > You appear to be arguing for the common use of much more ambitious heuristic > guessing, which *cannot* be made reliable. First, the word "guess" necessarily implies unreliability. Guido started this whole chain of discussion when he said: "(Auto-detection from sniffing the data is a perfectly valid answer BTW -- I see no reason why that couldn't be one option, as long as there's a way to disable it.)" > ... I am not opposed to providing > support for such guessing in the Python standard library, but only if its > limitations are thoroughly documented, and only if it is not the default. Those are both characteristics of the proposal that started this thread so what are we arguing about? Since writing the PEP, I've noticed that the strategy of trying to decode as UTF-* and falling back to an 8-bit character set is actually pretty common in text editors, which implies that Python's behaviour here can be highly similar to text editors. This was the key requirement Guido gave me in an off-list email for the guessing mode. VIM: "fileencodings: This is a list of character encodings considered when starting to edit a file. When a file is read, Vim tries to use the first mentioned character encoding. If an error is detected, the next one in the list is tried. When an encoding is found that works, 'fileencoding' is set to it. " Reading the docs, one can infer that this feature is specifically designed to support UTF-8 sniffing. I would guess that the default configuration has it do UTF-8 sniffing. BBEdit: "If the file contains no other cues to indicate its text encoding, and its contents appear to be valid UTF-8, BBEdit will open it as UTF-8 (No BOM) without recourse to the preferences option." Paul Prescod From paul at prescod.net Tue Sep 12 03:16:15 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 11 Sep 2006 18:16:15 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <87r6yij7ea.fsf@qrnik.zagroda> References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> Message-ID: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> On 9/11/06, Marcin 'Qrczak' Kowalczyk wrote: > "Paul Prescod" writes: > > >> The bizarre Windows behavious of using different > >> encodings for console and GUI programs doesn't > >> bother me either. Really. I promise." > > > > So according to this philosophy, Windows and Mac users will probably > > never be able to open UTF-8 documents by default even if every > > Microsoft app generates and consumes UTF-8 by default, because > > Microsoft and Apple will probably _never change the default locale_ > > for backwards compatibility reasons. > > This can be solved for file reading by making a "Windows locale" > always consider UTF-8 BOM and switch to UTF-8 in this case. That's fine but I don't see why we would turn that feature off for any platform. Do you have a bunch of files hanging around starting with zero-width non-breaking spaces? > It's still unclear what to do for writing on Windows. UTF-8 with BOM is the Microsoft preferred format. Maybe after experimentation we'll find that there are still apps out there that choke on it, but we should start out trying to be compatible with other apps on the platform. > I have no idea what Mac does (does it typically use UTF-8 locales? > and does it typicaly use a BOM in UTF-8?). Like Windows, the Mac has backwards-compatible behaviours in some places (textedit defaults to a proprietary encoding called Mac Roman) and UTF-8 behaviours in other places (e.g. cut and paste). In some places (on my configuration) it claims its locale is US ASCII. Textedit can read files with a BOM and auto-detect Unicode with a BOM. It always saves without a BOM, which results in the unfortunate situation that Textedit will recognize a file's encoding, then save it, then forget its encoding when you reopen it. :( But again, this implies that at least on these two platforms UTF-8 w/BOM is a good default output encoding. On Unix, VIM is also set up to auto-detect UTF-8 (using the BOM or full decoding attemption). According to Google, XEmacs also has some kind of UTF-8/BOM detector but I don't know the details. GNU Emacs: According to "Emacs wiki": "Auto-detection of UTF-8 is effectively disabled by default in GNU Emacs 21.3 and below." So the situation on Unix is not as clear. Paul Prescod From talin at acm.org Tue Sep 12 04:48:39 2006 From: talin at acm.org (Talin) Date: Mon, 11 Sep 2006 19:48:39 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <1158000166.4672.33.camel@fsol> Message-ID: <45062007.1040909@acm.org> Guido van Rossum wrote: >>>>> from scripting import raw_input, autotextfile > > I'm not so keen on 'scripting' as the name either, but I'm sure we can > come up with something. Perhaps easyio, simpleio or basicio? (Not to > be confused with vbio. :-) > > I'm also not completely against revising the decision on killing > raw_input(). While input() must definitely go, raw_input() might > survive under a new name. Too bad calling it input() would be too > confusing from a Python 2.x POV, and I don't want to call it > readline() because it strips the trailing newline and raises EOF on > error. Unless the educators can line with having to use > readline().strip() instead of raw_input()...? How about calling it 'ask'? >>> s = ask( "How are you today?" ) --> Fine >>> s "Fine" And as far as the name of a library goes how about "quickstart"? Other possibilities are: quickstudy, kickstart, simplestart, etc. "With the Python quickstart module, programming is as easy as one...two...five!" -- Talin From greg.ewing at canterbury.ac.nz Tue Sep 12 05:20:50 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 12 Sep 2006 15:20:50 +1200 Subject: [Python-3000] The future of exceptions In-Reply-To: <874pvgozlo.fsf@qrnik.zagroda> References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com> <4503AC79.4090601@canterbury.ac.nz> <874pvgozlo.fsf@qrnik.zagroda> Message-ID: <45062792.3040207@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > It's lazily instantiated today (see PyErr_NormalizeException). Only in C code, though, not Python. And if the separate type/value specification when raising goes away, it might not be possible any more even in C. > 'WithExit' constructs a unique exception object and catches precisely > this object. That would fill the bill, yes. So it's really just a matter of making sure we keep the ability to be as lazy as possible with exception processing. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 12 05:34:42 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 12 Sep 2006 15:34:42 +1200 Subject: [Python-3000] sys.stdin and sys.stdout with textfile In-Reply-To: References: <1157898432.4246.161.camel@fsol> Message-ID: <45062AD2.1090207@canterbury.ac.nz> Guido van Rossum wrote: > All sorts of things are different when reading stdin vs. opening a > filename. e.g. stdin may be a pipe. Which suggests that if anything is going to try to guess the encoding, it would be better for it to start reading from the actual stream you're going to use and buffer the result, rather than rely on being able to open it separately. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 12 05:37:55 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 12 Sep 2006 15:37:55 +1200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com> <4503FF8F.6070801@gmail.com> Message-ID: <45062B93.3080207@canterbury.ac.nz> Guido van Rossum wrote: > if possible I'd like the > guessing function to have access to what was in the file before it was > emptied by the "create" function, or what's at the start before > appending to the end, Which further suggests that the encoding-guesser needs to be fairly intimately built into some layer of the i/o stack, and not require calling a separate function (although it could be provided as such in case you want to use it that way). -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 12 06:18:37 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 12 Sep 2006 16:18:37 +1200 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> Message-ID: <4506351D.4040109@canterbury.ac.nz> Michael Chermside wrote: > The idea of a standard edu library though is a GREAT one. That would > provide a standard place for things like raw_input() (with a better > name) as well as lots of other "helper functions" useful to beginners > and/or students -- and all it would cost is a single line of boilerplate > at the top of each program ("from beginnerlib import *" or something > like that). I disagree for two reasons: 1) Even a single line of boilerplate is too much when you're trying to pare things down to the bare minimum for a beginner. 2) It teaches a bad habit right from the beginning (i.e. using 'import *'). This is the wrong foot to start a beginner off on. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 12 06:36:01 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 12 Sep 2006 16:36:01 +1200 Subject: [Python-3000] iostack, second revision In-Reply-To: <45045105.5040209@jmunch.dk> References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com> <4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org> <1157740873.4979.10.camel@fsol> <450254DB.3020502@gmail.com> <45045105.5040209@jmunch.dk> Message-ID: <45063931.3050904@canterbury.ac.nz> Anders J. Munch wrote: > any file that supports seeking to the end will also support > reporting the file size. Thus > f.seek(f.length) > should suffice, Although the micro-optimisation circuit in my brain complains that it will take 2 system calls when it could be done with 1... -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From solipsis at pitrou.net Tue Sep 12 08:37:58 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 12 Sep 2006 08:37:58 +0200 Subject: [Python-3000] text editors In-Reply-To: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: <1158043078.4276.6.camel@fsol> Le lundi 11 septembre 2006 ? 18:16 -0700, Paul Prescod a ?crit : > On Unix, VIM is also set up to auto-detect UTF-8 (using the BOM or > full decoding attemption). According to Google, XEmacs also has some > kind of UTF-8/BOM detector but I don't know the details. GNU Emacs: > According to "Emacs wiki": "Auto-detection of UTF-8 is effectively > disabled by default in GNU Emacs 21.3 and below." > > So the situation on Unix is not as clear. gedit has an ordered list of encodings to test for when it opens a file, and it chooses the first encoding which succeeds in decoding the file. The encoding list is stored as a gconf key named "auto_detected" in /apps/gedit-2/preferences/encodings, and it's default value is [UTF-8, CURRENT, ISO-8859-15] ("CURRENT" being interpreted as the current locale). I suppose the explicit fallback to iso-8859-15 is for the common case where the user has a Western European language, his user locale is utf-8 and he has some non-Unicode files hanging around... Regards Antoine. From ajm at flonidan.dk Tue Sep 12 09:01:01 2006 From: ajm at flonidan.dk (Anders J. Munch) Date: Tue, 12 Sep 2006 09:01:01 +0200 Subject: [Python-3000] iostack, second revision Message-ID: <9B1795C95533CA46A83BA1EAD4B01030031F52@flonidanmail.flonidan.net> Greg Ewing wrote: > Anders J. Munch wrote: > > any file that supports seeking to the end will also support > > reporting the file size. Thus > > f.seek(f.length) > > should suffice, > > Although the micro-optimisation circuit in my > brain complains that it will take 2 system > calls when it could be done with 1... I don't expect file methods and systems calls to map one to one, but you're right, the first time the length is needed, that's an extra system call. My micro-optimisation circuitry blew a fuse when I discovered that seek always implies flush. You won't get good performance out of code that does a lot of seeks, whatever you do. Use my upcoming FileBytes class :) - Anders From tony at PageDNA.com Tue Sep 12 04:27:09 2006 From: tony at PageDNA.com (Tony Lownds) Date: Mon, 11 Sep 2006 19:27:09 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <1158000166.4672.33.camel@fsol> Message-ID: <0CC6FE11-6DD0-4923-B417-21273679A15A@PageDNA.com> >> IMHO, it would be better to label the module "scripting" rather than >> "beginnerlib" (and why append "lib" at the end of module names >> anyway? :-)). >> It might even contain stuff such as encoding guessing. >> >>>>> from scripting import raw_input, autotextfile > > I'm not so keen on 'scripting' as the name either, but I'm sure we can > come up with something. Perhaps easyio, simpleio or basicio? (Not to > be confused with vbio. :-) > How about simpleui? This is a user interface routine. > I'm also not completely against revising the decision on killing > raw_input(). While input() must definitely go, raw_input() might > survive under a new name. Too bad calling it input() would be too > confusing from a Python 2.x POV, and I don't want to call it > readline() because it strips the trailing newline and raises EOF on > error. Unless the educators can line with having to use > readline().strip() instead of raw_input()...? > Javascript provides prompt, confirm, alert. They are all very useful as user interface routines and would be just as useful in Python. Maybe raw_input could survive as "prompt". Alternatively, here is a way to keep the "input" name with a useful extension to the semantics of input(). Add a "converter" argument that defaults to eval in Python 2.6. "eval" is deprecated as an argument in 2.7. In Python 3.0 the default gets changed to "str". The user's input is passed through the converter function. Any exceptions from the converter cause input() to prompt the user again. Code is below. sys.stdin.readline() doesn't use the readline library. That is a nice feature of the current raw_input() and input() builtins. I don't think this feature can even be emulated with the current readline module. -Tony Lownds import sys def input(prompt='', converter=eval): while 1: sys.stdout.write(prompt) sys.stdout.flush() line = sys.stdin.readline().rstrip('\n\r') try: return converter(line) except (KeyboardInterrupt, SystemExit): raise except Exception, e: print str(e) if __name__ == '__main__': print "Result: %s" % input("Enter string:", str) print "Result: %d" % input("Enter integer:", int) print "Result: %r" % input("Enter expression:") Here's how it looks when run: Enter string:12 Result: 12 Enter integer:1a invalid literal for int(): 1a Enter integer: invalid literal for int(): Enter integer:12 Result: 12 Enter expression:a b c invalid syntax (line 1) Enter expression:abc name 'abc' is not defined Enter expression:'abc' Result: 'abc' From ncoghlan at gmail.com Tue Sep 12 15:45:44 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 12 Sep 2006 23:45:44 +1000 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> Message-ID: <4506BA08.7090905@gmail.com> Brett Cannon wrote: > On 9/11/06, *Michael Chermside* Personally, I think input() should never have existed and must go > no matter what. > > > Agreed. Teach the folks eval() quick if you want something like that. The world would probably be a happier place if you taught them int() and float() instead, though :) > I think raw_input() is worth discussing -- I wouldn't > need it, but it's little more than a convenience function. > > > Yeah, but when you are learning it's cool to take input easily. I loved > raw_input() when I started out. We could always rename raw_input() to input(). Just a thought. . . Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Tue Sep 12 15:47:15 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 12 Sep 2006 23:47:15 +1000 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <4506BA08.7090905@gmail.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506BA08.7090905@gmail.com> Message-ID: <4506BA63.7040201@gmail.com> Nick Coghlan wrote: > We could always rename raw_input() to input(). Just a thought. . . D'oh. Guido already said he doesn't like that idea :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From rhettinger at ewtllc.com Tue Sep 12 16:25:06 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Tue, 12 Sep 2006 07:25:06 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <4506BA63.7040201@gmail.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com> Message-ID: <4506C342.6010202@ewtllc.com> An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060912/8bcdcbc9/attachment.html From jcarlson at uci.edu Tue Sep 12 18:05:50 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Tue, 12 Sep 2006 09:05:50 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <9B1795C95533CA46A83BA1EAD4B01030031F52@flonidanmail.flonidan.net> References: <9B1795C95533CA46A83BA1EAD4B01030031F52@flonidanmail.flonidan.net> Message-ID: <20060912085752.F915.JCARLSON@uci.edu> "Anders J. Munch" wrote: > > Greg Ewing wrote: > > Anders J. Munch wrote: > > > any file that supports seeking to the end will also support > > > reporting the file size. Thus > > > f.seek(f.length) > > > should suffice, > > > > Although the micro-optimisation circuit in my > > brain complains that it will take 2 system > > calls when it could be done with 1... > > I don't expect file methods and systems calls to map one to one, but > you're right, the first time the length is needed, that's an extra > system call. Every time the length is needed, a system call is required (you can have multiple writers of the same file)... >>> import os >>> a = open('test.txt', 'a') >>> b = open('test.txt', 'a') >>> a.write('hello') >>> b.write('whee!!') >>> a.flush() >>> os.fstat(a.fileno()).st_size 5L >>> b.flush() >>> os.fstat(b.fileno()).st_size 11L >>> > My micro-optimisation circuitry blew a fuse when I discovered that > seek always implies flush. You won't get good performance out of code > that does a lot of seeks, whatever you do. Use my upcoming FileBytes > class :) Flushing during seek is important. By not flushing during seek in your FileBytes object, you are unnecessarily delaying writes, which could cause file corruption. - Josiah From tjd at sfu.ca Tue Sep 12 18:51:18 2006 From: tjd at sfu.ca (Toby Donaldson) Date: Tue, 12 Sep 2006 09:51:18 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <4506C342.6010202@ewtllc.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com> <4506C342.6010202@ewtllc.com> Message-ID: On 9/12/06, Raymond Hettinger wrote: > > We could always rename raw_input() to input(). Just a thought. . . > > D'oh. Guido already said he doesn't like that idea :) > > > > FWIW, I think it is a good idea. If there is a little 2.x vs 3.0 > confusion, so be it. The use of input() function is already somewhat rare > (both because of infrequent use cases and because of the stern warnings > about eval's security issues). It is better to bite the bullet and move on > than it would be to avoid the most obvious name. I agree ... "input" is perhaps the best name from a beginners point of view, and it is only a minor inconvenience for experienced programmers. Toby -- Dr. Toby Donaldson School of Computing Science Simon Fraser University (Surrey) From tjd at sfu.ca Tue Sep 12 19:11:29 2006 From: tjd at sfu.ca (Toby Donaldson) Date: Tue, 12 Sep 2006 10:11:29 -0700 Subject: [Python-3000] educational aspects of Python 3000 Message-ID: > How about calling it 'ask'? > > >>> s = ask( "How are you today?" ) > --> Fine > >>> s > "Fine" > > And as far as the name of a library goes how about "quickstart"? Other > possibilities are: quickstudy, kickstart, simplestart, etc. > > "With the Python quickstart module, programming is as easy as > one...two...five!" :-) Actually, some educators (not me so much, but I see where they are coming from) have negative reactions to library names like "teach" or "edu" because they feel it sends a message to students that they are not learning "real" Python. A positive sounding names like "quickstart" would avoid this problem. Toby -- Dr. Toby Donaldson School of Computing Science Simon Fraser University (Surrey) From talin at acm.org Tue Sep 12 19:35:39 2006 From: talin at acm.org (Talin) Date: Tue, 12 Sep 2006 10:35:39 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: References: Message-ID: <4506EFEB.6040001@acm.org> Toby Donaldson wrote: >> How about calling it 'ask'? >> >> >>> s = ask( "How are you today?" ) >> --> Fine >> >>> s >> "Fine" >> >> And as far as the name of a library goes how about "quickstart"? Other >> possibilities are: quickstudy, kickstart, simplestart, etc. >> >> "With the Python quickstart module, programming is as easy as >> one...two...five!" > > :-) > > Actually, some educators (not me so much, but I see where they are > coming from) have negative reactions to library names like "teach" or > "edu" because they feel it sends a message to students that they are > not learning "real" Python. > > A positive sounding names like "quickstart" would avoid this problem. That was exactly my thinking. -- Talin From nnorwitz at gmail.com Tue Sep 12 19:53:13 2006 From: nnorwitz at gmail.com (Neal Norwitz) Date: Tue, 12 Sep 2006 10:53:13 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <4506C342.6010202@ewtllc.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com> <4506C342.6010202@ewtllc.com> Message-ID: On 9/12/06, Raymond Hettinger wrote: > > We could always rename raw_input() to input(). Just a thought. . . > > D'oh. Guido already said he doesn't like that idea :) > > FWIW, I think it is a good idea. If there is a little 2.x vs 3.0 > confusion, so be it. The use of input() function is already somewhat rare > (both because of infrequent use cases and because of the stern warnings > about eval's security issues). It is better to bite the bullet and move on > than it would be to avoid the most obvious name. I agree. Plus we are already doing something similar for {}.keys() etc by changing them in a somewhat subtle way. I also recall something weird when I ripped out input wrt readline or something. I don't recall if I checked in the removal of {raw_,}input or not. This is also something easy to look for and flag. pychecker already catches uses of input and warns about it. n From rrr at ronadam.com Tue Sep 12 23:03:30 2006 From: rrr at ronadam.com (Ron Adam) Date: Tue, 12 Sep 2006 16:03:30 -0500 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <4506C342.6010202@ewtllc.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com> <4506C342.6010202@ewtllc.com> Message-ID: Raymond Hettinger wrote: > >>> We could always rename raw_input() to input(). Just a thought. . . >>> >> >> D'oh. Guido already said he doesn't like that idea :) >> >> > > FWIW, I think it is a good idea. If there is a little 2.x vs 3.0 > confusion, so be it. The use of input() function is already somewhat > rare (both because of infrequent use cases and because of the stern > warnings about eval's security issues). It is better to bite the bullet > and move on than it would be to avoid the most obvious name. > > Raymond Maybe "input" can be depreciated in 2.x with a messages to use eval(raw_input()) instead. That would limit some of the confusion. From rhettinger at ewtllc.com Tue Sep 12 23:58:19 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Tue, 12 Sep 2006 14:58:19 -0700 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com> <4506C342.6010202@ewtllc.com> Message-ID: <45072D7B.4020202@ewtllc.com> Ron Adam wrote: >Maybe "input" can be depreciated in 2.x with a messages to use eval(raw_input()) >instead. That would limit some of the confusion. > > > Let me take this opportunity to articulate a principle that I hope this group will adopt, "Thou shalt not muck-up Py2.x in the name of Py3k." Given that Py3k will not be backwards compatible in many ways, we may expect that tons of code will remain in the 2.x world and it behooves us not to burden that massive codebase with Py3k oriented deprecations, warnings, etc. It's okay to backport compatible feature additions and I expect that a number of people will author third-party transition tools, but let's not gum-up the current, wildly successful strain of Python. Expect that 2.x will continue to live side-by-side with Py3k for a long time. It is a bit premature to read the will and auction-off the estate ;-) Any ideas for Py3k that are part of the natural evolution of the 2.x series can of course be done in parallel, but each 2.x proposal needs to be evaluated on its own merits. IOW, "limiting 2.x vs 3k confusion" is NOT a sufficient reason to change 2.x. Raymond From bjourne at gmail.com Wed Sep 13 00:56:27 2006 From: bjourne at gmail.com (=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=) Date: Wed, 13 Sep 2006 00:56:27 +0200 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <4506351D.4040109@canterbury.ac.nz> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506351D.4040109@canterbury.ac.nz> Message-ID: <740c3aec0609121556x12586796wa0888b8284eb94f@mail.gmail.com> > > The idea of a standard edu library though is a GREAT one. That would > > provide a standard place for things like raw_input() (with a better > > name) as well as lots of other "helper functions" useful to beginners > > and/or students -- and all it would cost is a single line of boilerplate > > at the top of each program ("from beginnerlib import *" or something > > like that). > > I disagree for two reasons: > > 1) Even a single line of boilerplate is too much > when you're trying to pare things down to the > bare minimum for a beginner. > > 2) It teaches a bad habit right from the > beginning (i.e. using 'import *'). This is the > wrong foot to start a beginner off on. I agree. For an absolute newbie, Pythons import semantics are way, WAY down the road long after variables, numbers, strings, comments, control statements, functions etc. A third reason is that if these functions are packages in a beginnerlib module, then you would have to type "from beginnerlib import *" each and every time you want to use raw_input() from the Python console. -- mvh Bj?rn From steven.bethard at gmail.com Wed Sep 13 04:47:02 2006 From: steven.bethard at gmail.com (Steven Bethard) Date: Tue, 12 Sep 2006 20:47:02 -0600 Subject: [Python-3000] educational aspects of Python 3000 In-Reply-To: <45072D7B.4020202@ewtllc.com> References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com> <4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com> <4506C342.6010202@ewtllc.com> <45072D7B.4020202@ewtllc.com> Message-ID: On 9/12/06, Raymond Hettinger wrote: > Ron Adam wrote: > >Maybe "input" can be depreciated in 2.x with a messages to use eval(raw_input()) > >instead. That would limit some of the confusion. > > Let me take this opportunity to articulate a principle that I hope this > group will adopt, "Thou shalt not muck-up Py2.x in the name of Py3k." I agree 100% with this principle. But "input" could definitely get a warning when the Python 2.x --warn-me-about-python-3-incompatibilities switch is given. Guido's already suggested that, for example, using the result of dict.items() for anything other than iteration should issue such a warning. STeVe -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy From martin at v.loewis.de Wed Sep 13 06:56:33 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:56:33 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> Message-ID: <45078F81.4010506@v.loewis.de> Paul Prescod schrieb: > I haven't created locale-relevant content in a generic text editor in a > very, very long time. You are an atypical user, then. I use plain text files all the time, and I know other people do as well. Regards, Martin From martin at v.loewis.de Wed Sep 13 06:38:30 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:38:30 +0200 Subject: [Python-3000] string C API In-Reply-To: References: Message-ID: <45078B46.90408@v.loewis.de> Fredrik Lundh schrieb: > just noticed that PEP 3100 says that PyString_AsEncodedString and > PyString_AsDecodedString is to be removed, but it doesn't mention > any other PyString (or PyUnicode) functions. > > how large changes can we make here, really ? All API that refers to the internal representation should be changed or removed; in theory, that could be all API that has char* arguments. For example, PyString_From{String[AndSize]|Format} would either: - have to grow an encoding argument - assume a default encoding (either ASCII or UTF-8) - change its signature to operate on Py_UNICODE* (although we don't have literals for these) or - be removed Likewise, PyString_AsString either goes away or changes its return type. String APIs that operate on PyObject* likely can stay as-is. Regards, Martin From martin at v.loewis.de Wed Sep 13 06:10:38 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:10:38 +0200 Subject: [Python-3000] Character Set Indepencence In-Reply-To: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> References: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> Message-ID: <450784BE.7050802@v.loewis.de> Paul Prescod schrieb: > I think that the gist of it is that Unicode will be "just one character > set" supported by Ruby. This idea has been kicked around for Python > before but you quickly run into questions about how you compare > character strings from multiple character sets, to say nothing of the > complexity of an character encoding and character set agnostic regular > expression engine. As Guido says, the arguments for "CSI (character set independence)" are hardly convincing. Yes, there are cases where Unicode doesn't "round-trip", but they are so obscure that they (IMO) can be ignored safely. There are two problems in this respect with Unicode: - in some cases, a character set may contain characters that are not included in Unicode. This was a serious problem for a while for Chinese for quite some time, but I believe this is now fixed, with the plane-2 additions. If just round-tripping is the goal, then it is always possible for a codec to map characters to the private-use areas of Unicode. This is not optimal, since a different codec may give a different meaning to the same PUA characters, but there should be rarely a need to use them in the first place. - in some cases, the input encoding has multiple representations for what becomes the same character in Unicode. For example, in ISO-2022-jp, there are three ways to encode the latin letters (either in ASCII, or in the romaji part of either JIS X 0208-1978 or JIS X 0208-1983). You can switch between these in a single string; if you go back and forth through Unicode, you get a normalized version that .encode gives you. While I have seen people bringing it up now and then, I don't recall anybody claiming that this is a real, practical problem. There is a third problem that people often associate with Unicode: due to the Han unification, you don't know whether a certain Han character originates from Chinese, Japanese, or Korean. This is a problem when rendering Unicode: you don't know what glyphs to use (as you should use different glyphs depending on the natural language). With CSI, you can use a "language-aware encoding": you use a Japanese encoding for Japanese text, and so on, then use the encoding to determine what the language is. For Unicode, there are several ways to deal with it: - you could carry language information along with the original text. This is what is commonly done in the web: you put language information into the HTML, and then use that to render the text correctly. - you could embed language information into the Unicode string, using the plane-14 tag characters. This should work fairly nicely, since you only need a single piece of information, but has some drawbacks: * you need four-byte Unicode, or surrogates * if you slice such a string, the slices won't carry the language tag * applications today typically don't know how to deal with tag characters - you could guess the language from the content, based on the frequency of characters (e.g. presence of katakana/hiragana would indicate that it is Japanese). As with all guessing, there are cases where it fails. I believe that web browsers commonly apply that approach, anyway. Regards, Martin From martin at v.loewis.de Wed Sep 13 06:44:54 2006 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:44:54 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: Message-ID: <45078CC6.7050505@v.loewis.de> Fredrik Lundh schrieb: > today's Python supports "locale aware" 8-bit strings; e.g. > > >>> import locale > >>> "???".isalpha() > False > >>> locale.setlocale(locale.LC_ALL, "sv_SE") > 'sv_SE' > >>> "???".isalpha() > True > > to what extent should this be supported by Python 3000 ? I would like to see locale-aware operations, but with an explicit locale, e.g. import locale l = locale.load(locale.LC_ALL, "sv_SE") print l.isalpha("???") (i.e. character properties become locale methods, not string methods). To implement that, we would have to incorporate ICU, which would be a tough decision to make (or have our own implementation based on the tables that ICU uses). Alternatively, we could try to get such locale objects from system APIs where available (e.g. in glibc), and not provide them on systems that don't have locale objects in their APIs. Regards, Martin From martin at v.loewis.de Wed Sep 13 06:51:30 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:51:30 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FDBDFF.7090505@sweetapp.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> Message-ID: <45078E52.9080402@v.loewis.de> Brian Quinlan schrieb: > As a user, I don't have any expectations regarding non-ASCII text files. > > I'm using a US-English version of Windows XP (very common) and I haven't > changed the default encoding (very common). Python claims that my system > encoding is CP436 (from sys.stdin/stdout.encoding). You are misinterpreting the data you see. Python makes no claims about your system encoding in sys.stdout.encoding. Instead, it makes a claim about your terminal's encoding, and that is indeed CP436 (just do "type foo.txt" with a document that contains non-ASCII characters, and watch the characters in the terminal look differently from the ones in notepad). It is an unfortunate fact that Windows has *two* system encodings: one used for "Windows", and one used for the "OEM". The terminal uses the OEM code page (by default, unless you run chcp.exe). > I can assure you > that most of the documents that I work with are not in CP436 - they are > a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that > this is true of many Windows XP (US-English) users. So, for me and users > like me, Python is going to silently misinterpret my data. No. It will use a different API to determine the system encoding, and it will guess correctly. Regards, Martin From martin at v.loewis.de Wed Sep 13 06:20:12 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:20:12 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609021529o3a83dccbod0a7a643d39da696@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44F9E844.2020603@acm.org> <1d85506f0609021529o3a83dccbod0a7a643d39da696@mail.gmail.com> Message-ID: <450786FC.2020808@v.loewis.de> tomer filiba schrieb: > # read 3 UTF8 *characters* > f.read(3) > > # this will seek by AT LEAST 7 *bytes*, until resynched > f.substream.seekby(7) > > # we can resume reading of UTF8 *characters* > f.read(3) > > heck, i even like this idea :) Notice that resyncing is a really tricky operation, and should not be expected to work for all encodings. For example, for the iso-2022 encodings, you have to know what character set you are "in", and you have to read forward/backward until you find a character-code switching escape sequence. There is an RFC-imposed requirement that each line of input is "neutral" wrt. character set switching, so you can typically synchronize at a line break. Still, this could require to skip an arbitrary amount of text. Regards, Martin From martin at v.loewis.de Wed Sep 13 06:53:52 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:53:52 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE1A65.7020900@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> <44FE1A65.7020900@blueyonder.co.uk> Message-ID: <45078EE0.8090906@v.loewis.de> David Hopwood schrieb: > Cp436 is almost certainly *not* the encoding set by the OS; Python > has got it wrong. Just to repeat myself: Python is *not* wrong, the terminal *indeed* uses CP 436. > If Brian is using an English-language variant of > Windows XP and has not changed the defaults, the system ("ANSI") > encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1 > if C1 control characters are not used). Yes, and the OEM encoding will be CP 436. It is common to interpret CP_ACP as the system encoding, yet Windows has two of them, and Python knows very well which one to use in which place. Regards, Martin From martin at v.loewis.de Wed Sep 13 06:22:16 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 06:22:16 +0200 Subject: [Python-3000] encoding hell In-Reply-To: References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FA0E59.9010302@canterbury.ac.nz> Message-ID: <45078778.6000407@v.loewis.de> Fredrik Lundh schrieb: >> The best you could do would be to return some kind >> of opaque object from tell() that could be passed >> back to seek(). > > that's how seek/tell works on text files in today's Python, of course. > if you're writing portable code, you can only seek to the beginning or > end of the file, or to a position returned to you by tell. The problem is that for character-oriented streams, that position should also incorporate the "shift state" of the codec. To support that, the codec API would need to grow a way to export and import its state into such "tell objects". Regards, Martin From paul at prescod.net Wed Sep 13 08:10:29 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 12 Sep 2006 23:10:29 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <45078E52.9080402@v.loewis.de> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> <45078E52.9080402@v.loewis.de> Message-ID: <1cb725390609122310k44b99f9eqb12aec7c5fadf886@mail.gmail.com> On 9/12/06, "Martin v. L?wis" wrote: > > > > I can assure you > > that most of the documents that I work with are not in CP436 - they are > > a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that > > this is true of many Windows XP (US-English) users. So, for me and users > > like me, Python is going to silently misinterpret my data. > > No. It will use a different API to determine the system encoding, and > it will guess correctly. If Python reports "cp1252" as I expect it to, then it has not "guessed correctly" for Brian's documents as described above. The mistake will be harmless for the ASCII files and often for the ISO8859-1 files, but would be dangerous for the UTF-8 ones. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060912/fa4bea07/attachment.html From brian at sweetapp.com Wed Sep 13 10:06:41 2006 From: brian at sweetapp.com (Brian Quinlan) Date: Wed, 13 Sep 2006 10:06:41 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <45078E52.9080402@v.loewis.de> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> <45078E52.9080402@v.loewis.de> Message-ID: <4507BC11.8040901@sweetapp.com> Martin v. L?wis wrote: >> I can assure you >> that most of the documents that I work with are not in CP436 - they are >> a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that >> this is true of many Windows XP (US-English) users. So, for me and users >> like me, Python is going to silently misinterpret my data. > > No. It will use a different API to determine the system encoding, and > it will guess correctly. You are addressing a completely different issue. I am saying that Python is going to silently misinterpret my *data* and you are saying that it is going to correctly determine the *system encoding*. As a user, I don't directly care if Python guesses my system encoding correctly or not, but I do care if it tries to interpret my UTF-8 documents as Windows-1252 (which will succeed) and I end up transmitting/storing/displaying incorrect data. Cheers, Brian From ajm at flonidan.dk Wed Sep 13 10:27:31 2006 From: ajm at flonidan.dk (Anders J. Munch) Date: Wed, 13 Sep 2006 10:27:31 +0200 Subject: [Python-3000] iostack, second revision Message-ID: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net> Josiah Carlson wrote: > "Anders J. Munch" wrote: > > I don't expect file methods and systems calls to map one to one, but > > you're right, the first time the length is needed, that's an extra > > system call. > > Every time the length is needed, a system call is required > (you can have > multiple writers of the same file)... Point taken. It's very rarely a good idea to do so, but the possibility of multiple writers shouldn't be ignored. Still there is no real performance issue. If anything, replacing f.seek(0,2);f.tell() with f.length in various places might save a few system calls. > > Flushing during seek is important. By not flushing during > seek in your > FileBytes object, you are unnecessarily delaying writes, which could > cause file corruption. That's what the flush method is for. The real reason seek implies flush is to save the library author the bother of getting the interactions between input and output buffering right. Anyway, FileBytes has no seek and no concept of current file position, so I really don't know what you're talking about :) - Anders From john at yates-sheets.org Wed Sep 13 15:24:00 2006 From: john at yates-sheets.org (John S. Yates, Jr.) Date: Wed, 13 Sep 2006 09:24:00 -0400 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote: > UTF-8 with BOM is the Microsoft preferred format. I believe this is a gloss. Microsoft uses UTF-16. Because the basic character unit is larger than one byte it is crucial for interoperability to prefix a string of UTF-16 text with an indication of the order of bytes in each two byte unit. This is the role of the BOM. The BOM is not part of the text. It is a wrapper or envelope. It is a mistake on Microsoft's part to fail to strip the BOM during conversion to UTF-8. There is no MEANINGFUL definition of BOM in a UTF-8 string. But instead of stripping the wrapper and converting only the text payload Microsoft lazily treats both the wrapper and its payload as text. You can see the logical fallacy if you imagine emitting UTF-16 text in an environment of one byte sex, reducing that text to UTF-8, carrying it to an environment of the other byte sex and raising it back to UTF-16. The Unicode.org assumption is that on generation one organizes the bytes of UTF-16 or UTF-32 units according to what is most convenient for a given environment. One prefixes a BOM to text objects to be persisted or passed to differing byte-sex environments. Such an object is not a string but a means of inter-operation. If the BOMs are not stripped during reduction to UTF-8 and are reconstituted during raising to UTF-16 or UTF-32 then raising must honor the BOM and the Unicode.org efficiency objective is subverted. You can take this further and imagine concatenating two UTF-8 strings, one originally UTF-16 generated in a little-endian environment, the other originally UTF-16 generated in a big- endian environment. If the BOMs are not pre-stripped then during raising of the concatenated result to UTF-16 you will get an object with embedded BOMs. This is not meaningful. What does it mean within a UTF-16 string to encounter a BOM that contradicts the wrapper/envelope? Does this mean that any correct UTF-16 utility much cope with hybrid object whose byte order potentially changes mid-stride? /john, who has written a database loader that has to contend with (and clearly diagnoses) BOM in UTF-8 strings. From jimjjewett at gmail.com Wed Sep 13 15:34:47 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 13 Sep 2006 09:34:47 -0400 Subject: [Python-3000] string C API In-Reply-To: <45078B46.90408@v.loewis.de> References: <45078B46.90408@v.loewis.de> Message-ID: On 9/13/06, "Martin v. L?wis" wrote: > Fredrik Lundh schrieb: > > just noticed that PEP 3100 says that PyString_AsEncodedString and > > PyString_AsDecodedString is to be removed, but it doesn't mention > > any other PyString (or PyUnicode) functions. > > how large changes can we make here, really ? > All API that refers to the internal representation should be > changed or removed; in theory, that could be all API that has > char* arguments. This is sufficient to allow polymorphic strings -- including strings whose data is implemented as a view into some other object. > For example, PyString_From{String[AndSize]|Format} would either: > - have to grow an encoding argument > - assume a default encoding (either ASCII or UTF-8) > - change its signature to operate on Py_UNICODE* (although > we don't have literals for these) or > - be removed Should encoding be an attribute of the string? If so, should recoding require the creation of a new string (in the new encoding)? -jJ From qrczak at knm.org.pl Wed Sep 13 15:37:05 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 13 Sep 2006 15:37:05 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: (John S. Yates, Jr.'s message of "Wed, 13 Sep 2006 09:24:00 -0400") References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: <87u03bq45a.fsf@qrnik.zagroda> "John S. Yates, Jr." writes: > It is a mistake on Microsoft's part to fail to strip the BOM > during conversion to UTF-8. There is no MEANINGFUL definition > of BOM in a UTF-8 string. But instead of stripping the wrapper > and converting only the text payload Microsoft lazily treats > both the wrapper and its payload as text. The Unicode standard is at fault too. It specifies UTF-16 and UTF-32 in variants: - UTF-{16,32} with an optional BOM (defaulting to big endian if the BOM is not present), where the BOM is mandatory if the first character of the contents is U+FEFF (otherwise it would be mistaken as a BOM). - UTF-{16,32}{LE,BE} with a fixed endianness and without a BOM; a U+FEFF in UTF-16BE must not be interpreted as a BOM, it's always a part of the text. The problem is that it's not clear in the case of UTF-8. Formally it doesn't have a BOM, but the standard includes some ambiguous wording that various software uses UTF-8 BOM and the presence of a BOM should not affect the interpretation. It should clearly distinguish two interpretations of UTF-8: one without the concept of a BOM, and one which permits the BOM (and in fact makes it mandatory if the stream begins with U+FEFF). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From martin at v.loewis.de Wed Sep 13 17:15:47 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 17:15:47 +0200 Subject: [Python-3000] string C API In-Reply-To: References: <45078B46.90408@v.loewis.de> Message-ID: <450820A3.4000302@v.loewis.de> Jim Jewett schrieb: >> For example, PyString_From{String[AndSize]|Format} would either: >> - have to grow an encoding argument >> - assume a default encoding (either ASCII or UTF-8) >> - change its signature to operate on Py_UNICODE* (although >> we don't have literals for these) or >> - be removed > > Should encoding be an attribute of the string? No. A Python string is a sequence of Unicode characters. Even if it was created by converting from some other encoding, that original encoding gets lost when doing the conversion (just like integers don't remember which base they were originally represented in). Regards, Martin From jcarlson at uci.edu Wed Sep 13 18:21:52 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 13 Sep 2006 09:21:52 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net> References: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net> Message-ID: <20060913084256.F930.JCARLSON@uci.edu> "Anders J. Munch" wrote: > Josiah Carlson wrote: > > "Anders J. Munch" wrote: > > > I don't expect file methods and systems calls to map one to one, but > > > you're right, the first time the length is needed, that's an extra > > > system call. > > > > Every time the length is needed, a system call is required > > (you can have > > multiple writers of the same file)... > > Point taken. It's very rarely a good idea to do so, but the > possibility of multiple writers shouldn't be ignored. Still there is > no real performance issue. If anything, replacing > f.seek(0,2);f.tell() with f.length in various places might save a few > system calls. Any sane person uses os.stat(f.name) or os.fstat(f.fileno()), unless they want to seek to the end of the file for later writing or expected reading of data yet-to-be-written. Interesting that both of these cases basically read and write to the same file at the same time (perhaps even in the same process), something you yourself said, "In all my programming days I don't believe I written to and read from the same file handle even once. Use cases exist, like if you're implementing a DBMS..." > > Flushing during seek is important. By not flushing during > > seek in your > > FileBytes object, you are unnecessarily delaying writes, which could > > cause file corruption. > > That's what the flush method is for. The real reason seek implies > flush is to save the library author the bother of getting the > interactions between input and output buffering right. > Anyway, FileBytes has no seek and no concept of current file position, > so I really don't know what you're talking about :) I was talking about your earlier statement, which I quoted in my earlier reply to you: > My micro-optimisation circuitry blew a fuse when I discovered that > seek always implies flush. You won't get good performance out of code > that does a lot of seeks, whatever you do. Use my upcoming FileBytes > class :) And with the context of a previous message from you: > FileBytes would support the sequence protocol, mimicking bytes objects. > It would support random-access read and write using __getitem__ and > __setitem__, allowing slice assignment for slices of equal size. And > there would be append() to extend the file, and partial __delitem__ > support for truncating. While it doesn't have the methods seek or tell, the underlying implementation needs to use seek and tell (or a memory-mapped file, mmap). You were also talking about buffering writes to reduce the overhead of the underlying seeks and tells because of apparent "optimizations" you wanted to make. Here is a data integrity optimization you can make for me: flush when accessing the file non-sequentially, any other behavior could corrupt the data of users who have been relying on "seek implies flush". I would also mention that your FileBytes class is essentially a fake memory-mapped file, and while I also have implemented an equivalent class (for low-memory testing purposes in a DBMS-like situation), I find that using an mmap to be far faster and generally more reliable (and usable with buffer()) than my FileBytes equivalent, never mind that the vast majority of users don't want a sequence interface to a file, they want a stream interface; which is why you don't see many FileBytes-like objects out in the wild, or really anyone suggesting such a wrapper object be in the standard library. With that said, I'm not sure your FileBytes object is really necessary or desired for the future io library. If people want that kind of an interface, they can use mmap (and push for the various mmap bugs/feature requests to be fixed), otherwise they should be using readable / writable / both streams, something that Tomer has been working towards. - Josiah From jcarlson at uci.edu Wed Sep 13 18:41:01 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 13 Sep 2006 09:41:01 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: References: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: <20060913092509.F933.JCARLSON@uci.edu> "John S. Yates, Jr." wrote: > > On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote: > > > UTF-8 with BOM is the Microsoft preferred format. > > I believe this is a gloss. Microsoft uses UTF-16. Because > the basic character unit is larger than one byte it is crucial > for interoperability to prefix a string of UTF-16 text with an > indication of the order of bytes in each two byte unit. This > is the role of the BOM. The BOM is not part of the text. It > is a wrapper or envelope. > > It is a mistake on Microsoft's part to fail to strip the BOM > during conversion to UTF-8. There is no MEANINGFUL definition > of BOM in a UTF-8 string. But instead of stripping the wrapper > and converting only the text payload Microsoft lazily treats > both the wrapper and its payload as text. I have actually had a variant of this particular discussion with Walter D?rwald. He brought up RCF 3629... [Walter D?rwald] I don't think it does. RFC 3629 isn't that clear about whether an initial 0xEF 0xBB 0xBF sequence is to be interpreted as an encoding signature or a ZWNBSP. But I think the following part of RFC 3629 applies here for Python source code: o A protocol SHOULD also forbid use of U+FEFF as a signature for those textual protocol elements for which the protocol provides character encoding identification mechanisms, when it is expected that implementations of the protocol will be in a position to always use the mechanisms properly. This will be the case when the protocol elements are maintained tightly under the control of the implementation from the time of their creation to the time of their (properly labeled) transmission. [My reply, slightly altered for this context] Because not all tools that may manipulate data consumed and/or produced by Python follow the coding: directive, then "the protocol elements" are not 'tightly maintained', so the inclusion of a "BOM" for utf-8 is a necessary "protocol element", at least for .py files, and certainly suggested for other file types that _may not have_ the equivalent of a Python coding: directive. Explicit is better than implicit, and in this case we have the opportunity to be explicit about the "envelope" or "the protocol elements", which will guarantee proper interpretation by non-braindead software. Braindead software that doesn't understand a utf-* BOM should be fixed by the developer or eschewed. > You can take this further and imagine concatenating two UTF-8 > strings, one originally UTF-16 generated in a little-endian > environment, the other originally UTF-16 generated in a big- > endian environment. If the BOMs are not pre-stripped then > during raising of the concatenated result to UTF-16 you will > get an object with embedded BOMs. This is not meaningful. And is generally ignored, as per unicode spec; it's a "zero width non-breaking space" - an invisible character with no effect on wrapping or otherwise. > What does it mean within a UTF-16 string to encounter a BOM > that contradicts the wrapper/envelope? Does this mean that > any correct UTF-16 utility much cope with hybrid object whose > byte order potentially changes mid-stride? Unless you are doing something wrong (like literally concatenating the byte representations of a utf-16be and utf-16le encoded text), this won't happen. > /john, who has written a database loader that has to contend > with (and clearly diagnoses) BOM in UTF-8 strings. Being that BOMs are only supposed to be seen as a BOM if they are literally the first few bytes in a string, I certainly hope you didn't spend too much time on that support. - Josiah (who has written an editor with support for all UTF variants with BOM, and UTF-8 + all other localized encodings using coding: directives) From paul at prescod.net Wed Sep 13 18:44:18 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 13 Sep 2006 09:44:18 -0700 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: <1cb725390609130944k20c315c1k482ff1bd7cc5a85a@mail.gmail.com> On 9/13/06, John S. Yates, Jr. wrote: > > On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote: > > > UTF-8 with BOM is the Microsoft preferred format. > > It is a mistake on Microsoft's part to fail to strip the BOM > during conversion to UTF-8. There is no MEANINGFUL definition > of BOM in a UTF-8 string. That is not true. Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes *no* difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is *only* used as a signature ? an indication that an otherwise unmarked text file is in UTF-8. This is a very valuable function and applications like Microsoft's Notepad, Apple's TextEdit and VIM take good advantage of it. """ Vim will try to detect what kind of file you are editing. It uses the encoding names in the 'fileencodings' option. When using Unicode , the default value is: "ucs-bom,utf-8,latin1". This means that Vim checks the file to see if it's one of these encodings: ucs-bom File must start with a Byte Order Mark (BOM). This allows detection of 16-bit, 32-bit and utf-8 Unicode encodings. utf-8 utf-8 Unicode . This is rejected when a sequence of bytes is illegal in utf-8 . latin1 The good old 8-bit encoding. """ I'm pretty much proposing this same algorithm for Python's encoding guessing. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060913/87ce70e9/attachment.html From jimjjewett at gmail.com Wed Sep 13 19:09:27 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 13 Sep 2006 13:09:27 -0400 Subject: [Python-3000] string C API In-Reply-To: <450820A3.4000302@v.loewis.de> References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> Message-ID: On 9/13/06, "Martin v. L?wis" wrote: > > Should encoding be an attribute of the string? > No. A Python string is a sequence of Unicode characters. > Even if it was created by converting from some other encoding, > that original encoding gets lost when doing the conversion > (just like integers don't remember which base they were originally > represented in). Theoretically, it is a sequence of code points. Today, in python 2.x, these are always represented by a specific (wide, fixed-width) concrete encoding, chosen at compile time. This is required so long as outside code can access the data buffer directly. It would no longer be required if all access were through unicode methods. (And it would probably make sense to have a "get-me-the-buffer-in-this-encoding" method.) Several people seem to want more efficient representations when possible. Several people seem to want UTF-8, which makes sense if the rest of the system is UTF8, but complicates the implementation. Simply not encoding/decoding until required would save quite a bit of time and space -- but then the object would need some way of indicating which encoding it is in. -jJ From martin at v.loewis.de Wed Sep 13 19:14:30 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 19:14:30 +0200 Subject: [Python-3000] string C API In-Reply-To: References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> Message-ID: <45083C76.8010302@v.loewis.de> Jim Jewett schrieb: > Simply not encoding/decoding until required would save quite a bit of > time and space -- but then the object would need some way of > indicating which encoding it is in. Try implementing that some time. You'll find it will be incredibly complex and unmaintainable. Start with implementing len(s). Regards, Martin From jimjjewett at gmail.com Wed Sep 13 19:27:28 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 13 Sep 2006 13:27:28 -0400 Subject: [Python-3000] string C API In-Reply-To: <45083C76.8010302@v.loewis.de> References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> Message-ID: On 9/13/06, "Martin v. L?wis" wrote: > Jim Jewett schrieb: > > Simply not encoding/decoding until required would save quite a bit of > > time and space -- but then the object would need some way of > > indicating which encoding it is in. > Try implementing that some time. You'll find it will be incredibly > complex and unmaintainable. Start with implementing len(s). Simply delegate such methods to a hidden per-encoding subclass. The UTF-8 methods will indeed be complex, unless the solution is simply "someone called indexing/slicing/len, so I have to recode after all." The Latin-1 encoding will have no such problem. -jJ From guido at python.org Wed Sep 13 20:06:05 2006 From: guido at python.org (Guido van Rossum) Date: Wed, 13 Sep 2006 11:06:05 -0700 Subject: [Python-3000] sys.stdin and sys.stdout with textfile In-Reply-To: <45062AD2.1090207@canterbury.ac.nz> References: <1157898432.4246.161.camel@fsol> <45062AD2.1090207@canterbury.ac.nz> Message-ID: On 9/11/06, Greg Ewing wrote: > Guido van Rossum wrote: > > > All sorts of things are different when reading stdin vs. opening a > > filename. e.g. stdin may be a pipe. > > Which suggests that if anything is going to try > to guess the encoding, it would be better for it > to start reading from the actual stream you're > going to use and buffer the result, rather than > rely on being able to open it separately. Right. The filename is useless. The stream may or may not be seekable (sometimes even stdin is!). Having a buffering layer in between would make it possible to peek ahead in the buffer. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Wed Sep 13 20:09:42 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Sep 2006 20:09:42 +0200 Subject: [Python-3000] string C API In-Reply-To: References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> Message-ID: <45084966.3000608@v.loewis.de> Jim Jewett schrieb: > Simply delegate such methods to a hidden per-encoding subclass. > > The UTF-8 methods will indeed be complex, unless the solution is > simply "someone called indexing/slicing/len, so I have to recode after > all." > > The Latin-1 encoding will have no such problem. I'm not so much worried about UTF-8 or Latin-1; they are fairly trivial. Efficiency of such methods for multi-byte encodings would be dramatically slow. Regards, Martin From jason.orendorff at gmail.com Wed Sep 13 20:23:33 2006 From: jason.orendorff at gmail.com (Jason Orendorff) Date: Wed, 13 Sep 2006 14:23:33 -0400 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: On 9/13/06, John S. Yates, Jr. wrote: > It is a mistake on Microsoft's part to fail to strip the BOM > during conversion to UTF-8. John, you're mistaken about the reason this BOM is here. In Notepad at least, the BOM is intentionally generated when writing the file. It's not a "mistake" or "laziness". It's metadata. (I admit the BOM was not originally invented for this purpose.) > There is no MEANINGFUL definition of BOM in a UTF-8 > string. This thread is about files, not strings. At the start of a file, a UTF-8 BOM is meaningful. It means the file is UTF-8. On Windows, there's a system default encoding, and it's never UTF-8. Notepad writes the BOM so that later, when you open the file in Notepad again, it can identify the file as UTF-8. > You can see the logical fallacy if you imagine emitting UTF-16 > text in an environment of one byte sex, reducing that text to > UTF-8, carrying it to an environment of the other byte sex and > raising it back to UTF-16. It sounds as if you think this will corrupt the BOM, but it works fine: >>> import codecs # "Emitting UTF-16 text" in little-endian environment >>> s1 = codecs.BOM_UTF16_LE + u'hello world'.encode('utf-16-le') # "Reducing that text to UTF-8" >>> s2 = s1.decode('utf-16-le').encode('utf-8') >>> s2 '\xef\xbb\xbfhello world' # "Raising it back to UTF-16" in big-endian environment >>> s3 = s2.decode('utf-8').encode('utf-16-be') >>> s3[:2] == codecs.BOM_UTF16_BE True The BOM is still correct: the data is UTF-16-BE, and the BOM agrees. A UTF-8 string or file will contain exactly the same bytes (including the BOM, if any) whether it is generated from UTF-16-BE or -LE. All three are lossless representations in bytes of the same abstract ideal, which is a sequence of Unicode codepoints. -j From rasky at develer.com Wed Sep 13 22:09:38 2006 From: rasky at develer.com (Giovanni Bajo) Date: Wed, 13 Sep 2006 22:09:38 +0200 Subject: [Python-3000] educational aspects of Python 3000 References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com><4506351D.4040109@canterbury.ac.nz> <740c3aec0609121556x12586796wa0888b8284eb94f@mail.gmail.com> Message-ID: <05df01c6d770$8a97c8f0$43492597@bagio> BJ?rn Lindqvist wrote: >>> The idea of a standard edu library though is a GREAT one. >>> [...] >> I disagree for two reasons: >> >> 1) Even a single line of boilerplate is too much >> when you're trying to pare things down to the >> bare minimum for a beginner. >> >> 2) It teaches a bad habit right from the >> beginning (i.e. using 'import *'). This is the >> wrong foot to start a beginner off on. > > I agree. For an absolute newbie, Pythons import semantics are way, WAY > down the road long after variables, numbers, strings, comments, > control statements, functions etc. A third reason is that if these > functions are packages in a beginnerlib module, then you would have to > type "from beginnerlib import *" each and every time you want to use > raw_input() from the Python console. Another solution would be to have a special "python --edu" command line options which automatically star-import the beginnerlib before the interactive mode starts. Or a PYTHONEDU=1 env. Or a custom site.py which patches __builtins__. Giovanni Bajo From solipsis at pitrou.net Wed Sep 13 22:33:22 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 13 Sep 2006 22:33:22 +0200 Subject: [Python-3000] BOM handling In-Reply-To: <20060913092509.F933.JCARLSON@uci.edu> References: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> <20060913092509.F933.JCARLSON@uci.edu> Message-ID: <1158179602.4721.24.camel@fsol> Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit : > And is generally ignored, as per unicode spec; it's a "zero width > non-breaking space" - an invisible character with no effect on wrapping > or otherwise. Well it would be better if Py3K (with all strings unicode) makes things easy for the programmer and abstracts away those "invisible characters with no textual meaning". Currently it's not the case: >>> a = "hello".decode("utf-8") >>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8") >>> len(a) 5 >>> len(b) 6 >>> a == b False >>> a = "hello".encode("utf-16le").decode("utf-16le") >>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le") >>> len(a) 5 >>> len(b) 6 >>> a == b False >>> a u'hello' >>> b u'\ufeffhello' >>> print a hello >>> print b Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to Regards Antoine. From g.brandl at gmx.net Wed Sep 13 22:45:27 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Wed, 13 Sep 2006 22:45:27 +0200 Subject: [Python-3000] BOM handling In-Reply-To: <1158179602.4721.24.camel@fsol> References: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> <20060913092509.F933.JCARLSON@uci.edu> <1158179602.4721.24.camel@fsol> Message-ID: Antoine Pitrou wrote: > Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit : >> And is generally ignored, as per unicode spec; it's a "zero width >> non-breaking space" - an invisible character with no effect on wrapping >> or otherwise. > > Well it would be better if Py3K (with all strings unicode) makes things > easy for the programmer and abstracts away those "invisible characters > with no textual meaning". Currently it's not the case: > >>>> a = "hello".decode("utf-8") >>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8") >>>> len(a) > 5 >>>> len(b) > 6 >>>> a == b > False This behavior is questionable... >>>> a = "hello".encode("utf-16le").decode("utf-16le") >>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le") >>>> len(a) > 5 >>>> len(b) > 6 ... while this is IMHO not. UTF-16LE does not have a BOM as byte order is already specified by the encoding. The correct example is b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16") b then equals u"hello", as it should. "hello".encode("utf-16") prepends a BOM itself. Georg From walter at livinglogic.de Thu Sep 14 00:05:31 2006 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Thu, 14 Sep 2006 00:05:31 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: <450880AB.1040104@livinglogic.de> Jason Orendorff wrote: > On 9/13/06, John S. Yates, Jr. wrote: >> It is a mistake on Microsoft's part to fail to strip the BOM >> during conversion to UTF-8. > > John, you're mistaken about the reason this BOM is here. > > In Notepad at least, the BOM is intentionally generated when writing > the file. It's not a "mistake" or "laziness". It's metadata. (I > admit the BOM was not originally invented for this purpose.) In theory it's only metadata if external information says that it is, it practice it's unlikely that a charmap encoded file begins with these three bytes. nevertheless it's only a hint. >> There is no MEANINGFUL definition of BOM in a UTF-8 >> string. > > This thread is about files, not strings. At the start of a file, a > UTF-8 BOM is meaningful. It means the file is UTF-8. ... and the first "character" in the file is U+FEFF. If you want the codec to drop the BOM on reading, use the UTF-8-Sig codec. > [...] Servus, Walter From jcarlson at uci.edu Thu Sep 14 01:14:29 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 13 Sep 2006 16:14:29 -0700 Subject: [Python-3000] BOM handling In-Reply-To: <1158179602.4721.24.camel@fsol> References: <20060913092509.F933.JCARLSON@uci.edu> <1158179602.4721.24.camel@fsol> Message-ID: <20060913153900.F936.JCARLSON@uci.edu> Antoine Pitrou wrote: > > > Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit : > > And is generally ignored, as per unicode spec; it's a "zero width > > non-breaking space" - an invisible character with no effect on wrapping > > or otherwise. > > Well it would be better if Py3K (with all strings unicode) makes things > easy for the programmer and abstracts away those "invisible characters > with no textual meaning". Currently it's not the case: > >>> a = "hello".decode("utf-8") > >>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8") > >>> len(a) > 5 > >>> len(b) > 6 > >>> a == b > False I had also had this particular discussion with another individual previously (but I can't seem to find it in my archive), and one point brought up was that apparently Python 2.5 was supposed to have a variant codec for utf-8 that automatically stripped at most one \ufeff character from the beginning of decoded output and added it during encoding, similar to how the generic 'utf-16' and 'utf-32' codecs add and strip: >>> u'hello'.encode('utf-16') '\xff\xfeh\x00e\x00l\x00l\x00o\x00' >>> len(u'hello'.encode('utf-16').decode('utf-16')) 5 >>> I'm unable to find that particular utf-8 codec in the version of Python 2.5 I have installed, but I may not be looking in the right places, or spelling it the right way. In any case, I believe that the above behavior is correct for the context. Why? Because utf-8 has no endianness, its 'generic' decoding spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and 'utf-16-le' decoding spellings; two of which don't strip. > >>> a = "hello".encode("utf-16le").decode("utf-16le") > >>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le") > >>> len(a) > 5 > >>> len(b) > 6 > >>> a == b > False Georg Brandl responded to this example already. > >>> a > u'hello' > >>> b > u'\ufeffhello' > >>> print a > hello > >>> print b > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode > return codecs.charmap_encode(input,errors,encoding_map) > UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to There are two answers to this particular "problem". Either that is expected and desireable behavior for all non-utf encodings, or all non-utf encodings need to gain a mapping of the feff code point to the empty string. I think the behavior is expected and desireable. Why? Because none of the non-utf encodings have a valid and round-trip-able representation for the feff code point. Also, if you want to print possibly arbitrary unicode strings to the console, you may consider encoding the unicode string first, offering either 'ignore' or 'replace' as the second argument. - Josiah From david.nospam.hopwood at blueyonder.co.uk Thu Sep 14 01:36:50 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Thu, 14 Sep 2006 00:36:50 +0100 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> Message-ID: <45089612.4060007@blueyonder.co.uk> Jason Orendorff wrote: > On 9/13/06, John S. Yates, Jr. wrote: > >>It is a mistake on Microsoft's part to fail to strip the BOM >>during conversion to UTF-8. > > John, you're mistaken about the reason this BOM is here. > > In Notepad at least, the BOM is intentionally generated when writing > the file. It's not a "mistake" or "laziness". It's metadata. (I > admit the BOM was not originally invented for this purpose.) > >>There is no MEANINGFUL definition of BOM in a UTF-8 >>string. > > This thread is about files, not strings. At the start of a file, a > UTF-8 BOM is meaningful. It means the file is UTF-8. > > On Windows, there's a system default encoding, and it's never UTF-8. The Windows system encoding can be UTF-8, but only for some locales recently added in Windows 2000/XP, where there was no compatibility constraint to use a non-Unicode encoding. You're correct about the use of a BOM as a signature. All Unicode-conformant applications should accept this use of a BOM in UTF-8 (although they need not generate it); the standard is quite clear on that. -- David Hopwood From solipsis at pitrou.net Thu Sep 14 08:19:00 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 14 Sep 2006 08:19:00 +0200 Subject: [Python-3000] BOM handling In-Reply-To: <20060913153900.F936.JCARLSON@uci.edu> References: <20060913092509.F933.JCARLSON@uci.edu> <1158179602.4721.24.camel@fsol> <20060913153900.F936.JCARLSON@uci.edu> Message-ID: <1158214740.5863.19.camel@fsol> Hi, Le mercredi 13 septembre 2006 ? 16:14 -0700, Josiah Carlson a ?crit : > In any case, I believe that the above behavior is correct for the > context. Why? Because utf-8 has no endianness, its 'generic' decoding > spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and > 'utf-16-le' decoding spellings; two of which don't strip. Your opinion is probably valid in a theoretical point of view. You are more knowledgeable than me. My point was different : most programmers are not at your level (or Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type is supposed to be an abstracted textual type to make it easy to write unicode-friendly applications (isn't it?). Therefore it should hide the messy issue of superfluous BOMs, unwanted BOMs, etc. Telling the programmer to use a specific UTF-8 variant specialized in BOM-stripping will make eyes roll... "why doesn't the standard UTF-8 do it for me?" Regards Antoine. From walter at livinglogic.de Thu Sep 14 09:12:21 2006 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Thu, 14 Sep 2006 09:12:21 +0200 Subject: [Python-3000] BOM handling In-Reply-To: <20060913153900.F936.JCARLSON@uci.edu> References: <20060913092509.F933.JCARLSON@uci.edu> <1158179602.4721.24.camel@fsol> <20060913153900.F936.JCARLSON@uci.edu> Message-ID: <450900D5.6050606@livinglogic.de> Josiah Carlson wrote: > Antoine Pitrou wrote: >> >> Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit : >>> And is generally ignored, as per unicode spec; it's a "zero width >>> non-breaking space" - an invisible character with no effect on wrapping >>> or otherwise. >> Well it would be better if Py3K (with all strings unicode) makes things >> easy for the programmer and abstracts away those "invisible characters >> with no textual meaning". Currently it's not the case: > >>>>> a = "hello".decode("utf-8") >>>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8") >>>>> len(a) >> 5 >>>>> len(b) >> 6 >>>>> a == b >> False > > I had also had this particular discussion with another individual > previously (but I can't seem to find it in my archive), and one point > brought up was that apparently Python 2.5 was supposed to have a variant > codec for utf-8 that automatically stripped at most one \ufeff character > from the beginning of decoded output and added it during encoding, > similar to how the generic 'utf-16' and 'utf-32' codecs add and strip: > >>>> u'hello'.encode('utf-16') > '\xff\xfeh\x00e\x00l\x00l\x00o\x00' >>>> len(u'hello'.encode('utf-16').decode('utf-16')) > 5 > > I'm unable to find that particular utf-8 codec in the version of Python > 2.5 I have installed, but I may not be looking in the right places, or > spelling it the right way. It's called "utf-8-sig". > In any case, I believe that the above behavior is correct for the > context. Why? Because utf-8 has no endianness, its 'generic' decoding > spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and > 'utf-16-le' decoding spellings; two of which don't strip. Servus, Walter From talin at acm.org Thu Sep 14 10:04:33 2006 From: talin at acm.org (Talin) Date: Thu, 14 Sep 2006 01:04:33 -0700 Subject: [Python-3000] BOM handling In-Reply-To: <1158214740.5863.19.camel@fsol> References: <20060913092509.F933.JCARLSON@uci.edu> <1158179602.4721.24.camel@fsol> <20060913153900.F936.JCARLSON@uci.edu> <1158214740.5863.19.camel@fsol> Message-ID: <45090D11.3060908@acm.org> Antoine Pitrou wrote: > Hi, > > Le mercredi 13 septembre 2006 ? 16:14 -0700, Josiah Carlson a ?crit : >> In any case, I believe that the above behavior is correct for the >> context. Why? Because utf-8 has no endianness, its 'generic' decoding >> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and >> 'utf-16-le' decoding spellings; two of which don't strip. > > Your opinion is probably valid in a theoretical point of view. You are > more knowledgeable than me. > > My point was different : most programmers are not at your level (or > Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type > is supposed to be an abstracted textual type to make it easy to write > unicode-friendly applications (isn't it?). > Therefore it should hide the messy issue of superfluous BOMs, unwanted > BOMs, etc. Telling the programmer to use a specific UTF-8 variant > specialized in BOM-stripping will make eyes roll... "why doesn't the > standard UTF-8 do it for me?" I've been reading this thread (and the ones that spawned it), and there's something about it that's been nagging at me for a while, which I am going to attempt to articulate. The basic controversy centers around the various ways in which Python should attempt to deal with character encodings on various platforms, but my question is "for what use cases?" To my mind, trying to ask "how should we handle character encoding" without indicating what we want to use the characters *for* is a meaningless question. From the standpoint of a programmer writing code to process file contents, there's really no such thing as a "text file" - there are only various text-based file formats. There are XML files, .ini files, email messages and Python source code, all of which need to be processed differently. So when one asks "how do I handle text files", my response is "there ain't no such thing" -- and when you ask "well, ok, how do I handle text-based file formats", my response is "well it depends on the format". Yes, there are some operations which can operate on textual data regardless of file format (i.e. grep), but these generic operations are so basic and uninteresting that one generally doesn't need to write Python code to do them. And even the case of simple unix utilities such as 'cat', *some* a priori knowledge of the file's encoded meaning is required - you can't just concatenate two XML files and get anything meaningful or valid. Running 'sort' on Python source code is unlikely to increase shareholder value or otherwise hold back the tide of entropy. Any given Python program that I write is going to know *something* about the format of the files that it is supposed to read/write, and the most important consideration is knowledge of what kinds of other programs are going to produce or consume that file. If the file that I am working with conforms to a standard (so that the number of producer/consumer programs can be large without me having to know the specific details of each one) then I need to understand that standard and constraints of what is legal within it. For files with any kind of structure in them, common practice is that we don't treat them as streams of characters, rather we generally have some abstraction layer that sits on top of the character stream and allows us to work with the structure directly. Thus, when dealing with XML one generally uses something like ElementTree, and in fact manipulating XML files as straight text is actively discouraged. So my whole approach to the problem of reading and writing is to come up with a collection of APIs that reflect the common use patterns for the various popular file types. The benefit of doing this is that you don't waste time thinking about all of the various file operations that don't apply to a particular file format. For example, using the ElementTree interface, I don't care whether the underlying file stream supports seek() or not - generally one doesn't seek into the middle of an XML, so there's no need to support that feature. On the other hand, if one is reading a bdb file, one needs to seek to the location of a record in order to read it - but in such a case, the result of the seek operation is well-defined. I don't have to spend time discussing what will happen if I seek into the middle of an encoded multi-byte character, because with a bdb file, that can't happen. It seems to me that a lot of the conundrums that have been discussed in this thread have to do with hypothetical use cases - 'Well, what if I use operation X on a file of format Y, for which the result is undefined?' My answer is "Don't do that." -- Talin From ncoghlan at gmail.com Thu Sep 14 12:19:46 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 14 Sep 2006 20:19:46 +1000 Subject: [Python-3000] string C API In-Reply-To: <45084966.3000608@v.loewis.de> References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> Message-ID: <45092CC2.4070700@gmail.com> Martin v. L?wis wrote: > Jim Jewett schrieb: >> Simply delegate such methods to a hidden per-encoding subclass. >> >> The UTF-8 methods will indeed be complex, unless the solution is >> simply "someone called indexing/slicing/len, so I have to recode after >> all." >> >> The Latin-1 encoding will have no such problem. > > I'm not so much worried about UTF-8 or Latin-1; they are fairly trivial. > Efficiency of such methods for multi-byte encodings would be > dramatically slow. Only the first such call on a given string, though - the idea is to use lazy decoding, not to avoid decoding altogether. Most manipulations (len, indexing, slicing, concatenation, etc) would require decoding to at least UCS-2 (or perhaps UCS-4). It's applications that are just schlepping bits around that would benefit from the lazy decoding behaviour. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From qrczak at knm.org.pl Thu Sep 14 14:44:28 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 14 Sep 2006 14:44:28 +0200 Subject: [Python-3000] string C API In-Reply-To: <45092CC2.4070700@gmail.com> (Nick Coghlan's message of "Thu, 14 Sep 2006 20:19:46 +1000") References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> Message-ID: <8764fq8vo3.fsf@qrnik.zagroda> Nick Coghlan writes: > Only the first such call on a given string, though - the idea > is to use lazy decoding, not to avoid decoding altogether. > Most manipulations (len, indexing, slicing, concatenation, etc) > would require decoding to at least UCS-2 (or perhaps UCS-4). Silently optimizing string recoding might change the way recoding errors are reported. i.e. they might not be reported at all even if the string is malformed. Optimizations which change the semantics are bad. I imagine only a few cases where lazy decoding would be beneficial: 1. A whole input stream is copied to an output stream which uses the same encoding. Here the application might choose to copy binary streams instead. 2. A file name, user name, or similar token is obtained from the OS in one place and used in another place. Especially on Unix where they use byte encodings (Windows prefers UTF-16). These cases can be optimized by other means: - Sometimes representing the token as a Python string can be avoided. For example executing an action in a different directory and then returning to the original directory might choose to represent the saved directory as a byte array. - Under the assumption that the system encoding is ASCII-compatible, calling the recoding machinery can be omitted for ASCII-only strings. This applies only to strings exchanged with the OS etc., not to stream contents which can use non-ASCII-compatible encodings. My language implementation has only two string representations: ISO-8859-1 and UTF-32 (the narrow representation is used for all strings where it's possible). This is completely transparent to the high level semantics, like the fixnum/bignum split. I'm happy with this choice. My text I/O buffers and recoding buffers use UTF-32 exclusively. It would be too complicated to try to use a narrow representation when the string is not processed as a whole. This makes the ASCII-only optimization significant I believe (but I haven't measured it). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From solipsis at pitrou.net Thu Sep 14 14:48:56 2006 From: solipsis at pitrou.net (Antoine) Date: Thu, 14 Sep 2006 14:48:56 +0200 (CEST) Subject: [Python-3000] string C API In-Reply-To: <45092CC2.4070700@gmail.com> References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> Message-ID: <51134.62.39.9.251.1158238136.squirrel@webmail.nerim.net> > Only the first such call on a given string, though - the idea is to use > lazy > decoding, not to avoid decoding altogether. Most manipulations (len, > indexing, > slicing, concatenation, etc) would require decoding to at least UCS-2 (or > perhaps UCS-4). My two cents: For len() you can compute the length at string construction and store it in the string object (which is immutable). For example if the string is constructed by concatenation then computing the resulting length should be trivial. Even when real computation is needed, it plays nicer with the CPU cache since the data has to be there anyway. As for concatenation, recoding can be avoided if the strings to be concatenated use the same internal encoding (assuming it does not hold internal state). Given that in many cases the strings will come from similar sources (thus use the same internal encoding), it may be an interesting optimization. Regards Antoine. From p.f.moore at gmail.com Thu Sep 14 14:50:49 2006 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 14 Sep 2006 13:50:49 +0100 Subject: [Python-3000] BOM handling In-Reply-To: <45090D11.3060908@acm.org> References: <20060913092509.F933.JCARLSON@uci.edu> <1158179602.4721.24.camel@fsol> <20060913153900.F936.JCARLSON@uci.edu> <1158214740.5863.19.camel@fsol> <45090D11.3060908@acm.org> Message-ID: <79990c6b0609140550j287792ex468ff93407a6d4ac@mail.gmail.com> On 9/14/06, Talin wrote: > I've been reading this thread (and the ones that spawned it), and > there's something about it that's been nagging at me for a while, which > I am going to attempt to articulate. [...] > Any given Python program that I write is going to know *something* about > the format of the files that it is supposed to read/write, and the most > important consideration is knowledge of what kinds of other programs are > going to produce or consume that file. If the file that I am working > with conforms to a standard (so that the number of producer/consumer > programs can be large without me having to know the specific details of > each one) then I need to understand that standard and constraints of > what is legal within it. Well said! There *is* still an issue, which is that Python needs to supply tools to cater for naive users writing naive programs to parse/produce ad-hoc text based file formats. For example, someone sent me this file of data, and I want to parse it and convert it into some other format (load it into a database, generate XML, whaterver). In my experience, in these cases: 1. Nobody tells me the character encoding used. 2. 99.9% of the data is ASCII - so there's very little basis for guessing. 3. The whole process isn't an exact science - I *expect* to have to do a bit of manual tidying up. Or it's all ASCII and it *really* doesn't matter. Those are the bulk of my use cases. For them, I'd be happy with the "system code page" (even though Windows has two, one for console and one for GUI, that wouldn't bother me if it was visible to me). I wouldn't mind UTF-8, or latin-1, or anything much. It's only that 0.1% of cases where I expect to need to check and possibly intervene, so no problem. On the other hand, getting an error *would* bother me. In Python 2.x, I get no error because I don't convert to Unicode. In Python 3.x, I fear that I might, because someone expects me to care about that 0.1%. And no, it's not good enough for me to be able to set a global option - that's boilerplate I'd rather do without. Parochially y'rs Paul. From qrczak at knm.org.pl Thu Sep 14 15:01:23 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 14 Sep 2006 15:01:23 +0200 Subject: [Python-3000] Pre-PEP: Easy Text File Decoding In-Reply-To: <45089612.4060007@blueyonder.co.uk> (David Hopwood's message of "Thu, 14 Sep 2006 00:36:50 +0100") References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com> <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com> <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com> <87r6yij7ea.fsf@qrnik.zagroda> <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com> <45089612.4060007@blueyonder.co.uk> Message-ID: <871wqe8uvw.fsf@qrnik.zagroda> David Hopwood writes: > You're correct about the use of a BOM as a signature. All > Unicode-conformant applications should accept this use of a BOM in > UTF-8 (although they need not generate it); the standard is quite > clear on that. When a program generates a list of filenames in a file, and I do xargs -i cp {} some-dir/ References: <20060913092509.F933.JCARLSON@uci.edu> <1158179602.4721.24.camel@fsol> <20060913153900.F936.JCARLSON@uci.edu> <1158214740.5863.19.camel@fsol> <45090D11.3060908@acm.org> Message-ID: <4509556F.4030508@latte.ca> Talin wrote: >> My point was different : most programmers are not at your level (or >> Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type >> is supposed to be an abstracted textual type to make it easy to write >> unicode-friendly applications (isn't it?). > > The basic controversy centers around the various ways in which Python > should attempt to deal with character encodings on various platforms, > but my question is "for what use cases?" To my mind, trying to ask "how > should we handle character encoding" without indicating what we want to > use the characters *for* is a meaningless question. Contrary to all expectations, this thread has helped me in my day job already. I'm about to start writing a program (in Python, natch) which will take a set of files, and perform simple token substitution on them, replacing tokens of the form %STUFF.format% with the value of the STUFF token looked up in another (XML, thus Unicode by the time it gets to me) file. The files I'll be substituting in will be in various encodings, and I'll be creating new files which must have the same encoding. Sadly, I don't know what all the encodings are. (The Windows Resource Compiler takes in .rc files, but I can't find any suggestion of what encoding those use. Anyone here know?) The first version of the spec naively mentioned nothing about encodings, and so I raised a red flag about that, seeing that we would have problems, and that the right thing to do in this case isn't clear. Um, what more data do we need for this use-case? I'm not going to suggest an API, other than it would be nice if I didn't have to manually figure out/hard code all the encodings. (It's my belief that I will currently have to do that, or at least special-case XML, to read the encoding attribute.) Oh, and it would be particularly horrible if I output a shell script in UTF-8, and it included the BOM, since I believe that would break the "magic number" of "#!". (To test it in vim, set the following options: :set encoding=utf-8 :set bomb ) Jennifer:~ bwinton$ xxd test 0000000: efbb bf23 2120 2f62 696e 2f62 6173 680a ...#! /bin/bash. 0000010: 6563 686f 204a 7573 7420 7465 7374 696e echo Just testin 0000020: 672e 2e2e 0a g.... Jennifer:~ bwinton$ ./test -bash: ./test: cannot execute binary file Jennifer:~ bwinton$ xxd test 0000000: 2321 202f 6269 6e2f 6261 7368 0a65 6368 #! /bin/bash.ech 0000010: 6f20 4a75 7374 2074 6573 7469 6e67 2e2e o Just testing.. 0000020: 2e0a .. Jennifer:~ bwinton$ ./test Just testing... > From the standpoint of a programmer writing code to process file > contents, there's really no such thing as a "text file" - there are only > various text-based file formats. There are XML files, .ini files, email > messages and Python source code, all of which need to be processed > differently. Yeah, see, at a business level, I really need to process those all in the same way, and it would be annoying to have to write code to handle them all differently. > For files with any kind of structure in them, common practice is that we > don't treat them as streams of characters, rather we generally have some > abstraction layer that sits on top of the character stream and allows us > to work with the structure directly. Your common practice, perhaps. I find myself treating them as streams of characters as often as not, because I neither need nor care to process the structure. Heck, even in my source code, I grep more often than I use the fancy "Find Usages" button (if only because PyDev in Eclipse doesn't let me search for all the usages of a function). > So my whole approach to the problem of reading and writing is to come up > with a collection of APIs that reflect the common use patterns for the > various popular file types. That sounds great. Can you also come up with an API for the files that you don't consider to be in common use? And if so, that's the one that everyone is going to use. (I'm not saying that to be contrary, but because I honestly believe that that's what's going to happen. If there's a choice between using one API for all your files, and using n APIs for all your files, my money is always going to be on the one. Maybe XML will have enough traction to make it two, but certainly no more than that.) Later, Blake. From jcarlson at uci.edu Thu Sep 14 18:28:39 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Thu, 14 Sep 2006 09:28:39 -0700 Subject: [Python-3000] BOM handling In-Reply-To: <4509556F.4030508@latte.ca> References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca> Message-ID: <20060914092020.F954.JCARLSON@uci.edu> Blake Winton wrote: [snip] > Um, what more data do we need for this use-case? I'm not going to > suggest an API, other than it would be nice if I didn't have to manually > figure out/hard code all the encodings. (It's my belief that I will > currently have to do that, or at least special-case XML, to read the > encoding attribute.) Oh, and it would be particularly horrible if I > output a shell script in UTF-8, and it included the BOM, since I believe > that would break the "magic number" of "#!". Use the XML tag/attribute " to discover the encoding and assume utf-8 otherwise as per spec: http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl Does bash natively support utf-8? Is there a bash equivalent to Python coding: directives? You may be attempting to fix a problem that doesn't exist. > Yeah, see, at a business level, I really need to process those all in > the same way, and it would be annoying to have to write code to handle > them all differently. So you, or anyone else, can write a module for discovering the encoding used for a particular file based on XML tags, Python coding: directives, etc. It could include an extensible registry, and if it is used enough, could be included in the Python standard library. - Josiah From jcarlson at uci.edu Thu Sep 14 18:46:06 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Thu, 14 Sep 2006 09:46:06 -0700 Subject: [Python-3000] string C API In-Reply-To: <8764fq8vo3.fsf@qrnik.zagroda> References: <45092CC2.4070700@gmail.com> <8764fq8vo3.fsf@qrnik.zagroda> Message-ID: <20060914093036.F957.JCARLSON@uci.edu> "Marcin 'Qrczak' Kowalczyk" wrote: > Nick Coghlan writes: > > > Only the first such call on a given string, though - the idea > > is to use lazy decoding, not to avoid decoding altogether. > > Most manipulations (len, indexing, slicing, concatenation, etc) > > would require decoding to at least UCS-2 (or perhaps UCS-4). > > Silently optimizing string recoding might change the way recoding > errors are reported. i.e. they might not be reported at all even > if the string is malformed. Optimizations which change the semantics > are bad. This is not a problem. During construction of the string, you would either be recoding the original string to the standard 'compressed' format, or if they had the same format, you would attempt a decoding, and on failure, claim that the input wasn't in the encoding originally specified. Personally though, I'm not terribly inclined to believe that using a 'compressed' representation of utf-8 is desireable. Why not use latin-1 when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2 isn't enough? You get a fixed-width character encoding, and aside from the (annoying) need to write variants of each string function for each width (macros would help here), or generic versions of each, you never need to recode the initial string after it has been created. Even better, with a slightly modified buffer interface, these characters can be exposed to C extensions in a somewhat transparent manner (if desired). - Josiah From bob at redivi.com Thu Sep 14 18:47:17 2006 From: bob at redivi.com (Bob Ippolito) Date: Thu, 14 Sep 2006 09:47:17 -0700 Subject: [Python-3000] string C API In-Reply-To: <20060914093036.F957.JCARLSON@uci.edu> References: <45092CC2.4070700@gmail.com> <8764fq8vo3.fsf@qrnik.zagroda> <20060914093036.F957.JCARLSON@uci.edu> Message-ID: <6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com> On 9/14/06, Josiah Carlson wrote: > > "Marcin 'Qrczak' Kowalczyk" wrote: > > Nick Coghlan writes: > > > > > Only the first such call on a given string, though - the idea > > > is to use lazy decoding, not to avoid decoding altogether. > > > Most manipulations (len, indexing, slicing, concatenation, etc) > > > would require decoding to at least UCS-2 (or perhaps UCS-4). > > > > Silently optimizing string recoding might change the way recoding > > errors are reported. i.e. they might not be reported at all even > > if the string is malformed. Optimizations which change the semantics > > are bad. > > This is not a problem. During construction of the string, you would > either be recoding the original string to the standard 'compressed' > format, or if they had the same format, you would attempt a decoding, > and on failure, claim that the input wasn't in the encoding originally > specified. > > > Personally though, I'm not terribly inclined to believe that using a > 'compressed' representation of utf-8 is desireable. Why not use latin-1 > when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2 > isn't enough? You get a fixed-width character encoding, and aside from > the (annoying) need to write variants of each string function for each > width (macros would help here), or generic versions of each, you never > need to recode the initial string after it has been created. > > Even better, with a slightly modified buffer interface, these characters > can be exposed to C extensions in a somewhat transparent manner (if > desired). The argument for UTF-8 is probably interop efficiency. Lots of C libraries, file formats, and wire protocols use UTF-8 for interchange. Verifying the validity of UTF-8 during string creation isn't that big of a deal. -bob From bwinton at latte.ca Thu Sep 14 19:56:11 2006 From: bwinton at latte.ca (Blake Winton) Date: Thu, 14 Sep 2006 13:56:11 -0400 Subject: [Python-3000] BOM handling In-Reply-To: <20060914092020.F954.JCARLSON@uci.edu> References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca> <20060914092020.F954.JCARLSON@uci.edu> Message-ID: <450997BB.6020703@latte.ca> Josiah Carlson wrote: > Blake Winton wrote: >> I'm not going to >> suggest an API, other than it would be nice if I didn't have to manually >> figure out/hard code all the encodings. (It's my belief that I will >> currently have to do that, or at least special-case XML, to read the >> encoding attribute.) > Use the XML tag/attribute " to discover the > encoding and assume utf-8 otherwise as per spec: > http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl Yeah, but now you're requiring me to read and understand the file's contents, which is something I (as someone who doesn't particularly care about all this "encoding" stuff) am trying very hard not to do. Does no-one write generic text processing programs anymore? If I were to write a program which rotated an image using PIL, I wouldn't have to care whether it was a png or a jpeg. (At least, I'm pretty sure I wouldn't. I haven't tried recently.) >> Oh, and it would be particularly horrible if I >> output a shell script in UTF-8, and it included the BOM, since I believe >> that would break the "magic number" of "#!". > Does bash natively support utf-8? A quick Google gives me: ------------------------- About bash utf-8: Bash is the shell, or command language interpreter, that will appear in the GNU operating system. It is default shell for BeOS. By default, GNU bash assumes that every character is one byte long and one column wide. It may cause several problems for all non-english BeOS users, especially with file names using national characters. A patch for bash 2.04, by Marcin 'Qrczak' Kowalczyk and Ricardas Cepas, teaches bash about multibyte characters in UTF-8 encoding, and fixes those problems. Double-width characters, combining characters and bidi are not supported by this patch. ------------------------- which I'm mainly posting here because of the reference to Marcin 'Qrczak' Kowalczyk. Small world, but I wouldn't want to paint it. > Is there a bash equivalent to Python coding: directives? You may be > attempting to fix a problem that doesn't exist. I don't know if the magic number stuff to determine whether a file is executable or not is bash-specific. Either way, when I save the file in UTF-8, it's fine, but when I save it in UTF-8 with a BOM, it fails. >> Yeah, see, at a business level, I really need to process those all in >> the same way, and it would be annoying to have to write code to handle >> them all differently. > So you, or anyone else, can write a module for discovering the encoding > used for a particular file based on XML tags, Python coding: directives, > etc. It could include an extensible registry, and if it is used enough, > could be included in the Python standard library. Okay, so what will happen for file types which aren't in the registry, like that Windows .rc files? I was lying up above when I said that I don't care about this sort of thing. I do care, but I also believe that I am, and should be, in the minority, and that if we can't ship something that will work for people who don't care about this stuff, then we've failed both them and Python. Later, Blake. From paul at prescod.net Thu Sep 14 20:12:10 2006 From: paul at prescod.net (Paul Prescod) Date: Thu, 14 Sep 2006 11:12:10 -0700 Subject: [Python-3000] BOM handling In-Reply-To: <20060914092020.F954.JCARLSON@uci.edu> References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca> <20060914092020.F954.JCARLSON@uci.edu> Message-ID: <1cb725390609141112j6bc22220yd290d43e90c8501@mail.gmail.com> As a somewhat aside: for XML encoding detection: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841 Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060914/99111ff6/attachment.htm From jcarlson at uci.edu Thu Sep 14 20:58:47 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Thu, 14 Sep 2006 11:58:47 -0700 Subject: [Python-3000] BOM handling In-Reply-To: <450997BB.6020703@latte.ca> References: <20060914092020.F954.JCARLSON@uci.edu> <450997BB.6020703@latte.ca> Message-ID: <20060914112926.F95D.JCARLSON@uci.edu> Blake Winton wrote: > Josiah Carlson wrote: > > Blake Winton wrote: > >> I'm not going to > >> suggest an API, other than it would be nice if I didn't have to manually > >> figure out/hard code all the encodings. (It's my belief that I will > >> currently have to do that, or at least special-case XML, to read the > >> encoding attribute.) > > Use the XML tag/attribute " to discover the > > encoding and assume utf-8 otherwise as per spec: > > http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl > > Yeah, but now you're requiring me to read and understand the file's > contents, which is something I (as someone who doesn't particularly care > about all this "encoding" stuff) am trying very hard not to do. Does > no-one write generic text processing programs anymore? Not too long ago, "generic text processing programs" only had to deal with one of ascii, ebdic, etc., or were written specifically for text encoded for a particular locale. Times have changed, but the tools really haven't. If you want to easily deal with such things, write the module. > If I were to write a program which rotated an image using PIL, I > wouldn't have to care whether it was a png or a jpeg. (At least, I'm > pretty sure I wouldn't. I haven't tried recently.) Right, but gif, png, jpeg, bmp, and scores of other multimedia formats contain the equivalent to a Python coding: directive. Examine the first dozen or so bytes bytes of basically any kind of image, sound (not mp3s though), or movie, and you will notice an ascii specifier for the type of file. By writing the registry module I described, one would be, in essence, writing a library that understands what kind of media it has been handed, at least as much as the equivalent of "this is a bmp" or "this is a gif". > > Is there a bash equivalent to Python coding: directives? You may be > > attempting to fix a problem that doesn't exist. > > I don't know if the magic number stuff to determine whether a file is > executable or not is bash-specific. Either way, when I save the file in > UTF-8, it's fine, but when I save it in UTF-8 with a BOM, it fails. So don't save it with a BOM and add a Python coding: directive to the second line. Python and bash comments just happen to have the same # delimiter, and if your editor doesn't suck, then it should understand such a directive. With luck, your editor should also allow for the non-writing of the BOM on utf-8 save (given certain conditions). If not, contact the author(s) and request that feature. > > So you, or anyone else, can write a module for discovering the encoding > > used for a particular file based on XML tags, Python coding: directives, > > etc. It could include an extensible registry, and if it is used enough, > > could be included in the Python standard library. > > Okay, so what will happen for file types which aren't in the registry, > like that Windows .rc files? I'm not writing the encoding registry, but if I was, and if no known encoding was found, I'd claim latin-1, if only because it 'succeeds' when decoding character values 128-255. > I was lying up above when I said that I don't care about this sort of > thing. I do care, but I also believe that I am, and should be, in the > minority, and that if we can't ship something that will work for people > who don't care about this stuff, then we've failed both them and Python. Indeed, which is why people who do care should write a registry so that their users don't need to care. - Josiah From p.f.moore at gmail.com Thu Sep 14 22:15:34 2006 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 14 Sep 2006 21:15:34 +0100 Subject: [Python-3000] BOM handling In-Reply-To: <20060914112926.F95D.JCARLSON@uci.edu> References: <20060914092020.F954.JCARLSON@uci.edu> <450997BB.6020703@latte.ca> <20060914112926.F95D.JCARLSON@uci.edu> Message-ID: <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com> On 9/14/06, Josiah Carlson wrote: > So don't save it with a BOM and add a Python coding: directive to the > second line. Python and bash comments just happen to have the same # > delimiter, and if your editor doesn't suck, then it should understand > such a directive. However, vim and emacs use *different* coding directive formats. Python understands both, but (AFAIK) they don't understand each other's. So which editor sucks? :-) :-) :-) (3 smileys is a get-out-of-flamewar-free card :-)) I'm not trying to contradict you - just pointing out that the world isn't as perfect as people here seem to want it to be. Paul. From jcarlson at uci.edu Thu Sep 14 22:19:03 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Thu, 14 Sep 2006 13:19:03 -0700 Subject: [Python-3000] string C API In-Reply-To: <6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com> References: <20060914093036.F957.JCARLSON@uci.edu> <6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com> Message-ID: <20060914104921.F95A.JCARLSON@uci.edu> "Bob Ippolito" wrote: > The argument for UTF-8 is probably interop efficiency. Lots of C > libraries, file formats, and wire protocols use UTF-8 for interchange. > Verifying the validity of UTF-8 during string creation isn't that big > of a deal. Indeed, UTF-8 validation/creation isn't a big deal. But that wasn't my concern. My concern was Python-only operation efficiency, for which a fixed-length-per-character encoding generally wins (at least for operations involving two strings with the same internal encoding). - Josiah From bob at redivi.com Thu Sep 14 22:34:38 2006 From: bob at redivi.com (Bob Ippolito) Date: Thu, 14 Sep 2006 13:34:38 -0700 Subject: [Python-3000] string C API In-Reply-To: <20060914104921.F95A.JCARLSON@uci.edu> References: <20060914093036.F957.JCARLSON@uci.edu> <6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com> <20060914104921.F95A.JCARLSON@uci.edu> Message-ID: <6a36e7290609141334x344cf42fpa561275c123c290b@mail.gmail.com> On 9/14/06, Josiah Carlson wrote: > > "Bob Ippolito" wrote: > > The argument for UTF-8 is probably interop efficiency. Lots of C > > libraries, file formats, and wire protocols use UTF-8 for interchange. > > Verifying the validity of UTF-8 during string creation isn't that big > > of a deal. > > Indeed, UTF-8 validation/creation isn't a big deal. But that wasn't my > concern. My concern was Python-only operation efficiency, for which a > fixed-length-per-character encoding generally wins (at least for > operations involving two strings with the same internal encoding). If you need to know the number of characters often you can calculate that when the string's contents are validated. Slice ops may become slower though... but versus UCS-4 the memory and memory bandwidth savings might actually be a net performance win overall for many applications. -bob From jason.orendorff at gmail.com Thu Sep 14 22:53:58 2006 From: jason.orendorff at gmail.com (Jason Orendorff) Date: Thu, 14 Sep 2006 16:53:58 -0400 Subject: [Python-3000] BOM handling In-Reply-To: <1cb725390609141112j6bc22220yd290d43e90c8501@mail.gmail.com> References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca> <20060914092020.F954.JCARLSON@uci.edu> <1cb725390609141112j6bc22220yd290d43e90c8501@mail.gmail.com> Message-ID: For what it's worth: in .NET, everything defaults to UTF-8, whether reading or writing. No BOM is generated when creating a new file. http://msdn2.microsoft.com/en-us/library/system.io.file.createtext.aspx Java defaults to a "default character encoding", which on Windows is the system's ANSI encoding. http://java.sun.com/j2se/1.4.2/docs/api/java/io/OutputStreamWriter.html Neither correctly reads the other's output. Pick your poison. -j From martin at v.loewis.de Thu Sep 14 23:34:34 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 14 Sep 2006 23:34:34 +0200 Subject: [Python-3000] string C API In-Reply-To: <45092CC2.4070700@gmail.com> References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> Message-ID: <4509CAEA.3040108@v.loewis.de> Nick Coghlan schrieb: > Only the first such call on a given string, though - the idea is to use > lazy decoding, not to avoid decoding altogether. Most manipulations > (len, indexing, slicing, concatenation, etc) would require decoding to > at least UCS-2 (or perhaps UCS-4). Ok. Then my objection is this: What about errors that occur in decoding? What happens if the bytes are not meaningful in the presumed encoding? ISTM that raising the exception lazily (which seems to be necessary) would be very confusing. Regards, Martin From 2006 at jmunch.dk Fri Sep 15 01:05:28 2006 From: 2006 at jmunch.dk (Anders J. Munch) Date: Fri, 15 Sep 2006 01:05:28 +0200 Subject: [Python-3000] iostack, second revision In-Reply-To: <20060913084256.F930.JCARLSON@uci.edu> References: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net> <20060913084256.F930.JCARLSON@uci.edu> Message-ID: <4509E038.2030808@jmunch.dk> Josiah Carlson wrote: > Any sane person uses os.stat(f.name) or os.fstat(f.fileno()), unless > they want to seek to the end of the file for later writing or expected > reading of data yet-to-be-written. os.fstat(f.fileno()).st_size doesn't work for file-like objects. Goodbye unit testing with StringIOs. f.seek(0,2);f.tell() is faster, too. I think the lunatics have a point. > You were also talking about buffering writes to reduce the overhead of > the underlying seeks and tells because of apparent "optimizations" you > wanted to make. Here is a data integrity optimization you can make for > me: flush when accessing the file non-sequentially, any other behavior > could corrupt the data of users who have been relying on "seek implies > flush". Again, that's what explicit calls to flush are for. And you can't violate expectations as to what the seek method does, when there's no seek method and no concept of a file pointer. Sprinkling extra flushes out here and there does not help data integrity: Only a flush that is part of a well thought out plan to recover partially written data in case of a crash, will help you do that. Anything less, and you're just a power failure and a disk that reorders writes away from unrecoverable corruption. My class consolidate writes, but doesn't reorder them. That means that to the extent that the system call for writing is transactional, writes are not reordered. I put the code up at http://pastecode.com/4818. As is - extending and truncating has bugs. If you really want it, it's three lines changed to disable buffering for non-sequential writes. And an equivalent class completely without buffering is pretty trivial. > With that said, I'm not sure your FileBytes object is really necessary > or desired for the future io library. If people want that kind of an > interface, they can use mmap (and push for the various mmap bugs/feature > requests to be fixed), otherwise they should be using readable / > writable / both streams, something that Tomer has been working towards. mmap has limitations that cannot be fixed. It takes up virtual memory, limiting the size of files you can work with. You need to specify the size in advance (note the potential race condition in f=mmap.mmap(f.fileno(),os.fstat(f.fileno()))). To what extent does it work over networked file systems? If you map a file on a file system that is subsequently unmounted, a core dump may be the result. All this assuming the operating system supports mmap at all. mmap is for use where speed is paramount, and pretty much only then. The reason people don't use sequence-based file interfaces as much is that robust, portable, practical sequence-based file interfaces aren't available. Probably most people who would have liked a sequence interface do what I do: slurp up the whole file in one read and deal with the string. Or use mmap and live with the fragility. - Anders From murman at gmail.com Fri Sep 15 01:30:09 2006 From: murman at gmail.com (Michael Urman) Date: Thu, 14 Sep 2006 18:30:09 -0500 Subject: [Python-3000] BOM handling In-Reply-To: <20060914112926.F95D.JCARLSON@uci.edu> References: <20060914092020.F954.JCARLSON@uci.edu> <450997BB.6020703@latte.ca> <20060914112926.F95D.JCARLSON@uci.edu> Message-ID: On 9/14/06, Josiah Carlson wrote: > With luck, your editor should also allow for the > non-writing of the BOM on utf-8 save (given certain conditions). If not, > contact the author(s) and request that feature. And hope they didn't write it in a language that doesn't let them control when to use a BOM. -- Michael Urman http://www.tortall.net/mu/blog From jcarlson at uci.edu Fri Sep 15 02:02:01 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Thu, 14 Sep 2006 17:02:01 -0700 Subject: [Python-3000] BOM handling In-Reply-To: <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com> References: <20060914112926.F95D.JCARLSON@uci.edu> <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com> Message-ID: <20060914153932.F967.JCARLSON@uci.edu> "Paul Moore" wrote: > On 9/14/06, Josiah Carlson wrote: > > So don't save it with a BOM and add a Python coding: directive to the > > second line. Python and bash comments just happen to have the same # > > delimiter, and if your editor doesn't suck, then it should understand > > such a directive. > > However, vim and emacs use *different* coding directive formats. > Python understands both, but (AFAIK) they don't understand each > other's. So which editor sucks? :-) :-) :-) (3 smileys is a > get-out-of-flamewar-free card :-)) Single users will be choosing a single tool. Multiple users will likely use a source repository. Good source repositories will allow for pre or post processing. Or heck, I'm sure that Emacs or Vim can even be tweaked to understand the others encoding declarations. If not, there are more than a dozen source editors that support both, and even some that offer the features I describe. - Josiah From david.nospam.hopwood at blueyonder.co.uk Fri Sep 15 02:00:19 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Fri, 15 Sep 2006 01:00:19 +0100 Subject: [Python-3000] BOM handling In-Reply-To: <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com> References: <20060914092020.F954.JCARLSON@uci.edu> <450997BB.6020703@latte.ca> <20060914112926.F95D.JCARLSON@uci.edu> <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com> Message-ID: <4509ED13.9040409@blueyonder.co.uk> Paul Moore wrote: > On 9/14/06, Josiah Carlson wrote: > >>So don't save it with a BOM and add a Python coding: directive to the >>second line. Python and bash comments just happen to have the same # >>delimiter, and if your editor doesn't suck, then it should understand >>such a directive. > > However, vim and emacs use *different* coding directive formats. > Python understands both, but (AFAIK) they don't understand each > other's. So which editor sucks? Both, obviously. It would not have been beyond the wit of those editor developers to talk to each other, or to just unilaterally support the other editor's format as well as their own. -- David Hopwood From greg.ewing at canterbury.ac.nz Fri Sep 15 03:32:00 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 15 Sep 2006 13:32:00 +1200 Subject: [Python-3000] iostack, second revision In-Reply-To: <4509E038.2030808@jmunch.dk> References: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net> <20060913084256.F930.JCARLSON@uci.edu> <4509E038.2030808@jmunch.dk> Message-ID: <450A0290.9080204@canterbury.ac.nz> Anders J. Munch wrote: > (note the potential race condition in > f=mmap.mmap(f.fileno(),os.fstat(f.fileno()))). Not sure anything could be done about that. Even if there were an mmap-this-file-however-big-it-is call, the size of the file could still change *after* you'd mapped it. -- Greg From jcarlson at uci.edu Fri Sep 15 05:01:39 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Thu, 14 Sep 2006 20:01:39 -0700 Subject: [Python-3000] iostack, second revision In-Reply-To: <4509E038.2030808@jmunch.dk> References: <20060913084256.F930.JCARLSON@uci.edu> <4509E038.2030808@jmunch.dk> Message-ID: <20060914191243.F972.JCARLSON@uci.edu> "Anders J. Munch" <2006 at jmunch.dk> wrote: > Josiah Carlson wrote: > > You were also talking about buffering writes to reduce the overhead of > > the underlying seeks and tells because of apparent "optimizations" you > > wanted to make. Here is a data integrity optimization you can make for > > me: flush when accessing the file non-sequentially, any other behavior > > could corrupt the data of users who have been relying on "seek implies > > flush". > > Again, that's what explicit calls to flush are for. And you can't > violate expectations as to what the seek method does, when there's no > seek method and no concept of a file pointer. People who have experience using Python 2.x file objects and/or underlying platform file handles may have come to expect "seek implies flush". Since you claim that offering an unbuffered version is easy, I'll pretend that such would be offered to the user as an option. > Sprinkling extra flushes out here and there does not help data > integrity: Only a flush that is part of a well thought out plan to > recover partially written data in case of a crash, will help you do > that. Anything less, and you're just a power failure and a disk that > reorders writes away from unrecoverable corruption. Indeed, whether or not extra flushes help data integrity depends on the file structure. But for those who have the know-how to properly deal with recovery of structured data files post power outage, not flushing due to optimization is a larger sin than actively flushing - as data may very well have a better chance to get to disk when you are flushing more often. > > With that said, I'm not sure your FileBytes object is really necessary > > or desired for the future io library. If people want that kind of an > > interface, they can use mmap (and push for the various mmap bugs/feature > > requests to be fixed), otherwise they should be using readable / > > writable / both streams, something that Tomer has been working towards. > > mmap has limitations that cannot be fixed. It takes up virtual > memory, limiting the size of files you can work with. You need to > specify the size in advance (note the potential race condition in > f=mmap.mmap(f.fileno(),os.fstat(f.fileno()))). To what extent does it > work over networked file systems? If you map a file on a file system > that is subsequently unmounted, a core dump may be the result. All > this assuming the operating system supports mmap at all. Some of your concerns can be addressed with mmap + starting offset, and length parameter of -1. This results in being able to map arbitrary portions of the file, as well as a Python-level race-free construction of an mmap. Then the FileBytes interface essentially becomes... class FileBytes(object): def __init__(self, fname, mode='r+b'): self.f = open(fname, mode) def __getitem__(self, key): start, stop = self._parseposition(key) return mmap.mmap(self.f.fileno(), start=start, stop=stop) def __setitem__(self, key, value): self[key] = value #_parseposition as you specify With a non-broken platform mmap implementation, multiple identical calls to __getitem__ will return identical data pointers, or at least the underlying OS will make sure that the two pointers actually point to the same physical memory region. NFS issues are a pain. This and the non-support of mmaps on smaller or less developed platforms may be the only situations where not using mmaps could offer superior failure conditions. > mmap is for use where speed is paramount, and pretty much only then. > The reason people don't use sequence-based file interfaces as much is > that robust, portable, practical sequence-based file interfaces aren't > available. Probably most people who would have liked a sequence > interface do what I do: slurp up the whole file in one read and deal > with the string. Or use mmap and live with the fragility. I've found the opposite to be true. Every time where I've wanted a sequence-based file interface, I use an mmap: because it is faster and far more reliable for all use-cases I've been confronted with (if your process crashes, all of your writes are flushed). But I suppose I spend time with 512M and 1G mmaps, for which constant slicing of strings and/or a file-based interface is about 100 times too slow (and useless when a C extension wants to write to the file - mmaps do this for free). - Josiah From ncoghlan at gmail.com Fri Sep 15 15:29:58 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 15 Sep 2006 23:29:58 +1000 Subject: [Python-3000] string C API In-Reply-To: <4509CAEA.3040108@v.loewis.de> References: <45078B46.90408@v.loewis.de> <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> Message-ID: <450AAAD6.5030901@gmail.com> Martin v. L?wis wrote: > Nick Coghlan schrieb: >> Only the first such call on a given string, though - the idea is to use >> lazy decoding, not to avoid decoding altogether. Most manipulations >> (len, indexing, slicing, concatenation, etc) would require decoding to >> at least UCS-2 (or perhaps UCS-4). > > Ok. Then my objection is this: What about errors that occur in decoding? > What happens if the bytes are not meaningful in the presumed encoding? > > ISTM that raising the exception lazily (which seems to be necessary) > would be very confusing. Yeah, it appears it would be necessary to at least *scan* the string when it was first created in order to ensure it can be decoded without errors later on. I also realised there is another issue with an internal representation that can change over the life of a string, which is that of thread-safety. Since strings don't currently have any mutable internal state, it's possible to freely share them between threads (without this property, the interning behaviour would be doomed). If strings could change the encoding of their internal buffers then they'd have to use a read/write lock internally on all operations that might be affected when the internal representation changes. Blech. Far, far simpler is the idea of supporting only latin-1, UCS-2 and UCS-4 as internal representations, and choosing which one to use when the string is created. Sure certain applications that are just copying from one data stream to another (both in the same encoding) may needlessly decode and then re-encode the data, but if the application *knows* that this might happen (and has reason to care about optimising the performance of this case), then the application is free to decouple the "reading" and "decoding" steps, and just transfer raw bytes between the streams. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From jimjjewett at gmail.com Fri Sep 15 16:25:08 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 15 Sep 2006 10:25:08 -0400 Subject: [Python-3000] string C API In-Reply-To: <450AAAD6.5030901@gmail.com> References: <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> Message-ID: On 9/15/06, Nick Coghlan wrote: > Martin v. L?wis wrote: > > Nick Coghlan schrieb: > >> Only the first such call on a given string, though - the idea is to use > >> lazy decoding, not to avoid decoding altogether. Most manipulations > >> (len, indexing, slicing, concatenation, etc) would require decoding to > >> at least UCS-2 (or perhaps UCS-4). Or other workarounds. > > Ok. Then my objection is this: What about errors that occur in decoding? > > What happens if the bytes are not meaningful in the presumed encoding? > > ISTM that raising the exception lazily (which seems to be necessary) > > would be very confusing. > Yeah, it appears it would be necessary to at least *scan* the string when it > was first created in order to ensure it can be decoded without errors later on. What happens today with strings? I think the answer is: "Nothing. They print something odd when printed. They may raise errors when explicitly recoded to unicde." Why is this a problem? I see nothing wrong with an explicit .validate() method. I see nothing wrong with a program choosing to recode everything into a known encoding, which would validate as a side-effect. This would be the moral equivalent of today's unicode() call. I'm not so happy about the efficiency implication of the idea that *all* strings *must* be validated (let alone recoded). > I also realised there is another issue with an internal representation that > can change over the life of a string, which is that of thread-safety. > Since strings don't currently have any mutable internal state, it's possible > to freely share them between threads (without this property, the interning > behaviour would be doomed). Interning may get awkward if multiple encodings are allowed within a program, regardless of whether they're allowed for single strings. It might make sense to intern only strings that are in the same encoding as the source code. (Or whose values are limited to ASCII?) > If strings could change the encoding of their internal buffers then they'd > have to use a read/write lock internally on all operations that might be > affected when the internal representation changes. Blech. Why? There should be only one reference to a string until is constructed, and after that, its data should be immutable. Recoding that results in different bytes should not be in-place. Either it returns a new string (no problem) or it doesn't change the databuffer-and-encoding pointer until the new databuffer is fully constructed. Anything keeping its own reference to the old databuffer (and old encoding) will continue to work, so immutability ==> the two buffers really are equivalent. > Sure certain applications that are just copying from one data stream to > another (both in the same encoding) may needlessly decode and then re-encode > the data, Other than text editors, "certain" includes almost any application I have ever used, let alone written. > but if the application *knows* that this might happen (and has > reason to care about optimising the performance of this case), then the > application is free to decouple the "reading" and "decoding" steps, and just > transfer raw bytes between the streams. So adding boilerplate to treat text as bytes "for efficiency" may become a standard recipe? Not so good. -jJ From ncoghlan at gmail.com Fri Sep 15 17:15:27 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 16 Sep 2006 01:15:27 +1000 Subject: [Python-3000] string C API In-Reply-To: References: <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> Message-ID: <450AC38F.4080005@gmail.com> Jim Jewett wrote: >> > ISTM that raising the exception lazily (which seems to be necessary) >> > would be very confusing. > >> Yeah, it appears it would be necessary to at least *scan* the string >> when it >> was first created in order to ensure it can be decoded without errors >> later on. > > What happens today with strings? I think the answer is: > "Nothing. > They print something odd when printed. > They may raise errors when explicitly recoded to unicde." > Why is this a problem? We don't have 8-bit strings lying around in Py3k. To convert bytes to characters, they *must* be converted to unicode code points. > I'm not so happy about the efficiency implication of the idea that > *all* strings *must* be validated (let alone recoded). Then always define latin-1 as the source encoding for your files - it will just pass the bytes straight through. >> Since strings don't currently have any mutable internal state, it's >> possible >> to freely share them between threads (without this property, the >> interning >> behaviour would be doomed). > > Interning may get awkward if multiple encodings are allowed within a > program, regardless of whether they're allowed for single strings. It > might make sense to intern only strings that are in the same encoding > as the source code. (Or whose values are limited to ASCII?) Unicode strings don't have an encoding - they only store code points. >> If strings could change the encoding of their internal buffers then >> they'd >> have to use a read/write lock internally on all operations that might be >> affected when the internal representation changes. Blech. > > Why? > > There should be only one reference to a string until is constructed, > and after that, its data should be immutable. Recoding that results > in different bytes should not be in-place. Either it returns a new > string (no problem) or it doesn't change the databuffer-and-encoding > pointer until the new databuffer is fully constructed. > > Anything keeping its own reference to the old databuffer (and old > encoding) will continue to work, so immutability ==> the two buffers > really are equivalent. I admit that by using a separate Python object for the data buffer instead of a pointer to raw memory, the incref/decref in the processing code becomes the moral equivalent of a read lock, but consider the case where Thread A performs an operation and decides "I need to recode the buffer to UCS-4" at the same time that Thread B performs an operation and decides "I need to recode the buffer to UCS-4". To deal with that you would still want to be very careful with the incref new/reassign/decref old step for switching in a new the data buffer (probably by using some form of atomic reassignment operation). And this style has some very serious overhead implications, as each string would now require: The string object, with a 32 or 64 bit pointer to the data buffer object The data buffer object String memory overhead would double, with an additional 32 or 64 bits depending on platform. This is a pretty significant increase when it comes to identifier-length strings. So still blech, even if you make the data buffer a separate Python object to avoid the need for an actual read/write lock. >> Sure certain applications that are just copying from one data stream to >> another (both in the same encoding) may needlessly decode and then >> re-encode >> the data, > > Other than text editors, "certain" includes almost any application I > have ever used, let alone written. If you're reading text and you *know* it is ASCII data, then you can just set the encoding to latin-1 (since that can just copy the original bytes to the string's internal buffer - the actual ascii codec needs to check each byte to see whether or not the high bit is set, so it would be slower, and blow up with a DecodingError if the high bit was ever set). I suspect an awful lot of quick-and-dirty scripts written by native English speakers will do exactly that. >> but if the application *knows* that this might happen (and has >> reason to care about optimising the performance of this case), then the >> application is free to decouple the "reading" and "decoding" steps, >> and just >> transfer raw bytes between the streams. > > So adding boilerplate to treat text as bytes "for efficiency" may > become a standard recipe? Not so good. No, the standard recipe becomes "handle bytes as bytes and text as characters". If you know your source data is 8-bit text (or are happy to treat it that way, even if it isn't), then use the latin-1 codec to decode the original bytes directly to 8-bit characters. Or just open the file in binary and read the data in as bytes instead of characters. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From jason.orendorff at gmail.com Fri Sep 15 18:22:30 2006 From: jason.orendorff at gmail.com (Jason Orendorff) Date: Fri, 15 Sep 2006 12:22:30 -0400 Subject: [Python-3000] string C API In-Reply-To: References: <450820A3.4000302@v.loewis.de> <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> Message-ID: On 9/15/06, Jim Jewett wrote: > There should be only one reference to a string until is constructed, > and after that, its data should be immutable. Recoding that results > in different bytes should not be in-place. Either it returns a new > string (no problem) or it doesn't change the databuffer-and-encoding > pointer until the new databuffer is fully constructed. Yes, but then having, say, a Latin-1 string, and repeatedly using it in places where UTF-16 is needed, causes you to repeat the decoding operation. The optimization becomes a pessimization. Here I'm imagining things like taking len(s) of a UTF-8 string, or s==u where u happens to be UTF-16. You only have to do this once or twice per string to start losing. Also, having two different classes of strings means fewer felicitous cases of x==y, where the result is True, being just a pointer comparison. This might matter in dictionaries: imagine a dictionary created as a literal and then used to look up key strings read from a file. > [Nick Coghlan wrote:] > > [...] the > > application is free to decouple the "reading" and "decoding" steps, and just > > transfer raw bytes between the streams. > > So adding boilerplate to treat text as bytes "for efficiency" may > become a standard recipe? Not so good. I'm sure this will happen to the same degree that it's become a standard recipe in Java and C# (both of which lack polymorphic whatzits). Which is to say, not at all. -j From paul at prescod.net Fri Sep 15 18:33:49 2006 From: paul at prescod.net (Paul Prescod) Date: Fri, 15 Sep 2006 09:33:49 -0700 Subject: [Python-3000] string C API In-Reply-To: References: <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> Message-ID: <1cb725390609150933q43b444f5ne788e9d222a5dcd1@mail.gmail.com> On 9/15/06, Jason Orendorff wrote: > > I'm sure this will happen to the same degree that it's become a > standard recipe in Java and C# (both of which lack polymorphic > whatzits). Which is to say, not at all. I think Jason's point is key. This is probably premature optimization and should not be done if it will complicate the Python user's experience at all (e.g. by delaying exceptions). Polymorphism is interesting to me primarily to support 4-byte characters and therefore go beyond Java and C# in functionality without slowing everything else down. If we gain some speed on them for 8-bit strings, that would be a nice bonus. But delaying UTF-8 decoding has not proven necessary for good performance in the other Unicode-based languages. It just seems like extra complexity for little benefit. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060915/c0109d27/attachment.htm From jimjjewett at gmail.com Fri Sep 15 19:04:08 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 15 Sep 2006 13:04:08 -0400 Subject: [Python-3000] string C API In-Reply-To: <450AC38F.4080005@gmail.com> References: <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> <450AC38F.4080005@gmail.com> Message-ID: On 9/15/06, Nick Coghlan wrote: > Jim Jewett wrote: > >> ... would be necessary to at least *scan* the string when it > >> was first created in order to ensure it can be decoded without errors > > What happens today with strings? I think the answer is: > > "Nothing. > > They print something odd when printed. > > They may raise errors when explicitly recoded to unicde." > > Why is this a problem? > We don't have 8-bit strings lying around in Py3k. Right. But we do in Py 2.x, and the equivalent delayed errors have not been a serious problem. I suppose that might change if everyone were actually using unicode, so that more stuff got converted eventually. On the other hand, I'm not sure how many strings will *ever* need recoding, if we don't do it on construction. > To convert bytes to > characters, they *must* be converted to unicode code points. A "code point" doesn't exist in actual code; it has to be represented by some concrete encoding. The most common encodings are the UTF-8 and the various UTF-16 and UTF-32, but they are still concrete encodings, rather than the "real" code point. A bytestream in latin-1 (with meta-knowledge that it is in latin-1) represents that abstract code points just as much as a bytestream in UTF8 would. For some purposes (including error detection) it is less efficient, but it is just as valid. > > I'm not so happy about the efficiency implication of the idea that > > *all* strings *must* be validated (let alone recoded). > Then always define latin-1 as the source encoding for your files - it will > just pass the bytes straight through. That would work for skipping validation. It won't work if Python insists on recoding everything to an internally privileged encoding. > > Interning may get awkward if multiple encodings are allowed within a > > program, regardless of whether they're allowed for single strings. It > > might make sense to intern only strings that are in the same encoding > > as the source code. (Or whose values are limited to ASCII?) > Unicode strings don't have an encoding - they only store code points. But these code points are stored somehow. In py2.k, the decision was to always use a specific privileged encoding, and to choose that encoding at compile time. This decision was not required by unicode; it was chosen for implementation reasons. > I admit that by using a separate Python object for the data buffer instead of > a pointer to raw memory, the incref/decref in the processing code becomes the > moral equivalent of a read lock, but consider the case where Thread A performs > an operation and decides "I need to recode the buffer to UCS-4" at the same > time that Thread B performs an operation and decides "I need to recode the > buffer to UCS-4". Then you end up doing it twice, and wasting even more space. I expect "never need to change the encoding" will be far more common than (1) Application is multithreaded and (2) Multiple threads happen to be using the same string and (3) Multiple threads need to recode it to the same new encoding at the same time and (4) This recoding need was in some way conditional, so the programmer felt it was sensible to request it both places, instead of just recoding once on creation. > And this style has some very serious overhead implications, as each string > would now require: > The string object, with a 32 or 64 bit pointer to the data buffer object > The data buffer object > String memory overhead would double, with an additional 32 or 64 bits > depending on platform. This is a pretty significant increase when it comes to > identifier-length strings. dicts already have to deal with this. The workaround there was to have a smalltable fastened to the dict, and to waste that smalltable if the dictionary grows too large. strings could do something similar. (Either all strings, keeping the original encoding, or just small strings, so that not too much will ever be wasted.) > >> Sure certain applications that are just copying from one data stream to > >> another (both in the same encoding) may needlessly decode and then > >> re-encode the data, > > Other than text editors, "certain" includes almost any application I > > have ever used, let alone written. > If you're reading text and you *know* it is ASCII data, then you can just set > the encoding to latin-1 Only if latin-1 is a valid encoding for the internal implementation. If it is, then python does have to allow multiple internal implementations, and some way of marking which was used. (Obviously, I think this is the right answer, but this is a change form 2.x, and would require some changes to the C API.) -jJ From jcarlson at uci.edu Fri Sep 15 19:46:52 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 15 Sep 2006 10:46:52 -0700 Subject: [Python-3000] string C API In-Reply-To: References: <450AAAD6.5030901@gmail.com> Message-ID: <20060915102433.F980.JCARLSON@uci.edu> "Jim Jewett" wrote: > Interning may get awkward if multiple encodings are allowed within a > program, regardless of whether they're allowed for single strings. It > might make sense to intern only strings that are in the same encoding > as the source code. (Or whose values are limited to ASCII?) Why? If the text hash function is defined on *code points*, then interning, or really any arbitrary dictionary lookup is the same as it has always been. > There should be only one reference to a string until is constructed, > and after that, its data should be immutable. Recoding that results > in different bytes should not be in-place. Either it returns a new > string (no problem) or it doesn't change the databuffer-and-encoding > pointer until the new databuffer is fully constructed. What about never recoding? The benefit of the latin-1/ucs-2/ucs-4 method I previously described is that each of the encodings offer a minimal representation of the code points that the text object contains. Certain operations would require a bit of work to handle the comparison of code points stored in an x-bit-wide representation with code points stored in a y-bit-wide representation. > So adding boilerplate to treat text as bytes "for efficiency" may > become a standard recipe? Not so good. Presumably there is going to be a mechanism to open files as bytes (reads return bytes), and for things like web servers, file servers, etc., serving the content up as just a bunch of bytes is really the only thing that makes sense. - Josiah From jcarlson at uci.edu Fri Sep 15 19:48:06 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 15 Sep 2006 10:48:06 -0700 Subject: [Python-3000] string C API In-Reply-To: References: Message-ID: <20060915102555.F983.JCARLSON@uci.edu> "Jason Orendorff" wrote: > > On 9/15/06, Jim Jewett wrote: > > There should be only one reference to a string until is constructed, > > and after that, its data should be immutable. Recoding that results > > in different bytes should not be in-place. Either it returns a new > > string (no problem) or it doesn't change the databuffer-and-encoding > > pointer until the new databuffer is fully constructed. > > Yes, but then having, say, a Latin-1 string, and repeatedly using it > in places where UTF-16 is needed, causes you to repeat the decoding > operation. The optimization becomes a pessimization. > > Here I'm imagining things like taking len(s) of a UTF-8 string, or > s==u where u happens to be UTF-16. You only have to do this once or > twice per string to start losing. This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: If I have a text object X whose internal representation is in UCS-2, and I have a another text object Y whose internal representation is in UCS-4, then I know X != Y. Why? Because X and Y were created with the minimal width necessary to support the code points they contain. Because Y must have a code point that X doesn't have, then X != Y. When one wants to do things like Y.startswith(X), then you actually compare the code points. - Josiah From solipsis at pitrou.net Fri Sep 15 20:04:33 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 15 Sep 2006 20:04:33 +0200 Subject: [Python-3000] string C API In-Reply-To: <20060915102555.F983.JCARLSON@uci.edu> References: <20060915102555.F983.JCARLSON@uci.edu> Message-ID: <1158343473.4292.14.camel@fsol> Le vendredi 15 septembre 2006 ? 10:48 -0700, Josiah Carlson a ?crit : > This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: You could replace "latin-1" with "one-byte system encoding chosen at interpreter startup depending on locale". There are lots of 8-bit encodings other than iso-8859-1. (for example, my current locale uses iso-8859-15) The algorithm for choosing the one-byte encoding could be: - if the current locale uses an one-byte encoding, use that encoding - otherwise, if current locale language has a popular one-byte encoding (for many languages this would mean iso-8859-), use that encoding - otherwise, no one-byte encoding This would ensure that, for example, Russian text on a system configured with a Russian locale does not always end up using two bytes per character internally. Regards Antoine. From qrczak at knm.org.pl Fri Sep 15 20:29:55 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 15 Sep 2006 20:29:55 +0200 Subject: [Python-3000] string C API In-Reply-To: <1158343473.4292.14.camel@fsol> (Antoine Pitrou's message of "Fri, 15 Sep 2006 20:04:33 +0200") References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> Message-ID: <871wqdhtjw.fsf@qrnik.zagroda> Antoine Pitrou writes: >> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: > > You could replace "latin-1" with "one-byte system encoding chosen at > interpreter startup depending on locale". Latin-1 has the advantage of being trivially decodable to a sequence of code points. This is convenient for operations like string concatenation, or string comparison, or taking substrings. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Fri Sep 15 20:36:21 2006 From: paul at prescod.net (Paul Prescod) Date: Fri, 15 Sep 2006 11:36:21 -0700 Subject: [Python-3000] string C API In-Reply-To: <1158343473.4292.14.camel@fsol> References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> Message-ID: <1cb725390609151136j14678530x97bd3dc30e1f6ca4@mail.gmail.com> On 9/15/06, Antoine Pitrou wrote: > > Le vendredi 15 septembre 2006 ? 10:48 -0700, Josiah Carlson a ?crit : > > This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: > > You could replace "latin-1" with "one-byte system encoding chosen at > interpreter startup depending on locale". > There are lots of 8-bit encodings other than iso-8859-1. > (for example, my current locale uses iso-8859-15) > > The algorithm for choosing the one-byte encoding could be: > - if the current locale uses an one-byte encoding, use that encoding > - otherwise, if current locale language has a popular one-byte encoding > (for many languages this would mean iso-8859-), use that encoding > - otherwise, no one-byte encoding > > This would ensure that, for example, Russian text on a system configured > with a Russian locale does not always end up using two bytes per > character internally. I do not believe that this extra complexity will be valuable in the long-term because most Europeans will switch to UTF-8 locales over the next five years. The current situation makes no sense. Think about it from the end-user's point of view: "You can use KOI8-R/ISO-8859-? or UTF-8. Pro for KOI8-R: 1. text files will use 0.8% instead of 1% of your hard disk space. 2. backwards compatibility Pro for UTF-8: 1. Better compatibility with new software 2. Easier to share files across geographic boundaries 3. Ability to encode characters from other character sets 4. Access to characters like smart quotes, wingdings, fractions and so forth. " The result seems obvious to me...8-bit-fixed encodings are a terrible idea and need to just go away. Let's not build them into Python's core on the basis of a minor and fleeting performance improvement. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060915/42867987/attachment.html From jcarlson at uci.edu Fri Sep 15 23:16:57 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 15 Sep 2006 14:16:57 -0700 Subject: [Python-3000] string C API In-Reply-To: <1cb725390609151136j14678530x97bd3dc30e1f6ca4@mail.gmail.com> References: <1158343473.4292.14.camel@fsol> <1cb725390609151136j14678530x97bd3dc30e1f6ca4@mail.gmail.com> Message-ID: <20060915133827.F98C.JCARLSON@uci.edu> "Paul Prescod" wrote: [snip] > The result seems obvious to me...8-bit-fixed encodings are a terrible idea > and need to just go away. Let's not build them into Python's core on the > basis of a minor and fleeting performance improvement. Variable-width encodings make many operations difficult, not the least of which being "what is the code point for the ith character?" The benefit of going with a fixed-width encoding (like Python currently does for unicode objects with UCS-2) is that so many computations are merely an iteration over a sequence of chars/shorts/ints. No need to recode for complicated operations, no need to understand utf-8 for string operations, etc. - Josiah From jimjjewett at gmail.com Fri Sep 15 23:37:41 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 15 Sep 2006 17:37:41 -0400 Subject: [Python-3000] string C API In-Reply-To: <20060915102433.F980.JCARLSON@uci.edu> References: <450AAAD6.5030901@gmail.com> <20060915102433.F980.JCARLSON@uci.edu> Message-ID: On 9/15/06, Josiah Carlson wrote: > > "Jim Jewett" wrote: > > Interning may get awkward if multiple encodings are allowed within a > > program, regardless of whether they're allowed for single strings. It > > might make sense to intern only strings that are in the same encoding > > as the source code. (Or whose values are limited to ASCII?) > Why? If the text hash function is defined on *code points*, then > interning, or really any arbitrary dictionary lookup is the same as it > has always been. The problem isn't the hash; it is the equality. Which encoding do you keep interned? > What about never recoding? The benefit of the latin-1/ucs-2/ucs-4 > method I previously described is that each of the encodings offer a > minimal representation of the code points that the text object contains. There may be some thrashing as s+= (larger char) s[:6] The three options might well be a sensible choice, but I think it would already have much of the disadvantage of multiple internal encodings, and we might eventually regret any specific limits. (Why not the local 8-bit? Why not UTF-8, if that is the system encoding?) It is easy enough to answer why not for each specific case, but I'm not *certain* that it is the right answer -- so why not leave it up to implementors if they want to do more than the basic three? > Presumably there is going to be a mechanism to open files as bytes > (reads return bytes), and for things like web servers, file servers, etc., > serving the content up as just a bunch of bytes is really the only thing > that makes sense. If somone has to recognize that their document is "text" when they edit it, but "bytes" when they serve it over the web, and then "text" again when they view it in the browser ... that is a recipe for misunderstandings. -jJ From jcarlson at uci.edu Sat Sep 16 02:13:33 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 15 Sep 2006 17:13:33 -0700 Subject: [Python-3000] string C API In-Reply-To: References: <20060915102433.F980.JCARLSON@uci.edu> Message-ID: <20060915153702.F98F.JCARLSON@uci.edu> "Jim Jewett" wrote: > On 9/15/06, Josiah Carlson wrote: > > "Jim Jewett" wrote: > > > Interning may get awkward if multiple encodings are allowed within a > > > program, regardless of whether they're allowed for single strings. It > > > might make sense to intern only strings that are in the same encoding > > > as the source code. (Or whose values are limited to ASCII?) > > > Why? If the text hash function is defined on *code points*, then > > interning, or really any arbitrary dictionary lookup is the same as it > > has always been. > > The problem isn't the hash; it is the equality. Which encoding do you > keep interned? There is one minimal 'encoding' for any unicode string (in one of latin-1, ucs-2, or ucs-4), really being an array of minimal-width char/short/int code points. Because all text objects are internally represented in its minimal 'encoding', equal text objects will always be in the same encoding. > > What about never recoding? The benefit of the latin-1/ucs-2/ucs-4 > > method I previously described is that each of the encodings offer a > > minimal representation of the code points that the text object contains. > > There may be some thrashing as > > s+= (larger char) > s[:6] So there may be thrashing. I don't see this as a problem. String addition and slicing is known linear in the length of the string being produced for all nontrivial cases. It's still linear. What's the problem? > The three options might well be a sensible choice, but I think it > would already have much of the disadvantage of multiple internal > encodings, and we might eventually regret any specific limits. (Why > not the local 8-bit? Why not UTF-8, if that is the system encoding?) > It is easy enough to answer why not for each specific case, but I'm > not *certain* that it is the right answer -- so why not leave it up to > implementors if they want to do more than the basic three? By "basic three" I presume you mean latin-1, ucs-2, and ucs-4. I'm not advocating for anything beyond those, in fact, I'm specifically discouraging using anything other than those three, and I'm specifically discouraging the idea of recoding internal representations. Once a text object is created, its internal state is fixed until it is destroyed. > > Presumably there is going to be a mechanism to open files as bytes > > (reads return bytes), and for things like web servers, file servers, etc., > > serving the content up as just a bunch of bytes is really the only thing > > that makes sense. > > If somone has to recognize that their document is "text" when they > edit it, but "bytes" when they serve it over the web, and then "text" > again when they view it in the browser ... that is a recipe for > misunderstandings. They don't need to recognize anything when it is served onto the web. Just like they don't need to recognize anything right now. The file is served verbatim off of disk, which is then understood by the browser because of encoding information built into the format. If the format doesn't have encoding information built into it, then the user isn't going to be able to edit it. - Josiah From and at doxdesk.com Sat Sep 16 03:08:06 2006 From: and at doxdesk.com (Andrew Clover) Date: Sat, 16 Sep 2006 02:08:06 +0100 Subject: [Python-3000] UTF-16 In-Reply-To: <1cb725390608312124u24d20ec2q27dbe5a69c2440d3@mail.gmail.com> References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> <1cb725390608312124u24d20ec2q27dbe5a69c2440d3@mail.gmail.com> Message-ID: <450B4E76.1010501@doxdesk.com> On 2006-09-01, Paul Prescod wrote: > I cannot understand why a user should be forced to choose between 16 and 32 > bit strings AT BUILD TIME. I strongly agree. This has been troublesome for many, not just people trying to install binary libs, but also Python code that does actually need to know the difference between unicode and wide-unicode characters. Ideally, implementation work notwithstanding, I would *love* to be able to have both types at a literal level (as unicode subclasses), along with retained byte string literals. ucs2string= u'\U00010000' # 2 chars, \ud800\udc00 ucs4string= w'\U00010000' # 1 char bytestring= b'abc' string= 'abc' # byte in 2.x, ucs2 in 3.0 If these were all subclasses of basestring, and other string type subclasses could be defined taking advantage of basic string methods, that could also allow the CSI stuff you posted Matz's mention of. Although I'm personally not at all a fan of non-Unicode string types and would rather die than put i-mode emoji in a character set :-) -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From greg.ewing at canterbury.ac.nz Sat Sep 16 03:07:06 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 16 Sep 2006 13:07:06 +1200 Subject: [Python-3000] string C API In-Reply-To: <20060915153702.F98F.JCARLSON@uci.edu> References: <20060915102433.F980.JCARLSON@uci.edu> <20060915153702.F98F.JCARLSON@uci.edu> Message-ID: <450B4E3A.7000005@canterbury.ac.nz> Josiah Carlson wrote: > Because all text objects are internally > represented in its minimal 'encoding', equal text objects will always be > in the same encoding. That places a burden on all creators of strings to ensure that they are in the minimal format, which could be inconvenient for some operations, e.g. taking a substring could require making an extra pass to re-code the data. It would also preclude the possibility of representing a substring as a view. I don't see any great advantage given by this restriction anyway. So you could tell two strings were unequal in some cases if they happened to have different storage formats, but there would still be plenty of cases where you did have to compare them. Doesn't look like a big deal to me. -- Greg From ncoghlan at gmail.com Sat Sep 16 05:14:49 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 16 Sep 2006 13:14:49 +1000 Subject: [Python-3000] string C API In-Reply-To: References: <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> <450AC38F.4080005@gmail.com> Message-ID: <450B6C29.8060107@gmail.com> Jim Jewett wrote: > On 9/15/06, Nick Coghlan wrote: >> If you're reading text and you *know* it is ASCII data, then you can >> just set >> the encoding to latin-1 > > Only if latin-1 is a valid encoding for the internal implementation. I think the possible internal encodings should be latin-1, UCS-2 and UCS-4, with the size for a given string dictated by the largest codepoint in the string at creation time. That way the internal representation of a string would only need to grow one extra field (the one saying how many bytes there are per character), and the internal state would remain immutable. For 8-bit source data, 'latin-1' would then be the most efficient encoding, in that it would be a simple memcpy from the bytes object's internal buffer to the string object's internal buffer. Other encodings like 'koi8-r' would be decoded to either latin-1, UCS-2 or UCS-4 depending on the largest code point in the source data. [Jim] > If it is, then python does have to allow multiple internal > implementations, and some way of marking which was used. (Obviously, > I think this is the right answer, but this is a change form 2.x, and > would require some changes to the C API.) One of the paragraphs you cut when replying to my message: [Nick] >> Far, far simpler is the idea of supporting only latin-1, UCS-2 and UCS-4 as >> internal representations, and choosing which one to use when the string is >> created. I think we might be violently agreeing :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Sat Sep 16 05:46:36 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 16 Sep 2006 13:46:36 +1000 Subject: [Python-3000] string C API In-Reply-To: <1158343473.4292.14.camel@fsol> References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> Message-ID: <450B739C.20607@gmail.com> Antoine Pitrou wrote: > Le vendredi 15 septembre 2006 ? 10:48 -0700, Josiah Carlson a ?crit : >> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: > > You could replace "latin-1" with "one-byte system encoding chosen at > interpreter startup depending on locale". > There are lots of 8-bit encodings other than iso-8859-1. > (for example, my current locale uses iso-8859-15) The choice of latin-1 is deliberate and non-arbitrary. The reason for the choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255: >>> x = range(256) >>> xs = ''.join(map(chr, x)) >>> xu = xs.decode('latin-1') >>> all(ord(s)==ord(u) for s, u in zip(xs, xu)) True In effect, when creating the string, you would be doing something like this: if encoding == 'latin-1': bytes_per_char = 1 code_points = 8_bit_data else: code_points, max_code_point = decode_to_UCS4(8_bit_data, encoding) if max_code_point < 256: bytes_per_char = 1 elif max_code_point < 65536: bytes_per_char = 2 else: bytes_per_char = 4 # A width argument to the bytes constructor would be very convenient # for being able to consistently deal with endianness issues self.internal_buffer = bytes(code_points, width=bytes_per_char) self.bytes_per_char = bytes_per_char Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ronaldoussoren at mac.com Sat Sep 16 07:59:45 2006 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Sat, 16 Sep 2006 07:59:45 +0200 Subject: [Python-3000] string C API In-Reply-To: References: <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> <450AC38F.4080005@gmail.com> Message-ID: <58B2A910-3360-4480-A8CF-2D4C95F56981@mac.com> On Sep 15, 2006, at 7:04 PM, Jim Jewett wrote: > On 9/15/06, Nick Coghlan wrote: >> Jim Jewett wrote: > >>>> ... would be necessary to at least *scan* the string when it >>>> was first created in order to ensure it can be decoded without >>>> errors > >>> What happens today with strings? I think the answer is: >>> "Nothing. >>> They print something odd when printed. >>> They may raise errors when explicitly recoded to unicde." >>> Why is this a problem? > >> We don't have 8-bit strings lying around in Py3k. > > Right. But we do in Py 2.x, and the equivalent delayed errors have > not been a serious problem. I suppose that might change if everyone > were actually using unicode, so that more stuff got converted > eventually. On the other hand, I'm not sure how many strings will > *ever* need recoding, if we don't do it on construction. Automatic conversion from str to unicode in Py2.x is a annoying at times, mostly because it is easy to mis at development time. Using unicode throughout (explicit conversion to unicode at the application boundary) solves that, but that problem would reappear if unicode (somestr, someencoding) would return a value that might cause a when you try to access its value UnicodeError. Another reason for disliking your idea is that unicode/py3k-str is a sequence of unicode code points and should always behave like one to the user. A polymorphic string type is an optimization (and an unproven one at that) and shouldn't complicate the Python-level string API. Ronald -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2157 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20060916/594b545a/attachment.bin From martin at v.loewis.de Sat Sep 16 08:32:37 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 16 Sep 2006 08:32:37 +0200 Subject: [Python-3000] string C API In-Reply-To: <450B6C29.8060107@gmail.com> References: <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> <450AC38F.4080005@gmail.com> <450B6C29.8060107@gmail.com> Message-ID: <450B9A85.6040904@v.loewis.de> Nick Coghlan schrieb: > That way the internal representation of a string would only need to grow > one extra field (the one saying how many bytes there are per character), > and the internal state would remain immutable. You could play tricks with ob_size to save this field: - ob_size < 0: 8-bit data; length is abs(ob_size) - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2 - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2 The first representation constrains the length of an 8-bit representation to max_ssize_t, which is also the limit today. For 16-bit strings, the limit is max_ssize_t/2, which means max_ssize_t bytes; this is technically more constraining, but such a string would still consume half of the address space, and is unlikely to get created (*). For 32-bit strings, the limit is also max_ssize_t/2, yet the maximum string would require more than 2*max_ssize_t (==max_size_t) bytes, so this isn't a real limitation. > For 8-bit source data, 'latin-1' would then be the most efficient > encoding, in that it would be a simple memcpy from the bytes object's > internal buffer to the string object's internal buffer. Other encodings > like 'koi8-r' would be decoded to either latin-1, UCS-2 or UCS-4 > depending on the largest code point in the source data. This might somewhat slow-down codecs, which would have to scan the input string first to find out what the maximum code point is, where they currently can decode in a single pass. Of course, for multi-byte codecs, such scanning is a good idea, anyway (some currently overallocate just to avoid the second pass). Regards, Martin (*) Many systems don't allow such large memory blocks,anyway. E.g. on 32-bit Windows, in the standard configuration, the address space is "only" 2GB. From jcarlson at uci.edu Sat Sep 16 10:22:43 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sat, 16 Sep 2006 01:22:43 -0700 Subject: [Python-3000] string C API In-Reply-To: <450B4E3A.7000005@canterbury.ac.nz> References: <20060915153702.F98F.JCARLSON@uci.edu> <450B4E3A.7000005@canterbury.ac.nz> Message-ID: <20060915183617.F995.JCARLSON@uci.edu> Greg Ewing wrote: > > Josiah Carlson wrote: > > Because all text objects are internally > > represented in its minimal 'encoding', equal text objects will always be > > in the same encoding. > > That places a burden on all creators of strings to ensure > that they are in the minimal format, which could be > inconvenient for some operations, e.g. taking a substring > could require making an extra pass to re-code the data. If Martin says it's not a big deal, I'm not really all that concerned. > It would also preclude the possibility of representing > a substring as a view. It doesn't preclude views. Every operation works as before, only now one would need to compare contents even on unequal-width code points. > I don't see any great advantage given by this restriction > anyway. So you could tell two strings were unequal in > some cases if they happened to have different storage > formats, but there would still be plenty of cases > where you did have to compare them. Doesn't look like > a big deal to me. It is ultimately about space savings, and in the case of names (since all will be 8-bit), perhaps even a bit faster to look up in the interning table (I believe it is easier to hash 8 chars than 8 shorts). - Josiah From qrczak at knm.org.pl Sat Sep 16 11:53:51 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 16 Sep 2006 11:53:51 +0200 Subject: [Python-3000] string C API In-Reply-To: <450B4E3A.7000005@canterbury.ac.nz> (Greg Ewing's message of "Sat, 16 Sep 2006 13:07:06 +1200") References: <20060915102433.F980.JCARLSON@uci.edu> <20060915153702.F98F.JCARLSON@uci.edu> <450B4E3A.7000005@canterbury.ac.nz> Message-ID: <874pv8i1cg.fsf@qrnik.zagroda> Greg Ewing writes: > That places a burden on all creators of strings to ensure > that they are in the minimal format, which could be > inconvenient for some operations, e.g. taking a substring > could require making an extra pass to re-code the data. Yes, but taking a substring already requires a linear time wrt. the length of the substring. Allocation a string from a C array of wide characters (which determines the format from the contents) will be written once and called as a function. Most strings are ASCII, so most of the time there is no need to check whether the substring could become even narrower. > It would also preclude the possibility of representing > a substring as a view. If views were implemented on the level of C pointers, then views would not have the property of being in the canonical representation wrt. character width. It's still valuable I think to use a more compact representation if it would affect most strings. > I don't see any great advantage given by this restriction > anyway. Keeping the canonical representation is not very important. It just ensures that the advantage of having a more compact representation taken as often as possible, even if the string has been cut from another string which contained a wide character. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Sat Sep 16 12:02:37 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 16 Sep 2006 12:02:37 +0200 Subject: [Python-3000] string C API In-Reply-To: <450B9A85.6040904@v.loewis.de> (Martin v. =?iso-8859-2?q?L=F6wis's?= message of "Sat, 16 Sep 2006 08:32:37 +0200") References: <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> <450AC38F.4080005@gmail.com> <450B6C29.8060107@gmail.com> <450B9A85.6040904@v.loewis.de> Message-ID: <87zmd0gmde.fsf@qrnik.zagroda> "Martin v. L?wis" writes: > You could play tricks with ob_size to save this field: > > - ob_size < 0: 8-bit data; length is abs(ob_size) > - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2 > - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2 I wonder whether strings with characters outside ISO-8859-1 are common enough that having a 16-bit representation is worth the trouble. CLISP does have it. My language doesn't. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From martin at v.loewis.de Sat Sep 16 15:43:36 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 16 Sep 2006 15:43:36 +0200 Subject: [Python-3000] string C API In-Reply-To: <20060915183617.F995.JCARLSON@uci.edu> References: <20060915153702.F98F.JCARLSON@uci.edu> <450B4E3A.7000005@canterbury.ac.nz> <20060915183617.F995.JCARLSON@uci.edu> Message-ID: <450BFF88.3060902@v.loewis.de> Josiah Carlson schrieb: >> That places a burden on all creators of strings to ensure >> that they are in the minimal format, which could be >> inconvenient for some operations, e.g. taking a substring >> could require making an extra pass to re-code the data. > > If Martin says it's not a big deal, I'm not really all that concerned. I was thinking about codecs specifically: they often need to make multiple passes anyway. In general, only measurements can tell the performance impacts of some design decision (e.g. it's non-obvious how often the various string operations occur, and what the performance impact is). There is also an issue of convenience here; however, with three different representations, library functions could be provided to support all cases. > It is ultimately about space savings, and in the case of names (since > all will be 8-bit), perhaps even a bit faster to look up in the > interning table (I believe it is easier to hash 8 chars than 8 shorts). That you need to demonstrate through profiling. First, strings likely continue to keep their hash, and then it seems plausible that the cost for hashing is in the computation and the loop, not in the memory access, and that the computation is carried out in 32-bit registers regardless of character width. Regards, Martin From martin at v.loewis.de Sat Sep 16 15:49:29 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 16 Sep 2006 15:49:29 +0200 Subject: [Python-3000] string C API In-Reply-To: <450B739C.20607@gmail.com> References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> <450B739C.20607@gmail.com> Message-ID: <450C00E9.6070008@v.loewis.de> Nick Coghlan schrieb: > The choice of latin-1 is deliberate and non-arbitrary. The reason for the > choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255: That's true, but that this makes a good choice for a special case doesn't follow. Instead, frequency of occurrence of the special case makes it a good choice. > In effect, when creating the string, you would be doing something like this: > > if encoding == 'latin-1': > bytes_per_char = 1 > code_points = 8_bit_data > else: > code_points, max_code_point = decode_to_UCS4(8_bit_data, encoding) > if max_code_point < 256: > bytes_per_char = 1 > elif max_code_point < 65536: > bytes_per_char = 2 > else: > bytes_per_char = 4 Hardly. Instead, the codec would have to create the string of the right width; a codec written in C would make two passes, rather than temporarily allocating memory to actually represent the UCS-4 codes. Regards, Martin From martin at v.loewis.de Sat Sep 16 15:55:47 2006 From: martin at v.loewis.de (=?ISO-8859-2?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 16 Sep 2006 15:55:47 +0200 Subject: [Python-3000] string C API In-Reply-To: <87zmd0gmde.fsf@qrnik.zagroda> References: <45083C76.8010302@v.loewis.de> <45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com> <4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com> <450AC38F.4080005@gmail.com> <450B6C29.8060107@gmail.com> <450B9A85.6040904@v.loewis.de> <87zmd0gmde.fsf@qrnik.zagroda> Message-ID: <450C0263.3090506@v.loewis.de> Marcin 'Qrczak' Kowalczyk schrieb: >> You could play tricks with ob_size to save this field: >> >> - ob_size < 0: 8-bit data; length is abs(ob_size) >> - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2 >> - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2 > > I wonder whether strings with characters outside ISO-8859-1 are common > enough that having a 16-bit representation is worth the trouble. > > CLISP does have it. My language doesn't. The design of Unicode is so that all "living" scripts are encoded with the BMP. So four-byte characters would be extremely rare, and one may argue that encoding them with UTF-16 is good enough. So if there is flexibility in the internal representation of strings, I think a two-byte representation should definitely be one of the options; I'd rather debate about the necessity of one-byte and four-byte representations. Regards, Martin From ncoghlan at gmail.com Sat Sep 16 18:49:36 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Sep 2006 02:49:36 +1000 Subject: [Python-3000] string C API In-Reply-To: <450C00E9.6070008@v.loewis.de> References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> <450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de> Message-ID: <450C2B20.4090504@gmail.com> Martin v. L?wis wrote: > Nick Coghlan schrieb: >> The choice of latin-1 is deliberate and non-arbitrary. The reason for the >> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255: > > That's true, but that this makes a good choice for a special case > doesn't follow. Instead, frequency of occurrence of the special case > makes it a good choice. If an 8-bit encoding other than latin-1 is used for the internal buffer, then every comparison operation would have to decode the string to Unicode in order to compare code points. It seems much simpler to me to ensure that what is stored internally is *always* the Unicode code points, with the width (1, 2 or 4 bytes) determined by the largest code point in the string. The latter two are the UCS-2 and UCS-4 formats that are compile-time selectable for unicode strings in Python 2.x, but I'm not aware of any name other than 'latin-1' for the case where all of the code points are less than 256. > Hardly. Instead, the codec would have to create the string of the right > width; a codec written in C would make two passes, rather than > temporarily allocating memory to actually represent the UCS-4 codes. Indeed, that does make more sense - one pass to figure out the number of characters and the largest code point, and a second to copy the characters to the allocated buffer. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From martin at v.loewis.de Sat Sep 16 20:01:28 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 16 Sep 2006 20:01:28 +0200 Subject: [Python-3000] string C API In-Reply-To: <450C2B20.4090504@gmail.com> References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> <450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de> <450C2B20.4090504@gmail.com> Message-ID: <450C3BF8.2030901@v.loewis.de> Nick Coghlan schrieb: > If an 8-bit encoding other than latin-1 is used for the internal buffer, > then every comparison operation would have to decode the string to > Unicode in order to compare code points. > > It seems much simpler to me to ensure that what is stored internally is > *always* the Unicode code points, with the width (1, 2 or 4 bytes) > determined by the largest code point in the string. Just try implementing comparison some time. You can end up implementing the same algorithm six times at least, once for each pair (1,1), (1,2), (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e. you can't reduce (2,1) to (1,2)), you need 9 different versions of the algorithm. That sounds more complicated than always decoding. Regards, Martin From jcarlson at uci.edu Sat Sep 16 20:51:33 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sat, 16 Sep 2006 11:51:33 -0700 Subject: [Python-3000] string C API In-Reply-To: <450C3BF8.2030901@v.loewis.de> References: <450C2B20.4090504@gmail.com> <450C3BF8.2030901@v.loewis.de> Message-ID: <20060916114123.F99E.JCARLSON@uci.edu> "Martin v. L?wis" wrote: > > Nick Coghlan schrieb: > > If an 8-bit encoding other than latin-1 is used for the internal buffer, > > then every comparison operation would have to decode the string to > > Unicode in order to compare code points. > > > > It seems much simpler to me to ensure that what is stored internally is > > *always* the Unicode code points, with the width (1, 2 or 4 bytes) > > determined by the largest code point in the string. > > Just try implementing comparison some time. You can end up implementing > the same algorithm six times at least, once for each pair (1,1), (1,2), > (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e. > you can't reduce (2,1) to (1,2)), you need 9 different versions of the > algorithm. That sounds more complicated than always decoding. One algorithm. Each character can be "decoded" during runtime. long expand(void* buffer, Py_ssize_t posn, int shift) { buffer += posn << shift; switch (bpc) { case 0: return ((unsigned char*)buffer)[0]; case 1: return ((unsigned short*)buffer)[0]; case 2: return ((long*)buffer)[0]; default: return -1; } Alternatively, with a little work, the 9 variants can be defined with a prototype system, using macros or otherwise. - Josiah From qrczak at knm.org.pl Sat Sep 16 23:20:44 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 16 Sep 2006 23:20:44 +0200 Subject: [Python-3000] string C API In-Reply-To: <450C3BF8.2030901@v.loewis.de> (Martin v. =?iso-8859-2?q?L=F6wis's?= message of "Sat, 16 Sep 2006 20:01:28 +0200") References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> <450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de> <450C2B20.4090504@gmail.com> <450C3BF8.2030901@v.loewis.de> Message-ID: <87psdvik43.fsf@qrnik.zagroda> "Martin v. L?wis" writes: > Just try implementing comparison some time. You can end up implementing > the same algorithm six times at least, once for each pair (1,1), (1,2), > (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e. > you can't reduce (2,1) to (1,2)), you need 9 different versions of the > algorithm. That sounds more complicated than always decoding. That's why I'm proposing only two variants, ISO-8859-1 and UCS-4. String equality: two variants. Two others are trivial if the representation is always canonical. String < and <=: 8 variants in total, all generated from a single 20-line piece of C code, parametrized by preprocessor macros. String !=, >, >=: defined in terms of the above. String concatenation: if both strings are narrow: allocate a narrow result copy narrow from str1 to result copy narrow from str2 to result else: allocate a wide result if str1 is narrow: copy narrow->wide from str1 to result else: copy wide from str1 to result if str2 is narrow: copy narrow->wide from str2 to result else: copy wide from str2 to result __contains__, startswith, index: three variants, one other is trivial. Seems simple enough for me. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From greg.ewing at canterbury.ac.nz Sun Sep 17 01:17:35 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 17 Sep 2006 11:17:35 +1200 Subject: [Python-3000] string C API In-Reply-To: <450C3BF8.2030901@v.loewis.de> References: <20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol> <450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de> <450C2B20.4090504@gmail.com> <450C3BF8.2030901@v.loewis.de> Message-ID: <450C860F.9010605@canterbury.ac.nz> Martin v. L?wis wrote: > Just try implementing comparison some time. You can end up implementing > the same algorithm six times at least, once for each pair (1,1), (1,2), > (1,4), (2,2), (2,4), (4,4). #define UnicodeStringComparisonFunction(TYPE1, TYPE2) \ /* code to implement it here */ UnicodeStringComparisonFunction(UCS1, UCS1) UnicodeStringComparisonFunction(UCS1, UCS2) UnicodeStringComparisonFunction(UCS1, UCS4) UnicodeStringComparisonFunction(UCS2, UCS2) UnicodeStringComparisonFunction(UCS2, UCS4) UnicodeStringComparisonFunction(UCS4, UCS4) -- Greg From meyer at acm.org Sun Sep 17 14:28:08 2006 From: meyer at acm.org (Andre Meyer) Date: Sun, 17 Sep 2006 14:28:08 +0200 Subject: [Python-3000] Kill GIL? Message-ID: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> Dear Python experts As a heavy user of multi-threading in Python and following the current discussions about Python on multi-processor systems on the python-list I wonder what the plans are for improving MP performance in Py3k. MP systems become more and more common as most modern processors have multiple processing units that could be used in parallel by distributing threads. Unfortunately, the GIL in CPython prevents to use this mechanism. As far as I understand IronPython, Jython and PyPy do not suffer from this. While I understand the difficulties in removing the GIL and the potential negative effect on single-threaded applications I would very much encourage discussion to seriously consider removing the GIL (maybe optionally) in Py3k. If not, what alternatives would you suggest? thanks a lot for your thoughts Andre -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060917/a1ee7948/attachment.htm From ncoghlan at gmail.com Sun Sep 17 15:16:30 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Sep 2006 23:16:30 +1000 Subject: [Python-3000] Kill GIL? In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> Message-ID: <450D4AAE.2000805@gmail.com> Andre Meyer wrote: > While I understand the difficulties in removing the GIL and the > potential negative effect on single-threaded applications I would very > much encourage discussion to seriously consider removing the GIL (maybe > optionally) in Py3k. If not, what alternatives would you suggest? Brett Cannon's sandboxing work (which aims to provide first-class support for multiple interpreters in the same process for security reasons) also seems like a potentially fruitful approach to distributing processing to multiple cores: - use threads to perform blocking I/O in parallel - use multiple interpreters to perform Python execution in parallel Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From qrczak at knm.org.pl Sun Sep 17 15:43:50 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sun, 17 Sep 2006 15:43:50 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> (Andre Meyer's message of "Sun, 17 Sep 2006 14:28:08 +0200") References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> Message-ID: <87irjmk3qh.fsf@qrnik.zagroda> "Andre Meyer" writes: > While I understand the difficulties in removing the GIL and the > potential negative effect on single-threaded applications I would > very much encourage discussion to seriously consider removing the > GIL (maybe optionally) in Py3k. I suppose this would require either fundamentally changing the garbage collection algorithm (lots of work and breaking all C extensions), or accompanying all reference count adjustments with memory barriers (a significant performance hit even if a particular object is not shared between threads; many objects like None will be shared anyway). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From solipsis at pitrou.net Sun Sep 17 16:51:12 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 17 Sep 2006 16:51:12 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <450D4AAE.2000805@gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> Message-ID: <1158504672.28528.82.camel@fsol> Le dimanche 17 septembre 2006 ? 23:16 +1000, Nick Coghlan a ?crit : > Brett Cannon's sandboxing work (which aims to provide first-class support for > multiple interpreters in the same process for security reasons) also seems > like a potentially fruitful approach to distributing processing to multiple cores: > - use threads to perform blocking I/O in parallel OTOH, the Twisted approach avoids all the delicate synchronization issues that arise when using threads to perform concurrent IO tasks. Also, IO is by definition not CPU-intensive, so there is no point in distributing IO to multiple cores (and it could even cause a small decrease in performance because of inter-CPU communication overhead). Regards Antoine. From jcarlson at uci.edu Sun Sep 17 19:56:15 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 17 Sep 2006 10:56:15 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> Message-ID: <20060917103103.F9A4.JCARLSON@uci.edu> "Andre Meyer" wrote: > Dear Python experts > > As a heavy user of multi-threading in Python and following the current > discussions about Python on multi-processor systems on the python-list I > wonder what the plans are for improving MP performance in Py3k. MP systems > become more and more common as most modern processors have multiple > processing units that could be used in parallel by distributing threads. > Unfortunately, the GIL in CPython prevents to use this mechanism. As far as > I understand IronPython, Jython and PyPy do not suffer from this. > > While I understand the difficulties in removing the GIL and the potential > negative effect on single-threaded applications I would very much encourage > discussion to seriously consider removing the GIL (maybe optionally) in > Py3k. If not, what alternatives would you suggest? Search for 'Python free threading' without quotes in Google to find the discussions about this topic over the years. Personally, I think that the effort to remove the GIL in Py3k (or otherwise) is quite a bit of trouble that we don't want to have to go through; both from an internal redesign, and C-extension perspective. It would be substantially easier if there were a distributed RPC mechanism that auto distributed to the "least-working" process in a set of potential working processes on a single machine. Something with the simplicity of XML-RPC calling (but without servers and clients) and the distribution properties of Linda. Of course then we run into a situation where we need to "pickle" the callable arguments across a connection of some kind. There is a solution to this on a single machine; copying the internal representation of every object in the arguments of a function call to memory shared between all processes (mmap). With such a semantic, only mutable portions need to be copied out into non-mmap memory. With that RPC mechanism and file handle migration (available on BSDs natively, linux with minor work, and Windows via pywin32), most operations would *just work*, would be reasonably fast, and Python could keep its GIL - which would be substantially less work for everyone involved. The details are cumbersome (specifically the copying Python objects to/from memory), but they can be made less cumbersome if one only allows builtin objects to be transferred. - Josiah From brett at python.org Sun Sep 17 20:03:34 2006 From: brett at python.org (Brett Cannon) Date: Sun, 17 Sep 2006 11:03:34 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: <450D4AAE.2000805@gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> Message-ID: On 9/17/06, Nick Coghlan wrote: > > Andre Meyer wrote: > > While I understand the difficulties in removing the GIL and the > > potential negative effect on single-threaded applications I would very > > much encourage discussion to seriously consider removing the GIL (maybe > > optionally) in Py3k. If not, what alternatives would you suggest? > > Brett Cannon's sandboxing work (which aims to provide first-class support > for > multiple interpreters in the same process for security reasons) also seems > like a potentially fruitful approach to distributing processing to > multiple cores: > - use threads to perform blocking I/O in parallel > - use multiple interpreters to perform Python execution in parallel Possibly, but as it stands now interpreters just execute in their own Python thread, so there is no real performance boost. Without the GIL shifting over to per interpreter instead of per process there is going to be the same performance problems as with Python threads. And changing that would be hard since objects can be shared between multiple interpreters. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060917/8395bdfb/attachment.html From ronaldoussoren at mac.com Sun Sep 17 20:36:40 2006 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Sun, 17 Sep 2006 20:36:40 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <450D4AAE.2000805@gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> Message-ID: On Sep 17, 2006, at 3:16 PM, Nick Coghlan wrote: > Andre Meyer wrote: >> While I understand the difficulties in removing the GIL and the >> potential negative effect on single-threaded applications I would >> very >> much encourage discussion to seriously consider removing the GIL >> (maybe >> optionally) in Py3k. If not, what alternatives would you suggest? > > Brett Cannon's sandboxing work (which aims to provide first-class > support for > multiple interpreters in the same process for security reasons) > also seems > like a potentially fruitful approach to distributing processing to > multiple cores: > - use threads to perform blocking I/O in parallel > - use multiple interpreters to perform Python execution in parallel ... except when you use extensions that use the PyGILState APIs, those don't work with multiple interpreters :-(. Ronald -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2157 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20060917/0d132cf9/attachment.bin From rasky at develer.com Sun Sep 17 23:58:57 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sun, 17 Sep 2006 23:58:57 +0200 Subject: [Python-3000] Kill GIL? References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <20060917103103.F9A4.JCARLSON@uci.edu> Message-ID: <007501c6daa4$79768520$a14c2597@bagio> Josiah Carlson wrote: > It would be substantially easier if there were a distributed RPC > mechanism that auto distributed to the "least-working" process in a > set > of potential working processes on a single machine. [...] I'm not sure I follow you. Would you mind providing an example of a plausible API for this mechanism (aka how the code would look like, compared to the current Python threading classes)? Giovanni Bajo From jcarlson at uci.edu Mon Sep 18 03:18:32 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 17 Sep 2006 18:18:32 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: <007501c6daa4$79768520$a14c2597@bagio> References: <20060917103103.F9A4.JCARLSON@uci.edu> <007501c6daa4$79768520$a14c2597@bagio> Message-ID: <20060917180402.07DB.JCARLSON@uci.edu> "Giovanni Bajo" wrote: > Josiah Carlson wrote: > > > It would be substantially easier if there were a distributed RPC > > mechanism that auto distributed to the "least-working" process in a > > set > > of potential working processes on a single machine. [...] > > I'm not sure I follow you. Would you mind providing an example of a plausible > API for this mechanism (aka how the code would look like, compared to the > current Python threading classes)? import autorpc caller = autorpc.init_processes(autorpc.num_processors()) import callables caller.register_module(callables) result = caller.fcn1(arg1, arg2, arg3) The point is not to compare API/etc., with threading, but to compare it with XMLRPC. Because ultimately, what I would like to see, is a mechanic similar to XMLRPC; call a method on an instance, that is automatically executed perhaps in some other thread in some other process, or maybe even in the same thread on the same process (depending on load, etc.), and which returns the result in-place. It's just much easier to handle (IMO). The above example highlights an example of single call/return. What if you don't care about getting a result back before continuing, or perhaps you have a bunch of things you want to get done? ... q = Queue.Queue() caller.delayed(q.put).fcn1(arg1, arg2, arg3) r = q.get() #will be delayed until q gets something What to do about exceptions happening in fcn1 remotely? A fellow over in the wxPython mailing list brought up the idea of exception objects; perhaps not stackframes, etc., but perhaps an object with information like exception type and traceback, used for both delayed and non-delayed tracebacks. - Josiah From ross at sourcelabs.com Mon Sep 18 06:06:34 2006 From: ross at sourcelabs.com (Ross Jekel) Date: Sun, 17 Sep 2006 21:06:34 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: References: Message-ID: I know it is a bit old, but would Python Object Sharing (POSH) http://poshmodule.sourceforge.net help you? Also, if you think you like the current state-of-the-art threading model, you might not after reading this: http://tinyurl.com/qvcbr This goes to an article on http://www.computer.org with a long URL entitled "The Problem with Threads." After some initial surprise when I learned about it, I'm now okay with a GIL or even single threaded python (with async I/O if necessary). In my opinion threaded programs with one big shared data space (like CPython's) are fundamentally untestable and unverifiable, and the GIL was the best solution to reduce risk in that area. I am happy the GIL exists because it forces me to come up designs for programs and systems that are easier to write, more predictable both in terms of correctness and performance, and easier to maintain and scale. I think there would be significant backlash in the Python development community the first time an intermittent race condition or a deadlock occurs in the CPython interpretor after years of relying on it as a predictable, reliable platform. I'm also happy the GIL exists because it forces alternative ideas like Twisted and stackless to be developed and tried. If you have shared data that really benefits from synchronized access and updates, write an extension, release the GIL at the appropriate places, and do whatever you want in a C data structure. I've done this when necessary and think it is the best of both worlds. I guess I'm assuming this will still be possible in Python 3000 (I haven't been on the list that long, sorry). There has to be a better concurrency model than threads. Let's design for the future with useful packages that implement the best ideas of today for scaling well without threads. Ross From ironfroggy at gmail.com Mon Sep 18 06:50:34 2006 From: ironfroggy at gmail.com (Calvin Spealman) Date: Mon, 18 Sep 2006 00:50:34 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <20060917180402.07DB.JCARLSON@uci.edu> References: <20060917103103.F9A4.JCARLSON@uci.edu> <007501c6daa4$79768520$a14c2597@bagio> <20060917180402.07DB.JCARLSON@uci.edu> Message-ID: <76fd5acf0609172150o55e79fddta141e348bffb342@mail.gmail.com> On 9/17/06, Josiah Carlson wrote: > > "Giovanni Bajo" wrote: > > Josiah Carlson wrote: > > > > > It would be substantially easier if there were a distributed RPC > > > mechanism that auto distributed to the "least-working" process in a > > > set > > > of potential working processes on a single machine. [...] > > > > I'm not sure I follow you. Would you mind providing an example of a plausible > > API for this mechanism (aka how the code would look like, compared to the > > current Python threading classes)? > > import autorpc > caller = autorpc.init_processes(autorpc.num_processors()) > > import callables > caller.register_module(callables) > > result = caller.fcn1(arg1, arg2, arg3) > > The point is not to compare API/etc., with threading, but to compare it > with XMLRPC. Because ultimately, what I would like to see, is a > mechanic similar to XMLRPC; call a method on an instance, that is > automatically executed perhaps in some other thread in some other > process, or maybe even in the same thread on the same process (depending > on load, etc.), and which returns the result in-place. > > It's just much easier to handle (IMO). The above example highlights an > example of single call/return. What if you don't care about getting a > result back before continuing, or perhaps you have a bunch of things you > want to get done? > > ... > q = Queue.Queue() > > caller.delayed(q.put).fcn1(arg1, arg2, arg3) > r = q.get() #will be delayed until q gets something > > What to do about exceptions happening in fcn1 remotely? A fellow over > in the wxPython mailing list brought up the idea of exception objects; > perhaps not stackframes, etc., but perhaps an object with information > like exception type and traceback, used for both delayed and non-delayed > tracebacks. > > > - Josiah I would be thrilled to see this kind of api brought into python. It could very likely be implemented in time for Python 2.6, which would be spawning processes to handle the load. At the very least, a Python 2.4 or older compatible module could be release to test the waters and see what works and doesnt in this idea. I tried to wrap my head around different options on how the GIL might go away, but the in end you just realize you would hate to see it go away. I'm sure Twisted would have a field day with such a facility. If this kind of thing gets brought into Python, it would almost require some form of MapReduce come along with it. Of course, with the existing talks about removing map as a built-in in favor of list comprehensions, it makes one consider if listcomps and genexps might have some way to utilize a distributed model natively. Some consideration toward that end would be valuable. From jcarlson at uci.edu Mon Sep 18 07:14:26 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 17 Sep 2006 22:14:26 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: References: Message-ID: <20060917212744.07DE.JCARLSON@uci.edu> "Ross Jekel" wrote: > I know it is a bit old, but would Python Object Sharing (POSH) > http://poshmodule.sourceforge.net help you? Also, if you think you like the > current state-of-the-art threading model, you might not after reading this: The RPC-like mechanism I described earlier could be implemented on top of POSH if desired, though I believe that some of the potential issues that POSH has yet to fix (see its TODO file) aren't as much of a concern (if at all) when only using shared memory as IPC and not as an object store. > "The Problem with Threads." Getting *concurrency* right is hard. One way of making it no so hard is to use simple abstractions; like producer/consumer, deferred results, etc. But not everything fits into these abstractions, and sometimes there are data structure manipulations that require locking. In that sense, it's not so much that *concurrency* is hard to get right, as much as locking is hard to get right. But with Python 2.5, we get the 'with' statement and context managers. Add context managers to locks, always use RLocks (so that you can .acquire() a lock multiple times), and while it hasn't gotten easy (one needs to be careful with lock acquisition order to prevent deadlocks, especially when mixing locks with queues), more concurrency tasks have gotten *easier*. Essentially the article points out that using abstractions like producer/consumer, deferreds, etc., can make concurrent programming not so hard, and you have to be basically insane to use threads in your concurrent programming (I've been doing it for about 7 years, and am thoroughly insane), but unless I'm missing something (I only skimmed the article when it first came out, so this is quite possible), it's not really saying anything new to the concurrent programmer (of nontrivial systems). With the API and RPC mechanism I sketched out earlier, threads are a possible underlying implementation detail. Essentially, it tries to force everything into a producer/consumer abstraction; the function I call is a consumer of the arguments I pass, and it produces a result that I (or someone else) later consume. This somewhat limits what kinds of things can be done 'natively', but you can't get everything. - Josiah From krstic at solarsail.hcs.harvard.edu Mon Sep 18 07:55:59 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Mon, 18 Sep 2006 01:55:59 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> Message-ID: <450E34EF.3090202@solarsail.hcs.harvard.edu> Andre Meyer wrote: > As a heavy user of multi-threading in Python and following the current > discussions about Python on multi-processor systems on the python-list I > wonder what the plans are for improving MP performance in Py3k. I have four aborted e-mails in my 'Drafts' folder that are asking the same question; each time, I decided that the almost inevitably ensuing "threads suck!" flamewar just isn't worth it. Now that someone else has taken the plunge... At present, the Python approach to multi-processing sounds a bit like "let's stick our collective hands in the sand and pretend there's no problem". In particular, one oft-parroted argument says that it's not worth changing or optimizing the language for the few people who can afford SMP hardware. In the meantime, dual-core laptops are becoming the standard, with Intel predicting quad-core will become mainstream in the next few years, and the number of server orders for single-core, UP machines is plummeting. >From this, it's obvious to me that we need to do *something* to introduce stronger multi-processing support. Our current abilities are rather bad: we offer no microthreads, which is making elegant concurrency primitives such as Erlang's, ported to Python by the Candygram project [0], unnecessarily expensive. Instead, we only offer heavy threads that each allocate a full-size stack, and there's no actual ability to parallelize thread execution across CPUs. There's also no way to simply fork and coordinate between the forked processes, depending on the nature of the problem being solved, since there's no shared memory primitive in the stdlib (this because shared memory semantics are notoriously different across platforms). On top of it all, any adopted solution needs to be implementable across all the major Python interpreters, which makes finding a solution that much harder. The way I see it, we have several options: * Bite the bullet; write and support a stdlib SHM primitive that works wherever possible, and simply doesn't work on completely broken platforms (I understand Windows falls into this category). Utilize it in a lightweight fork-and-coordinate wrapper provided in the stdlib. * Bite the mortar shell, and remove the GIL. * Introduce microthreads, declare that Python endorses Erlang's no-sharing approach to concurrency, and incorporate something like candygram into the stdlib. * Introduce a fork-and-coordinate wrapper in the stdlib, and declare that we're simply not going to support the use case that requires sharing (as opposed to merely passing) objects between processes. The first option is a Pareto optimization, but having stdlib functionality flat out unavailable on some platforms might be out of the question. It'd be good to hear Guido's longer-term view on concurrency in Python. That discussion might be more appropriate on python-dev, though. Cheers, [0] http://candygram.sourceforge.net/ -- Ivan Krsti? | GPG: 0x147C722D From bob at redivi.com Mon Sep 18 08:29:44 2006 From: bob at redivi.com (Bob Ippolito) Date: Sun, 17 Sep 2006 23:29:44 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: <450E34EF.3090202@solarsail.hcs.harvard.edu> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> Message-ID: <6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com> On 9/17/06, Ivan Krsti? wrote: > Andre Meyer wrote: > > As a heavy user of multi-threading in Python and following the current > > discussions about Python on multi-processor systems on the python-list I > > wonder what the plans are for improving MP performance in Py3k. > > I have four aborted e-mails in my 'Drafts' folder that are asking the > same question; each time, I decided that the almost inevitably ensuing > "threads suck!" flamewar just isn't worth it. Now that someone else has > taken the plunge... > > At present, the Python approach to multi-processing sounds a bit like > "let's stick our collective hands in the sand and pretend there's no > problem". In particular, one oft-parroted argument says that it's not > worth changing or optimizing the language for the few people who can > afford SMP hardware. In the meantime, dual-core laptops are becoming the > standard, with Intel predicting quad-core will become mainstream in the > next few years, and the number of server orders for single-core, UP > machines is plummeting. > > From this, it's obvious to me that we need to do *something* to > introduce stronger multi-processing support. Our current abilities are > rather bad: we offer no microthreads, which is making elegant > concurrency primitives such as Erlang's, ported to Python by the > Candygram project [0], unnecessarily expensive. Instead, we only offer > heavy threads that each allocate a full-size stack, and there's no > actual ability to parallelize thread execution across CPUs. There's also > no way to simply fork and coordinate between the forked processes, > depending on the nature of the problem being solved, since there's no > shared memory primitive in the stdlib (this because shared memory > semantics are notoriously different across platforms). On top of it all, > any adopted solution needs to be implementable across all the major > Python interpreters, which makes finding a solution that much harder. Candygram is heavyweight by trade-off, not because it has to be. Candygram could absolutely be implemented efficiently in current Python if a Twisted-like style was used. An API that exploits Python 2.5's with blocks and enhanced iterators would make it less verbose than a traditional twisted app and potentially easier to learn. Stackless or greenlets could be used for an even lighter weight API, though not as portably. > The way I see it, we have several options: > > * Bite the bullet; write and support a stdlib SHM primitive that works > wherever possible, and simply doesn't work on completely broken > platforms (I understand Windows falls into this category). Utilize it in > a lightweight fork-and-coordinate wrapper provided in the stdlib. I really don't think that's the right approach. If we're going to bother supporting distributed processing, we might as well support it in a portable way that can scale across machines. > * Bite the mortar shell, and remove the GIL. This really isn't even an option because we're not throwing away the current C Python implementation. The C API would have to change quite a bit for that. > * Introduce microthreads, declare that Python endorses Erlang's > no-sharing approach to concurrency, and incorporate something like > candygram into the stdlib. We have cooperatively scheduled microthreads with ugly syntax (yield), or more platform-specific and much less debuggable microthreads with stackless or greenlets. The missing part is the async message passing API and the libraries to go with it. Erlang uses something a lot like pickle for this, but Erlang only has about 8 types that are all immutable (IIRC: function, binary, list, tuple, pid, atom, integer, float). Communication between Erlang nodes requires a cookie (shared secret), which skirts around security issues. You can definitely kill an Erlang node if you have its cookie by flooding the atom table (atoms are like interned strings), but that's not considered to be a problem in most deployment scenarios. > * Introduce a fork-and-coordinate wrapper in the stdlib, and declare > that we're simply not going to support the use case that requires > sharing (as opposed to merely passing) objects between processes. What use case *requires* sharing? In a message passing system, usage of shared memory is an optimization that you shouldn't care much about as a user. Also, sockets are generally very fast over loopback. IIRC, Erlang only does this with binaries > 64 bytes long across processes on the same node (same pid, but not necessarily the same pthread in an SMP build). HiPE might do some more aggressive communication optimizations... but I think the general idea is that sending a really big message to another process is probably the wrong thing to do anyway. -bob From martin at v.loewis.de Mon Sep 18 08:44:41 2006 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Mon, 18 Sep 2006 08:44:41 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> Message-ID: <450E4059.1000806@v.loewis.de> Andre Meyer schrieb: > While I understand the difficulties in removing the GIL and the > potential negative effect on single-threaded applications I would very > much encourage discussion to seriously consider removing the GIL (maybe > optionally) in Py3k. If not, what alternatives would you suggest? Encouraging "very much" is probably not good enough to make anything happen. Actual code contributions may, as may offering a bounty (although it probably depends on the size of the bounty whether anybody wants to collect it). The alternatives are very straight-forward: 1. use Python the same way as you did for Python 2.x. I.e. create many threads, and have only one of them run. Use the other processors for something else, or don't use them at all. 2. use Python the same way as many other people do. Don't use threads, instead use multiple processors, and some sort of IPC. 3. don't use Python, at least not for the activities that need to run on multiple processors. If you want to fully use your multiple processors, depending on the application, I'd typically go with option 2 or 3. Option 2 if the code to parallelize is written in Python, option 3 if it is written in C (yes, you can use multiple truly concurrent threads in Python: just release the GIL on the C level; you can't make any calls into Python until you reacquire the GIL). Regards, Martin From jcarlson at uci.edu Mon Sep 18 09:25:57 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Mon, 18 Sep 2006 00:25:57 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: <450E34EF.3090202@solarsail.hcs.harvard.edu> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> Message-ID: <20060918001556.07E4.JCARLSON@uci.edu> Ivan Krstic wrote: > * Bite the bullet; write and support a stdlib SHM primitive that works > wherever possible, and simply doesn't work on completely broken > platforms (I understand Windows falls into this category). Utilize it in > a lightweight fork-and-coordinate wrapper provided in the stdlib. Shared memory as an object store, or as IPC? Either way, shared mmaps offer shared memory for most platforms. Which ones? Windows, linux, OSX, solaris, BSDs, ... I would be surprised if Irix, AIX, HP-UX and other "big iron" OSes /didn't/ support shared mmaps. Sure, you don't get it on little embedded machines, but I'm not sure if we want to worry about concurrency libraries there. Alternatively, for platforms that support it, I have found that synchronous unix domain sockets can push about 3x as much as the loopback interface, about 1 GBytes/second on a 3 ghz Xeon, vs. around 350 MBytes/second for loopback tcp/ip. I haven't tried using domain+tcp/ip as a synchronization/"check the mmap at offset X, length Y", but I would imagine that it would be competitive. - Josiah From krstic at solarsail.hcs.harvard.edu Mon Sep 18 09:38:38 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Mon, 18 Sep 2006 03:38:38 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> <6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com> Message-ID: <450E4CFE.8080209@solarsail.hcs.harvard.edu> Bob Ippolito wrote: > Candygram is heavyweight by trade-off, not because it has to be. > Candygram could absolutely be implemented efficiently in current > Python if a Twisted-like style was used. Specifically? >> * Bite the bullet; write and support a stdlib SHM primitive that works [..] >> a lightweight fork-and-coordinate wrapper provided in the stdlib. > > I really don't think that's the right approach. If we're going to > bother supporting distributed processing, we might as well support it > in a portable way that can scale across machines. Fork-and-coordinate is a specialized case of distribute-and-coordinate. Other d-a-c mechanisms can be provided, including those that utilize some form of RPC as a transport. SHM is orthogonal to all of this. Note that scaling across machines is only equivalent to scaling across CPUs in the simple case; in more complicated cases, there's a lot of glue involved that grid frameworks like Boinc provide. If we end up shipping any cross-machine abilities in the stdlib, we'd have to make sure it's clear that we're not attempting to provide a grid framework, just the plumbing that someone could use to build one. >> * Bite the mortar shell, and remove the GIL. > > This really isn't even an option because we're not throwing away the > current C Python implementation. The C API would have to change quite > a bit for that. Hence 'mortar shell'. It can be done, but I think Guido's been pretty clear on it not happening anytime soon. > We have cooperatively scheduled microthreads with ugly syntax (yield), > or more platform-specific and much less debuggable microthreads with > stackless or greenlets. Right. This is why I'm not sure we want to be recommending either as `the` Python way to do concurrency. > What use case *requires* sharing? Strictly speaking, it's always avoidable. But in setup-heavy systems, avoiding SHM is a massive and costly pain. Consider web applications; ideally, you can preload one copy of all of your translations, database information, and other static information, into RAM -- and have worker threads do reads from this table as they're processing individual requests. Without SHM, you'd have to either duplicate the static set in memory for each CPU, or make individual requests for each desired piece of information to the master process that keeps the static set in RAM. I've seen a number of computationally-bound systems that require an authoritative copy of a (large) dataset in RAM, and are OK with paying the cost of a read waiting on a lock during a write (and since writes only happen at the completion of complex calculations, they generally want to use locking like that provided by brlocks in the Linux kernel). All of this is workable without SHM, but some of it gets really unwieldy. -- Ivan Krsti? | GPG: 0x147C722D From ironfroggy at gmail.com Mon Sep 18 09:45:05 2006 From: ironfroggy at gmail.com (Calvin Spealman) Date: Mon, 18 Sep 2006 03:45:05 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <450E4CFE.8080209@solarsail.hcs.harvard.edu> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> <6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com> <450E4CFE.8080209@solarsail.hcs.harvard.edu> Message-ID: <76fd5acf0609180045r41fa6ef1tc78b70228f3c5fe@mail.gmail.com> On 9/18/06, Ivan Krsti? wrote: > > What use case *requires* sharing? > > Strictly speaking, it's always avoidable. But in setup-heavy systems, > avoiding SHM is a massive and costly pain. Consider web applications; > ideally, you can preload one copy of all of your translations, database > information, and other static information, into RAM -- and have worker > threads do reads from this table as they're processing individual > requests. Without SHM, you'd have to either duplicate the static set in > memory for each CPU, or make individual requests for each desired piece > of information to the master process that keeps the static set in RAM. > > I've seen a number of computationally-bound systems that require an > authoritative copy of a (large) dataset in RAM, and are OK with paying > the cost of a read waiting on a lock during a write (and since writes > only happen at the completion of complex calculations, they generally > want to use locking like that provided by brlocks in the Linux kernel). > All of this is workable without SHM, but some of it gets really unwieldy. So reload the information you want available to worker tasks and pass that information along to them, or provide a mechanism for them to request it from its preloaded housing. Shared memory isn't the only or best way to share resources between the tasks involved. Very rarely would any worker task need more than a few rows of any large preloaded table. Alternatively, one could say you dont usually want any preloaded data because there is simply too much information to preload and reusable worker tasks can provide their own, more effectively targetted caches. One might even consider some setup where-by worker threads report their cache contents to a controller, which distributes tasks to workers it knows to have the information required already cached. But in the end, you have to realize this is all at a higher level than we would really need to even consider for the discussion at hand. From dialtone at divmod.com Mon Sep 18 09:50:26 2006 From: dialtone at divmod.com (Valentino Volonghi aka Dialtone) Date: Mon, 18 Sep 2006 09:50:26 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <20060917180402.07DB.JCARLSON@uci.edu> Message-ID: <20060918075026.1717.707665485.divmod.quotient.52675@ohm> On Sun, 17 Sep 2006 18:18:32 -0700, Josiah Carlson wrote: > import autorpc > caller = autorpc.init_processes(autorpc.num_processors()) > > import callables > caller.register_module(callables) > > result = caller.fcn1(arg1, arg2, arg3) > >The point is not to compare API/etc., with threading, but to compare it >with XMLRPC. Because ultimately, what I would like to see, is a >mechanic similar to XMLRPC; call a method on an instance, that is >automatically executed perhaps in some other thread in some other >process, or maybe even in the same thread on the same process (depending >on load, etc.), and which returns the result in-place. I've written something similar taking inspiration from axiom.batch (from divmod.org). And the result is the following code: import sys, os from twisted.internet import reactor from twisted.protocols import amp from twisted.internet import protocol from twisted.python import log from epsilon import process # These are the Commands, they are needed to call remote methods safely. class Sum(amp.Command): arguments = [('a', amp.Integer()), ('b', amp.Integer())] response = [('total', amp.Integer())] class StopReactor(amp.Command): arguments = [('delay', amp.Integer())] response = [('status', amp.String())] # This is the class that tells the RPC exposed methods and their # implementation class JustSum(amp.AMP): def sum(self, a, b): total = a + b log.msg('Did a sum: %d + %d = %d' % (a, b, total)) return {'total': total} Sum.responder(sum) def stop(self, delay): reactor.callLater(delay, reactor.stop) return {'status': 'scheduled'} StopReactor.responder(stop) # Various stuff needed to use AMP over stdin/stdout/stderr with a child # process class AMPConnector(protocol.ProcessProtocol): def __init__(self, proto, controller): self.amp = proto self.controller = controller def connectionMade(self): log.msg("Subprocess started.") self.amp.makeConnection(self) self.controller.childProcessCreated() # Transport disconnecting = False def write(self, data): self.transport.write(data) def writeSequence(self, data): self.transport.writeSequence(data) def loseConnection(self): self.transport.loseConnection() def getPeer(self): return ('omfg what are you talking about',) def getHost(self): return ('seriously it is a process this makes no sense',) def inConnectionLost(self): log.msg("Standard in closed") protocol.ProcessProtocol.inConnectionLost(self) def outConnectionLost(self): log.msg("Standard out closed") protocol.ProcessProtocol.outConnectionLost(self) def errConnectionLost(self): log.msg("Standard err closed") protocol.ProcessProtocol.errConnectionLost(self) def outReceived(self, data): self.amp.dataReceived(data) def errReceived(self, data): log.msg("Received stderr from subprocess: " + repr(data)) def processEnded(self, status): log.msg("Process ended") self.amp.connectionLost(status) self.controller.childProcessTerminated(status) # Here you write the code that uses the commands above. class ProcessController(object): def childProcessCreated(self): def _cb(result): print result d = self.child.callRemote(StopReactor, delay=0) d.addErrback(lambda _: reactor.stop()) def _eb(error): print error d = self.child.callRemote(Sum, a=4, b=5) d.addCallback(_cb) d.addErrback(_eb) def childProcessTerminated(self, status): print status def startProcess(self): executable = '/Library/Frameworks/Python.framework/Versions/2.4/bin/python' env = os.environ env['PYTHONPATH'] = os.pathsep.join(sys.path) self.child = JustSum() self.connector = AMPConnector(self.child, self) args = ( executable, '/usr/bin/twistd', '--logfile=/Users/dialtone/Projects/python/twist/sub.log', '--pidfile=/Users/dialtone/Projects/python/twist/sub.pid', '-noy', '/Users/dialtone/Projects/python/twist/sub.tac') self.process = process.spawnProcess(self.connector, executable, args, env=env) if __name__ == '__main__': p = ProcessController() reactor.callWhenRunning(p.startProcess) reactor.run() If you exlude 'boilerplate' code you end up with a ProcessController class and with the other 3 classes about RPC. Since the father process is using self.child = JustSum() it will also expose the same API to the child which will be able to call any of the methods. The style above may not be immediately obvious to people not used to Twisted but other than that it's not really hard to abstract a bit more to provide an API similar to what you described. HTH From krstic at solarsail.hcs.harvard.edu Mon Sep 18 10:06:59 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Mon, 18 Sep 2006 04:06:59 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <76fd5acf0609180045r41fa6ef1tc78b70228f3c5fe@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> <6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com> <450E4CFE.8080209@solarsail.hcs.harvard.edu> <76fd5acf0609180045r41fa6ef1tc78b70228f3c5fe@mail.gmail.com> Message-ID: <450E53A3.6050309@solarsail.hcs.harvard.edu> Calvin Spealman wrote: > So reload the information you want available to worker tasks and pass > that information along to them, or provide a mechanism for them to > request it from its preloaded housing. With large sets, you can't afford duplicate copies in memory, so there's nothing to reload. I specifically mentioned providing a mechanism for retrieving individual pieces of information from the master process, but if you're doing lots of reads, this introduces complexity and overhead that's best avoided. Maybe it doesn't matter; with an appropriately nice interface for it in the distribute-and-coordinate wrapper, we might be able to hide the complexity from the programmer and use the best available IPC mechanism in the background to ferry the requests. Sync domain sockets are certainly fast enough, even though you're again unnecessarily duplicating parts of memory for each of your workers. > Alternatively, one could say you dont usually want any preloaded data > because there is simply too much information to preload and reusable > worker tasks can provide their own, more effectively targetted caches. I'm talking about real-world problems, where this most often doesn't work. > But in the end, you have to realize this is all at a higher level than > we would really need to even consider for the discussion at hand. I was answering a direct question. -- Ivan Krsti? | GPG: 0x147C722D From paul at prescod.net Mon Sep 18 10:19:40 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 18 Sep 2006 01:19:40 -0700 Subject: [Python-3000] Kill GIL? In-Reply-To: <450E34EF.3090202@solarsail.hcs.harvard.edu> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> Message-ID: <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com> On 9/17/06, Ivan Krsti? wrote: > > At present, the Python approach to multi-processing sounds a bit like > "let's stick our collective hands in the sand and pretend there's no > problem". In particular, one oft-parroted argument says that it's not > worth changing or optimizing the language for the few people who can > afford SMP hardware. In the meantime, dual-core laptops are becoming the > standard, with Intel predicting quad-core will become mainstream in the > next few years, and the number of server orders for single-core, UP > machines is plummeting. I agree with you Ivan. Even if I won't contribute code or even a design to the solution (because it isn't an area of expertise and I'm still working on encodings stuff) I think that there would be value in saying: "There's a big problem here and we intend to fix it in Python 3000." When you state baldly that something is a problem you encourage the community to experiement with and debate solutions. But I have been in the audience at Python conferences where the majority opinion was that Python had no problem around multi-processor apps because you could just roll your own IPC on top of processes. If you have to roll your own, that's a problem. If you have to select between five solutions with really subtle tradeoffs, that's a problem too. Ivan: why don't you write a PEP about this? > * Bite the bullet; write and support a stdlib SHM primitive that works > wherever possible, and simply doesn't work on completely broken > platforms (I understand Windows falls into this category). Utilize it in > a lightweight fork-and-coordinate wrapper provided in the stdlib. Such a low-level approach will not fly. Not just because of Windows but also because of Jython and IronPython. But maybe I misunderstand it in general. Python does not really have an abstraction as low-level "memory" and I don't see why we would want to add it. * Introduce microthreads, declare that Python endorses Erlang's > no-sharing approach to concurrency, and incorporate something like > candygram into the stdlib. > > * Introduce a fork-and-coordinate wrapper in the stdlib, and declare > that we're simply not going to support the use case that requires > sharing (as opposed to merely passing) objects between processes. I'm confused on a few levels. 1. "No sharing" seems to be a feature of both of these options, but the wording you use to describe it is very different. 2. You're conflating API and implementation in a manner that is unclear to me. Why are microthreads important to the Erlang model and what would the API for fork-and-coordinate look like? Since you are fired up about this now, would you consider writing a PEP at least outlining the problem persuasively and championing one of the (feasible) options? This issue has been discussed for more than a decade and the artifacts of previous discussions can be quite hard to find. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060918/8e60e281/attachment-0001.html From krstic at solarsail.hcs.harvard.edu Mon Sep 18 11:07:56 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Mon, 18 Sep 2006 05:07:56 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com> Message-ID: <450E61EC.40507@solarsail.hcs.harvard.edu> Paul Prescod wrote: > I think that there would be value in saying: "There's a > big problem here and we intend to fix it in Python 3000." I'm not at all convinced that this is something to be addressed in 3.0. Py3k is about removing cruft, not adding features; a proper MP system represents a feature addition that might be more appropriate for 2.6 or post-3.0 if it gets horribly drawn out. > Ivan: why don't you write a PEP about this? I'd like to hear Guido's overarching thoughts on the matter, if any, and would afterwards be happy to write a PEP. > Python does not really have an abstraction as low-level > "memory" and I don't see why we would want to add it. You don't need a special abstraction; a library adding primitives like SHMlist and SHMdict would be fully adequate. Arbitrary objects could decide to react to specific getattr/setattribute calls by peeking and poking at the primitives. A SHMpickle mechanism could be used to stuff existing objects into SHM, and then create relevant proxies. > 1. "No sharing" seems to be a feature of both of these options, but the > wording you use to describe it is very different. Erlang's shared-nothing is a conviction: "you don't need to share things to get good concurrent operation". The alternative I mention is to declare "well, we recognize the need for shared-something, but are pointedly not providing the functionality". An irrelevant difference for the most part. > 2. You're conflating API and implementation in a manner that is unclear > to me. Why are microthreads important to the Erlang model They're not; Candygram proves as much. But the Erlang model was designed with the idea that threads ("processes") cost almost nothing, and if threads instead cost at least a full stack allocation, it's easy to get into hot water. > and what would > the API for fork-and-coordinate look like? I'm not going to try and design an API at 5:05AM. I'll think about this in the next few days, and stick it in the PEP after Guido chimes in. > Since you are fired up about this now, would you consider writing a PEP > at least outlining the problem persuasively and championing one of the > (feasible) options? Sure. -- Ivan Krsti? | GPG: 0x147C722D From ncoghlan at gmail.com Mon Sep 18 12:47:08 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Sep 2006 20:47:08 +1000 Subject: [Python-3000] Kill GIL? In-Reply-To: References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> Message-ID: <450E792C.1070105@gmail.com> Brett Cannon wrote: > On 9/17/06, *Nick Coghlan* - use threads to perform blocking I/O in parallel > - use multiple interpreters to perform Python execution in parallel > > > Possibly, but as it stands now interpreters just execute in their own > Python thread, so there is no real performance boost. Without the GIL > shifting over to per interpreter instead of per process there is going > to be the same performance problems as with Python threads. And > changing that would be hard since objects can be shared between > multiple interpreters. I was thinking it would be easier to split out the Global Interpreter Lock and a per-interpreter Local Interpreter Lock, rather than trying to go to a full free-threading model. Anyone sharing other objects between interpreters would still need their own synchronisation mechanism, but something like threading.Queue should suffice for that. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Mon Sep 18 13:04:57 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Sep 2006 21:04:57 +1000 Subject: [Python-3000] Kill GIL? In-Reply-To: <1158504672.28528.82.camel@fsol> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> <1158504672.28528.82.camel@fsol> Message-ID: <450E7D59.90706@gmail.com> Antoine Pitrou wrote: > Le dimanche 17 septembre 2006 ? 23:16 +1000, Nick Coghlan a ?crit : >> Brett Cannon's sandboxing work (which aims to provide first-class support for >> multiple interpreters in the same process for security reasons) also seems >> like a potentially fruitful approach to distributing processing to multiple cores: >> - use threads to perform blocking I/O in parallel > > OTOH, the Twisted approach avoids all the delicate synchronization > issues that arise when using threads to perform concurrent IO tasks. > > Also, IO is by definition not CPU-intensive, so there is no point in > distributing IO to multiple cores (and it could even cause a small > decrease in performance because of inter-CPU communication overhead). Yeah, I made a mistake. The distinction is whether or not the CPU-intensive task is written in C/C++ or Python - threads already work fine for the former, but something new is needed for the latter (either better IPC support or better in-process multi-interpreter support that uses an interpreter-specific lock for interpreter-specific data structures, reserving the GIL for shared state). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From talin at acm.org Mon Sep 18 15:38:35 2006 From: talin at acm.org (Talin) Date: Mon, 18 Sep 2006 06:38:35 -0700 Subject: [Python-3000] Ruminations on switch, constants, imports, etc. Message-ID: <450EA15B.1050302@acm.org> "Ruminating" is the best word I can think of here - I've been slowly digesting the ideas and discussions over the last couple months. Part of the reason why all this is relevant to me is that I'm working on a couple of side projects, some of which involve "mini-languages" that have similar issues to what has been discussed on the list. Bear in mind that none of what I say here is a recommendation of where Python *should* go - rather, it's a description of where I have *already* gone in some of the work that I am doing. It is merely one possible answer out of many to the suggestions that have been put forward in this forum. (And if this is merely a rehash of something that was discussed long ago, I apologize.) I'll start with the 'switch' discussion (you'll note that the ideas here cut across a bunch of different threads.) The controversy, for which there seemed to be no final resolution, had to do with when the case values should be evaluated (I won't repeat the descriptions of the various schools - look in the PEP.) As some people pointed out, the truly fundamental issue has to do with the nature of constants in Python, which is to say that currently, there are none - other than literals. What would be entailed in adding a const type? Without fundamentally changing the language, the best you can do is early binding, in other words the const value isn't known at compilation time, but is a variable that is frozen at some point - perhaps at module load time. Adding a true const - that is, a variable whose value is known to the compiler, and can be optimized as such - is somewhat more involved. For one thing, the compiler knows nothing about external modules, or anything outside the current compilation unit. Without the ability to import external definitions, a compile-time 'const' is quite useless. One way around this (which is a little kludgey) would be to add a second type of 'import' - a compile-time import, which could be called something like 'include' or 'using'. An import of this type would act as if it had been textually included in the current source file. It would become part of the current compilation unit, and it would have the same restrictions - such as the inability to access variables imported via 'import' at compile time. Include files can of course include other files - but they can also 'import' as well. The effect of the importing from an include is the same as importing from the primary source file (because of the rule which states that 'include' is equivalent to textual inclusion.) Conversely, imported files can include - however the effect of the inclusion is limited to the imported file only, and does not affect the primary source file (because it's a different compilation unit.) This implies that you can't access constants that are within an imported module (because the constant definitions exist only within the compiler - they are transformed into literals before code generation occurs.) If a source file and an included module need to share constant values, they must each include the definitions of those constants. This can lead to problems if the include files are changing - one file might be compiled with a different version of the include file than another. OTOH, many potential uses for constants would be for things like operating system error codes and flags, which are fairly stable and unchanging -- so even a restricted use of the facility (i.e. don't use it for values which are in flux) might be worth while. Another possibility is to embed include checksums or other version info within the compiled file. So far, it seems like a lot of added complexity for fairly little benefit. However, where it gets interesting is when you realize that once you've given the compiler knowledge of the world outside a single compilation unit, a number of interesting possibilities arise. The one I've been experimenting with in my mini-language is macros. Not macros in the C sense, but in the lisp sense - a function which takes unevaluated arguments. Actually, they more closely resemble Dylan macros, in that they add production rules to the parser. Internally a macro is a small snippet of an AST, which gets spliced into the AST of the calling function at compile time. Macro arguments can be identifiers, expressions, and statements, all of which get substituted into the appropriate point in the AST. This is of course going way past Guido's "Programmable Syntax" prohibition. Good thing I am only talking hypothetically! It seems to me, however, (getting back to 'switch') that supporting a proper 'switch' statement has to address these issues in *some* fashion - the issue of constants isn't going to go away completely under any of the proposed approaches. In fact, I'm actually leaning towards the position of *not* adding a switch statement to Python, simply because I'm not sure that Python *should* deal with all of these issues. It seems to me that adding 'const' to the language opens up a Pandora's box, containing both chaos and hope - and I think that for some language X, it may be a good idea to open that box, but I don't know if X includes Python. -- Talin From rhamph at gmail.com Mon Sep 18 17:48:56 2006 From: rhamph at gmail.com (Adam Olsen) Date: Mon, 18 Sep 2006 09:48:56 -0600 Subject: [Python-3000] Delayed reference counting idea Message-ID: I think all the attempts to expose GIL-less semantics to python code miss the point. Reference counting turns all references into modifications. You can't avoid the GIL without first changing reference counting. There's a few ways to approach this: * atomic INCREF/DECREF using cpu instructions. This would be very expensive, considering how often we do it. * Bolt-on tracing GC such as Boehm-Demers-Weiser. Totally unsupported by the C standards and changes cache characteristics that CPython has been designed with for years, likely with a very large performance penalty. * Tracing GC within C. Would require rewriting every API in CPython, as well as the code that uses them. Alternative implementations (PyPy, et al) can try this, but I think it's clear that it's not worth the effort for CPython, especially given the performance risks. * Delayed reference counting (save 10 or 20 INCREF/DECREF ops to a buffer, then flush them all at once). In theory, it would retain the cache locality while amortizing locking needed for SMP machines. For the most part delayed reference counting should require no changes, since it would use the existing INCREF/DECREF API. Some code does circumvent that API, and would need to be changed. Anyway, my point is that, for those of you out there who want to remove the GIL, here is something you really can experiment with. Even if there was a 20% performance drop on real-world tests you could still make it a configure option, enabled only for people who need many CPUs. (I've tried it myself, but never got past the weird crashes. Probably missed something silly). -- Adam Olsen, aka Rhamphoryncus From jimjjewett at gmail.com Mon Sep 18 17:56:51 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 18 Sep 2006 11:56:51 -0400 Subject: [Python-3000] Kill GIL? - to PEP 3099? Message-ID: Guido -- If I'm not mis-stating, this might be a candidate for PEP 3099. On 9/18/06, Ivan Krsti? wrote: > Paul Prescod wrote: > > Ivan: why don't you write a PEP about this? > I'd like to hear Guido's overarching thoughts on the matter, if any, and > would afterwards be happy to write a PEP. IIRC, his most recent statements boiled down to: (1) The GIL works well enough, most of the time. (2) Taking it out is harder than people realize. (3) Therefore, he won't spend too much time rethinking unless/until there is code to evaluate. -jJ From phd at phd.pp.ru Mon Sep 18 18:02:33 2006 From: phd at phd.pp.ru (Oleg Broytmann) Date: Mon, 18 Sep 2006 20:02:33 +0400 Subject: [Python-3000] Kill GIL? - to PEP 3099? In-Reply-To: References: Message-ID: <20060918160232.GB30336@phd.pp.ru> On Mon, Sep 18, 2006 at 11:56:51AM -0400, Jim Jewett wrote: > IIRC, his most recent statements boiled down to: > > (1) The GIL works well enough, most of the time. 1a. On multiprocessor/multicore systems use processes, not threads. > (2) Taking it out is harder than people realize. > (3) Therefore, he won't spend too much time rethinking unless/until > there is code to evaluate. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From qrczak at knm.org.pl Mon Sep 18 18:27:12 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Mon, 18 Sep 2006 18:27:12 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: (Adam Olsen's message of "Mon, 18 Sep 2006 09:48:56 -0600") References: Message-ID: <8764fl87j3.fsf@qrnik.zagroda> "Adam Olsen" writes: > * Bolt-on tracing GC such as Boehm-Demers-Weiser. Totally unsupported > by the C standards and changes cache characteristics that CPython has > been designed with for years, likely with a very large performance > penalty. Last time I did some GC benchmarks (unrelated to Python), Boehm GC came up surprisingly fast. I suppose it's faster than malloc + reference counting (not sure how much amortizing malloc calls helps). I don't like the idea of a conservative GC at all in general, but Boehm GC seems to have very good quality, and it's easy to use from the point of view of a C API. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Mon Sep 18 18:45:52 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 18 Sep 2006 09:45:52 -0700 Subject: [Python-3000] Kill GIL? - to PEP 3099? In-Reply-To: References: Message-ID: <1cb725390609180945r64f2cfafv33a801abe42452b4@mail.gmail.com> The thread subject notwithstanding, the majority of the discussion was about ways to work around the GIL, not remove it. Therefore the thing you might put in PEP 3099 is not the thing under active discussion. On 9/18/06, Jim Jewett wrote: > > Guido -- If I'm not mis-stating, this might be a candidate for PEP 3099. > > On 9/18/06, Ivan Krsti? wrote: > > Paul Prescod wrote: > > > > Ivan: why don't you write a PEP about this? > > > I'd like to hear Guido's overarching thoughts on the matter, if any, and > > would afterwards be happy to write a PEP. > > IIRC, his most recent statements boiled down to: > > (1) The GIL works well enough, most of the time. > (2) Taking it out is harder than people realize. > (3) Therefore, he won't spend too much time rethinking unless/until > there is code to evaluate. > > -jJ > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060918/fb9aa398/attachment.htm From rhamph at gmail.com Mon Sep 18 19:11:51 2006 From: rhamph at gmail.com (Adam Olsen) Date: Mon, 18 Sep 2006 11:11:51 -0600 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <8764fl87j3.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> Message-ID: On 9/18/06, Marcin 'Qrczak' Kowalczyk wrote: > "Adam Olsen" writes: > > > * Bolt-on tracing GC such as Boehm-Demers-Weiser. Totally unsupported > > by the C standards and changes cache characteristics that CPython has > > been designed with for years, likely with a very large performance > > penalty. > > Last time I did some GC benchmarks (unrelated to Python), Boehm GC > came up surprisingly fast. I suppose it's faster than malloc + > reference counting (not sure how much amortizing malloc calls helps). I expect Boehm would do very well in applications suited for it. I just don't think that includes CPython, especially with all the third-party C libraries. -- Adam Olsen, aka Rhamphoryncus From barry at python.org Mon Sep 18 19:40:19 2006 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Sep 2006 13:40:19 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <8764fl87j3.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> Message-ID: <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 18, 2006, at 12:27 PM, Marcin 'Qrczak' Kowalczyk wrote: > "Adam Olsen" writes: > >> * Bolt-on tracing GC such as Boehm-Demers-Weiser. Totally >> unsupported >> by the C standards and changes cache characteristics that CPython has >> been designed with for years, likely with a very large performance >> penalty. > > Last time I did some GC benchmarks (unrelated to Python), Boehm GC > came up surprisingly fast. I suppose it's faster than malloc + > reference counting (not sure how much amortizing malloc calls helps). > > I don't like the idea of a conservative GC at all in general, but > Boehm GC seems to have very good quality, and it's easy to use from > the point of view of a C API. What worries me is the unpredictability of gc vs. refcounting. For some class of Python applications it's important that when an object is dereferenced it really goes away right then. I /like/ reference counting! - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRQ7aCHEjvBPtnXfVAQLziwP+K/lepARPfrRtGoH/7HTUE6oXL+4kF5Ow fEmg7zRPL3p8vrPrdKZi63kW4pZWYbmlsb/ugF+WmSdJIYebdK/p5d4kq5uOcWKi 9qVLtVXo6/f/nsNEeN0pcX/Y5RTRXPSgMy7hwlDH7/x4gT+Rz6uZSCR1I02x5OHa wN4+KiInPSw= =ScRh -----END PGP SIGNATURE----- From solipsis at pitrou.net Mon Sep 18 20:38:00 2006 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 18 Sep 2006 20:38:00 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <1158604680.4726.9.camel@fsol> Le lundi 18 septembre 2006 ? 09:48 -0600, Adam Olsen a ?crit : > * Bolt-on tracing GC such as Boehm-Demers-Weiser. Totally unsupported > by the C standards and changes cache characteristics that CPython has > been designed with for years, likely with a very large performance > penalty. Has it been measured what cache effects reference counting entails ? With reference counting, each object is mutable from the point of view of the CPU cache (refcnt is always incremented and later decremented). This means almost every cache line containing Python objects - including functions, modules... - has to be written back when it is evicted, even if those objects are "constant". > * Delayed reference counting (save 10 or 20 INCREF/DECREF ops to a > buffer, then flush them all at once). In theory, it would retain the > cache locality while amortizing locking needed for SMP machines. You would have to lock the buffer, wouldn't you? Unless you use per-CPU buffers. From meyer at acm.org Mon Sep 18 21:18:29 2006 From: meyer at acm.org (Andre Meyer) Date: Mon, 18 Sep 2006 21:18:29 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <450E4059.1000806@v.loewis.de> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E4059.1000806@v.loewis.de> Message-ID: <7008329d0609181218v64ca1465s14cefe3cc91b67a8@mail.gmail.com> I would love to contribute code for this problem. Unfortunately, I am not able to do so, but see the problem for myself and others. Therefore, I wanted to raise the question in time for Py3k. The number of responses indicates that it is not just me who struggles and there are people who know how to improve the situation. So far, this seems like a fruitful discussion. Thanks Andre On 9/18/06, "Martin v. L?wis" wrote: > > Andre Meyer schrieb: > > While I understand the difficulties in removing the GIL and the > > potential negative effect on single-threaded applications I would very > > much encourage discussion to seriously consider removing the GIL (maybe > > optionally) in Py3k. If not, what alternatives would you suggest? > > Encouraging "very much" is probably not good enough to make anything > happen. Actual code contributions may, as may offering a bounty > (although it probably depends on the size of the bounty whether anybody > wants to collect it). > > The alternatives are very straight-forward: > 1. use Python the same way as you did for Python 2.x. I.e. create > many threads, and have only one of them run. Use the other processors > for something else, or don't use them at all. > 2. use Python the same way as many other people do. Don't use threads, > instead use multiple processors, and some sort of IPC. > 3. don't use Python, at least not for the activities that need to > run on multiple processors. > If you want to fully use your multiple processors, depending on the > application, I'd typically go with option 2 or 3. Option 2 if the code > to parallelize is written in Python, option 3 if it is written in C > (yes, you can use multiple truly concurrent threads in Python: just > release the GIL on the C level; you can't make any calls into Python > until you reacquire the GIL). > > Regards, > Martin > -- Dr. Andre P. Meyer http://python.openspace.nl/meyer TNO Defence, Security and Safety http://www.tno.nl/ Delft Cooperation on Intelligent Systems http://www.decis.nl/ Ah, this is obviously some strange usage of the word 'safe' that I wasn't previously aware of. - Douglas Adams -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060918/32b0dfae/attachment.htm From jimjjewett at gmail.com Mon Sep 18 21:27:02 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 18 Sep 2006 15:27:02 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <1158604680.4726.9.camel@fsol> References: <1158604680.4726.9.camel@fsol> Message-ID: On 9/18/06, Antoine Pitrou wrote: > > Le lundi 18 septembre 2006 ? 09:48 -0600, Adam Olsen a ?crit : > > * Bolt-on tracing GC such as Boehm-Demers-Weiser. Totally unsupported > > by the C standards and changes cache characteristics that CPython has > > been designed with for years, likely with a very large performance > > penalty. > Has it been measured what cache effects reference counting entails ? Probably not recently. > With reference counting, each object is mutable from the point of view > of the CPU cache (refcnt is always incremented and later decremented). But each object request is only to one piece of memory, not two (obj and header separate). Just a reminder about Neil Schemenauer's (old) patch to use Boehm-Demers http://mail.python.org/pipermail/python-list/1999-July/thread.html#7638 http://arctrix.com/nas/python/gc/ http://people.csail.mit.edu/gregs/ll1-discuss-archive-html/threads.html#00056 According to http://codespeak.net/pypy/dist/pypy/doc/getting-started.html PyPy sometimes translates to the use of BDW. I also seem to remember (but can't find a reference) that someone tried using a separate immortal namespace for basic objects like None, but the hassle of deciding what to do on each object ate up the savings. -jJ From rhettinger at ewtllc.com Mon Sep 18 21:56:27 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Mon, 18 Sep 2006 12:56:27 -0700 Subject: [Python-3000] Delayed reference counting idea Message-ID: [Adam Olsen] > I don't like the idea of a conservative GC at all in general, but > Boehm GC seems to have very good quality, and it's easy to use from > the point of view of a C API. Several thoughts: * An easier C API would significantly benefit the language in terms of more extensions being available and in terms of increased reliability for those extensions. The current refcount scheme results in pervasive refleak bugs and subsequent, interminable bughunts. It adds to code verbosity/complexity and makes it tricky for beginning extension writers to get their first apps done correctly. IOW, I agree that GC without refcounts will make it easier to write good C code. * I doubt the anecdotal comments about Boehm GC with respect to performance. It may be better or it may be worse. While I think the latter is more likely, only an implementation patch will tell the tale. * At my company, we write real-time apps that benefit from the current refcounting scheme. We would have to stick with Py2.x unless Boehm GC can be implemented without periodically killing responsiveness. [Barry Warsaw] > What worries me is the unpredictability of gc vs. refcounting. > For some class of Python applications it's important that when > an object is dereferenced it really goes away right then. > I /like/ reference counting! No doubt that those exist; however, that sort of design is somewhat fragile and bugprone leading to endless sessions to find-out who or what is keeping an object alive. This situation can only get worse when new-style classes become the norm. Also, IIRC, bugs involving __del__ have been one of the more complex, buggy, and dark corners of the language. Statistics incontrovertibly prove that people who habitually avoid __del__ lead happier lives and spend fewer hours in therapy ;-) Raymond From rhamph at gmail.com Mon Sep 18 21:59:35 2006 From: rhamph at gmail.com (Adam Olsen) Date: Mon, 18 Sep 2006 13:59:35 -0600 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <1158604680.4726.9.camel@fsol> References: <1158604680.4726.9.camel@fsol> Message-ID: On 9/18/06, Antoine Pitrou wrote: > > Le lundi 18 septembre 2006 ? 09:48 -0600, Adam Olsen a ?crit : > > * Bolt-on tracing GC such as Boehm-Demers-Weiser. Totally unsupported > > by the C standards and changes cache characteristics that CPython has > > been designed with for years, likely with a very large performance > > penalty. > > Has it been measured what cache effects reference counting entails ? > > With reference counting, each object is mutable from the point of view > of the CPU cache (refcnt is always incremented and later decremented). > This means almost every cache line containing Python objects - including > functions, modules... - has to be written back when it is evicted, even > if those objects are "constant". I don't think there's ever been any measuring, just theorizing based on some general benchmarks. For example, it's likely that the cache line containing the refcount is likely already loaded, when the type point is loaded. However, delayed reference counting could allow you to remove incref/decref pairs, thereby avoiding the write entierly in some cases. > > > * Delayed reference counting (save 10 or 20 INCREF/DECREF ops to a > > buffer, then flush them all at once). In theory, it would retain the > > cache locality while amortizing locking needed for SMP machines. > > You would have to lock the buffer, wouldn't you? > Unless you use per-CPU buffers. I'm assuming per-CPU buffers. You'd need a global lock to flush them. There's probably some more creative schemes, but they couldn't be implemented quite so simply. -- Adam Olsen, aka Rhamphoryncus From barry at python.org Mon Sep 18 22:15:47 2006 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Sep 2006 16:15:47 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 18, 2006, at 3:56 PM, Raymond Hettinger wrote: > * An easier C API would significantly benefit the language in terms of > more extensions being available and in terms of increased reliability > for those extensions. The current refcount scheme results in > pervasive > refleak bugs and subsequent, interminable bughunts. It adds to code > verbosity/complexity and makes it tricky for beginning extension > writers > to get their first apps done correctly. IOW, I agree that GC without > refcounts will make it easier to write good C code. > > * I doubt the anecdotal comments about Boehm GC with respect to > performance. It may be better or it may be worse. While I think the > latter is more likely, only an implementation patch will tell the > tale. > > * At my company, we write real-time apps that benefit from the current > refcounting scheme. We would have to stick with Py2.x unless Boehm GC > can be implemented without periodically killing responsiveness. We'd be in the same boat. While I agree with Raymond that it can be quite difficult to get C code to be refcount-correct, I wonder if there aren't tools or other debugging aids we can develop that will at least help debug when problems occur. Not that I have any bright ideas here, but as an example, one of the things we do when our app exits (it's potentially long running, but never daemonic) is to stroll through the list of all live objects, checking their refcounts against expected values. Of course we only do this in debug builds, but right now in our dev tree I'm looking at an issue where a central object has a few hundred more refcounts than expected at program exit. The really tricky thing about refcounting is making sure all the exit conditions out of a function are refcount correct. Usually these involve error or exception conditions, and they can be a bear to get right. Make you want to write the goto-considered-useful rant all over again. :) Would a garbage collection interface make this easier (because you could ignore all that) or would you be trading that off for things like gcpro in Emacs, which can be just as harmful if you screw then up? - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRQ7+eHEjvBPtnXfVAQKV8QQAkmgd7XEHIyNKRi25LyG+WB9KX9lXsucc dg/1BUNpkAPjyK6jXrXKpSvQtMzfCkPSyRENSy/B/bjom1TRcSPpmQWiFeT73MYm aRgma8L5ahuZkGdu9MaAr9LUCNW4VsPMPJCRBB0vlpkPaaDvgyoCIFpL1SjbRako hh+HAMuEHHY= =YqgR -----END PGP SIGNATURE----- From jimjjewett at gmail.com Mon Sep 18 22:33:09 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 18 Sep 2006 16:33:09 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org> References: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org> Message-ID: On 9/18/06, Barry Warsaw wrote: > ... I agree with Raymond that it can be quite difficult to get > C code to be refcount-correct, ... How much of this (particularly for beginners) is remembering the refcount affects of standard functions? Could this be avoided by just always using the more abstract interface? (Sequence instead of List, Mapping instead of Dict) > The really tricky thing about refcounting is making sure all the exit > conditions out of a function are refcount correct. Usually these > involve error or exception conditions, and they can be a bear to get > right. Would it solve this problem if there were a PyTEMPREF that magically treated the refcount as an automatic variable? (It increfed immediately, and decrefed whenever the function exited, without the user having to track this manually.) Would it help enough to justify a pre-processing requirement? -jJ From rhamph at gmail.com Mon Sep 18 22:34:44 2006 From: rhamph at gmail.com (Adam Olsen) Date: Mon, 18 Sep 2006 14:34:44 -0600 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: On 9/18/06, Raymond Hettinger wrote: > [Adam Olsen] > > I don't like the idea of a conservative GC at all in general, but > > Boehm GC seems to have very good quality, and it's easy to use from > > the point of view of a C API. This was Marcin, not me ;) > Several thoughts: > > * An easier C API would significantly benefit the language in terms of > more extensions being available and in terms of increased reliability > for those extensions. The current refcount scheme results in pervasive > refleak bugs and subsequent, interminable bughunts. It adds to code > verbosity/complexity and makes it tricky for beginning extension writers > to get their first apps done correctly. IOW, I agree that GC without > refcounts will make it easier to write good C code. > > * I doubt the anecdotal comments about Boehm GC with respect to > performance. It may be better or it may be worse. While I think the > latter is more likely, only an implementation patch will tell the tale. I have played with it before, on the CPython codebase. I really can't imagine it getting more than a minor speed boost, or else we'd already be finding that refcounting was taking up a large portion of our CPU time. (Anybody have actual numbers on the time spend in malloc/free?) The real advantage of Boehm is with threading. Avoiding the locking means you don't get the giant penalty you'd otherwise get. Still not inherently faster than a single-threaded program (which needs no locking). I discount Boehm because of the complexity and non-standardness though. I'd never want to maintain it, especially since it would effect all the libraries we link to as well. Although, with suitable proxying, it may be possible to limit it to just Python objects.. If I was to seriously consider a python implementation with a tracing GC, I'd want it to be a moving GC, to fix the high-water mark problem of malloc. That seems incompatible with conservative GCs such as Boehm, although, come to think of it, I could do it using standard-conforming C (if any API rewrite were permissible). > * At my company, we write real-time apps that benefit from the current > refcounting scheme. We would have to stick with Py2.x unless Boehm GC > can be implemented without periodically killing responsiveness. Boehm does have options for incremental GC. > [Barry Warsaw] > > What worries me is the unpredictability of gc vs. refcounting. > > For some class of Python applications it's important that when > > an object is dereferenced it really goes away right then. > > I /like/ reference counting! > > No doubt that those exist; however, that sort of design is somewhat > fragile and bugprone leading to endless sessions to find-out who or what > is keeping an object alive. This situation can only get worse when > new-style classes become the norm. Also, IIRC, bugs involving __del__ > have been one of the more complex, buggy, and dark corners of the > language. Statistics incontrovertibly prove that people who habitually > avoid __del__ lead happier lives and spend fewer hours in therapy ;-) I agree here. I think an executor approach is much better; kill the object, then make a weakref callback do any further cleanups using copies it made in advance. -- Adam Olsen, aka Rhamphoryncus From ronaldoussoren at mac.com Mon Sep 18 22:45:35 2006 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Mon, 18 Sep 2006 22:45:35 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: On Sep 18, 2006, at 9:56 PM, Raymond Hettinger wrote: > > * I doubt the anecdotal comments about Boehm GC with respect to > performance. It may be better or it may be worse. While I think the > latter is more likely, only an implementation patch will tell the > tale. hear, hear ;-). Other anecdotical evidence says that a GC can be significantly faster than manual allocation, especially a copying collector where allocation can be really, really cheap. Boehm's GC isn't a copying collector, but I wouldn't count it out just because "everybody knows that GC is slow". I'd be more worried about changes in semantics, it's pretty convenient to write 'open(somefile, 'r').read()' to read a file in bulk, currently this will immediately close the file but with a GC system it may be a long time before the file is actually closed. Another reason to be scared of GC is some bad experience I've had with Java's GC, its rather annoying if you're a sysadmin, get a Java app thrown over the wall and then have to tweak obscure GC-related parameters to get decent performance (or rather, an application that doesn't crash after running for a couple of days). That may have been bad code in the application, but I'm not entirely convinced that Java's GC doesn't deserve to get some of the blame. Ronald -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2157 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20060918/a9855a76/attachment.bin From barry at python.org Mon Sep 18 22:56:19 2006 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Sep 2006 16:56:19 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 18, 2006, at 4:33 PM, Jim Jewett wrote: > On 9/18/06, Barry Warsaw wrote: > >> ... I agree with Raymond that it can be quite difficult to get >> C code to be refcount-correct, ... > > How much of this (particularly for beginners) is remembering the > refcount affects of standard functions? Could this be avoided by just > always using the more abstract interface? (Sequence instead of List, > Mapping instead of Dict) I think that may be part of it (I've mentioned this before), but our C API code wasn't written by beginners, and while we don't have any known refcounting problems in production code, during development one or two can slip through. I don't think that the above is the major contributor. >> The really tricky thing about refcounting is making sure all the exit >> conditions out of a function are refcount correct. Usually these >> involve error or exception conditions, and they can be a bear to get >> right. > > Would it solve this problem if there were a PyTEMPREF that magically > treated the refcount as an automatic variable? (It increfed > immediately, and decrefed whenever the function exited, without the > user having to track this manually.) > > Would it help enough to justify a pre-processing requirement? I don't know, I hate macros. :) It's been a long while since I programmed on the NeXT, so Mac folks here please chime in, but isn't there some Foundation idiom where temporary Objective-C objects didn't need to be explicitly released if their lifetime was exactly the duration of the function in which they were created? ISTR something like the main event loop tracking such refcount=1 objects and deleting them automatically the next time through the loop. Since Python has a main loop, I wonder if the same kind of trick couldn't be done here. IOW, if you're just creating an object temporarily, you never need to explicitly decref it because the main eval loop would do it for you. Dang I wish I could remember the details. Something like that, where you didn't have to track all objects through all exit conditions would probably help. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRQ8H83EjvBPtnXfVAQIz5wP+JUJF3fwYIZ6fUmG4PkpyE8K+oOflCQYE vjBSa4vaCkX8fJvAZzwH5VgFoOEJ6WxLwagkJvFmVdCLDNgs2TwJF+cT45qJYCLF cWbcNAtesxMVZIUMjtUDpQLoSw/1CTuGbCdymqEuteF8IRZEJP5Usv1c6ytS5LJK cuLWyArvNeo= =UDIj -----END PGP SIGNATURE----- From martin at v.loewis.de Mon Sep 18 23:11:03 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 18 Sep 2006 23:11:03 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <1158604680.4726.9.camel@fsol> References: <1158604680.4726.9.camel@fsol> Message-ID: <450F0B67.2030405@v.loewis.de> Antoine Pitrou schrieb: > Has it been measured what cache effects reference counting entails ? I don't think so. > With reference counting, each object is mutable from the point of view > of the CPU cache (refcnt is always incremented and later decremented). > This means almost every cache line containing Python objects - including > functions, modules... - has to be written back when it is evicted, even > if those objects are "constant". Yes, though this is likely negligible wrt. to the overhead that locking operations on refcount changes would have. Regards, Martin From martin at v.loewis.de Mon Sep 18 23:16:01 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 18 Sep 2006 23:16:01 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <450F0C91.3090608@v.loewis.de> Raymond Hettinger schrieb: > * An easier C API would significantly benefit the language in terms of > more extensions being available and in terms of increased reliability > for those extensions. The current refcount scheme results in pervasive > refleak bugs and subsequent, interminable bughunts. It adds to code > verbosity/complexity and makes it tricky for beginning extension writers > to get their first apps done correctly. IOW, I agree that GC without > refcounts will make it easier to write good C code. I don't think this will be the case. A garbage collector will likely need to find out what the pointer local and global variables are, as well as the pointers hidden in C structures (at least if the collector is going to be "precise"). So I think a Python with "true" GC will be much more error-prone on the C level, with authors not getting the declarations of variables right, and endless bug hunts because a referenced object is already collected, and its memory overwritten. Regards, Martin From martin at v.loewis.de Mon Sep 18 23:19:05 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 18 Sep 2006 23:19:05 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <450F0D49.4090606@v.loewis.de> Ronald Oussoren schrieb: > hear, hear ;-). Other anecdotical evidence says that a GC can be > significantly faster than manual allocation, especially a copying > collector where allocation can be really, really cheap. OTOH, it isn't typically faster than obmalloc (which also allocates in constant time "on average"). Regards, Martin From rhettinger at ewtllc.com Tue Sep 19 03:21:44 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Mon, 18 Sep 2006 18:21:44 -0700 Subject: [Python-3000] Delayed reference counting idea Message-ID: [Raymond Hettinger] >> * At my company, we write real-time apps that benefit from the current >> refcounting scheme. We would have to stick with Py2.x unless Boehm GC >> can be implemented without periodically killing responsiveness. [Jim Jewett] >Do you effectively turn off cyclic collections, (but refcount > reclaims enough) or is the current cyclic collector fast enough? We turn-off GC and code carefully. Raymond From greg.ewing at canterbury.ac.nz Tue Sep 19 06:01:10 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 19 Sep 2006 16:01:10 +1200 Subject: [Python-3000] Kill GIL? In-Reply-To: <450E792C.1070105@gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> <450E792C.1070105@gmail.com> Message-ID: <450F6B86.5050902@canterbury.ac.nz> Nick Coghlan wrote: > I was thinking it would be easier to split out the Global Interpreter Lock and > a per-interpreter Local Interpreter Lock, rather than trying to go to a full > free-threading model. Anyone sharing other objects between interpreters would > still need their own synchronisation mechanism, but something like > threading.Queue should suffice for that. I don't think that using an ordinary Queue object would suffice for that, because it's designed on the assumption that basic refcounting etc. is already protected by a GIL. If nothing else, you'd need some kind of extra locking mechanism to manage the refcount of the Queue object itself. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 19 06:52:22 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 19 Sep 2006 16:52:22 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <450F7786.3060501@canterbury.ac.nz> Adam Olsen wrote: > Reference counting turns all references into > modifications. > > There's a few ways to approach this: I've just thought of another one: Instead of a single refcount per object, each thread has its own set of refcounts. Then the object has a count of the number of threads that currently have nonzero refcounts for it. Most refcount operations would only affect the thread's local refcount for the object. Only when that reached zero would you need to lock the object and update the global refcount. Not sure what kind of data structure you'd use for this, though... -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 19 07:20:00 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 19 Sep 2006 17:20:00 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <450F7E00.7000304@canterbury.ac.nz> Raymond Hettinger wrote: > * An easier C API would significantly benefit the language in terms of > more extensions being available and in terms of increased reliability > for those extensions. The current refcount scheme results in pervasive > refleak bugs and subsequent, interminable bughunts. It's not clear that a different scheme would be much different, though. If it's not refcounting, there will be some other set of rules that must be followed, with equally obscure bugs if you slip up. Also, at least half of the boilerplate is due to the necessity of checking for errors at each step. A different GC scheme wouldn't help with that. IMO the only way to make writing C extensions truly straightforward and non-error-prone is to use some kind of code generation tool like Pyrex. And then it doesn't matter how complicated the rules are. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 19 07:27:18 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 19 Sep 2006 17:27:18 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org> Message-ID: <450F7FB6.90709@canterbury.ac.nz> Jim Jewett wrote: > Would it solve this problem if there were a PyTEMPREF that magically > treated the refcount as an automatic variable? (It increfed > immediately, and decrefed whenever the function exited, without the > user having to track this manually.) This would be wrong, because most functions return new references, which should *not* be increfed when assigned to a variable. How would you implement that in C anyway? (C++ could do it, but we're not going there, as far as I know.) -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 19 07:31:43 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 19 Sep 2006 17:31:43 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <450F80BF.5050203@canterbury.ac.nz> We seem to have a situation where we have refcounting, which incurs a small penalty many times, but which we're willing to pay for the benefits it brings, and locking, which in theory should also only have a small penalty most of the time, not much bigger than refcounting, but it seems we're not willing to pay both these penalties at once. I'm wondering whether there's some way we could merge the two -- i.e. somehow make the one mechanism serve as both a refcounting *and* a locking mechanism at the same time. A refcount is a count, and a semaphore also has a count... is there some way we can make use of that? -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Tue Sep 19 07:36:26 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 19 Sep 2006 17:36:26 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: Message-ID: <450F81DA.6010204@canterbury.ac.nz> Ronald Oussoren wrote: > I'd be more worried about changes in semantics, it's pretty convenient > to write 'open(somefile, 'r').read()' to read a file in bulk, currently > this will immediately close the file but with a GC system it may be a > long time before the file is actually closed. Another data point in favour of deterministic memory management: I was working on a game recently involving OpenGL and animation, and I found that I couldn't get a smooth frame rate until I disabled cyclic GC, after which everything was fine. So I'd be unhappy if refcounting were removed and not replaced with something equally unobtrusive in the case where you don't create a lot of cycles. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From ronaldoussoren at mac.com Tue Sep 19 07:46:42 2006 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Tue, 19 Sep 2006 07:46:42 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org> Message-ID: <12CCD752-F18B-4DE0-91FC-D490C2E85421@mac.com> On Sep 18, 2006, at 10:56 PM, Barry Warsaw wrote: > > I don't know, I hate macros. :) > > > It's been a long while since I programmed on the NeXT, so Mac folks > here please chime in, but isn't there some Foundation idiom where > temporary Objective-C objects didn't need to be explicitly released > if their lifetime was exactly the duration of the function in which > they were created? ISTR something like the main event loop tracking > such refcount=1 objects and deleting them automatically the next time > through the loop. Since Python has a main loop, I wonder if the same > kind of trick couldn't be done here. Objective-C, or rather Cocoa, uses reference counting but with a twist. Cocoa as autorelease pools (class NSAutoreleasePool) any object that is inserted in an autorelease pool gets its refcount decreased when the pool is deleted. Furthermore the main event loop creates a new pool at the start of the loop and removes it at the end, cleaning up all autoreleased objects. Most cocoa methods return borrowed references (which they can do because of autorelease pools). When you know you won't hang onto an object until after the current iteration of the eventloop you can savely ignore reference counting. Only when you store a reference to an object somewhere (such as in an instance variable) you have to worry about the refcount. The annoying part of Cocoa's refcounting scheme is that they, unlike python, don't have a GC to clean up loops. This causes several parts of the Cocoa framework to ignore refcounts to avoid creating loops, which is rather annoying when you write a bridge to Cocoa and want to hide reference counting details. Ronald P.S. Apple is switching to a non-reference counting GC scheme in OSX 10.5 (http://www.apple.com/macosx/leopard/xcode.html). -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2157 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20060919/230fec49/attachment.bin From greg.ewing at canterbury.ac.nz Tue Sep 19 07:49:52 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 19 Sep 2006 17:49:52 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: References: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org> Message-ID: <450F8500.5080102@canterbury.ac.nz> Barry Warsaw wrote: > It's been a long while since I programmed on the NeXT, so Mac folks > here please chime in, but isn't there some Foundation idiom where > temporary Objective-C objects didn't need to be explicitly released > if their lifetime was exactly the duration of the function in which > they were created? I think you're talking about the autorelease mechanism. It's a kind of delayed decref, the delay being until execution reaches some safe place, usually the main event loop of the application. It exists because Cocoa mostly manages refcounts on a much coarser-grained scale than Python. You don't normally count all the temporary references created by parameters and local variables, only "major" ones such as references stored in an instance variable of an object. The problem then is that an object might get released while in the middle of executing one or more of its methods, and there are still references to it in active stack frames. By delaying the decref until returning to the main loop, all these references have hopefully gone away by the time the object gets freed. You couldn't translate this scheme directly into Python, because there are various differences in the way refcounts are used. There's also not really any safe place to do the delayed decrefs. The interpreter loop is *not* a safe place, because there can be nested invocations of it, with C stack frames outside the current one holding references. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From mcherm at mcherm.com Tue Sep 19 14:36:09 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Tue, 19 Sep 2006 05:36:09 -0700 Subject: [Python-3000] Removing __del__ Message-ID: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> The following comments got me thinking: Raymond: > Statistics incontrovertibly prove that people who habitually > avoid __del__ lead happier lives and spend fewer hours in therapy ;-) Adam Olsen: > I agree here. I think an executor approach is much better; kill the > object, then make a weakref callback do any further cleanups using > copies it made in advance. And of course similar sentiments have been proposed in many Python discussions by many people over several years. Since we're apparently still in "propose wild ideas" mode for Py3K I'd like to propose that for Py3K we remove __del__. Not "fix" it, not "tweak" it, just remove it and perhaps add a note in the manual pointing people to the weakref module. What'cha think folks? I'd love to hear an opinion from someone who is a current user of __del__ -- I'm not. -- Michael Chermside From qrczak at knm.org.pl Tue Sep 19 16:33:52 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 19 Sep 2006 16:33:52 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> (Michael Chermside's message of "Tue, 19 Sep 2006 05:36:09 -0700") References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> Message-ID: <87irjk7wof.fsf@qrnik.zagroda> Michael Chermside writes: > Adam Olsen: >> I agree here. I think an executor approach is much better; kill the >> object, then make a weakref callback do any further cleanups using >> copies it made in advance. I agree. Objects with finalizers with the semantics of __del__ are inherently unsafe: they can be finalized when they are still in use, namely when they are used by a finalizer of another object. The correct way is to register a finalizer from outside the object, such that it's invoked asynchronously when the associated object has been garbage collected. Everything reachable from a finalizer is considered live. As far as I understand it, Python's weakref have a mostly correct semantics. Finalizers must be invoked from a separate thread: http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html The finalizer should not access the associated object itself (or it will never be invoked), but it should only access the parts of the object and other objects that it needs. Sometimes it is necessary to split an object into an outer part which triggers finalization, and an inner part which is accessed by the finalizer. Even though this looks inconvenient, this design is necessary for building rock solid finalizable objects. This design allows for the presence of a finalizer to be a private implementation detail. __del__ methods don't have this property because objects with finalizers are unsafe to use from other finalizers. Python documentation contains the following snippet: "Starting with version 1.5, Python guarantees that globals whose name begins with a single underscore are deleted from their module before other globals are deleted; if no other references to such globals exist, this may help in assuring that imported modules are still available at the time when the __del__() method is called." This is clearly a hack which just increases the likelihood that the code works. A correct design allows to make code which works in 100% of the cases. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Tue Sep 19 16:42:18 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 19 Sep 2006 16:42:18 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> (Barry Warsaw's message of "Mon, 18 Sep 2006 13:40:19 -0400") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> Message-ID: <87eju79aut.fsf@qrnik.zagroda> Barry Warsaw writes: > What worries me is the unpredictability of gc vs. refcounting. For > some class of Python applications it's important that when an object > is dereferenced it really goes away right then. I /like/ reference > counting! This can be solved by explicit freeing of objects whose cleanup must be performed deterministically. Lisp has UNWIND-PROTECT and type-specific macros like WITH-OPEN-FILE. C# has 'using' keyword. Python has 'with' which can be used for that. Reference counting is inefficient, doesn't by itself handle cycles, and is impractical to combine with threads which run in parallel. The general consensus of modern language implementations is that a tracing GC is the future. I admit that implementing a good GC is hard. It's quite hard to make it incremental, and it's hard to avoid stopping all threads during GC (but it's easier to allow threads to run in parallel between GCs, with no need of forced synchronization each time a reference to an object is created). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From barry at python.org Tue Sep 19 16:53:23 2006 From: barry at python.org (Barry Warsaw) Date: Tue, 19 Sep 2006 10:53:23 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87eju79aut.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 19, 2006, at 10:42 AM, Marcin 'Qrczak' Kowalczyk wrote: > Barry Warsaw writes: > >> What worries me is the unpredictability of gc vs. refcounting. For >> some class of Python applications it's important that when an object >> is dereferenced it really goes away right then. I /like/ reference >> counting! > > This can be solved by explicit freeing of objects whose cleanup must > be performed deterministically. > > Lisp has UNWIND-PROTECT and type-specific macros like WITH-OPEN-FILE. > C# has 'using' keyword. Python has 'with' which can be used for that. I don't see how that helps. I can remove all references to the object but I still have to wait until gc runs to free it. Can you explain your idea in more detail? > Reference counting is inefficient, doesn't by itself handle cycles, > and is impractical to combine with threads which run in parallel. The > general consensus of modern language implementations is that a tracing > GC is the future. > > I admit that implementing a good GC is hard. It's quite hard to make > it incremental, and it's hard to avoid stopping all threads during GC > (but it's easier to allow threads to run in parallel between GCs, with > no need of forced synchronization each time a reference to an object > is created). I just think that it's important to remember that there are use cases that reference counting solves. GC and refcounting both have their pros and cons. I tend to think that Python's current refcounting + cyclic gc is the devil we know, so unless there is a clear, proven better way I'm not eager to change it. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRRAEa3EjvBPtnXfVAQIpyAQArWHs0j+yJs5raS4EQgj/v1NXYOqzXLAn eM5eWMMTDY6qZgWa2i7DFciO1MZnX6/HAUsRYSc7lHPEWKMbNoCgPQZP46XoX8/w FYtvuRCdVUlPvTtfZk8ltl/ERXb+vtR4Jtb/dT7+0VxdbGLHvqgMaCrcDXMd2n4C du4cjV+GZ1k= =anX9 -----END PGP SIGNATURE----- From brian at sweetapp.com Tue Sep 19 17:29:00 2006 From: brian at sweetapp.com (Brian Quinlan) Date: Tue, 19 Sep 2006 17:29:00 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87eju79aut.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> Message-ID: <45100CBC.2060304@sweetapp.com> Marcin 'Qrczak' Kowalczyk wrote: > Reference counting is inefficient, doesn't by itself handle cycles, > and is impractical to combine with threads which run in parallel. The > general consensus of modern language implementations is that a tracing > GC is the future. How is reference counting inefficient? Cheers, Brian From qrczak at knm.org.pl Tue Sep 19 17:29:12 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 19 Sep 2006 17:29:12 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: (Barry Warsaw's message of "Tue, 19 Sep 2006 10:53:23 -0400") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> Message-ID: <878xkfc1tj.fsf@qrnik.zagroda> Barry Warsaw writes: > I don't see how that helps. I can remove all references to the > object but I still have to wait until gc runs to free it. Can you > explain your idea in more detail? Objects which should be closed deterministically have the closing action decoupled from the lifetime of the object. They are closed explicitly; the object in a "closed" state doesn't take up any sensitive resources. > I just think that it's important to remember that there are use > cases that reference counting solves. GC and refcounting both have > their pros and cons. Unfortunately it's hard to mix the two styles. Counting all reference operations in the presence of a real GC would imply paying the costs of both schemes together. > I tend to think that Python's current refcounting + cyclic gc is the > devil we know, so unless there is a clear, proven better way I'm not > eager to change it. They are different sets of tradeoffs; neither is universally better. I claim that a tracing GC is usually better, or better in overall, but it can't be proven to be better in all respects. Changing an existing system creates more compatibility obstacles than designing a system from scratch. I'm not convinced that it's practical to change the Python GC now. I only wish it had a tracing GC instead. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Tue Sep 19 17:50:29 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 19 Sep 2006 17:50:29 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <45100CBC.2060304@sweetapp.com> (Brian Quinlan's message of "Tue, 19 Sep 2006 17:29:00 +0200") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com> Message-ID: <87mz8vg8je.fsf@qrnik.zagroda> Brian Quinlan writes: >> Reference counting is inefficient, doesn't by itself handle cycles, >> and is impractical to combine with threads which run in parallel. The >> general consensus of modern language implementations is that a tracing >> GC is the future. > > How is reference counting inefficient? It involves operations every time an object is merely passed around, as references to the object are created or destroyed. It doesn't move objects in memory, and thus free memory is fragmented. Memory allocation can't just chop from from a single area of free memory. It can't allocate several objects with the cost of one allocation either. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From jimjjewett at gmail.com Tue Sep 19 17:54:30 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 19 Sep 2006 11:54:30 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> Message-ID: On 9/19/06, Michael Chermside wrote: > The following comments got me thinking: > Raymond: > > Statistics incontrovertibly prove that people who habitually > > avoid __del__ lead happier lives and spend fewer hours in therapy ;-) > Adam Olsen: > > I agree here. I think an executor approach is much better; kill the > > object, then make a weakref callback do any further cleanups using > > copies it made in advance. > Since we're apparently still in "propose wild ideas" mode for Py3K > I'd like to propose that for Py3K we remove __del__. Not "fix" it, > not "tweak" it, just remove it and perhaps add a note in the manual > pointing people to the weakref module. The various "create a separate closer object instead" recipescall seem to cause a jump in complexity, particularly if you try for a general solution. I do think we should split __del__ into the (rare, problematic) general case and a "special-purpose" lightweight __close__ version that does a better job in the normal case. For the general case, python refuses to guess about which order to call __del__ cycles in; this has the unfortunately side effect of making them immortal. Almost all actual __del__ uses are effectively a call to self.close(). The call might be required (Tk would leak if tkinter didn't notify it), or it might just be good housekeeping. The key point is that order doesn't matter. In practice they all seem to already be written defensively, so that they can be called multiple times, or even after teardown has started. So the semantics of __close__ would be just like those of __del__ except that (1) It would be called at least once if the process terminates normally. (2) Call order for linked objects would be arbitrary. FWIW, I couldn't find a single example in the stdlib (outside of tests) that wouldn't work at least as well if converted to a __close__ method. (subprocess and popen2 would be harder if __close__ were a once-only method, like I think generator close ended up becoming.) -jJ From barry at python.org Tue Sep 19 18:01:53 2006 From: barry at python.org (Barry Warsaw) Date: Tue, 19 Sep 2006 12:01:53 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <878xkfc1tj.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> Message-ID: <2CEDFB05-01F4-47C6-A8B7-460A9FFFD369@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 19, 2006, at 11:29 AM, Marcin 'Qrczak' Kowalczyk wrote: > Barry Warsaw writes: > >> I don't see how that helps. I can remove all references to the >> object but I still have to wait until gc runs to free it. Can you >> explain your idea in more detail? > > Objects which should be closed deterministically have the closing > action decoupled from the lifetime of the object. They are closed > explicitly; the object in a "closed" state doesn't take up any > sensitive resources. It's not external resources I'm concerned about, it's the physical memory consumed in process by objects which are unreachable but not reclaimed. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRRAUdXEjvBPtnXfVAQKFuQP9HmWucjJ//dTiEnEmjCgLNDbFF1J12c5U KwZAbBZw0CFjtZXCF9/cGuZ+KWROJGIB7A6YnqqmuIXhJ82t6Qmvm257pvQkWe/5 HmZbLCPoGKzmL33ince2f5gLxqKzl90B2L24TLlEYvrfOS9KTe2ree3HJXmyuRz3 471OBzViVAA= =WhVp -----END PGP SIGNATURE----- From brian at sweetapp.com Tue Sep 19 18:05:46 2006 From: brian at sweetapp.com (Brian Quinlan) Date: Tue, 19 Sep 2006 18:05:46 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda> Message-ID: <4510155A.5010308@sweetapp.com> Marcin 'Qrczak' Kowalczyk wrote: > Brian Quinlan writes: > >>> Reference counting is inefficient, doesn't by itself handle cycles, >>> and is impractical to combine with threads which run in parallel. The >>> general consensus of modern language implementations is that a tracing >>> GC is the future. >> How is reference counting inefficient? Do somehow know that tracing GC would be more efficient for typical python programs or are you just speculating? > It involves operations every time an object is merely passed around, > as references to the object are created or destroyed. But if the lifetime of most objects is confined to a single function call, isn't reference counting going to be quite efficient? > It doesn't move objects in memory, and thus free memory is fragmented. OK. Have you had memory fragmentation problems with Python? > Memory allocation can't just chop from from a single area of free memory. > It can't allocate several objects with the cost of one allocation either. I'm not sure what you mean here. Cheers, Brian From barry at python.org Tue Sep 19 18:10:35 2006 From: barry at python.org (Barry Warsaw) Date: Tue, 19 Sep 2006 12:10:35 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <4510155A.5010308@sweetapp.com> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda> <4510155A.5010308@sweetapp.com> Message-ID: <13D26562-CF05-44F3-A855-0CD280BA3140@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 19, 2006, at 12:05 PM, Brian Quinlan wrote: > Marcin 'Qrczak' Kowalczyk wrote: >> Brian Quinlan writes: >> >>>> Reference counting is inefficient, doesn't by itself handle cycles, >>>> and is impractical to combine with threads which run in >>>> parallel. The >>>> general consensus of modern language implementations is that a >>>> tracing >>>> GC is the future. >>> How is reference counting inefficient? > > Do somehow know that tracing GC would be more efficient for typical > python programs or are you just speculating? Also, what does "efficient" mean here? Overall program run time? No user-discernible pauses in operation? Stinginess in overall memory use? There are a lot of different efficiency parameters to consider, and of course different applications will care more about some than others. A u/i-based tool doesn't want noticeable pauses. A long running daemon wants manageable and predictable memory utilization. Etc. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRRAWfHEjvBPtnXfVAQJ2oQP/UtBlUbCb74YfnmR6ueyL/DAxe0yT5sK6 0i1bqcStZeTsub1Hor0xYQ8VDTL38lR6L446vw5WehEmaDkK0v5zreNHCEYvaqFC 3nWm/xC9NUFJrONX+YzkBLOuEpW0g08imOsbgPdvEREopvsS5kJ4e9TrNeS+fRu8 x8CIY3r5Vm0= =d/HI -----END PGP SIGNATURE----- From jcarlson at uci.edu Tue Sep 19 18:23:01 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Tue, 19 Sep 2006 09:23:01 -0700 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda> References: <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda> Message-ID: <20060919091147.07F3.JCARLSON@uci.edu> "Marcin 'Qrczak' Kowalczyk" wrote: > > Brian Quinlan writes: > > >> Reference counting is inefficient, doesn't by itself handle cycles, > >> and is impractical to combine with threads which run in parallel. The > >> general consensus of modern language implementations is that a tracing > >> GC is the future. > > > > How is reference counting inefficient? > > It involves operations every time an object is merely passed around, > as references to the object are created or destroyed. Redefine the INC/DECREF macros to assign something like 2**30 as the reference count in INCREF, and make DECREF do nothing. A write of a constant should be measurably faster than an increment. Run some relatively small test program (be concerned about memory!), and compare the results to see if there is a substantial difference in performance. > It doesn't move objects in memory, and thus free memory is fragmented. > Memory allocation can't just chop from from a single area of free memory. > It can't allocate several objects with the cost of one allocation either. It can certainly allocate several objects with the cost of one allocation, but it can't *deallocate* those objects individually. See the various freelists for examples where this is used successfully in Python now. - Josiah From qrczak at knm.org.pl Tue Sep 19 18:37:55 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 19 Sep 2006 18:37:55 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <2CEDFB05-01F4-47C6-A8B7-460A9FFFD369@python.org> (Barry Warsaw's message of "Tue, 19 Sep 2006 12:01:53 -0400") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <2CEDFB05-01F4-47C6-A8B7-460A9FFFD369@python.org> Message-ID: <87r6y7ke1o.fsf@qrnik.zagroda> Barry Warsaw writes: > It's not external resources I'm concerned about, it's the physical > memory consumed in process by objects which are unreachable but not > reclaimed. The rate of garbage collection depends on the rate of allocation. While the objects are not freed in the earliest possible moment, they are freed if the memory is needed for other objects. Brian Quinlan writes: > Do somehow know that tracing GC would be more efficient for typical > python programs or are you just speculating? I'm mostly speculating. It's hard to measure the difference between garbage collection schemes because most language runtimes are tied to a particular GC implementation, and thus you can't substitute a different GC leaving everything else the same. I've done some experiments with C++-based GCs incl. reference counting, but they were inconclusive. The effects strongly depend on the kind of the program and the amount of memory it uses, and various GC schemes are better or worse in different scenarios. >> It involves operations every time an object is merely passed around, >> as references to the object are created or destroyed. > > But if the lifetime of most objects is confined to a single function > call, isn't reference counting going to be quite efficient? Even if an object begins and ends its lifetime within a particular function call, it's usually passed down to other functions in the meantime. Every time a Python function returns an object, the reference count on the result is incremented, and it's decremented at some time by its caller. Every time a function implemented in Python is called, reference counts of its parameters are incremented, and they are decremented when it returns. Every time a None is stored in a data structure or returned from a function, its reference count is incremented. Every time a list is freed, reference counts of objects it refers to is decremented. Every time two ints are added, the reference count of the result is incremented even if that integer was preallocated. Every time a field is assigned to, two reference counts are manipulated. >> It doesn't move objects in memory, and thus free memory is fragmented. > > OK. Have you had memory fragmentation problems with Python? Indirectly: memory allocation can't be as fast as in some GC schemes. >> Memory allocation can't just chop from from a single area of free memory. >> It can't allocate several objects with the cost of one allocation either. > > I'm not sure what you mean here. There are various GCs (including OCaml, and probably Sun's Java and Microsoft's .NET implementations, and my language implementation, and surely others) where the fast path of memory allocation looks like stack allocation with overflow checking. Moreover, if several objects are to be allocated at once (which I admit is more likely in compiled code), the cost is still the same as allocation of one object of the size being the sum of sizes of the objects (not counting filling the objects with contents). There are no per-allocated-object data to fill besides a header which points to a static structure which describes the layout. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Tue Sep 19 18:55:26 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 19 Sep 2006 18:55:26 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda> (Marcin Kowalczyk's message of "Tue, 19 Sep 2006 17:50:29 +0200") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda> Message-ID: <87wt7zzthd.fsf@qrnik.zagroda> "Marcin 'Qrczak' Kowalczyk" writes: > It involves operations every time an object is merely passed around, > as references to the object are created or destroyed. And it does something when it frees an object. In some GCs there is a cost associated with keeping an object alive, but there is no per-object cost when a group of objects die. Most objects die young. This is what I've measured myself. When my compiler runs, the average lifetime of an object is about 1/5 GCs. This means that 80% of objects have only an allocation cost, while freeing is free. And with a generational GC most of others are copied only once: major GCs are less frequent than minor GCs. It is true that a given long-living object has a larger cost, but such objects are a minority, and I believe this scheme pays off. Especially if it was implemented better than I did it; this is the only GC I've implemented so far, I'm sure that experienced people can tune it better. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Tue Sep 19 20:48:11 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 19 Sep 2006 20:48:11 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: (Jim Jewett's message of "Tue, 19 Sep 2006 11:54:30 -0400") References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> Message-ID: <87lkofn15g.fsf@qrnik.zagroda> "Jim Jewett" writes: > I do think we should split __del__ into the (rare, problematic) > general case and a "special-purpose" lightweight __close__ version > that does a better job in the normal case. A synchronous finalizer which doesn't keep object it refers to alive, like Python's __del__, is sufficient when the finalizer doesn't use other finalizable objects, and doesn't conflict with the rest of the program in terms of potentially concurrent operations on shared data (read/write or write/write). Note that the concurrency conflict can manifest even in a single-threaded program, because __del__ finalizers are in fact semi-asynchronous: they are invoked when a reference count is decremented and causes the relevant object to become dead, which can happen in lots of places, even on a seemingly innocent variable assignment. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From mcherm at mcherm.com Tue Sep 19 21:54:15 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Tue, 19 Sep 2006 12:54:15 -0700 Subject: [Python-3000] Delayed reference counting idea Message-ID: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com> Speaking on the speed of GC implementations, Marcin writes: > I'm mostly speculating. It's hard to measure the difference between > garbage collection schemes because most language runtimes are tied > to a particular GC implementation, and thus you can't substitute a > different GC leaving everything else the same. Interestingly, one of the original goals of PyPy was to create a test bed in which it was easy to experiment and answer just this kind of question. Unfortunately, although they have an architechure allowing pluggable GC algorithms (what an incredible concept!) I don't belive that any reliable conclusions can be drawn from things as they now stand. For more details see http://codespeak.net/pypy/dist/pypy/doc/garbage_collection.html -- Michael Chermside From martin at v.loewis.de Tue Sep 19 23:32:07 2006 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 19 Sep 2006 23:32:07 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com> Message-ID: <451061D7.4010105@v.loewis.de> Paul Prescod schrieb: > Even if I won't contribute code or even a design to the solution > (because it isn't an area of expertise and I'm still working on > encodings stuff) I think that there would be value in saying: "There's a > big problem here and we intend to fix it in Python 3000." This is of value only if "we" really intend to fix it. I don't, and apparently, you don't either. It would be very bad to claim that "we" fix it, and then don't. It's much much much much better to acknowledge that "we" aren't going to fix it, not with Python 3.0, and likely not with any release in the foreseeable future. The only exception would be if somebody offered a reasonable solution, which "we" would just have to incorporate (and possibly maintain, although it would be good if the original author would be around for a year or so). Regards, Martin From martin at v.loewis.de Tue Sep 19 23:34:53 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 19 Sep 2006 23:34:53 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <450E792C.1070105@gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> <450E792C.1070105@gmail.com> Message-ID: <4510627D.2080704@v.loewis.de> Nick Coghlan schrieb: > I was thinking it would be easier to split out the Global Interpreter Lock and > a per-interpreter Local Interpreter Lock, rather than trying to go to a full > free-threading model. Anyone sharing other objects between interpreters would > still need their own synchronisation mechanism, but something like > threading.Queue should suffice for that. The challenge with that is "global" (i.e. across-interpreter) objects. There are several of these: the obvious singletons (None, True, False), the non-obvious singletons ((), -2..100 or so), and the extension module globals (types, and in particular exceptions). Do you want them still to be global, or per-interpreter? Regards, Martin From martin at v.loewis.de Tue Sep 19 23:41:05 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 19 Sep 2006 23:41:05 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda> Message-ID: <451063F1.9050207@v.loewis.de> Marcin 'Qrczak' Kowalczyk schrieb: > It doesn't move objects in memory, and thus free memory is fragmented. That's true, but not a problem. > Memory allocation can't just chop from from a single area of free memory. That's not true. Python does it all the time. Allocation is in constant time most of the time (in some applications, it's always constant). Regards, Martin From rasky at develer.com Tue Sep 19 23:42:43 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 19 Sep 2006 23:42:43 +0200 Subject: [Python-3000] Removing __del__ References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> Message-ID: <023701c6dc34$8a79dc50$a14c2597@bagio> Michael Chermside wrote: > Since we're apparently still in "propose wild ideas" mode for Py3K > I'd like to propose that for Py3K we remove __del__. Not "fix" it, > not "tweak" it, just remove it and perhaps add a note in the manual > pointing people to the weakref module. I don't use __del__ much. I use it only in leaf classes, where it surely can't be part of loops. In those rare cases, it's very useful to me. For instance, I have a small classes which wraps an existing handle-based C API exported to Python. Something along the lines of: class Wrapper: def __init__(self, *args): self.handle = CAPI.init(*args) def __del__(self, *args): CAPI.close(self.handle) def foo(self): CAPI.foo(self.handle) The real class isn't much longer than this (really). How do you propose to write this same code without __del__? Notice that I'd be perfectly fine with the __close__ semantic prosed in this thread (might be called more than once, order within the loop doesn't matter). Giovanni Bajo From rasky at develer.com Tue Sep 19 23:50:34 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 19 Sep 2006 23:50:34 +0200 Subject: [Python-3000] Delayed reference counting idea References: <450F7E00.7000304@canterbury.ac.nz> Message-ID: <026e01c6dc35$a4ee83f0$a14c2597@bagio> Greg Ewing wrote: >> * An easier C API would significantly benefit the language in terms >> of more extensions being available and in terms of increased >> reliability for those extensions. The current refcount scheme >> results in pervasive refleak bugs and subsequent, interminable >> bughunts. > > It's not clear that a different scheme would be much > different, though. If it's not refcounting, there will > be some other set of rules that must be followed, with > equally obscure bugs if you slip up. Agreed. > Also, at least half of the boilerplate is due to the > necessity of checking for errors at each step. A > different GC scheme wouldn't help with that. Given that C handles in an equally-bad fashion errors (need manual checks at every step) and finalizers (need to manually refcount and de-refcount), maybe a C++Python is in order? ATL helped somewhat with COM refcounting, after all. Giovanni Bajo From rhamph at gmail.com Wed Sep 20 00:31:36 2006 From: rhamph at gmail.com (Adam Olsen) Date: Tue, 19 Sep 2006 16:31:36 -0600 Subject: [Python-3000] Removing __del__ In-Reply-To: <023701c6dc34$8a79dc50$a14c2597@bagio> References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> <023701c6dc34$8a79dc50$a14c2597@bagio> Message-ID: On 9/19/06, Giovanni Bajo wrote: > Michael Chermside wrote: > > > Since we're apparently still in "propose wild ideas" mode for Py3K > > I'd like to propose that for Py3K we remove __del__. Not "fix" it, > > not "tweak" it, just remove it and perhaps add a note in the manual > > pointing people to the weakref module. > > > I don't use __del__ much. I use it only in leaf classes, where it surely can't > be part of loops. In those rare cases, it's very useful to me. For instance, I > have a small classes which wraps an existing handle-based C API exported to > Python. Something along the lines of: > > class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > > def __del__(self, *args): > CAPI.close(self.handle) > > def foo(self): > CAPI.foo(self.handle) > > The real class isn't much longer than this (really). How do you propose to > write this same code without __del__? I've experimented with using metaclasses to do some fun here. It could look something like this: Class Wrapper(Core): def __init__(self, *args): Core.__init__(self) self.core.handle = CAPI.init(*args) @coremethod def __coredel__(core): CAPI.close(core.handle) def foo(self): CAPI.foo(self.core.handle) Works just fine in 2.x. -- Adam Olsen, aka Rhamphoryncus From exarkun at divmod.com Wed Sep 20 00:40:48 2006 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Tue, 19 Sep 2006 18:40:48 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <023701c6dc34$8a79dc50$a14c2597@bagio> Message-ID: <20060919224048.1717.886353737.divmod.quotient.54336@ohm> On Tue, 19 Sep 2006 23:42:43 +0200, Giovanni Bajo wrote: >Michael Chermside wrote: > >> Since we're apparently still in "propose wild ideas" mode for Py3K >> I'd like to propose that for Py3K we remove __del__. Not "fix" it, >> not "tweak" it, just remove it and perhaps add a note in the manual >> pointing people to the weakref module. > > >I don't use __del__ much. I use it only in leaf classes, where it surely can't >be part of loops. In those rare cases, it's very useful to me. For instance, I >have a small classes which wraps an existing handle-based C API exported to >Python. Something along the lines of: > >class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > > def __del__(self, *args): > CAPI.close(self.handle) > > def foo(self): > CAPI.foo(self.handle) > >The real class isn't much longer than this (really). How do you propose to >write this same code without __del__? Untested, but roughly: _weakrefs = [] def _cleanup(ref, handle): _weakrefs.remove(ref) CAPI.close(handle) class BetterWrapper: def __init__(self, *args): handle = self.handle = CAPI.init(*args) _weakrefs.append( weakref.ref(self, lambda ref: _cleanup(ref, handle))) def foo(self): CAPI.foo(self.handle) There are probably even better ways too, this is just the first that comes to mind. Jean-Paul From greg.ewing at canterbury.ac.nz Wed Sep 20 02:22:47 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 20 Sep 2006 12:22:47 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <878xkfc1tj.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> Message-ID: <451089D7.6060204@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > Objects which should be closed deterministically have the closing > action decoupled from the lifetime of the object. That doesn't cover the case where the "closing" action you want includes freeing the memory occupied by the object. The game I mentioned earlier is one of those -- I don't need anything "closed", I just want the memory They are closed > explicitly; the object in a "closed" state doesn't take up any > sensitive resources. > > >>I just think that it's important to remember that there are use >>cases that reference counting solves. GC and refcounting both have >>their pros and cons. > > > Unfortunately it's hard to mix the two styles. Counting all reference > operations in the presence of a real GC would imply paying the costs > of both schemes together. > > >>I tend to think that Python's current refcounting + cyclic gc is the >>devil we know, so unless there is a clear, proven better way I'm not >>eager to change it. > > > They are different sets of tradeoffs; neither is universally better. > I claim that a tracing GC is usually better, or better in overall, > but it can't be proven to be better in all respects. > > Changing an existing system creates more compatibility obstacles than > designing a system from scratch. I'm not convinced that it's practical > to change the Python GC now. I only wish it had a tracing GC instead. > From greg.ewing at canterbury.ac.nz Wed Sep 20 02:25:07 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 20 Sep 2006 12:25:07 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda> Message-ID: <45108A63.9090506@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > It doesn't move objects in memory, and thus free memory is fragmented. That's not a feature of refcounting as such. With sufficient indirection, moveable refcounted memory blocks are possible (early Smalltalks worked that way, I believe). -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 20 02:34:06 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 20 Sep 2006 12:34:06 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com> References: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com> Message-ID: <45108C7E.1080104@canterbury.ac.nz> Michael Chermside wrote: > Interestingly, one of the original goals of PyPy was to create a > test bed in which it was easy to experiment and answer just this > kind of question. A worry about that is whether the architecture required to allow pluggable GC implementations introduces inefficiencies of its own that would skew the results. -- Greg From bob at redivi.com Wed Sep 20 02:47:05 2006 From: bob at redivi.com (Bob Ippolito) Date: Tue, 19 Sep 2006 17:47:05 -0700 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <45108C7E.1080104@canterbury.ac.nz> References: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com> <45108C7E.1080104@canterbury.ac.nz> Message-ID: <6a36e7290609191747r215f7184uc2cf2bd5679b821e@mail.gmail.com> On 9/19/06, Greg Ewing wrote: > Michael Chermside wrote: > > > Interestingly, one of the original goals of PyPy was to create a > > test bed in which it was easy to experiment and answer just this > > kind of question. > > A worry about that is whether the architecture required to > allow pluggable GC implementations introduces inefficiencies > of its own that would skew the results. > There's no need to worry about that in the case of PyPy. Those kinds of choices are made way before runtime, so there's no required indirection. -bob From qrczak at knm.org.pl Wed Sep 20 03:01:33 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 20 Sep 2006 03:01:33 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <45108A63.9090506@canterbury.ac.nz> (Greg Ewing's message of "Wed, 20 Sep 2006 12:25:07 +1200") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda> <45108A63.9090506@canterbury.ac.nz> Message-ID: <8764fjnyfm.fsf@qrnik.zagroda> Greg Ewing writes: >> It doesn't move objects in memory, and thus free memory is fragmented. > > That's not a feature of refcounting as such. With sufficient > indirection, moveable refcounted memory blocks are possible > (early Smalltalks worked that way, I believe). Yes, but the indirection is a cost in itself. A tracing GC can move objects without such indirection, because it can update all pointers to the given object. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From greg.ewing at canterbury.ac.nz Wed Sep 20 02:59:14 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 20 Sep 2006 12:59:14 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <6a36e7290609191747r215f7184uc2cf2bd5679b821e@mail.gmail.com> References: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com> <45108C7E.1080104@canterbury.ac.nz> <6a36e7290609191747r215f7184uc2cf2bd5679b821e@mail.gmail.com> Message-ID: <45109262.5020308@canterbury.ac.nz> Bob Ippolito wrote: > There's no need to worry about that in the case of PyPy. Those kinds > of choices are made way before runtime, so there's no required > indirection. Even so, we're talking about machine-generated code rather than the sort of hand-crafting you need to get the best out of something critical like GC. There could still be room for inefficiencies. -- Greg From ironfroggy at gmail.com Wed Sep 20 04:54:07 2006 From: ironfroggy at gmail.com (Calvin Spealman) Date: Tue, 19 Sep 2006 22:54:07 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <4510627D.2080704@v.loewis.de> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> <450E792C.1070105@gmail.com> <4510627D.2080704@v.loewis.de> Message-ID: <76fd5acf0609191954p3979ba48v89e520fdf8e3d124@mail.gmail.com> On 9/19/06, "Martin v. L?wis" wrote: > Nick Coghlan schrieb: > > I was thinking it would be easier to split out the Global Interpreter Lock and > > a per-interpreter Local Interpreter Lock, rather than trying to go to a full > > free-threading model. Anyone sharing other objects between interpreters would > > still need their own synchronisation mechanism, but something like > > threading.Queue should suffice for that. > > The challenge with that is "global" (i.e. across-interpreter) objects. > There are several of these: the obvious singletons (None, True, False), > the non-obvious singletons ((), -2..100 or so), and the extension module > globals (types, and in particular exceptions). > > Do you want them still to be global, or per-interpreter? > > Regards, > Martin It is one fixable problem among many, but fixable none-the-less. Any solution is going to break the API, but that should be allowed, especially for something as important as this. The obvious and non-obvious singletons don't represent much of a real problem, when you realize that you'll have to change the locking API anyway, at least to specify some interpreter ID for which to operate on its Local Interpreter Lock. Should you check every object on what it is? No, so either don't have cross-interpreter globals, which it doesn't save you much anyway, or add a lock pointer to all PyObject structs, which can be a single GIL, a LIL, or something else down the road. The API will need to add a parameter to the locking anyway, so the door is already open and singletons arent getting in the way. From martin at v.loewis.de Wed Sep 20 08:02:30 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 20 Sep 2006 08:02:30 +0200 Subject: [Python-3000] Kill GIL? In-Reply-To: <76fd5acf0609191954p3979ba48v89e520fdf8e3d124@mail.gmail.com> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> <450E792C.1070105@gmail.com> <4510627D.2080704@v.loewis.de> <76fd5acf0609191954p3979ba48v89e520fdf8e3d124@mail.gmail.com> Message-ID: <4510D976.1060600@v.loewis.de> Calvin Spealman schrieb: >> The challenge with that is "global" (i.e. across-interpreter) objects. >> There are several of these: the obvious singletons (None, True, False), >> the non-obvious singletons ((), -2..100 or so), and the extension module >> globals (types, and in particular exceptions). >> >> Do you want them still to be global, or per-interpreter? >> >> Regards, >> Martin > > It is one fixable problem among many, but fixable none-the-less. [...] Your message didn't really answer the question, did it? Regards, Martin From qrczak at knm.org.pl Wed Sep 20 08:25:52 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 20 Sep 2006 08:25:52 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <451089D7.6060204@canterbury.ac.nz> (Greg Ewing's message of "Wed, 20 Sep 2006 12:22:47 +1200") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> Message-ID: <87hcz38367.fsf@qrnik.zagroda> Greg Ewing writes: > That doesn't cover the case where the "closing" action > you want includes freeing the memory occupied by the > object. The game I mentioned earlier is one of those -- > I don't need anything "closed", I just want the memory Why do you want to free memory at a particular point of time? -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From greg.ewing at canterbury.ac.nz Wed Sep 20 10:53:20 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 20 Sep 2006 20:53:20 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87hcz38367.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> <87hcz38367.fsf@qrnik.zagroda> Message-ID: <45110180.4070807@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > Why do you want to free memory at a particular point of time? I don't. However, I *do* want it freed by the time I need it again, and I *don't* want unpredictable pauses to catch up on backed-up memory-freeing, so that my animations run smoothly. -- Greg From ncoghlan at gmail.com Wed Sep 20 12:12:01 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 20 Sep 2006 20:12:01 +1000 Subject: [Python-3000] Kill GIL? In-Reply-To: <4510627D.2080704@v.loewis.de> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450D4AAE.2000805@gmail.com> <450E792C.1070105@gmail.com> <4510627D.2080704@v.loewis.de> Message-ID: <451113F1.2030302@gmail.com> Martin v. L?wis wrote: > Nick Coghlan schrieb: >> I was thinking it would be easier to split out the Global Interpreter Lock and >> a per-interpreter Local Interpreter Lock, rather than trying to go to a full >> free-threading model. Anyone sharing other objects between interpreters would >> still need their own synchronisation mechanism, but something like >> threading.Queue should suffice for that. > > The challenge with that is "global" (i.e. across-interpreter) objects. > There are several of these: the obvious singletons (None, True, False), > the non-obvious singletons ((), -2..100 or so), and the extension module > globals (types, and in particular exceptions). > > Do you want them still to be global, or per-interpreter? The GIL would still exist - the idea would be that most threads would be spending most of their time holding only their local interpreter lock. Only when reading or writing the state shared between interpreters would a thread need to acquire the GIL. Alternatively, the GIL might be able to be turned into a read/write lock instead of a basic mutex, with threads normally holding a read lock which they periodically release & reacquire (in case there are any other threads trying to acquire). The latter approach would probably give better performance (since you wouldn't need to be dropping and reacquiring the GIL in order to access the singleton objects). Cheers, Nick. P.S. Just to be clear, I don't think doing this would be *easy*, but unlike full free-threading, I think it is at least potentially workable. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From qrczak at knm.org.pl Wed Sep 20 12:57:59 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 20 Sep 2006 12:57:59 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <45110180.4070807@canterbury.ac.nz> (Greg Ewing's message of "Wed, 20 Sep 2006 20:53:20 +1200") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> <87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz> Message-ID: <87psdqztxk.fsf@qrnik.zagroda> Greg Ewing writes: >> Why do you want to free memory at a particular point of time? > > I don't. However, I *do* want it freed by the time I need it again, As I said, the rate of GC depends on the rate of allocation. Unreachable objects are collected when memory is needed for allocation. > and I *don't* want unpredictable pauses to catch up on backed-up > memory-freeing, Incremental GC (e.g. in OCaml) has short pauses. It doesn't scan all memory at once, but distributes the work among GC cycles. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From ncoghlan at gmail.com Wed Sep 20 13:18:58 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 20 Sep 2006 21:18:58 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> Message-ID: <451123A2.7040701@gmail.com> Michael Chermside wrote: > The following comments got me thinking: > > Raymond: >> Statistics incontrovertibly prove that people who habitually >> avoid __del__ lead happier lives and spend fewer hours in therapy ;-) > > Adam Olsen: >> I agree here. I think an executor approach is much better; kill the >> object, then make a weakref callback do any further cleanups using >> copies it made in advance. > > What'cha think folks? I'd love to hear an opinion from someone who > is a current user of __del__ -- I'm not. How about an API change and a tweak to type.__call__, rather than complete removal? I've re-used __del__ as the method name below, but a different name would obviously work too. 1. __del__ would become an automatic static method (like __new__) 2. Make an addition to the end of type.__call__ along the lines of (stealing from Jean-Paul's example): # sys.finalizers would just be a new global set in the sys module # that keeps the weakrefs alive until they are needed # In definition of type.__call__, after invoking __init__ if hasattr(cls, '__del__'): finalizer = cls.__del__ if hasattr(self, '__del_arg__'): finalizer_arg = self.__del_arg__ else: # Create a class with the same instance attributes # as the original class attr_holder(object): pass finalizer_arg = attr_holder() finalizer_arg.__dict__ = self.__dict__ def call_finalizer(ref): sys.finalizers.remove(ref) finalizer(finalizer_arg) sys.finalizers.add(weakref.ref(self, call_finalizer)) 3. The __init__ method then simply needs to make sure that the right argument is passed to __del__. For example, if the object holds a reference to a file that needs to be closed when the object goes away: class CloseFileOnDel(object): def __init__(self, fname): self.f = self.__del_arg__ = open(fname) def __del__(f): f.close() Alternatively, the class could rely on the pseudo-self that is passed if __del_arg__ isn't defined: class CloseFileOnDel(object): def __init__(self, fname): self.f = open(fname) def __del__(self_attrs): self_attrs.f.close() The only way for __del__ to receive a reference to self is if the finalizer argument had a reference to it - but that would mean the object itself was not collectable, so __del__ wouldn't be called in the first place. That all seems too simple, though. Since we're talking about gc and that's never simple, there has to be something wrong with the idea :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From rhamph at gmail.com Wed Sep 20 14:55:56 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 06:55:56 -0600 Subject: [Python-3000] How will unicode get used? Message-ID: Before we can decide on the internal representation of our unicode objects, we need to decide on their external interface. My thoughts so far: * Most transformation and testing methods (.lower(), .islower(), etc) can be copied directly from 2.x. They require no special implementation to perform reasonably. * Indexing and slicing is the big issue. Do we need constant-time integer slicing? .find() could be changed to return a token that could be used as a constant-time offset. Incrementing the token would have linear costs, but that's no big deal if the offsets are always small. * Grapheme clusters, words, lines, other groupings, do we need/want ways to slice based on them too? * Cheap slicing and concatenation (between O(1) and O(log(n))), do we want to support them? Now would be the time. -- Adam Olsen, aka Rhamphoryncus From krstic at solarsail.hcs.harvard.edu Wed Sep 20 11:18:22 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Wed, 20 Sep 2006 05:18:22 -0400 Subject: [Python-3000] Kill GIL? In-Reply-To: <451061D7.4010105@v.loewis.de> References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com> <450E34EF.3090202@solarsail.hcs.harvard.edu> <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com> <451061D7.4010105@v.loewis.de> Message-ID: <4511075E.6010101@solarsail.hcs.harvard.edu> Martin v. L?wis wrote: > The only exception would be if somebody offered a reasonable > solution, which "we" would just have to incorporate (and possibly > maintain, although it would be good if the original author would > be around for a year or so). I am interested in doing just this. I'm loathe to spending time on it, however, if it turns out that Guido still doesn't think multiprocessing is a problem, or has a particular solution in mind. So once that clears up, I'm happy to commit to a PEP, a reference implementation (if it can be done purely in Python; if it involves diving into CPython, I'll require assistance), and ongoing maintenance of the same for the foreseeable future. -- Ivan Krsti? | GPG: 0x147C722D From mcherm at mcherm.com Wed Sep 20 15:24:01 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Wed, 20 Sep 2006 06:24:01 -0700 Subject: [Python-3000] Removing __del__ Message-ID: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com> Nick Coghlan writes: [...proposes revision of __del__ rather than removal...] > The only way for __del__ to receive a reference to self is if the > finalizer argument had a reference to it - but that would mean the > object itself was not > collectable, so __del__ wouldn't be called in the first place. > > That all seems too simple, though. Since we're talking about gc and > that's never simple, there has to be something wrong with the idea :) Unfortunately you're right... this is all too simple. The existing mechanism doesn't have a problem with __del__ methods that do not participate in loops. For those that DO participate in loops I think it's perfectly plausible for your __del__ to receive a reference to the actual object being finalized. Another problem (but less important as it's trivially fixable) is that you're storing away the values that the object had when it was created, perhaps missing out on things that got added or initialized later. -- Michael Chermside From krstic at solarsail.hcs.harvard.edu Wed Sep 20 15:32:47 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Wed, 20 Sep 2006 21:32:47 +0800 Subject: [Python-3000] Kill GIL? - to PEP 3099? In-Reply-To: References: Message-ID: <451142FF.20203@solarsail.hcs.harvard.edu> Jim Jewett wrote: >> > Ivan: why don't you write a PEP about this? > >> I'd like to hear Guido's overarching thoughts on the matter, if any, and >> would afterwards be happy to write a PEP. The `this` and `the matter` referred not to removing the GIL, but providing some form of sane multiprocessing support that doesn't require everyone interested in MP to reinvent the wheel. The GIL situation and Guido's position on it seem pretty clear to me, as I've tried to indicate in prior messages. -- Ivan Krsti? | GPG: 0x147C722D From rasky at develer.com Wed Sep 20 15:36:48 2006 From: rasky at develer.com (Giovanni Bajo) Date: Wed, 20 Sep 2006 15:36:48 +0200 Subject: [Python-3000] Removing __del__ References: <023701c6dc34$8a79dc50$a14c2597@bagio> <20060919224048.1717.886353737.divmod.quotient.54336@ohm> Message-ID: <016801c6dcb9$d2915c40$e303030a@trilan> Jean-Paul Calderone wrote: >>> Since we're apparently still in "propose wild ideas" mode for Py3K >>> I'd like to propose that for Py3K we remove __del__. Not "fix" it, >>> not "tweak" it, just remove it and perhaps add a note in the manual >>> pointing people to the weakref module. >> >> >> I don't use __del__ much. I use it only in leaf classes, where it >> surely can't be part of loops. In those rare cases, it's very useful >> to me. For instance, I have a small classes which wraps an existing >> handle-based C API exported to Python. Something along the lines of: >> >> class Wrapper: >> def __init__(self, *args): >> self.handle = CAPI.init(*args) >> >> def __del__(self, *args): >> CAPI.close(self.handle) >> >> def foo(self): >> CAPI.foo(self.handle) >> >> The real class isn't much longer than this (really). How do you >> propose to write this same code without __del__? > > Untested, but roughly: > > _weakrefs = [] > > def _cleanup(ref, handle): > _weakrefs.remove(ref) > CAPI.close(handle) > > class BetterWrapper: > def __init__(self, *args): > handle = self.handle = CAPI.init(*args) > _weakrefs.append( > weakref.ref(self, > lambda ref: _cleanup(ref, handle))) > > def foo(self): > CAPI.foo(self.handle) > > There are probably even better ways too, this is just the first that > comes to mind. Thanks for the example. Thus, I believe my example is a good use case for __del__ with no good enough workaround, which was requested by Micheal in the original post. I believe that it would be a mistake to remove __del__ unless we provide a graceful alternative (and I don't consider the code above a graceful alternative). I still like the __close__ method being proposed. I'd love to see a PEP for it. -- Giovanni Bajo From fredrik at pythonware.com Wed Sep 20 15:38:59 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 20 Sep 2006 15:38:59 +0200 Subject: [Python-3000] Kill GIL? - to PEP 3099? In-Reply-To: <451142FF.20203@solarsail.hcs.harvard.edu> References: <451142FF.20203@solarsail.hcs.harvard.edu> Message-ID: Ivan Krsti? wrote: > The `this` and `the matter` referred not to removing the GIL, but > providing some form of sane multiprocessing support that doesn't require > everyone interested in MP to reinvent the wheel. no need to wait for Guido for this: adding library support for shared- memory dictionaries/lists is a no-brainer. if you have experience in this field, start hacking. I'll take care of the rest ;-) From fredrik at pythonware.com Wed Sep 20 16:02:28 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 20 Sep 2006 16:02:28 +0200 Subject: [Python-3000] Kill GIL? - to PEP 3099? In-Reply-To: References: <451142FF.20203@solarsail.hcs.harvard.edu> Message-ID: <451149F4.2040501@pythonware.com> Fredrik Lundh wrote: > no need to wait for Guido for this: adding library support for shared- > memory dictionaries/lists is a no-brainer. if you have experience in > this field, start hacking. I'll take care of the rest ;-) and no need to wait for Python 3000 either, of course -- I see no reason why this cannot go into some 2.X release. From fredrik at pythonware.com Wed Sep 20 16:03:39 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 20 Sep 2006 16:03:39 +0200 Subject: [Python-3000] Kill GIL? - to PEP 3099? In-Reply-To: References: <451142FF.20203@solarsail.hcs.harvard.edu> Message-ID: Fredrik Lundh wrote: > no need to wait for Guido for this: adding library support for shared- > memory dictionaries/lists is a no-brainer. if you have experience in > this field, start hacking. I'll take care of the rest ;-) and you don't need to wait for Python 3000 either, of course -- if done right, this would certainly fit into some future 2.X release. From jcarlson at uci.edu Wed Sep 20 17:50:25 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 20 Sep 2006 08:50:25 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: <20060920083244.0817.JCARLSON@uci.edu> "Adam Olsen" wrote: > Before we can decide on the internal representation of our unicode > objects, we need to decide on their external interface. My thoughts > so far: I believe the only options up for actual decision is what the internal representation of a unicode object will be. Utf-8 that is never changed? Utf-8 that is converted to ucs-2/4 on certain kinds of accesses? Latin-1/ucs-2/ucs-4 depending on code point content? Always ucs-2/4, depending on compiler switch? > * Most transformation and testing methods (.lower(), .islower(), etc) > can be copied directly from 2.x. They require no special > implementation to perform reasonably. A decoding variant of these would be required if the underlying representation of a particular string is not latin-1, ucs-2, or ucs-4. Further, any rstrip/split/etc. methods need to scan/parse the entire string in order to discover code point starts/ends when using a utf-* variant as an internal encoding (except for utf-32, which has a constant width per character). Whether or not we choose to go with a varying internal representation (the latin-1/ucs-2/ucs-4 variant I have been suggesting), > * Indexing and slicing is the big issue. Do we need constant-time > integer slicing? .find() could be changed to return a token that > could be used as a constant-time offset. Incrementing the token would > have linear costs, but that's no big deal if the offsets are always > small. If by "constant-time integer slicing" you mean "find the start and end memory offsets of a slice in constant time", I would say yes. Generally, I think tokens (in unicode strings) are a waste of time and implementation. Giving each string a fixed-width per character allows methods on those unicode strings to be far simpler in implementation. > * Grapheme clusters, words, lines, other groupings, do we need/want > ways to slice based on them too? No. > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we > want to support them? Now would be the time. This would imply a tree-based string, which Guido has specifically stated would not happen. Never mind that it would be a beast to implement and maintain or that it would exclude the possibility for offering the single-segment buffer interface, without reprocessing. - Josiah From mcherm at mcherm.com Wed Sep 20 18:27:56 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Wed, 20 Sep 2006 09:27:56 -0700 Subject: [Python-3000] Removing __del__ Message-ID: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> Giovanni Bajo writes: > I believe my example is a good use case for __del__ with no good > enough workaround, which was requested by Micheal in the original post. I > believe that it would be a mistake to remove __del__ unless we provide a > graceful alternative (and I don't consider the code above a graceful > alternative). I still like the __close__ method being proposed. Thank you! This is exactly the kind of discussion that I was hoping to engender. Let me see if I can make the case a little more effectively. First of all, let clean up Jean-Paul's solution a little bit so it looks prettier when used. Let's put the following code into a module: ----- deletions.py ----- import weakref # Maintain a separate list so the weakrefs themselves # won't be garbage collected. on_del_callbacks = [] def on_del_invoke(obj, func, *args, **kwargs): """This sets up a callback to be executed when an object is finalized. It is similar to the old __del__ method but without some of the risks and limitations of that method. The first argument is an object to watch; the second is a callable. After the object being watched gets finalized, the callable will be invoked; arguments for this call can be provided after the callable. Please note that the callable must not be a bound method of the object being watched, and the object being watched must not be (or be refered to by) one of the arguments or else the object will never be garbage collected.""" def callback(ref): on_del_callbacks.remove(ref) func(*args, **kwargs) on_del_callbacks.append( weakref.ref(obj, callback)) --- end deletions.py --- Performance could be improved in minor ways (avoiding the O(n) lookup cost in the remove() call; avoiding the need for a separate function object for each callback; catching obvious loops and raising an exception immediately to make it more newbie-friendly), but this will do for discussion. Using this, your original code: > class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > > def __del__(self, *args): > CAPI.close(self.handle) > > def foo(self): > CAPI.foo(self.handle) becomes this code: from deletions import on_del_invoke class Wrapper: def __init__(self, *args): self.handle = CAPI.init(*args) on_del_invoke(self, CAPI.close, self.handle) def foo(self): CAPI.foo(self.handle) It's actually *fewer* lines this way, and I find it quite readable. Furthermore, unlike the __del__ version it doesn't break as soon as someone accidentally puts a Wrapper object into a loop. Working from this example, I'm not convinced that the price of giving up __del__ is really all that high. (But please, find another example to convince me!) On the other side of the scales, here are some benefits that we gain if we get rid of __del__: * Simpler GC code which is less likely to have obscure bugs that are incredibly difficult to track down. Less core developer time spent maintaining complex, fragile code. * No need to explain about keeping __del__ objects[1] out of reference loops. In exchange, we choose explain about not passing the object being monitored or anything that links to it as arguments to on_del_invoke. I find that preferable because: (1) it seems more intuitive to me that the callback musn't reference the object being finalized, (2) it requires reasoning about the call-site, not about all future uses of the object, and (3) if the programmer violates this rule then the disadvantage is that the objects become immortal -- which is true for ALL __del__ objects in loops today. * Programmers no longer have the ability to allow __del__ to resurect the object being finalized. Technically, that's a disadvantage, not an advantage, but I honestly don't think anyone believes it's a good idea to write __del__ methods that resurect the object, so I'm happy to lose that ability. -- Michael Chermside [1] - I'm using "__del__ object" to mean an object that has a __del__ method. From guido at python.org Wed Sep 20 18:40:49 2006 From: guido at python.org (Guido van Rossum) Date: Wed, 20 Sep 2006 09:40:49 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: On 9/20/06, Adam Olsen wrote: > Before we can decide on the internal representation of our unicode > objects, we need to decide on their external interface. My thoughts > so far: Let me cut this short. The external string API in Py3k should not change or only very marginally so (like removing rarely used useless APIs or adding a few new conveniences). The plan is to keep the 2.x API that is supported (in 2.x) by both str and unicode, but merge the twp string types into one. Anything else could be done just as easily before or after Py3k. OTOH, if you want to start to gather requirements for the bytes API, now is the time. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mcherm at mcherm.com Wed Sep 20 18:48:10 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Wed, 20 Sep 2006 09:48:10 -0700 Subject: [Python-3000] Delayed reference counting idea Message-ID: <20060920094810.2lpvw6n4pk4k44wc@login.werra.lunarpages.com> Greg Ewing writes: > A worry about that is whether the architecture required to > allow pluggable GC implementations introduces inefficiencies > of its own that would skew the results. Bob Ippolito > There's no need to worry about that in the case of PyPy. Those kinds > of choices are made way before runtime, so there's no required > indirection. Someone who knows PyPy better than me feel free to chime in if I get things wrong, but I *think* that when it happens well before runtime, well before compile-time, more equivalent to "time at which the interpreter is compiled". So if you have PyPy set up to compile to C and use reference counting GC, then it generates calls to INCR and DECR before and after variable accesses, but if you have it set up to compile to the LLVM which has its own tracing GC then it doesn't generate anything before and after variable accesses. Greg again: > Even so, we're talking about machine-generated code rather > than the sort of hand-crafting you need to get the best > out of something critical like GC. There could still be > room for inefficiencies. Quite true. As further illustration Python's GC is written in C and thus you can't get the kind of efficiency you might out of hand-crafted assembly. Unless of course the machine generating the code is actually smarter about optimization than the hand that's crafting it, or if the two are close enough in performance that we don't mind. I don't think PyPy has anything to teach us about GC performance *yet*, but I think their approach is quite promissing as a platform for running this kind of experiment. -- Michael Chermside From jimjjewett at gmail.com Wed Sep 20 19:09:14 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 20 Sep 2006 13:09:14 -0400 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060920083244.0817.JCARLSON@uci.edu> References: <20060920083244.0817.JCARLSON@uci.edu> Message-ID: On 9/20/06, Josiah Carlson wrote: > "Adam Olsen" wrote: > > Before we can decide on the internal representation of our unicode > > objects, we need to decide on their external interface. My thoughts > > so far: > I believe the only options up for actual decision is what the internal > representation of a unicode object will be. If I request string[4:7], what (format of string) will come back? The same format as the original string? A canonical format? The narrowest possible for the new string? When a recoding occurs, is that in addition to the original format, or instead of? (I think "in addition" would be useful, as we're likely to need that original format back for output -- but it does waste space when we don't need the original again.) > Further, any rstrip/split/etc. methods need to scan/parse the entire > string in order to discover code point starts/ends when using a utf-* > variant as an internal encoding (except for utf-32, which has a constant > width per character). No. That is true of some encodings, but not the UTF variants. A byte (or double-byte, for UTF-16) is unambiguous. Within a specific encoding, each possible (byte or double-byte) value represents at most one of a complete value the start of a multi-position value the continuation of a multi-position value That said, string[47:-34] may need to parse the whole string, just to count double-position characters. (To be honest, I'm not sure even then; for UTF-16 it might make sense to treat surrogates as double-width characters. Even for UTF-8, there might be a workaround that speeds up the majority of strings.) > Giving each string a fixed-width per character allows > methods on those unicode strings to be far simpler in implementation. Which is why that was done in Py 2K. The question for Py3K is Should we *commit* to this particular representation and allow direct access to the internals? Or should we treat the internals as opaque, and allow more efficient representations if someone wants to write one. Today, I can go ahead and write my own string representation, but if I change the internal storage, I can't actually use it with most compiled extensions. > > * Grapheme clusters, words, lines, other groupings, do we need/want > > ways to slice based on them too? > No. I assume that you don't really mean strings will stop supporting split() > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we > > want to support them? Now would be the time. > This would imply a tree-based string, Cheap slicing wouldn't. Cheap concatenation in *all* cases would. Cheap concatenation in a few lucky cases wouldn't. > it would exclude the possibility for > offering the single-segment buffer interface, without reprocessing. I'm not sure exactly what you mean here. If you just mean "C code can't get at the internals without warning", then that is true. It is also true that any function requesting the internals would need to either get the encoding along with it, or work with bytes. If the C code wants that buffer in a specific encoding, it will have to request that, which might well require reprocessing. But if so, then this recoding already happens today -- it is just that today, we do it for every string, instead of only the ones that need it. (But today, the recoding happens earlier, which can be better for debugging.) -jJ From rhamph at gmail.com Wed Sep 20 19:47:39 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 11:47:39 -0600 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060920083244.0817.JCARLSON@uci.edu> References: <20060920083244.0817.JCARLSON@uci.edu> Message-ID: On 9/20/06, Josiah Carlson wrote: > > "Adam Olsen" wrote: > > Before we can decide on the internal representation of our unicode > > objects, we need to decide on their external interface. My thoughts > > so far: > > I believe the only options up for actual decision is what the internal > representation of a unicode object will be. Utf-8 that is never changed? > Utf-8 that is converted to ucs-2/4 on certain kinds of accesses? > Latin-1/ucs-2/ucs-4 depending on code point content? Always ucs-2/4, > depending on compiler switch? Just a minor nit. I doubt we could accept UCS-2, we'd want UTF-16 instead, with all the variable-width goodness that brings in. Or maybe not so minor. Old versions of windows used UCS-2, new versions use UTF-16. The former should get errors if too high of a character is used, the latter will need conversion if we're not using UTF-16. > > * Most transformation and testing methods (.lower(), .islower(), etc) > > can be copied directly from 2.x. They require no special > > implementation to perform reasonably. > > A decoding variant of these would be required if the underlying > representation of a particular string is not latin-1, ucs-2, or ucs-4. That makes no sense. They can operate on any encoding we design them to. The cost is always O(n) with the length of the string. > Further, any rstrip/split/etc. methods need to scan/parse the entire > string in order to discover code point starts/ends when using a utf-* > variant as an internal encoding (except for utf-32, which has a constant > width per character). See below. > Whether or not we choose to go with a varying internal representation > (the latin-1/ucs-2/ucs-4 variant I have been suggesting), > > > > * Indexing and slicing is the big issue. Do we need constant-time > > integer slicing? .find() could be changed to return a token that > > could be used as a constant-time offset. Incrementing the token would > > have linear costs, but that's no big deal if the offsets are always > > small. > > If by "constant-time integer slicing" you mean "find the start and end > memory offsets of a slice in constant time", I would say yes. > > Generally, I think tokens (in unicode strings) are a waste of time and > implementation. Giving each string a fixed-width per character allows > methods on those unicode strings to be far simpler in implementation. s = 'foobar' p = s[s.find('bar'):] == 'bar' Even if .find() is made to return a token, rather than an integer, the behavior and performance of this example are unchanged. However, I can imagine there might be use cases, such as the .find() output on one string being used to slice a different string, which tokens wouldn't support. I haven't been able to dream up any sane examples, which is why I asked about it here. I want to see specific examples showing that tokens won't work. Using only utf-8 would be simpler than three distinct representations. And if memory usage is an issue (which it seems to be, albeit in a vague way), we could make a custom encoding that's even simpler and more space efficient than utf-8. > > * Grapheme clusters, words, lines, other groupings, do we need/want > > ways to slice based on them too? > > No. Can you explain your reasoning? > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we > > want to support them? Now would be the time. > > This would imply a tree-based string, which Guido has specifically > stated would not happen. Never mind that it would be a beast to > implement and maintain or that it would exclude the possibility for > offering the single-segment buffer interface, without reprocessing. The only reference I found was this: http://mail.python.org/pipermail/python-3000/2006-August/003334.html I interpret that as him being very sceptical, not an outright refusal. Allowing external code to operate on a python string in-place seems tenuous at best. Even with three types (Latin-1, UCS-2, UCS-4) you would need to automatically copy and convert if the wrong type is given. -- Adam Olsen, aka Rhamphoryncus From rhamph at gmail.com Wed Sep 20 20:20:13 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 12:20:13 -0600 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: On 9/20/06, Guido van Rossum wrote: > On 9/20/06, Adam Olsen wrote: > > Before we can decide on the internal representation of our unicode > > objects, we need to decide on their external interface. My thoughts > > so far: > > Let me cut this short. The external string API in Py3k should not > change or only very marginally so (like removing rarely used useless > APIs or adding a few new conveniences). The plan is to keep the 2.x > API that is supported (in 2.x) by both str and unicode, but merge the > twp string types into one. Anything else could be done just as easily > before or after Py3k. Thanks, but one thing remains unclear: is the indexing intended to represent bytes, code points, or code units? Note that C code operating on UTF-16 would use code units for slicing of UTF-16, which splits surrogate pairs. As far as I can tell, CPython on windows uses UTF-16 with code units. Perhaps not intentionally, but by default (not throwing an error on surrogates). For those trying to make sense of this, a Code Point anything in the 0 to 0x10FFFF range. A Code Unit goes up to 0xFF for UTF-8, 0xFFFF for UTF-16, and 0xFFFFFFFF for UTF-32. One or more code units may be needed to form a single code point. Obviously code units expose our internal implementation choice. -- Adam Olsen, aka Rhamphoryncus From brett at python.org Wed Sep 20 20:30:28 2006 From: brett at python.org (Brett Cannon) Date: Wed, 20 Sep 2006 11:30:28 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: On 9/20/06, Adam Olsen wrote: > > On 9/20/06, Guido van Rossum wrote: > > On 9/20/06, Adam Olsen wrote: > > > Before we can decide on the internal representation of our unicode > > > objects, we need to decide on their external interface. My thoughts > > > so far: > > > > Let me cut this short. The external string API in Py3k should not > > change or only very marginally so (like removing rarely used useless > > APIs or adding a few new conveniences). The plan is to keep the 2.x > > API that is supported (in 2.x) by both str and unicode, but merge the > > twp string types into one. Anything else could be done just as easily > > before or after Py3k. > > Thanks, but one thing remains unclear: is the indexing intended to > represent bytes, code points, or code units? Note that C code > operating on UTF-16 would use code units for slicing of UTF-16, which > splits surrogate pairs. Assuming my Unicode lingo is right and code point represents a letter/character/digraph/whatever, then it will be a code point. Doing one of my rare channels of Guido, I *really* doubt he wants to expose the technical details of Unicode to the point of having people need to realize that UTF-8 takes two bytes to represent "?". If you want that kind of exposure, use the bytes type. Otherwise assume the usage will be by people ignorant of Unicode and thus want something that will work the way they are used to when compared to working in ASCII. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060920/a71a932e/attachment.htm From guido at python.org Wed Sep 20 20:32:04 2006 From: guido at python.org (Guido van Rossum) Date: Wed, 20 Sep 2006 11:32:04 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: On 9/20/06, Adam Olsen wrote: > On 9/20/06, Guido van Rossum wrote: > > On 9/20/06, Adam Olsen wrote: > > > Before we can decide on the internal representation of our unicode > > > objects, we need to decide on their external interface. My thoughts > > > so far: > > > > Let me cut this short. The external string API in Py3k should not > > change or only very marginally so (like removing rarely used useless > > APIs or adding a few new conveniences). The plan is to keep the 2.x > > API that is supported (in 2.x) by both str and unicode, but merge the > > twp string types into one. Anything else could be done just as easily > > before or after Py3k. > > Thanks, but one thing remains unclear: is the indexing intended to > represent bytes, code points, or code units? I don't see what's unclear -- the existing unicode object does what it does. > Note that C code > operating on UTF-16 would use code units for slicing of UTF-16, which > splits surrogate pairs. I thought we were discussing the Python API. C code will likely have the same access to unicode objects as it has in 2.x. > As far as I can tell, CPython on windows uses UTF-16 with code units. > Perhaps not intentionally, but by default (not throwing an error on > surrogates). This is intentional, to be compatible with the rest of that platform. Jython and IronPython do this too I believe. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rhamph at gmail.com Wed Sep 20 20:43:03 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 12:43:03 -0600 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: On 9/20/06, Guido van Rossum wrote: > On 9/20/06, Adam Olsen wrote: > > On 9/20/06, Guido van Rossum wrote: > > > On 9/20/06, Adam Olsen wrote: > > > > Before we can decide on the internal representation of our unicode > > > > objects, we need to decide on their external interface. My thoughts > > > > so far: > > > > > > Let me cut this short. The external string API in Py3k should not > > > change or only very marginally so (like removing rarely used useless > > > APIs or adding a few new conveniences). The plan is to keep the 2.x > > > API that is supported (in 2.x) by both str and unicode, but merge the > > > twp string types into one. Anything else could be done just as easily > > > before or after Py3k. > > > > Thanks, but one thing remains unclear: is the indexing intended to > > represent bytes, code points, or code units? > > I don't see what's unclear -- the existing unicode object does what it does. The existing unicode object doesn't expose the difference between them except when UTF-16 is used and surrogates exist. > > Note that C code > > operating on UTF-16 would use code units for slicing of UTF-16, which > > splits surrogate pairs. > > I thought we were discussing the Python API. > > C code will likely have the same access to unicode objects as it has in 2.x. I only mentioned it because C doesn't mind exposing the internal details for performance benefits, whereas python usually does mind. > > As far as I can tell, CPython on windows uses UTF-16 with code units. > > Perhaps not intentionally, but by default (not throwing an error on > > surrogates). > > This is intentional, to be compatible with the rest of that platform. > Jython and IronPython do this too I believe. So you're saying we should use code units?! Or are you referring to the choice of UTF-16? I would expect us to use code points in 3.x, but that's not how it is in 2.x. -- Adam Olsen, aka Rhamphoryncus From jimjjewett at gmail.com Wed Sep 20 21:04:23 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 20 Sep 2006 15:04:23 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> Message-ID: On 9/20/06, Michael Chermside wrote: > Giovanni Bajo writes: > > I believe my example is a good use case for __del__ with no good > > enough workaround, ... I still like the __close__ method being proposed. [Michael asks about this alternative] ... > def on_del_invoke(obj, func, *args, **kwargs): ... > Please note that the callable must not be a bound method > of the object being watched, and the object being watched > must not be (or be refered to by) one of the arguments > or else the object will never be garbage collected.""" By far the most frequently desired callable is self.close You can work around this with a wrapper, by setting self.f=open(...) and then passing self.f.close -- but with this API, I'll be wondering why I can't just register self.f as the object in the first place. If bound methods did not increment the refcount, this would work, but I imagine it would break various GUI and event-processing idioms. A special rebind-this-method-weakly builtin would work, but I'm not sure that is any simpler than __close__. (~= __del__ but cycles can be broken in an arbitrary order) > Using this, your original code: > > class Wrapper: > > def __init__(self, *args): > > self.handle = CAPI.init(*args) > > def __del__(self, *args): > > CAPI.close(self.handle) > > def foo(self): > > CAPI.foo(self.handle) > becomes this code: > from deletions import on_del_invoke > class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > on_del_invoke(self, CAPI.close, self.handle) > def foo(self): > CAPI.foo(self.handle) Note that the wrapper (as posted) does nothing except store a pointer to the CAPI object and then delegate to it. With a __close__ method, this class could reduce to (at most) class MyCAPI(CAPI): __close__ = close Since the CAPI class could use the __close__ convention directly, the wrapper could be eliminated entirely. (In real life, his class might do more ... but if so, then *these* lines are still boilerplate that it would be good to remove). > On the other side of the scales, here are some benefits that > we gain if we get rid of __del__: > * No need to explain about keeping __del__ objects[1] out > of reference loops. In exchange, we choose explain > about not passing the object being monitored or > anything that links to it as arguments to on_del_invoke. Adding an extra wrapper just to avoid passing self isn't really any better than adding an extra cleanup object hanging off an attribute to avoid loops. So the explanation might be better, but the resulting code would end up using the same workarounds that are recommended (but often not used) today. > (3) if the programmer violates this rule then the > disadvantage is that the objects become immortal -- which > is true for ALL __del__ objects in loops today. But most objects are not in a __del__ loop. By passing a bound method, the user makes the object immortal even if it is the only object that needs cleanup. > * Programmers no longer have the ability to allow __del__ > to resurect the object being finalized. Technically, > that's a disadvantage, not an advantage, but I honestly > don't think anyone believes it's a good idea to write > __del__ methods that resurect the object, so I'm happy > to lose that ability. How do you feel about the __del__ in stdlib subprocess.Popen (about line 615)? This resurrects itself, in order to finish waiting for the child process. If the child isn't done yet, then it will check again the next time a new Popen is created (or at final closedown). Without this ability to reschedule itself, it would have to do a blocking wait, which might put some odd pressures on concurrency. (And note that if it needed to revive (not recreate, revive) subobjects, it would need the full immortal-cycle power of today's __del__. It may be valid not to support this case, but it isn't automatically bad usage.) -jJ From jimjjewett at gmail.com Wed Sep 20 22:59:22 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 20 Sep 2006 16:59:22 -0400 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: On 9/20/06, Guido van Rossum wrote: > On 9/20/06, Adam Olsen wrote: > > On 9/20/06, Guido van Rossum wrote: > > > Let me cut this short. The external string API in Py3k should not > > > change or only very marginally so (like removing rarely used useless > > > APIs or adding a few new conveniences). ... > I thought we were discussing the Python API. I don't think anyone has proposed much change to strings *as seen from python*. At most, there has been an implicit suggestion that the bytes.decode().encode() dance be shortened. > C code will likely have the same access to unicode objects as it has in 2.x. Can C code still assume that (1) the data buffer will always be available for any sort of direct manipulation (including mutation) (2) in a specific canonical encoding (3) directly from the memory layout, without calling a "prepare" or "recode" or "encode" method first. Today, that canonical encoding is a compile-time choice, and any specific choice causes integration hassles. Unless the choice matches the system default for text, it also requires many decode/encode round trips that might otherwise be avoided. The proposed changes mostly boil down to removing the third assumption, and agreeing that some implementations might delay the decode-to-canonical-format until it was needed. Rough Summary of new C API restrictions: Replace ((PyStringObject *)string).ob_sval /* supported today */ with PyString_AsString(string) /* already recommended */ or replace ((PyUnicodeObject *)string)->str /* supported today */ and ((PyUnicodeObject *)string)->defenc /* supported today */ with PyUnicode_AsEncodedString(PyObject *unicode, /* already recommended */ const char *encoding, const char *errors) and PyUnicode_AsAnyString(PyObject *unicode, /* new */ char **encoding, /* return the actual encoding */ const char *errors) Also note that some macros would need to become functions. The most prominent is PyUnicode_AS_DATA(string) /* supports mutation */ -jJ From jcarlson at uci.edu Wed Sep 20 23:20:22 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 20 Sep 2006 14:20:22 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920083244.0817.JCARLSON@uci.edu> Message-ID: <20060920135601.0822.JCARLSON@uci.edu> "Jim Jewett" wrote: > On 9/20/06, Josiah Carlson wrote: > > > "Adam Olsen" wrote: > > > Before we can decide on the internal representation of our unicode > > > objects, we need to decide on their external interface. My thoughts > > > so far: > > > I believe the only options up for actual decision is what the internal > > representation of a unicode object will be. > > If I request string[4:7], what (format of string) will come back? > > The same format as the original string? > A canonical format? > The narrowest possible for the new string? Which of the three depend on the choice of internal representation. If the internal representation is always canonical, narrowest, or same as the original string, then it would be one of those. > When a recoding occurs, is that in addition to the original format, or > instead of? (I think "in addition" would be useful, as we're likely > to need that original format back for output -- but it does waste > space when we don't need the original again.) The current implementation, I believe, uses "in addition", unless I'm misreading the unicode string struct. > > Further, any rstrip/split/etc. methods need to scan/parse the entire > > string in order to discover code point starts/ends when using a utf-* > > variant as an internal encoding (except for utf-32, which has a constant > > width per character). > > No. That is true of some encodings, but not the UTF variants. A byte > (or double-byte, for UTF-16) is unambiguous. I was under the impression that utf-8 was a particular kind of prefix encoding. Looking at the actual output of utf-8, I notice that the encodings are such that bytes with value >= 0xc0 define the beginning of the multi-character encodings, so handling 'from the front' or 'from the back' are equivalently as reasonable. > That said, string[47:-34] may need to parse the whole string, just to > count double-position characters. (To be honest, I'm not sure even > then; for UTF-16 it might make sense to treat surrogates as > double-width characters. Even for UTF-8, there might be a workaround > that speeds up the majority of strings.) It would involve keeping some sort of cache of indices/offset values. This may not be worthwhile. > > Giving each string a fixed-width per character allows > > methods on those unicode strings to be far simpler in implementation. > > Which is why that was done in Py 2K. The question for Py3K is > > Should we *commit* to this particular representation and allow > direct access to the internals? Why not? > Or should we treat the internals as opaque, and allow more > efficient representations if someone wants to write one. I'm not sure that the efficiencies are necessarily desireable. > Today, I can go ahead and write my own string representation, but if I > change the internal storage, I can't actually use it with most > compiled extensions. Right, but extensions that are used *right now* would need to be rewritten to handle these "more efficient" representations. > > > * Grapheme clusters, words, lines, other groupings, do we need/want > > > ways to slice based on them too? > > > No. > > I assume that you don't really mean strings will stop supporting split() That would be silly. What I meant was that text.word[7], text.line[3], etc., shouldn't mean anything on the base implementation. > > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we > > > want to support them? Now would be the time. > > > This would imply a tree-based string, > > Cheap slicing wouldn't. O(logn) would imply a tree-based string. O(1) would imply slicing on text returning views (which I'm not even advocating, and I'm a view proponent). > Cheap concatenation in *all* cases would. > Cheap concatenation in a few lucky cases wouldn't. Presumably one would need to copy data from one to the other, so that would O(n) with a non-tree version. > > it would exclude the possibility for > > offering the single-segment buffer interface, without reprocessing. > > I'm not sure exactly what you mean here. If you just mean "C code > can't get at the internals without warning", then that is true. The single-segment buffer interface is, not uncommonly, how C extensions get at the content of strings, unicode, array, mmap, etc. Technically speaking, the current implementation of str and unicode use an internal variant to gain access to their own internals for processing. > It is also true that any function requesting the internals would need > to either get the encoding along with it, or work with bytes. Or code points... The point of specifying the character width as 1,2 or 4 bytes, would be that one can iterate over chars, shorts, or ints. > If the C code wants that buffer in a specific encoding, it will have > to request that, which might well require reprocessing. But if so, > then this recoding already happens today -- it is just that today, we > do it for every string, instead of only the ones that need it. (But > today, the recoding happens earlier, which can be better for > debugging.) Indeed. But it's not just for C extensions, it's for Python's own string/unicode internals. Simple is better than complex. Having a flat array-based implementation is simple, and allows us to re-use the vast majority of code we already have. - Josiah From jcarlson at uci.edu Wed Sep 20 23:59:22 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 20 Sep 2006 14:59:22 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920083244.0817.JCARLSON@uci.edu> Message-ID: <20060920142332.0825.JCARLSON@uci.edu> "Adam Olsen" wrote: > > On 9/20/06, Josiah Carlson wrote: > > > > "Adam Olsen" wrote: > > > Before we can decide on the internal representation of our unicode > > > objects, we need to decide on their external interface. My thoughts > > > so far: > > > > I believe the only options up for actual decision is what the internal > > representation of a unicode object will be. Utf-8 that is never changed? > > Utf-8 that is converted to ucs-2/4 on certain kinds of accesses? > > Latin-1/ucs-2/ucs-4 depending on code point content? Always ucs-2/4, > > depending on compiler switch? > > Just a minor nit. I doubt we could accept UCS-2, we'd want UTF-16 > instead, with all the variable-width goodness that brings in. If we are opting for a *single* internal representation, then UTF-16 or UTF-32 are really the only options. > > > * Most transformation and testing methods (.lower(), .islower(), etc) > > > can be copied directly from 2.x. They require no special > > > implementation to perform reasonably. > > > > A decoding variant of these would be required if the underlying > > representation of a particular string is not latin-1, ucs-2, or ucs-4. > > That makes no sense. They can operate on any encoding we design them > to. The cost is always O(n) with the length of the string. I was thinking .startswith() and .endswith(), but assuming *some* canonical representation (UTF-16, UTF-32, etc.) this is trivial to implement. I take back my concerns on this particular point. > > Whether or not we choose to go with a varying internal representation > > (the latin-1/ucs-2/ucs-4 variant I have been suggesting), > > > > > > > * Indexing and slicing is the big issue. Do we need constant-time > > > integer slicing? .find() could be changed to return a token that > > > could be used as a constant-time offset. Incrementing the token would > > > have linear costs, but that's no big deal if the offsets are always > > > small. > > > > If by "constant-time integer slicing" you mean "find the start and end > > memory offsets of a slice in constant time", I would say yes. > > > > Generally, I think tokens (in unicode strings) are a waste of time and > > implementation. Giving each string a fixed-width per character allows > > methods on those unicode strings to be far simpler in implementation. > > However, I can imagine there might be use cases, such as the .find() > output on one string being used to slice a different string, which > tokens wouldn't support. I haven't been able to dream up any sane > examples, which is why I asked about it here. I want to see specific > examples showing that tokens won't work. p = s[6:-6] Or even in actual code I use today: p = s.lstrip() lil = len(s) - len(p) si = s[:lil] lil += si.count('\t')*(self.GetTabWidth()-1) #s is the original line #p is the line without leading indentation #si is the line indentation characters #lil is the indentation of the line in columns If I can't slice based on character index, then we end up with a similar situation that the wxPython StyledTextCtrl runs into right now: the content is encoded via utf-8 internally, so users have to use the fairly annoying PositionBefore(pos) and PositionAfter(pos) methods to discover where characters start/end. While it is possible to handle everything this way, it is *damn annoying*, and some users have gone so far as to say that it *doesn't work* for Europeans. While I won't make the claim that it *doesn't work*, it is a pain in the ass. > Using only utf-8 would be simpler than three distinct representations. > And if memory usage is an issue (which it seems to be, albeit in a > vague way), we could make a custom encoding that's even simpler and > more space efficient than utf-8. One of the reasons I've been pushing for the 3 representations is because it is (arguably) optimal for any particular string. > > > * Grapheme clusters, words, lines, other groupings, do we need/want > > > ways to slice based on them too? > > > > No. > > Can you explain your reasoning? We can already split based on words, lines, etc., usingsplit(), and re.split(). Building additional functionality for text.word[4] seems to be a waste of time. > > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we > > > want to support them? Now would be the time. > > > > This would imply a tree-based string, which Guido has specifically > > stated would not happen. Never mind that it would be a beast to > > implement and maintain or that it would exclude the possibility for > > offering the single-segment buffer interface, without reprocessing. > > The only reference I found was this: > http://mail.python.org/pipermail/python-3000/2006-August/003334.html > > I interpret that as him being very sceptical, not an outright refusal. > > Allowing external code to operate on a python string in-place seems > tenuous at best. Even with three types (Latin-1, UCS-2, UCS-4) you > would need to automatically copy and convert if the wrong type is > given. The only benefits that utf-8 gains over any other internal representation is that it is an arguably minimal-sized representation, and it is commonly used among other C libraries. The benefits gained by using the three internal representations are primarily from a simplicity standpoint. That is to say, when manipulating any one of the three representations, you know that the value at offset X represents the code point of character X in the string. Further, with a slight change in how the single-segment buffer interface is defined (returns the width of the character), C extensions that want to deal with unicode strings in *native* format (due to concerns about speed), could do so without having to worry about reencoding, variable-width characters, etc. You can get this same behavior by always using UTF-32 (aka UCS-4), but at least 1/4 of the underlying data is always going to be nulls (code points are limited to 0x0010ffff), and for many people (in Europe, the US, and anywhere else with code points < 65536), 1/2 to 3/4 of the underlying data is going to be nulls. While I would imagine that people could deal with UTF-16 as an underlying representation (from a data waste perspective), the potential for varying-width characters in such an encoding is a pain in the ass (like it is for UTF-8). Regardless of our choice, *some platform* is going to be angry. Why? GTK takes utf-8 encoded strings. (I don't know what Qt or linux system calls take) Windows takes utf-16. Whatever underlying representation, *someone* is going to have to recode when dealing with GUI or OS-level operations. - Josiah From qrczak at knm.org.pl Thu Sep 21 00:34:40 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 21 Sep 2006 00:34:40 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060920142332.0825.JCARLSON@uci.edu> (Josiah Carlson's message of "Wed, 20 Sep 2006 14:59:22 -0700") References: <20060920083244.0817.JCARLSON@uci.edu> <20060920142332.0825.JCARLSON@uci.edu> Message-ID: <87ac4ui2v3.fsf@qrnik.zagroda> Josiah Carlson writes: > Regardless of our choice, *some platform* is going to be angry. Why? > GTK takes utf-8 encoded strings. (I don't know what Qt or linux system > calls take) Windows takes utf-16. The representation of QChar in Qt-3.3.5: ushort ucs; #if defined(QT_QSTRING_UCS_4) ushort grp; #endif The representation of QStringData in Qt-3.3.5: QChar *unicode; char *ascii; #ifdef Q_OS_MAC9 uint len; #else uint len : 30; #endif uint issimpletext : 1; #ifdef Q_OS_MAC9 uint maxl; #else uint maxl : 30; #endif uint islatin1 : 1; I would say that it's silly. It seems a transition from UCS-2 to UCS-4 in Qt is incomplete. Almost no code is prepared for QT_QSTRING_UCS_4. For example the implementation of a function which explains what issimpletext means: void QString::checkSimpleText() const { QChar *p = d->unicode; QChar *end = p + d->len; while ( p < end ) { ushort uc = p->unicode(); // sort out regions of complex text formatting if ( uc > 0x058f && ( uc < 0x1100 || uc > 0xfb0f ) ) { d->issimpletext = FALSE; return; } p++; } d->issimpletext = TRUE; } QChar documentation says: Unicode characters are (so far) 16-bit entities without any markup or structure. This class represents such an entity. It is lightweight, so it can be used everywhere. Most compilers treat it like a "short int". (In a few years it may be necessary to make QChar 32-bit when more than 65536 Unicode code points have been defined and come into use.) Bleh... -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From rasky at develer.com Thu Sep 21 00:39:48 2006 From: rasky at develer.com (Giovanni Bajo) Date: Thu, 21 Sep 2006 00:39:48 +0200 Subject: [Python-3000] Removing __del__ References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> Message-ID: <000f01c6dd05$ae08b8e0$7b4b2597@bagio> Michael Chermside wrote: > from deletions import on_del_invoke > > class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > on_del_invoke(self, CAPI.close, self.handle) > > def foo(self): > CAPI.foo(self.handle) > > It's actually *fewer* lines this way, and I find it quite > readable. It's fewer lines, but *less* readable than a simple plain method call. It's still an indirection. > Furthermore, unlike the __del__ version it doesn't > break as soon as someone accidentally puts a Wrapper object > into a loop. Yes, but I'm an adult and I know that it won't. I'm not even touching __del__ with a hundred foot pole if it's a class which has 1% of possibility of getting into a loop, really. I know it will always be a "leaf" class, if you know what I mean. > Working from this example, I'm not convinced that the price > of giving up __del__ is really all that high. If you ask me, I don't think I can find any library solution to finalization acceptable. Finalization is really something that ought to be easy. If the cyclic GC and __del__ doesn't get along well together, let's substitute __del__ with another finalization feature in the core, with an easy syntax and semantic, which can cope better with the cyclic GC. Again, I vote for the __close__ method (which is: just fix the semantic). > (But please, > find another example to convince me!) Let's say: class Wrapper: def __init__(self, *args): self.handle = CAPI.init(*args) def close(self): if self.handle is not None: CAPI.close(self.handle) self.handle = None __del__ = close Now what, remove_on_del_invokation()? > On the other side of the scales, here are some benefits that > we gain if we get rid of __del__: > > * Simpler GC code which is less likely to have obscure > bugs that are incredibly difficult to track down. Less > core developer time spent maintaining complex, fragile > code. This is an argument against the current semantic of __del__, not against any finalization method which is invoked during the cyclic GC. I believe that __close__ fixes these problems as well. > * No need to explain about keeping __del__ objects[1] out > of reference loops. In exchange, we choose explain > about not passing the object being monitored or > anything that links to it as arguments to on_del_invoke. > I find that preferable because: [...] I think you are right in that the latter is preferable, but I think it's even easier to just "avoid __del__ when coding, unless you are dramatically sure of what you're doing". This way, you don't have to keep mental reference counts. In fact, I believe we're missing a valuable tool for Python 2. Wouldn't be possible to have a debug mode where, between each statement (or very often at least), Python looks for cycles with __del__ in them, and abort execution? It would be very useful to early detect uncollectable cycles at the moment they are created, instead of doing long sessions trying to parse gc.garbage. Giovanni Bajo From mcherm at mcherm.com Thu Sep 21 00:41:15 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Wed, 20 Sep 2006 15:41:15 -0700 Subject: [Python-3000] How will unicode get used? Message-ID: <20060920154115.4wy2hnw6cnc4gkgw@login.werra.lunarpages.com> Guido writes: > > As far as I can tell, CPython on windows uses UTF-16 with code units. > > Perhaps not intentionally, but by default (not throwing an error on > > surrogates). > > This is intentional, to be compatible with the rest of that platform. > Jython and IronPython do this too I believe. The following code illustrates this: >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.' >>> msg[35:-18] u'"\U00010143"' >>> greek_five = msg[36:-19] >>> len(greek_five) 2 >>> greek_five[0] u'\ud800' >>> greek_five[1] u'\udd43' The single unicode character greek_five, when expressed as a string in CPython has length of 2 and can be sliced into two separate characters. In Jython, the code above will not work because Jython doesn't currently support \U or extended unicode (but someday that may change). I'm not sure about IronPython. So if I understand Guido's point, he's saying that it is on purpose that len(greek_five) == 2. That's useful for compatibility today with the Java and Microsoft VM platforms. But it's not particularly compatible with extended Unicode. (Technically it doesn't violate any rules so long as it's clearly defined that a character in Python is NOT the same as a unicode code point.) I wonder if it would be better to say that len(greek_five) is undefined in Python. (And obviously slicing behavior follows from len behavior.) There are excellent reasons for CPython to return 2 in the near future, but the far future is less clear. And the Jython and Iron Python will be constrained by common sense to do whatever their underlying platform does, even if that changes in the future. Designing these things would be a lot easier if we had a time machine so we could go see how extended Unicode is used in practice a decade or two from now. Oh, wait.... -- Michael Chermside From mcherm at mcherm.com Thu Sep 21 00:46:55 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Wed, 20 Sep 2006 15:46:55 -0700 Subject: [Python-3000] How will unicode get used? Message-ID: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> I wrote: >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.' >>> msg[35:-18] u'"\U00010143"' >>> greek_five = msg[36:-19] >>> len(greek_five) 2 After posting, I realized that it's worse than that. I suspect that if I tried this on a CPython compiled with wide characters, then len(greek_five) would be 1. What should it be? 2? 1? Implementation-dependent? -- Michael Chermside From guido at python.org Thu Sep 21 00:52:16 2006 From: guido at python.org (Guido van Rossum) Date: Wed, 20 Sep 2006 15:52:16 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: On 9/20/06, Michael Chermside wrote: > I wrote: > >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.' > >>> msg[35:-18] > u'"\U00010143"' > >>> greek_five = msg[36:-19] > >>> len(greek_five) > 2 > > > After posting, I realized that it's worse than that. I suspect that if > I tried this on a CPython compiled with wide characters, then > len(greek_five) would be 1. > > What should it be? 2? 1? Implementation-dependent? This has all been rehashed endlessly. It's implementation (and platform- and compilation options-) dependent because there are good reasons for both choices. Even if CPython 3.0 supports a dynamic choice (which some are proposing) then the *language* will still make it implementation dependent because of Jython and IronPython, where the only choice is UTF-16 (or UCS-2, depending the attitude towards surrogates). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rhamph at gmail.com Thu Sep 21 00:52:38 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 16:52:38 -0600 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060920142332.0825.JCARLSON@uci.edu> References: <20060920083244.0817.JCARLSON@uci.edu> <20060920142332.0825.JCARLSON@uci.edu> Message-ID: On 9/20/06, Josiah Carlson wrote: > > "Adam Olsen" wrote: > > > > On 9/20/06, Josiah Carlson wrote: > > > > > > "Adam Olsen" wrote: [snip token stuff] Withdrawn. Blake Winston pointed me to some problems in private as well. > If I can't slice based on character index, then we end up with a similar > situation that the wxPython StyledTextCtrl runs into right now: the > content is encoded via utf-8 internally, so users have to use the fairly > annoying PositionBefore(pos) and PositionAfter(pos) methods to discover > where characters start/end. While it is possible to handle everything > this way, it is *damn annoying*, and some users have gone so far as to > say that it *doesn't work* for Europeans. > > While I won't make the claim that it *doesn't work*, it is a pain in the > ass. I'm going to agree with you. That's also why I'm going to assume Guido meant to use Code Points, not Code Units (which would be bytes in the case of UTF-8). > > Using only utf-8 would be simpler than three distinct representations. > > And if memory usage is an issue (which it seems to be, albeit in a > > vague way), we could make a custom encoding that's even simpler and > > more space efficient than utf-8. > > One of the reasons I've been pushing for the 3 representations is > because it is (arguably) optimal for any particular string. It bothers me that adding a single character would cause it to double or quadruple in size. May be the best compromise though. > > > > * Grapheme clusters, words, lines, other groupings, do we need/want > > > > ways to slice based on them too? > > > > > > No. > > > > Can you explain your reasoning? > > We can already split based on words, lines, etc., usingsplit(), and > re.split(). Building additional functionality for text.word[4] seems to > be a waste of time. I'm not entierly convinced, but I'll leave it for now. Maybe it'll be a 3.1 feature. > The benefits gained by using the three internal representations are > primarily from a simplicity standpoint. That is to say, when > manipulating any one of the three representations, you know that the > value at offset X represents the code point of character X in the string. > > Further, with a slight change in how the single-segment buffer interface > is defined (returns the width of the character), C extensions that want > to deal with unicode strings in *native* format (due to concerns about > speed), could do so without having to worry about reencoding, > variable-width characters, etc. Is it really worthwhile if there's three different formats they'd have to handle? > You can get this same behavior by always using UTF-32 (aka UCS-4), but > at least 1/4 of the underlying data is always going to be nulls (code > points are limited to 0x0010ffff), and for many people (in Europe, the > US, and anywhere else with code points < 65536), 1/2 to 3/4 of the > underlying data is going to be nulls. > > While I would imagine that people could deal with UTF-16 as an > underlying representation (from a data waste perspective), the potential > for varying-width characters in such an encoding is a pain in the ass > (like it is for UTF-8). > > Regardless of our choice, *some platform* is going to be angry. Why? > GTK takes utf-8 encoded strings. (I don't know what Qt or linux system > calls take) Windows takes utf-16. Whatever underlying representation, > *someone* is going to have to recode when dealing with GUI or OS-level > operations. Indeed, it seems like all our options are lose-lose. Just to summarize, our requirements are: * Full unicode range (0 through 0x10FFFF) * Constant-time slicing using integer offsets * Basic unit is a Code Point * Continuous in memory The best idea I've had so far for making UTF-8 have constant-time sliving is to use a two level table, with the second level having one byte per code point. However, that brings up the minimum size to (more than) 2 bytes per code point, ruining any space advantage that utf-8 had. UTF-16 is in the same boat, but it's (more than) 3 bytes per code point. I think the only viable options (without changing the requirements) are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4). The size variability of three-way doesn't seem so important when it's only competitor is straight UCS-4. The deciding factor is what we want to expose to third-party interfaces. Sane interface (not bytes/code units), good efficiency, C-accessible: pick two. -- Adam Olsen, aka Rhamphoryncus From rasky at develer.com Thu Sep 21 01:00:32 2006 From: rasky at develer.com (Giovanni Bajo) Date: Thu, 21 Sep 2006 01:00:32 +0200 Subject: [Python-3000] Removing __del__ References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> Message-ID: <001501c6dd08$93c73260$7b4b2597@bagio> Jim Jewett wrote: >>> I believe my example is a good use case for __del__ with no good >>> enough workaround, ... I still like the __close__ method being >>> proposed. > > [Michael asks about this alternative] > ... >> def on_del_invoke(obj, func, *args, **kwargs): > ... >> Please note that the callable must not be a bound method >> of the object being watched, and the object being watched >> must not be (or be refered to by) one of the arguments >> or else the object will never be garbage collected.""" > > By far the most frequently desired callable is self.close > > You can work around this with a wrapper, by setting self.f=open(...) > and then passing self.f.close -- but with this API, I'll be wondering > why I can't just register self.f as the object in the first place. > > If bound methods did not increment the refcount, this would work, but > I imagine it would break various GUI and event-processing idioms. > > A special rebind-this-method-weakly builtin would work, but I'm not > sure that is any simpler than __close__. (~= __del__ but cycles can > be broken in an arbitrary order) I once wrote a simple weakref wrapper, which binds methods weakly (it's pretty easy to write). I thought it would have been dramatically useful one day, and I still have to use it once :) And yes, I agree that __close__ is a much easier solution to the problem. > Note that the wrapper (as posted) does nothing except store a pointer > to the CAPI object and then delegate to it. With a __close__ method, > this class could reduce to (at most) > > class MyCAPI(CAPI): > __close__ = close Ehm, can a class be derived from a module? Giovanni Bajo From rhamph at gmail.com Thu Sep 21 01:02:49 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 17:02:49 -0600 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: On 9/20/06, Guido van Rossum wrote: > On 9/20/06, Michael Chermside wrote: > > I wrote: > > >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.' > > >>> msg[35:-18] > > u'"\U00010143"' > > >>> greek_five = msg[36:-19] > > >>> len(greek_five) > > 2 > > > > > > After posting, I realized that it's worse than that. I suspect that if > > I tried this on a CPython compiled with wide characters, then > > len(greek_five) would be 1. > > > > What should it be? 2? 1? Implementation-dependent? > > This has all been rehashed endlessly. It's implementation (and > platform- and compilation options-) dependent because there are good > reasons for both choices. Even if CPython 3.0 supports a dynamic > choice (which some are proposing) then the *language* will still make > it implementation dependent because of Jython and IronPython, where > the only choice is UTF-16 (or UCS-2, depending the attitude towards > surrogates). Wow, you really did mean code units. In that case I'm very tempted to support UTF-8, with byte indexing (which is what code units are in its case). It's ugly, but it technically works fine, and it's the de facto standard on Linux. No more ugly than UTF-16 code units IMO, just more obvious. -- Adam Olsen, aka Rhamphoryncus From guido at python.org Thu Sep 21 01:08:01 2006 From: guido at python.org (Guido van Rossum) Date: Wed, 20 Sep 2006 16:08:01 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: On 9/20/06, Adam Olsen wrote: > Wow, you really did mean code units. In that case I'm very tempted to > support UTF-8, with byte indexing (which is what code units are in its > case). It's ugly, but it technically works fine, and it's the de > facto standard on Linux. No more ugly than UTF-16 code units IMO, > just more obvious. Who charged you with designing the string implementation? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From qrczak at knm.org.pl Thu Sep 21 01:13:56 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 21 Sep 2006 01:13:56 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: (Guido van Rossum's message of "Wed, 20 Sep 2006 15:52:16 -0700") References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: <87mz8u9lmz.fsf@qrnik.zagroda> "Guido van Rossum" writes: > Even if CPython 3.0 supports a dynamic choice (which some are > proposing) then the *language* will still make it implementation > dependent because of Jython and IronPython, where the only choice > is UTF-16 (or UCS-2, depending the attitude towards surrogates). Jython and IronPython could use a dual UCS-2 / UTF-32 encoding (with some work and interoperability overhead I admit). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From rhamph at gmail.com Thu Sep 21 01:20:29 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 17:20:29 -0600 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: On 9/20/06, Guido van Rossum wrote: > On 9/20/06, Adam Olsen wrote: > > Wow, you really did mean code units. In that case I'm very tempted to > > support UTF-8, with byte indexing (which is what code units are in its > > case). It's ugly, but it technically works fine, and it's the de > > facto standard on Linux. No more ugly than UTF-16 code units IMO, > > just more obvious. > > Who charged you with designing the string implementation? Last I checked, the point of mailing lists such as these was to allow input from the community at large. In any case, my reaction was simply because I misunderstood your intentions. -- Adam Olsen, aka Rhamphoryncus From jcarlson at uci.edu Thu Sep 21 02:29:32 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 20 Sep 2006 17:29:32 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920142332.0825.JCARLSON@uci.edu> Message-ID: <20060920165545.082B.JCARLSON@uci.edu> "Adam Olsen" wrote: > On 9/20/06, Josiah Carlson wrote: > > > > "Adam Olsen" wrote: > > > > > > On 9/20/06, Josiah Carlson wrote: > > > > > > > > "Adam Olsen" wrote: > > [snip token stuff] > > Withdrawn. Blake Winston pointed me to some problems in private as well. > > > > If I can't slice based on character index, then we end up with a similar > > situation that the wxPython StyledTextCtrl runs into right now: the > > content is encoded via utf-8 internally, so users have to use the fairly > > annoying PositionBefore(pos) and PositionAfter(pos) methods to discover > > where characters start/end. While it is possible to handle everything > > this way, it is *damn annoying*, and some users have gone so far as to > > say that it *doesn't work* for Europeans. > > > > While I won't make the claim that it *doesn't work*, it is a pain in the > > ass. > > I'm going to agree with you. That's also why I'm going to assume > Guido meant to use Code Points, not Code Units (which would be bytes > in the case of UTF-8). > > > Using only utf-8 would be simpler than three distinct representations. > > > And if memory usage is an issue (which it seems to be, albeit in a > > > vague way), we could make a custom encoding that's even simpler and > > > more space efficient than utf-8. > > > > One of the reasons I've been pushing for the 3 representations is > > because it is (arguably) optimal for any particular string. > > It bothers me that adding a single character would cause it to double > or quadruple in size. May be the best compromise though. Ahh, but the crucial observation is that the string would have been two or four times as large initially. > I'm not entierly convinced, but I'll leave it for now. Maybe it'll be > a 3.1 feature. I'll just say, "you ain't gonna need it". Why? In my experience, I rarely, if ever, say "give me the ith word" or "give me the ith line". I really do, "give me the first ..., and the remaing ...". With partition (with or without views), you can do these things quite easily. > > The benefits gained by using the three internal representations are > > primarily from a simplicity standpoint. That is to say, when > > manipulating any one of the three representations, you know that the > > value at offset X represents the code point of character X in the string. > > > > Further, with a slight change in how the single-segment buffer interface > > is defined (returns the width of the character), C extensions that want > > to deal with unicode strings in *native* format (due to concerns about > > speed), could do so without having to worry about reencoding, > > variable-width characters, etc. > > Is it really worthwhile if there's three different formats they'd have > to handle? It would depend, but any application that currently handles both utf-16 and UCS-4 builds of Python and unicode strings would require slight modification to handle Latin-1, and could be simplified to handle UCS-2 instead of UTF-16. > Indeed, it seems like all our options are lose-lose. > > Just to summarize, our requirements are: > * Full unicode range (0 through 0x10FFFF) > * Constant-time slicing using integer offsets > * Basic unit is a Code Point > * Continuous in memory > > The best idea I've had so far for making UTF-8 have constant-time > sliving is to use a two level table, with the second level having one > byte per code point. However, that brings up the minimum size to > (more than) 2 bytes per code point, ruining any space advantage that > utf-8 had. (I'm not advocating the following, just expressing that it could be done) Another way of doing it would be to have the underlying string in utf-8 (or even 16), but layer a specially crafted tree structure over the top of it, sized in a particular manner so that (bytes used)/(bytes required) is somewhat small. It could offer log-time discovery of offsets (for slicing) and memory-continuous representation. This tree could also be generated after some k slices, avoiding the overhead of tree creation unless we have determined it to be reasonable to ammortize. If one chooses one node per klogn characters, then we get O(n) tree construction time with the same big-O index discovery time, using ~24*n/ (klogn) additional bytes/string of length n (assuming a 64-bit Python). Choose k=24 (or k=12 on a 32 bit Python), and we get the used/required ratio of 1 + logn/n. Not bad. > I think the only viable options (without changing the requirements) > are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4). The size > variability of three-way doesn't seem so important when it's only > competitor is straight UCS-4. > > The deciding factor is what we want to expose to third-party interfaces. > > Sane interface (not bytes/code units), good efficiency, C-accessible: pick two. I would say that both options are C-accessable, though perhaps not optimally in either case. Note that we can always recode for the third-party interfaces; that's what is already done for PyGTK, wxPython on linux, 32-bit characters on Windows, etc. - Josiah From greg.ewing at canterbury.ac.nz Thu Sep 21 02:54:29 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 21 Sep 2006 12:54:29 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87psdqztxk.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> <87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz> <87psdqztxk.fsf@qrnik.zagroda> Message-ID: <4511E2C5.2020500@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > Incremental GC (e.g. in OCaml) has short pauses. It doesn't scan all > memory at once, but distributes the work among GC cycles. Can it be made to guarantee that no pause will be longer than some small amount, such as 20ms? Because that's what is needed to ensure smooth animation. -- Greg From greg.ewing at canterbury.ac.nz Thu Sep 21 03:14:23 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 21 Sep 2006 13:14:23 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> Message-ID: <4511E76F.9000706@canterbury.ac.nz> Michael Chermside wrote: > * Programmers no longer have the ability to allow __del__ > to resurect the object being finalized. I've never even considered trying to write such code, and can't think of any reason why I ever would, so I wouldn't miss this ability at all. -- Greg From david.nospam.hopwood at blueyonder.co.uk Thu Sep 21 03:09:24 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Thu, 21 Sep 2006 02:09:24 +0100 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: <4511E644.2030306@blueyonder.co.uk> Brett Cannon wrote: > On 9/20/06, Adam Olsen wrote: >> On 9/20/06, Guido van Rossum wrote: >> > On 9/20/06, Adam Olsen wrote: >> > > >> > > Before we can decide on the internal representation of our unicode >> > > objects, we need to decide on their external interface. My thoughts >> > > so far: >> > >> > Let me cut this short. The external string API in Py3k should not >> > change or only very marginally so (like removing rarely used useless >> > APIs or adding a few new conveniences). The plan is to keep the 2.x >> > API that is supported (in 2.x) by both str and unicode, but merge the >> > two string types into one. Anything else could be done just as easily >> > before or after Py3k. >> >> Thanks, but one thing remains unclear: is the indexing intended to >> represent bytes, code points, or code units? Note that C code >> operating on UTF-16 would use code units for slicing of UTF-16, which >> splits surrogate pairs. > > Assuming my Unicode lingo is right and code point represents a > letter/character/digraph/whatever, then it will be a code point. Doing one > of my rare channels of Guido, I *really* doubt he wants to expose the > technical details of Unicode to the point of having people need to realize > that UTF-8 takes two bytes to represent "?". The argument used here is not valid. People do need to realize that *all* Unicode encodings are variable-length, in the sense that abstract characters can be represented by multiple code points. For example, "?" can be represented either as the precomposed character U+00F6, or as "o" followed by a combining diaeresis (U+006F U+0308). Programs must avoid splitting sequences of code points that represent a single abstract character. A program that does that correctly will automatically also avoid splitting within the representation of a code point, whatever UTF is used. > If you want that kind of > exposure, use the bytes type. Otherwise assume the usage will be by people > ignorant of Unicode and thus want something that will work the way they are > used to when compared to working in ASCII. It simply is not possible to do correct string processing in Unicode that will "work the way [programmers] are used to when compared to working in ASCII". The Unicode standard is on-line at www.unicode.org, and is quite well written, with lots of motivation and explanation of how processing international texts necessarily differs from working with ASCII. There is no excuse for any programmer doing text processing not to have read it. Should we nevertheless try to avoid making the use of Unicode strings unnecessarily difficult for people who have minimal knowledge of Unicode? Absolutely, but not at the expense of making basic operations on strings asymptotically less efficient. O(1) indexing and slicing is a basic requirement, even if it has to be done using code units. -- David Hopwood From greg.ewing at canterbury.ac.nz Thu Sep 21 03:34:14 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 21 Sep 2006 13:34:14 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com> Message-ID: <4511EC16.4030307@canterbury.ac.nz> Jim Jewett wrote: > How do you feel about the __del__ in stdlib subprocess.Popen (about line 615)? > > This resurrects itself, in order to finish waiting for the child > process. I don't see a need for resurrection here. Why can't it create another object holding the necessary info for doing the waiting? > (And note that if it needed to revive (not recreate, revive) > subobjects, it would need the full immortal-cycle power of today's > __del__. Any subobjects which may need to be preserved can be passed as arguments to the finalizer, which can then prevent them from dying in the first place if it wants. I'm far from convinced that there's ever a *need* for resurrection. -- Greg From jcarlson at uci.edu Thu Sep 21 03:58:30 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 20 Sep 2006 18:58:30 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <4511E644.2030306@blueyonder.co.uk> References: <4511E644.2030306@blueyonder.co.uk> Message-ID: <20060920184147.082F.JCARLSON@uci.edu> David Hopwood wrote: > Brett Cannon wrote: [snip] > > If you want that kind of > > exposure, use the bytes type. Otherwise assume the usage will be by people > > ignorant of Unicode and thus want something that will work the way they are > > used to when compared to working in ASCII. > > It simply is not possible to do correct string processing in Unicode that > will "work the way [programmers] are used to when compared to working in ASCII". > > The Unicode standard is on-line at www.unicode.org, and is quite well written, > with lots of motivation and explanation of how processing international texts > necessarily differs from working with ASCII. There is no excuse for any > programmer doing text processing not to have read it. Since, basically everyone using Python today performs "text processing" in one way or another, you are saying that basically everyone should be reading the Unicode spec before using Python. Nevermind that the document is generally larger than most people want to be reading, and that you didn't provide a link to the most applicable section (with regards to *using* unicode). I will also mention that in the unicode 4.0 spec, Chapter 5 "Implementation Guidelines" starts with: ''' It is possible to implement a substantial subset of the Unicode Standard as "wide ASCII" with little change to existing programming practice. ... ''' It later goes on to explain where "wide ASCII" is not a reasonable strategy, but I'm not sure that users of Python necessarily need to know all of that. > Should we nevertheless try to avoid making the use of Unicode strings > unnecessarily difficult for people who have minimal knowledge of Unicode? > Absolutely, but not at the expense of making basic operations on strings > asymptotically less efficient. O(1) indexing and slicing is a basic > requirement, even if it has to be done using code units. I believe you mean "code points", "code units" imply non-O(1) indexing and slicing (variable-width characters). - Josiah From guido at python.org Thu Sep 21 03:55:24 2006 From: guido at python.org (Guido van Rossum) Date: Wed, 20 Sep 2006 18:55:24 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: On 9/20/06, Adam Olsen wrote: > On 9/20/06, Guido van Rossum wrote: > > On 9/20/06, Adam Olsen wrote: > > > Wow, you really did mean code units. In that case I'm very tempted to > > > support UTF-8, with byte indexing (which is what code units are in its > > > case). It's ugly, but it technically works fine, and it's the de > > > facto standard on Linux. No more ugly than UTF-16 code units IMO, > > > just more obvious. > > > > Who charged you with designing the string implementation? > > Last I checked, the point of mailing lists such as these was to allow > input from the community at large. > > In any case, my reaction was simply because I misunderstood your intentions. I was specifically reacting to your use of the phrasing "I'm very tempted to support UTF-8"; this wording suggests that it would be your choice to make. I could have pointed out the obvious (that equating the difficulty of using UTF-8 with that of using UTF-16 doesn't make it so) but I figured the other readers are also tired of your attempts to move this into an entirely different direction, and based on a thorough lack of understanding of the status quo no less. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rhamph at gmail.com Thu Sep 21 04:12:35 2006 From: rhamph at gmail.com (Adam Olsen) Date: Wed, 20 Sep 2006 20:12:35 -0600 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: On 9/20/06, Guido van Rossum wrote: > On 9/20/06, Adam Olsen wrote: > > On 9/20/06, Guido van Rossum wrote: > > > On 9/20/06, Adam Olsen wrote: > > > > Wow, you really did mean code units. In that case I'm very tempted to > > > > support UTF-8, with byte indexing (which is what code units are in its > > > > case). It's ugly, but it technically works fine, and it's the de > > > > facto standard on Linux. No more ugly than UTF-16 code units IMO, > > > > just more obvious. > > > > > > Who charged you with designing the string implementation? > > > > Last I checked, the point of mailing lists such as these was to allow > > input from the community at large. > > > > In any case, my reaction was simply because I misunderstood your intentions. > > I was specifically reacting to your use of the phrasing "I'm very > tempted to support UTF-8"; this wording suggests that it would be your > choice to make. I could have pointed out the obvious (that equating > the difficulty of using UTF-8 with that of using UTF-16 doesn't make > it so) but I figured the other readers are also tired of your attempts > to move this into an entirely different direction, and based on a > thorough lack of understanding of the status quo no less. It was poor wording then. I never intended to imply that it was my choice. Instead, I was referring to the input I have as a member of the community. I am not attempting to move this in a different direction. I (and apparently several other people) thought it always was a different direction. It is obvious now that it wasn't your intent to use code points, and I can accept that code units are the best (most efficient) choice. -- Adam Olsen, aka Rhamphoryncus From fredrik at pythonware.com Thu Sep 21 09:00:56 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 21 Sep 2006 09:00:56 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: Adam Olsen wrote: > Wow, you really did mean code units. In that case I'm very tempted to > support UTF-8, with byte indexing (which is what code units are in its > case). It's ugly, but it technically works fine, and it's the de > facto standard on Linux. No more ugly than UTF-16 code units IMO, > just more obvious. *plonk* From fredrik at pythonware.com Thu Sep 21 09:16:01 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 21 Sep 2006 09:16:01 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: Guido van Rossum wrote: > based on a thorough lack of understanding of the status quo no less. that's, unfortunately, a bit too common on this list. (as the author of Python's Unicode type and cElementTree, I especially like arguments along the lines of "using separate buffers to hold the actual character data is not feasible" and "that narrow storage would have any advantages over wide storage is far from proven". nice try, guys ;-) From fredrik at pythonware.com Thu Sep 21 09:21:34 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 21 Sep 2006 09:21:34 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <4511E644.2030306@blueyonder.co.uk> References: <4511E644.2030306@blueyonder.co.uk> Message-ID: David Hopwood wrote: > For example, "?" can be represented either as the precomposed character U+00F6, > or as "o" followed by a combining diaeresis (U+006F U+0308). normalization is a good thing, though: http://www.w3.org/TR/charmod-norm/ (it would probably be a good idea to turn unicodedata.normalize into a method for the new unicode string type). From qrczak at knm.org.pl Thu Sep 21 09:51:02 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 21 Sep 2006 09:51:02 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <4511E2C5.2020500@canterbury.ac.nz> (Greg Ewing's message of "Thu, 21 Sep 2006 12:54:29 +1200") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> <87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz> <87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz> Message-ID: <87slil1wux.fsf@qrnik.zagroda> Greg Ewing writes: >> Incremental GC (e.g. in OCaml) has short pauses. It doesn't scan all >> memory at once, but distributes the work among GC cycles. > > Can it be made to guarantee that no pause will > be longer than some small amount, such as 20ms? It's not hard realtime. There are no strict guarantees, and a single large object is processed in whole. Python also processes large objects in whole. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Thu Sep 21 11:22:29 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 21 Sep 2006 11:22:29 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <4511E644.2030306@blueyonder.co.uk> (David Hopwood's message of "Thu, 21 Sep 2006 02:09:24 +0100") References: <4511E644.2030306@blueyonder.co.uk> Message-ID: <87venh60bu.fsf@qrnik.zagroda> David Hopwood writes: > People do need to realize that *all* Unicode encodings are > variable-length, in the sense that abstract characters can be > represented by multiple code points. Unicode algorithms for case mapping, word splitting, collation etc. are generally defined in terms of code points. Character database is keyed by code points, which is the largest practical text unit with a finite domain. Even if on the high level there are some other units, any algorithm which determines these high level text boundaries is easier to implement in terms of code points than in terms of even lower-level UTF-x code units. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From ncoghlan at iinet.net.au Thu Sep 21 12:26:18 2006 From: ncoghlan at iinet.net.au (Nick Coghlan) Date: Thu, 21 Sep 2006 20:26:18 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> <451123A2.7040701@gmail.com> Message-ID: <451268CA.3030307@iinet.net.au> Second attempt, this time to the right list :) Jim Jewett wrote: > On 9/20/06, Nick Coghlan wrote: >> # Create a class with the same instance attributes >> # as the original >> class attr_holder(object): >> pass >> finalizer_arg = attr_holder() >> finalizer_arg.__dict__ = self.__dict__ > > Does this really work? It works for normal user-defined classes at least: >>> class C1(object): ... pass ... >>> class C2(object): ... pass ... >>> a = C1() >>> b = C2() >>> b.__dict__ = a.__dict__ >>> a.x = 1 >>> b.x 1 > (1) for classes with a dictproxy of some sort, you might get either a > copy (which isn't updated) Classes that change the way __dict__ is handled would probably need to define their own __del_arg__. > (2) for other classes, self might be added to the dict later Yeah, that's the strongest argument I know of against having that default fallback - it can easily lead to a strong reference from sys.finalizers into an otherwise unreachable cycle. I believe it currently takes two __del__ methods to prevent a cycle from being collected, whereas in this set up it would only take one. OTOH, fixing it would be much easier than it is now (by setting __del_args__ to something that holds only the subset of attributes that require finalization). > and of course, if it isn't added later, then it doesn't hvae the full > power of current finalizers -- just the __close__ subset. True, but most finalizers I've seen don't really *need* the full power of the current __del__. They only need to get at a couple of their internal members in order to explicitly release external resources. And more sophisticated usage is still possible by assigning an appropriate value to __del_arg__. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From rasky at develer.com Thu Sep 21 12:28:51 2006 From: rasky at develer.com (Giovanni Bajo) Date: Thu, 21 Sep 2006 12:28:51 +0200 Subject: [Python-3000] How will unicode get used? References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: <015c01c6dd68$bb6bcfa0$e303030a@trilan> Guido van Rossum wrote: > I was specifically reacting to your use of the phrasing "I'm very > tempted to support UTF-8"; this wording suggests that it would be your > choice to make. I could have pointed out the obvious (that equating > the difficulty of using UTF-8 with that of using UTF-16 doesn't make > it so) but I figured the other readers are also tired of your attempts > to move this into an entirely different direction, and based on a > thorough lack of understanding of the status quo no less. Is there a design document explaining the rationale of unicode type, the status quo? Any time this subject is raised on the mailing list, the net result is "you guys don't understand unicode". Well, let us know what is good and what is bad of the current unicode type; what is by design and what is an implementation detail; what you want to absolutely keep, and what you want to absolutely change. I am *really* confused about the status quo of the unicode type (which is why I keep myself out of technical discussions on the matter of course). Is there any desire to let people understand and join the discussion? Or otherwise, let's decide that the unicode type in Py3k will not be publically discussed and will be handled only by the experts. This would save us from these "attempts" as well. -- Giovanni Bajo From ncoghlan at gmail.com Thu Sep 21 12:31:25 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Sep 2006 20:31:25 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com> References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com> Message-ID: <451269FD.30508@gmail.com> Michael Chermside wrote: > Nick Coghlan writes: > [...proposes revision of __del__ rather than removal...] >> The only way for __del__ to receive a reference to self is if the >> finalizer argument had a reference to it - but that would mean the >> object itself was not >> collectable, so __del__ wouldn't be called in the first place. >> >> That all seems too simple, though. Since we're talking about gc and >> that's never simple, there has to be something wrong with the idea :) > > Unfortunately you're right... this is all too simple. The existing > mechanism doesn't have a problem with __del__ methods that do not > participate in loops. For those that DO participate in loops I > think it's perfectly plausible for your __del__ to receive a reference > to the actual object being finalized. Nope. If the argument to __del__ has a strong reference to the object, that object simply won't get finalized at all because it's not in an unreachable cycle. sys.finalizers would act as a global root for all objects reachable from finalizers (with those refcounts only be decremented when the callback removes the weakref object from the finalizer set). > Another problem (but less important as it's trivially fixable) is that > you're storing away the values that the object had when it was created, > perhaps missing out on things that got added or initialized later. The default fallback doesn't do that - it stores a reference to the instance dictionary of the object so it sees later modifications and additions. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From fredrik at pythonware.com Thu Sep 21 12:41:08 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 21 Sep 2006 12:41:08 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <015c01c6dd68$bb6bcfa0$e303030a@trilan> References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> <015c01c6dd68$bb6bcfa0$e303030a@trilan> Message-ID: Giovanni Bajo wrote: > Is there a design document explaining the rationale of unicode type, the > status quo? Guido isn't complaining about people who don't understand the rationale behind the design, he's complaining about people who HAVEN'T EVEN LOOKED AT THE CURRENT DESIGN before spouting off random proposals. From gabor at nekomancer.net Thu Sep 21 12:50:30 2006 From: gabor at nekomancer.net (=?ISO-8859-1?Q?G=E1bor_Farkas?=) Date: Thu, 21 Sep 2006 12:50:30 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> Message-ID: <45126E76.9020600@nekomancer.net> Guido van Rossum wrote: > On 9/20/06, Michael Chermside wrote: >> I wrote: >>>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.' >>>>> msg[35:-18] >> u'"\U00010143"' >>>>> greek_five = msg[36:-19] >>>>> len(greek_five) >> 2 >> >> >> After posting, I realized that it's worse than that. I suspect that if >> I tried this on a CPython compiled with wide characters, then >> len(greek_five) would be 1. >> >> What should it be? 2? 1? Implementation-dependent? > > This has all been rehashed endlessly. It's implementation (and > platform- and compilation options-) dependent because there are good > reasons for both choices. while i understand the constraints, i think it's not a good decision to leave this to be implementation-dependent. the strings seem to me as such a basic functionality, that it's behaviour should not depend on the platform. for example, how is an application developer then supposed to write their applications? should he write his own slicing/whatever functions to get consistent behaviour on linux/windows? i think this is not just a 'theoretical' issue. it's a very practical issue. the only reason why it does not seem to be important, because currently not much of the non-16-bit unicode characters are used. (and this situation seems to be quite similar to that one, when only 8byte-characters were used :-) btw. an idea: ============== maybe this 'problem' should be separated into 2 issues: 1. representation of the unicode string (utf-16 or utf-32) 2. behaviour of the unicode strings in python-3000 of course there are some dependencies between them. (mostly the performance of #2) so why don't we make the *behaviour* cross-platform, and the *performance characteristics* and the *representation* platform-dependent? (means that jython/ironpython could use utf-16, but would slice strings slower (because of the surrogate-issues)) ================ > Even if CPython 3.0 supports a dynamic > choice (which some are proposing) then the *language* will still make > it implementation dependent because of Jython and IronPython, where > the only choice is UTF-16 (or UCS-2, depending the attitude towards > surrogates). > i don't see why there should be the only choice utf-16. it's the obvious/most-convenient choice for jython/ironpython, that's correct. but (correct me if i'm wrong), ironPython or jython could support utf-32 characters. but it of course would mean that they could not use the 'platform''s string for their string handling. but the same way i could say, that because most of the unix-world is utf-8, for those pythons the best way is to handle it internally as utf-8, couldn't i? it simply seems to me strange to make compromises that makes the life of the cpython-users harder, just to make the life for the jython/ironpython developers (i mean the 'creators') easier. gabor From rasky at develer.com Thu Sep 21 14:00:05 2006 From: rasky at develer.com (Giovanni Bajo) Date: Thu, 21 Sep 2006 14:00:05 +0200 Subject: [Python-3000] Removing __del__ References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com> <451269FD.30508@gmail.com> Message-ID: <042501c6dd75$79f2df70$e303030a@trilan> Nick Coghlan wrote: >> Unfortunately you're right... this is all too simple. The existing >> mechanism doesn't have a problem with __del__ methods that do not >> participate in loops. For those that DO participate in loops I >> think it's perfectly plausible for your __del__ to receive a >> reference to the actual object being finalized. > > Nope. If the argument to __del__ has a strong reference to the > object, that object simply won't get finalized at all because it's > not in an unreachable cycle. What if the "self" passed to __del__ was instead a weakref.proxy, or a similar wrapper object which does not give you access to the object itself but lets you access its attributes? The object could have been already collected for what I care, what I really need is to be able to say "self.foo" to access what used to be a "foo" member of self. You can create a totally different object of any type but with the same __dict__. Ok, it's not that easy (properties, etc.), but you get the idea. Am I missing something? -- Giovanni Bajo From qrczak at knm.org.pl Thu Sep 21 14:54:26 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Thu, 21 Sep 2006 14:54:26 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: <042501c6dd75$79f2df70$e303030a@trilan> (Giovanni Bajo's message of "Thu, 21 Sep 2006 14:00:05 +0200") References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com> <451269FD.30508@gmail.com> <042501c6dd75$79f2df70$e303030a@trilan> Message-ID: <87u031xtvh.fsf@qrnik.zagroda> "Giovanni Bajo" writes: > What if the "self" passed to __del__ was instead a weakref.proxy, > or a similar wrapper object which does not give you access to the > object itself but lets you access its attributes? weakref.proxy will find the object already dead. I doubt this can be done fully automatically. The basic design is splitting the object into an outer part handled to clients, which is watched to become unreachable, and a private inner part used to physically access the resource, including releasing it. I see no good way around it. Often the inner part is a single field which is already separated. In other cases it might require an extra indirection, in particular if it's a mutable field. This design distinguishes between related objects which are needed during finalization (fields of the inner object) and related objects which are not (fields of the outer object). Cycles involving only outer objects are harmless, they can be safely freed together, triggering finalization of all associated objects. Inner objects may also refer to most other objects, ensuring that they are not finalized earlier. But a path from an inner object to its associated outer object prevents it from being finalized and is a bug in the program (unless it is broken before the object loses all other references). -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From theller at python.net Thu Sep 21 15:24:52 2006 From: theller at python.net (Thomas Heller) Date: Thu, 21 Sep 2006 15:24:52 +0200 Subject: [Python-3000] Small Py3k task: fix modulefinder.py In-Reply-To: References: Message-ID: Guido van Rossum schrieb: > Is anyone familiar enough with modulefinder.py to fix its breakage in > Py3k? It chokes in a nasty way (exceeding the recursion limit) on the > relative import syntax. I suspect this is also a problem for 2.5, when > people use that syntax; hence the cross-post. There's no unittest for > modulefinder.py, but I believe py2exe depends on it (and of course > freeze.py, but who uses that still?) > I'm not (yet) using relative imports in 2.5 or Py3k, but have not been able to reproduce the recursion limit problem. Can you describe the package that fails? Thanks, Thomas From david.nospam.hopwood at blueyonder.co.uk Thu Sep 21 21:41:54 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Thu, 21 Sep 2006 20:41:54 +0100 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <4511E644.2030306@blueyonder.co.uk> Message-ID: <4512EB02.6@blueyonder.co.uk> Fredrik Lundh wrote: > David Hopwood wrote: > >>For example, "?" can be represented either as the precomposed character U+00F6, >>or as "o" followed by a combining diaeresis (U+006F U+0308). > > normalization is a good thing, though: > > http://www.w3.org/TR/charmod-norm/ > > (it would probably be a good idea to turn unicodedata.normalize into a > method for the new unicode string type). Normalization is certainly a good thing to support. But that's orthogonal to my point above -- that some abstract characters are representable by sequences of more than one code point, which must not be split, and that avoidance of such splitting automatically also avoids splitting within a code point representation. Note that some abstract characters needed for living languages are representable *only* by combining sequences. -- David Hopwood From greg.ewing at canterbury.ac.nz Fri Sep 22 01:57:08 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 22 Sep 2006 11:57:08 +1200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87slil1wux.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> <87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz> <87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz> <87slil1wux.fsf@qrnik.zagroda> Message-ID: <451326D4.5030401@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > It's not hard realtime. There are no strict guarantees, and a single > large object is processed in whole. I know. What I mean to say, I think, is can it be designed so that there cannot be any pauses longer than there would have been if freeing had been performed as early as possible by refcounting. -- Greg From michel at dialnetwork.com Fri Sep 22 04:22:55 2006 From: michel at dialnetwork.com (Michel Pelletier) Date: Thu, 21 Sep 2006 19:22:55 -0700 Subject: [Python-3000] Kill GIL? - to PEP 3099? In-Reply-To: References: Message-ID: <1158891775.14240.7.camel@amdy> > Fredrik Lundh wrote: > > > no need to wait for Guido for this: adding library support for shared- > > memory dictionaries/lists is a no-brainer. if you have experience in > > this field, start hacking. I'll take care of the rest ;-) > > and you don't need to wait for Python 3000 either, of course -- if done > right, this would certainly fit into some future 2.X release. Here's a straight wrapper around the OSSP mm shared memory library: http://70.103.91.130/~michel/pymm-0.1.tgz I've only minimally tested it on AMD64 linux. It exposes mm shared memory regions as buffers and only wraps the "Standard" mm API. Hacking the Python memory system to put objects in shared memory is too deep for me. Included is a test that uses cPickle to share object state between forked processes. It needs a lot more testing and tweaking but it works as a proof of concept. -Michel From ncoghlan at gmail.com Fri Sep 22 13:02:38 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Sep 2006 21:02:38 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: <87u031xtvh.fsf@qrnik.zagroda> References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com> <451269FD.30508@gmail.com> <042501c6dd75$79f2df70$e303030a@trilan> <87u031xtvh.fsf@qrnik.zagroda> Message-ID: <4513C2CE.4050506@gmail.com> Marcin 'Qrczak' Kowalczyk wrote: > "Giovanni Bajo" writes: > >> What if the "self" passed to __del__ was instead a weakref.proxy, >> or a similar wrapper object which does not give you access to the >> object itself but lets you access its attributes? > > weakref.proxy will find the object already dead. > > I doubt this can be done fully automatically. > > The basic design is splitting the object into an outer part handled to > clients, which is watched to become unreachable, and a private inner > part used to physically access the resource, including releasing it. > I see no good way around it. > > Often the inner part is a single field which is already separated. > In other cases it might require an extra indirection, in particular > if it's a mutable field. > > This design distinguishes between related objects which are needed > during finalization (fields of the inner object) and related objects > which are not (fields of the outer object). > > Cycles involving only outer objects are harmless, they can be safely > freed together, triggering finalization of all associated objects. > Inner objects may also refer to most other objects, ensuring that > they are not finalized earlier. But a path from an inner object to its > associated outer object prevents it from being finalized and is a bug > in the program (unless it is broken before the object loses all other > references). Exactly. My strawman design made the default inner object a simple class with the same instance dictionary as the outer object so that most current __del__ implementations would 'just work', but it poses the problem of making it easy to inadvertently create an immortal cycle. OTOH, that can already happen today, and the __del_arg__ mechanism provides an easy way of ensuring it doesn't happen. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From qrczak at knm.org.pl Fri Sep 22 13:54:19 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 22 Sep 2006 13:54:19 +0200 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <451326D4.5030401@canterbury.ac.nz> (Greg Ewing's message of "Fri, 22 Sep 2006 11:57:08 +1200") References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> <87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz> <87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz> <87slil1wux.fsf@qrnik.zagroda> <451326D4.5030401@canterbury.ac.nz> Message-ID: <87lkocxgk4.fsf@qrnik.zagroda> Greg Ewing writes: > I know. What I mean to say, I think, is can it be designed so that > there cannot be any pauses longer than there would have been if > freeing had been performed as early as possible by refcounting. The question is misleading: refcounting also causes pauses, but at different times and with different length distribution. An incremental GC generally has pauses which are incomparable to pauses of refcounting, i.e. it has longer pauses where refcounting had shorter pauses and vice versa. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From barry at python.org Fri Sep 22 15:01:59 2006 From: barry at python.org (Barry Warsaw) Date: Fri, 22 Sep 2006 09:01:59 -0400 Subject: [Python-3000] Delayed reference counting idea In-Reply-To: <87lkocxgk4.fsf@qrnik.zagroda> References: <8764fl87j3.fsf@qrnik.zagroda> <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> <87eju79aut.fsf@qrnik.zagroda> <878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz> <87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz> <87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz> <87slil1wux.fsf@qrnik.zagroda> <451326D4.5030401@canterbury.ac.nz> <87lkocxgk4.fsf@qrnik.zagroda> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sep 22, 2006, at 7:54 AM, Marcin 'Qrczak' Kowalczyk wrote: > Greg Ewing writes: > >> I know. What I mean to say, I think, is can it be designed so that >> there cannot be any pauses longer than there would have been if >> freeing had been performed as early as possible by refcounting. > > The question is misleading: refcounting also causes pauses, but at > different times and with different length distribution. An incremental > GC generally has pauses which are incomparable to pauses of > refcounting, > i.e. it has longer pauses where refcounting had shorter pauses and > vice versa. Python's cyclic gc can also cause long pauses if you end up with a ton of objects in say generation 2, because it takes time just to traverse them even if they can't yet be collected. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQCVAwUBRRPe2nEjvBPtnXfVAQLspwQAtFhE1UJprBg8Cf/jH6obpaP+4+T2GHsP ZW2IUcp41jZanfVcrOOEVERyR5saQtsCRoZtaN8XTxOJ1P1ZBCXZId0kGc39MQBW 9J4RDoQ4WTdXQFIaN+15OHIkKDIaLFakX0/smdwjHfAm8QI8D8EbEoetbsu5q0nq MbfLHc7kk2U= =Dz8F -----END PGP SIGNATURE----- From mchermside at ingdirect.com Fri Sep 22 15:04:49 2006 From: mchermside at ingdirect.com (Chermside, Michael) Date: Fri, 22 Sep 2006 09:04:49 -0400 Subject: [Python-3000] Removing __del__ Message-ID: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> I don't seem to have gotten anyone one board with the bold proposal to just rip __del__ out and tell people to learn to use weakrefs. But I'm hearing general agreement (at least among those contributing to this thread) that it might be wise to change the status quo. The two kinds of solutions I'm hearing are (1) those that are based around making a helper object that gets stored as an attribute in the object, or a list of weakrefs, or something like that, and (2) the __close__ proposal (or perhaps keep the name __del__ but change the semantics. The difficulties with (1) that have been acknowledged so far are that the way you code things becomes somewhat less obvious, and that there is the possibility of accidentally creating immortal objects through reference loops. I would like to hear someone address the weaknesses of (2). The first I know of is that the code in your __close__ method (or __del__) must assume that it might have been in a reference loop which was broken in some arbitrary place. As a result, it cannot assume that all references it holds are still valid. To avoid crashing the system, we'd probably have to set the broken references to None (is that the best choice?), but can people really write code that has to run assuming that its references might be invalid? A second problem I know of is, what if the code stores a reference to self someplace? The ability for __del__ methods to resurrect the object being finalized is one of the major sources of complexity in the GC module, and changing the semantics to __close__ doesn't fix this. Does anyone defending __close__ want to address these issues? -------- examples only below this line -------- Just in case it isn't clear enough, I wanted to put together some examples. First, I'll do the kind of problem that __close__ handles well: class MyClass(object): def __init__(self, resource1_name, resource2_name): self.resource1 = acquire_resource(resource1_name) self.resource2 = acquire_resource(resource2_name) def close(self): self.resource1.release() self.resource2.release() def __close__(self): self.close() This is the simplest example I could think of for an object which needs to call self.close() when it is freed in order to release resources. Now let's imagine creating a loop with such an object. x = MyClass('db1', 'db2') y = MyClass('db3', 'db4') x.next = y y.next = x In today's world, with __del__ instead of __close__ such a loop would be immortal (and the resources would never be released). And it would work fine with __close__ semantics because the __close__ method doesn't use self.next. So this one is just fine. The danger in __close__ is when something used (if only indirectly) by the __close__ method participates in the loop. We will modify the original example by adding a flush() method which flushes the resources and calling it in close(): class MyClass2(object): def __init__(self, resource1_name, resource2_name): self.resource1 = acquire_resource(resource1_name) self.resource2 = acquire_resource(resource2_name) def flush(self): self.resource1.flush() self.resource2.flush() if hasattr(self, 'next'): self.next.flush() def close(self): self.resource1.release() self.resource2.release() def __close__(self): self.flush() self.close() x = MyClass2('db1', 'db2') y = MyClass2('db3', 'db4') x.next = y y.next = x This version will encounter a problem. When the GC sees the x <--> y loop it will break it somewhere... without loss of generality, let us say it breaks the y -> x link by setting y.next to None. Now y will be freed, so __close__ will be called. __close__ will invoke self.flush() which will then try to invoke self.next.flush(). But self.next is None, so we'll get an exception and never make it to invoking self.close(). ------ The other problem I discussed is illustrated by the following malicious code: evil_list = [] class MyEvilClass(object): def __close__(self): evil_list.append(self) Do the proponents of __close__ propose a way of prohibiting this behavior? Or do we continue to include complicated logic the GC module to support it? I don't think anyone cares how this code behaves so long as it doesn't segfault. -- Michael Chermside ***************************************************************************** This email may contain confidential or privileged information. If you believe you have received the message in error, please notify the sender and delete the message without copying or disclosing it. ***************************************************************************** From jimjjewett at gmail.com Fri Sep 22 15:53:27 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 22 Sep 2006 09:53:27 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: On 9/22/06, Chermside, Michael wrote: > the code in your __close__ method (or > __del__) must assume that it might have been in a reference loop > which was broken in some arbitrary place. As a result, it cannot > assume that all references it holds are still valid. Most close methods already assume this; how well they defend against it varies. > A second problem I know of is, what if the code stores a reference > to self someplace? The ability for __del__ methods to resurrect > the object being finalized is one of the major sources of > complexity in the GC module, and changing the semantics to > __close__ doesn't fix this. Even if this were forbidden, __close__ could still create a new object that revived some otherwise-dead subobjects. Needing those exact subobjects (as opposed to a newly created equivalent) is the only justification I've seen for keeping the original __del__ semantics. (And even then, I think we should have __close__ as well, for the normal case.) > We will modify the original example by adding a flush() > method which flushes the resources and calling it in close(): The more careful close methods already either check a flag attribute or use try-except. > class MyClass2(object): > def __init__(self, resource1_name, resource2_name): > self.resource1 = acquire_resource(resource1_name) > self.resource2 = acquire_resource(resource2_name) > def flush(self): > self.resource1.flush() > self.resource2.flush() > if hasattr(self, 'next'): > self.next.flush() Do the two resources need to be as correct as possible, or as in-sync as possible? If they need to be as correct as possible, this would be def flush(self): try: self.resource1.flush() except Exception: pass try: self.resource2.flush() except Exception: pass try: self.next.flush() # no need to check for self.next -- just eat the exception except Exception: pass Note that this is an additional motivation for exception expressions. (Or, at least, some way to write "This may fail -- I don't care" in less than four lines.) > def close(self): > self.resource1.release() > self.resource2.release() > def __close__(self): > self.flush() > self.close() If the resources instead need to be as in-sync as possible, then keep the original flush, but replace __close__ with def __close__(self): try: self.flush() except Exception: pass self.close() # exceptions here will be swallowed anyhow > The other problem I discussed is illustrated by the following > malicious code: > evil_list = [] > class MyEvilClass(object): > def __close__(self): > evil_list.append(self) > Do the proponents of __close__ propose a way of prohibiting > this behavior? Or do we continue to include complicated > logic the GC module to support it? I don't think anyone > cares how this code behaves so long as it doesn't segfault. I'll again point to the standard library module subprocess, where MyEvilClass ~= subprocess.Popen MyEvilClass.__close__ ~= subprocess.Popen.__del__ evil_list ~= subprocess._active It does the append only conditionally -- if it is still waiting for the subprocess *and* python as a whole is not shutting down. People do care how that code behaves. If the decision is not to support it (or to require that it be written in a more complicated way), that may be a reasonable tradeoff, but there would be a cost. -jJ From rhettinger at ewtllc.com Fri Sep 22 18:26:17 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Fri, 22 Sep 2006 09:26:17 -0700 Subject: [Python-3000] Removing __var Message-ID: I propose dropping the __var private name mangling trick for double-underscores. It is rarely used; it smells like a hack; it complicates instrospection tools; it's not beautiful; and it is not in line with Python's spirit of "we're all consenting adults". Raymond From rhettinger at ewtllc.com Fri Sep 22 18:26:23 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Fri, 22 Sep 2006 09:26:23 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: [Michael Chermside] > I don't seem to have gotten anyone one board with the bold > proposal to just rip __del__ out and tell people to learn > to use weakrefs. But I'm hearing general agreement (at least > among those contributing to this thread) that it might be > wise to change the status quo. I'm on-board for just ripping out __del__. Is there anything vital that could be done with a __close__ method that can't already be done with a weakref callback? We aren't going to need it. FWIW, don't despair on your original bold proposal. While it's fun to free associate and generate ideas for new atrocities, I think most of your respondants are just kicking ideas around. In the spirit of Py3k development, I recommend being quick to remove and slow to add. Let 3.0 emerge without __del__ and if strong use cases emerge, there can be a 3.1 PEP for a new magic method. I think Py3k should be as lean as possible and then build-up very slowly afterwards, emphasizing cruft-removal instead of cruft-substitution. Raymond From krstic at solarsail.hcs.harvard.edu Fri Sep 22 18:28:56 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Sat, 23 Sep 2006 00:28:56 +0800 Subject: [Python-3000] Removing __var In-Reply-To: References: Message-ID: <45140F48.7000305@solarsail.hcs.harvard.edu> Raymond Hettinger wrote: > I propose dropping the __var private name mangling trick for > double-underscores. +1. -- Ivan Krsti? | GPG: 0x147C722D From krstic at solarsail.hcs.harvard.edu Fri Sep 22 18:29:58 2006 From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=) Date: Sat, 23 Sep 2006 00:29:58 +0800 Subject: [Python-3000] Removing __del__ In-Reply-To: References: Message-ID: <45140F86.6050004@solarsail.hcs.harvard.edu> Raymond Hettinger wrote: > I'm on-board for just ripping out __del__. [...] > In the spirit of Py3k development, I recommend being quick to remove and > slow to add. Let 3.0 emerge without __del__ and if strong use cases > emerge, there can be a 3.1 PEP for a new magic method. I think Py3k > should be as lean as possible and then build-up very slowly afterwards, > emphasizing cruft-removal instead of cruft-substitution. +1, on all counts. -- Ivan Krsti? | GPG: 0x147C722D From ncoghlan at gmail.com Fri Sep 22 18:49:55 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 23 Sep 2006 02:49:55 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: References: Message-ID: <45141433.9070302@gmail.com> Raymond Hettinger wrote: > FWIW, don't despair on your original bold proposal. While it's fun to > free associate and generate ideas for new atrocities, I think most of > your respondants are just kicking ideas around. Who, us? ;) > In the spirit of Py3k development, I recommend being quick to remove and > slow to add. Let 3.0 emerge without __del__ and if strong use cases > emerge, there can be a 3.1 PEP for a new magic method. I think Py3k > should be as lean as possible and then build-up very slowly afterwards, > emphasizing cruft-removal instead of cruft-substitution. I'd be fine with this too (my suggestion for updated __del__ semantics was pure syntactic sugar for a weakref based solution), but I don't think I use __del__ enough for my vote on this particular topic to mean anything :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From cfbolz at gmx.de Fri Sep 22 19:00:36 2006 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Fri, 22 Sep 2006 19:00:36 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: <451416B4.1060201@gmx.de> Chermside, Michael wrote: [snip] > The other problem I discussed is illustrated by the following > malicious code: > > evil_list = [] > > class MyEvilClass(object): > def __close__(self): > evil_list.append(self) > > > > Do the proponents of __close__ propose a way of prohibiting > this behavior? Or do we continue to include complicated > logic the GC module to support it? I don't think anyone > cares how this code behaves so long as it doesn't segfault. I still think a rather nice solution would be to guarantee to call __del__ (or __close__ or whatever) only once, as was discussed earlier: http://mail.python.org/pipermail/python-dev/2005-August/055251.html It solves all sorts of nasty problems with resurrection and cyclic GC and it is the semantics you already get when using Jython and PyPy (maybe IronPython too, I don't know how GC is handled in the CLR). Now the implementation side of this is more messy, especially with refcounting. You would need a way to store whether the object was already finalized. I think you could steal one bit of the refcounting field to store this information (and still have a very fast check for whether the rest of the refcounting field is really zero, if the correct bit is chosen). Cheers, Carl Fridrich Bolz From tanzer at swing.co.at Fri Sep 22 19:05:51 2006 From: tanzer at swing.co.at (Christian Tanzer) Date: Fri, 22 Sep 2006 19:05:51 +0200 Subject: [Python-3000] Removing __var In-Reply-To: Your message of "Fri, 22 Sep 2006 09:26:17 PDT." Message-ID: "Raymond Hettinger" wrote: > I propose dropping the __var private name mangling trick for > double-underscores. > > It is rarely used; it smells like a hack; it complicates instrospection > tools; it's not beautiful; and it is not in line with Python's spirit of > "we're all consenting adults". It is useful in some situations, though. In particular, I use a metaclass that sets `__super` to the right value. This wouldn't work without name mangling. -- Christian Tanzer http://www.c-tanzer.at/ From fdrake at acm.org Fri Sep 22 19:29:17 2006 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 22 Sep 2006 13:29:17 -0400 Subject: [Python-3000] Removing __var In-Reply-To: References: Message-ID: <200609221329.17334.fdrake@acm.org> On Friday 22 September 2006 13:05, Christian Tanzer wrote: > It is useful in some situations, though. In particular, I use a > metaclass that sets `__super` to the right value. This wouldn't work > without name mangling. This also doesn't work if two classes in the inheritance hierarchy have the same __name__, if I understand how you're using this. My guess is that you're using calls like def doSomething(self, arg): self.__super.doSomething(arg + 1) -Fred -- Fred L. Drake, Jr. From jimjjewett at gmail.com Fri Sep 22 19:52:17 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 22 Sep 2006 13:52:17 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <451416B4.1060201@gmx.de> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <451416B4.1060201@gmx.de> Message-ID: On 9/22/06, Carl Friedrich Bolz wrote: > I still think a rather nice solution would be to guarantee to call > __del__ (or __close__ or whatever) only once, as was discussed earlier: How does this help? It doesn't say how to resolve cycles. This cycle problem is the cause of much implementation complexity and most user frustration (because the method doesn't get called). Once-only does prevent objects from usefully reviving them*selves*, but it doesn't prevent them from creating a revived copy. Since you still have to start at the top of a tree, they can even reuse otherwise-dead subobjects -- which keeps most of the rest of the complexity. And to be honest, I'm not sure you *can* remove the complexity, so much as you can move it. Enforcing no-revival-even-of-subobjects is the same tricky maze in reverse. Saying "We don't make any promises regarding revival" just leads to inconsistency, and make the bugs subtler. The advantage of the __close__ semantics is that it greatly reduces the number of unbreakable cycles; this still doesn't avoid corner cases, but it simplifies the average case, and therefore the typical user experience. -jJ From bob at redivi.com Fri Sep 22 20:02:19 2006 From: bob at redivi.com (Bob Ippolito) Date: Fri, 22 Sep 2006 11:02:19 -0700 Subject: [Python-3000] Removing __var In-Reply-To: <200609221329.17334.fdrake@acm.org> References: <200609221329.17334.fdrake@acm.org> Message-ID: <6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com> On 9/22/06, Fred L. Drake, Jr. wrote: > On Friday 22 September 2006 13:05, Christian Tanzer wrote: > > It is useful in some situations, though. In particular, I use a > > metaclass that sets `__super` to the right value. This wouldn't work > > without name mangling. > > This also doesn't work if two classes in the inheritance hierarchy have the > same __name__, if I understand how you're using this. My guess is that > you're using calls like > > def doSomething(self, arg): > self.__super.doSomething(arg + 1) In the one or two situations where it "is useful" you could always write out what it would've done. self._ThisClass__super.doSomething(arg + 1) -bob From theller at python.net Fri Sep 22 20:19:52 2006 From: theller at python.net (Thomas Heller) Date: Fri, 22 Sep 2006 20:19:52 +0200 Subject: [Python-3000] Removing __var In-Reply-To: <6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com> References: <200609221329.17334.fdrake@acm.org> <6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com> Message-ID: Bob Ippolito schrieb: > On 9/22/06, Fred L. Drake, Jr. wrote: >> On Friday 22 September 2006 13:05, Christian Tanzer wrote: >> > It is useful in some situations, though. In particular, I use a >> > metaclass that sets `__super` to the right value. This wouldn't work >> > without name mangling. >> >> This also doesn't work if two classes in the inheritance hierarchy have the >> same __name__, if I understand how you're using this. My guess is that >> you're using calls like >> >> def doSomething(self, arg): >> self.__super.doSomething(arg + 1) > > In the one or two situations where it "is useful" you could always > write out what it would've done. > > self._ThisClass__super.doSomething(arg + 1) It is much more verbose, though. The question is are you writing this more often, or are you introspecting more often? Thomas From bob at redivi.com Fri Sep 22 20:31:16 2006 From: bob at redivi.com (Bob Ippolito) Date: Fri, 22 Sep 2006 11:31:16 -0700 Subject: [Python-3000] Removing __var In-Reply-To: References: <200609221329.17334.fdrake@acm.org> <6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com> Message-ID: <6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com> On 9/22/06, Thomas Heller wrote: > Bob Ippolito schrieb: > > On 9/22/06, Fred L. Drake, Jr. wrote: > >> On Friday 22 September 2006 13:05, Christian Tanzer wrote: > >> > It is useful in some situations, though. In particular, I use a > >> > metaclass that sets `__super` to the right value. This wouldn't work > >> > without name mangling. > >> > >> This also doesn't work if two classes in the inheritance hierarchy have the > >> same __name__, if I understand how you're using this. My guess is that > >> you're using calls like > >> > >> def doSomething(self, arg): > >> self.__super.doSomething(arg + 1) > > > > In the one or two situations where it "is useful" you could always > > write out what it would've done. > > > > self._ThisClass__super.doSomething(arg + 1) > > It is much more verbose, though. The question is are you writing > this more often, or are you introspecting more often? The point is that legitimate __ usage is supposedly so rare that this verbosity doesn't matter. If it's verbose, people definitely won't use it until they need to, where right now people do it all the time cause it's "private". -bob From brett at python.org Fri Sep 22 21:06:33 2006 From: brett at python.org (Brett Cannon) Date: Fri, 22 Sep 2006 12:06:33 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: On 9/22/06, Raymond Hettinger wrote: > > [Michael Chermside] > > I don't seem to have gotten anyone one board with the bold > > proposal to just rip __del__ out and tell people to learn > > to use weakrefs. But I'm hearing general agreement (at least > > among those contributing to this thread) that it might be > > wise to change the status quo. > > I'm on-board for just ripping out __del__. Same here. I have just been busy with other stuff to not make this thread a priority, partially because I still remember when Tim proposed this and said there was something slightly off with the way weakrefs worked for it to be the perfect solution. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060922/1f490b48/attachment-0001.html From rasky at develer.com Sat Sep 23 00:16:58 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sat, 23 Sep 2006 00:16:58 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: <008901c6de94$d2072ed0$4bbd2997@bagio> Raymond Hettinger wrote: > Is there anything vital that could be done with a __close__ method > that can't already be done with a weakref callback? We aren't going > to need it. It can't be done with the same cleaness and easyness. It will require more convoluted and complex code. It will require people to understand weakrefs in the first place. Did you actually read my posts where I have shown some legitimate use cases of __del__ which can't be substituted with short and elegant enough code? Giovanni Bajo From cfbolz at gmx.de Fri Sep 22 20:45:27 2006 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Fri, 22 Sep 2006 20:45:27 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <451416B4.1060201@gmx.de> Message-ID: <45142F47.4050600@gmx.de> Jim Jewett wrote: > On 9/22/06, Carl Friedrich Bolz wrote: >> I still think a rather nice solution would be to guarantee to call >> __del__ (or __close__ or whatever) only once, as was discussed earlier: > > How does this help? It helps by removing many corner cases in the GC that come from objects reviving themselves (and putting themselves into a cycle, for example). It makes reviving an object perfectly ok, since the strange things start to happen when an object continuously revives itself again and again. > It doesn't say how to resolve cycles. This cycle problem is the cause > of much implementation complexity and most user frustration (because > the method doesn't get called). But the above proposal is independent from the question how cycles with finalizers get resolved. We could still say that it does so in an arbitrary order. My point is more that just allowing objects to be finalized in arbitrary order does not solve the problem of objects continuously reviving themselves. > Once-only does prevent objects from usefully reviving them*selves*, > but it doesn't prevent them from creating a revived copy. Since you > still have to start at the top of a tree, they can even reuse > otherwise-dead subobjects -- which keeps most of the rest of the > complexity. > > And to be honest, I'm not sure you *can* remove the complexity, so > much as you can move it. Enforcing no-revival-even-of-subobjects is > the same tricky maze in reverse. Saying "We don't make any promises > regarding revival" just leads to inconsistency, and make the bugs > subtler. > > The advantage of the __close__ semantics is that it greatly reduces > the number of unbreakable cycles; this still doesn't avoid corner > cases, but it simplifies the average case, and therefore the typical > user experience. See above. Calling __del__ once is an independent issue from how to break cycles. Cheers, Carl Friedrich Bolz From rhettinger at ewtllc.com Sat Sep 23 01:24:48 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Fri, 22 Sep 2006 16:24:48 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: <023701c6dc34$8a79dc50$a14c2597@bagio> References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> <023701c6dc34$8a79dc50$a14c2597@bagio> Message-ID: <451470C0.8060903@ewtllc.com> Giovanni Bajo wrote: >I don't use __del__ much. I use it only in leaf classes, where it surely can't >be part of loops. In those rare cases, it's very useful to me. For instance, I >have a small classes which wraps an existing handle-based C API exported to >Python. Something along the lines of: > >class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > > def __del__(self, *args): > CAPI.close(self.handle) > > def foo(self): > CAPI.foo(self.handle) > >The real class isn't much longer than this (really). How do you propose to >write this same code without __del__? > > Use weakref and apply the usual idioms for the callbacks: class Wrapper: def __init__(self, *args): self.handle = CAPI.init(*args) self._wr = weakref.ref(self, lambda wr, h=self.handle: CAPI.close(h)) def foo(self): CAPI.foo(self.handle) Raymond From aahz at pythoncraft.com Sat Sep 23 01:56:02 2006 From: aahz at pythoncraft.com (Aahz) Date: Fri, 22 Sep 2006 16:56:02 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: <008901c6de94$d2072ed0$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> Message-ID: <20060922235602.GA3427@panix.com> On Sat, Sep 23, 2006, Giovanni Bajo wrote: > > Did you actually read my posts where I have shown some legitimate use > cases of __del__ which can't be substituted with short and elegant > enough code? The question is whether those use cases are frequent enough -- especially for less-than-wizard programmers -- to warrant keeping __del__ around. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ "LL YR VWL R BLNG T S" -- www.nancybuttons.com From bob at redivi.com Sat Sep 23 02:35:50 2006 From: bob at redivi.com (Bob Ippolito) Date: Fri, 22 Sep 2006 17:35:50 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: <20060922235602.GA3427@panix.com> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> Message-ID: <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> On 9/22/06, Aahz wrote: > On Sat, Sep 23, 2006, Giovanni Bajo wrote: > > > > Did you actually read my posts where I have shown some legitimate use > > cases of __del__ which can't be substituted with short and elegant > > enough code? > > The question is whether those use cases are frequent enough -- especially > for less-than-wizard programmers -- to warrant keeping __del__ around. I still haven't seen one that can't be done pretty trivially with a weakref. Perhaps the solution is to make doing cleanup-by-weakref easier or more obvious? Something like this maybe: import weakref class GarbageDisposal: def __init__(self): self.refs = set() def __call__(self, object, func, *args, **kw): def cleanup(ref): self.refs.remove(ref) func(*args, **kw) self.refs.add(weakref.ref(object, cleanup)) on_cleanup = GarbageDisposal() class Wrapper: def __init__(self, *args): self.handle = CAPI.init(*args) on_cleanup(self, CAPI.close, self.handle) def foo(self): CAPI.foo(self.handle) -bob From greg.ewing at canterbury.ac.nz Sat Sep 23 04:08:11 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 23 Sep 2006 14:08:11 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: <4514970B.7070806@canterbury.ac.nz> Chermside, Michael wrote: > I don't seem to have gotten anyone one board with the bold proposal > to just rip __del__ out and tell people to learn to use weakrefs. Well, I'd be in favour of it. I've argued something similar in the past, without much success then either. -- Greg From greg.ewing at canterbury.ac.nz Sat Sep 23 04:20:53 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 23 Sep 2006 14:20:53 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: <45149A05.4030409@canterbury.ac.nz> Brett Cannon wrote: > I still remember when Tim proposed > this and said there was something slightly off with the way weakrefs > worked for it to be the perfect solution. If that's true, it might be better to concentrate on fixing this problem so that weakrefs can be used, rather than trying to patch up __del__. -- Greg From rasky at develer.com Sat Sep 23 10:01:32 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sat, 23 Sep 2006 10:01:32 +0200 Subject: [Python-3000] Removing __del__ References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com> <023701c6dc34$8a79dc50$a14c2597@bagio> <451470C0.8060903@ewtllc.com> Message-ID: <02ac01c6dee6$7bbee430$4bbd2997@bagio> Raymond Hettinger wrote: >> I don't use __del__ much. I use it only in leaf classes, where it >> surely can't be part of loops. In those rare cases, it's very useful >> to me. For instance, I have a small classes which wraps an existing >> handle-based C API exported to Python. Something along the lines of: >> >> class Wrapper: >> def __init__(self, *args): >> self.handle = CAPI.init(*args) >> >> def __del__(self, *args): >> CAPI.close(self.handle) >> >> def foo(self): >> CAPI.foo(self.handle) >> >> The real class isn't much longer than this (really). How do you >> propose to write this same code without __del__? >> >> > Use weakref and apply the usual idioms for the callbacks: > > class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > self._wr = weakref.ref(self, lambda wr, h=self.handle: > CAPI.close(h)) > > def foo(self): > CAPI.foo(self.handle) What happens if self.handle changes? Or if it's closed, so that weakref should be destroyed? You will have to bookkeep _wr everywhere across the class code. You're proposing to remove a simple method that is easy to use and explain, but that can cause complex problems in some cases (cycles). The alternative is a complex finalization system, which uses weakrefs, delayed function calls, and must be written smartly to avoid keeping references to "self". I don't see this as a progress. On the other hand, __close__ is easy to understand, maintain, and would solve one problem of __del__. I think what Python 2.x really needs is a better way (library + interpreter support) to debug cycles (both collectable and uncollectable, as the frormer type can be just as bad as the latter time in real-time applications). Removing __del__ just complicates real-world use cases without providing a comprehensive solution to the problem. Giovanni Bajo From rasky at develer.com Sat Sep 23 11:13:41 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sat, 23 Sep 2006 11:13:41 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com><008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> Message-ID: <035f01c6def0$900594c0$4bbd2997@bagio> Aahz wrote: >> Did you actually read my posts where I have shown some legitimate use >> cases of __del__ which can't be substituted with short and elegant >> enough code? > > The question is whether those use cases are frequent enough -- > especially for less-than-wizard programmers -- to warrant keeping > __del__ around. What I am basically against is the need of removing an easy syntax which can have problematic side effects if you are not adult enough, in favor of a complicated library workaround which requires deeper knowledge of Python (weakrefs, lambdas, early binding of default arguments, just to name three), and can cause side effects just as bad if you are not adult enough. Where's the trade-off? On the other hand, __close__ (out-of-order, recallable __del__) fixes some issues of __del__, it is easy to teach and understand, it is easy to write. And, if we could (optionally) raise a RuntimeError as soon an object with __del__ enters a loop, would your opinion about it be different? Giovanni Bajo From rasky at develer.com Sat Sep 23 11:18:48 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sat, 23 Sep 2006 11:18:48 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com><008901c6de94$d2072ed0$4bbd2997@bagio><20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> Message-ID: <039d01c6def1$46df1ef0$4bbd2997@bagio> Bob Ippolito wrote: > import weakref > > class GarbageDisposal: > def __init__(self): > self.refs = set() > > def __call__(self, object, func, *args, **kw): > def cleanup(ref): > self.refs.remove(ref) > func(*args, **kw) > self.refs.add(weakref.ref(object, cleanup)) > > on_cleanup = GarbageDisposal() > > class Wrapper: > def __init__(self, *args): > self.handle = CAPI.init(*args) > on_cleanup(self, CAPI.close, self.handle) > > def foo(self): > CAPI.foo(self.handle) Try with this: class Wrapper2: def __init__(self, *args): self.handle = CAPI.init(*args) def foo(self): CAPI.foo(self.handle) def restart(self): self.handle = CAPI.restart(self.handle) def close(self): CAPI.close(self.handle) self.handle = None def __del__(self): if self.handle is not None: self.close() Giovanni Bajo From bob at redivi.com Sat Sep 23 11:22:07 2006 From: bob at redivi.com (Bob Ippolito) Date: Sat, 23 Sep 2006 02:22:07 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: <039d01c6def1$46df1ef0$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> Message-ID: <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> On 9/23/06, Giovanni Bajo wrote: > Bob Ippolito wrote: > > > import weakref > > > > class GarbageDisposal: > > def __init__(self): > > self.refs = set() > > > > def __call__(self, object, func, *args, **kw): > > def cleanup(ref): > > self.refs.remove(ref) > > func(*args, **kw) > > self.refs.add(weakref.ref(object, cleanup)) > > > > on_cleanup = GarbageDisposal() > > > > class Wrapper: > > def __init__(self, *args): > > self.handle = CAPI.init(*args) > > on_cleanup(self, CAPI.close, self.handle) > > > > def foo(self): > > CAPI.foo(self.handle) > > Try with this: > > class Wrapper2: > def __init__(self, *args): > self.handle = CAPI.init(*args) > > def foo(self): > CAPI.foo(self.handle) > > def restart(self): > self.handle = CAPI.restart(self.handle) > > def close(self): > CAPI.close(self.handle) > self.handle = None > > def __del__(self): > if self.handle is not None: > self.close() I've never seen an API that works like that. Have you? -bob From rasky at develer.com Sat Sep 23 11:39:20 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sat, 23 Sep 2006 11:39:20 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> Message-ID: <03bb01c6def4$257b6c70$4bbd2997@bagio> Bob Ippolito wrote: >> class Wrapper2: >> def __init__(self, *args): >> self.handle = CAPI.init(*args) >> >> def foo(self): >> CAPI.foo(self.handle) >> >> def restart(self): >> self.handle = CAPI.restart(self.handle) >> >> def close(self): >> CAPI.close(self.handle) >> self.handle = None >> >> def __del__(self): >> if self.handle is not None: >> self.close() > > I've never seen an API that works like that. Have you? The class above shows a case where: 1) There's a way to destruct the handle BEFORE __del__ is called, which would require killing the weakref / deregistering the finalization hook. I believe you agree that this is pretty common (I've around 10 usages of this pattern, __del__ with a separate explicit closure method, in one Python base-code of mine). 2) The objects required in the destructor can be mutated / changed during the lifetime of the instance. For instance, a class that wraps Win32 FindFirstFirst/FindFirstNext and support transparent directory recursion needs something similar. Or CreateToolhelp32Snapshot() with the Module32First/Next stuff. Another example is a class which creates named temporary files and needs to remove them on finalization. It might need to create several different temporary files (say, self.handle is the filename in that case)[1], so the filename needed in the destructor changes during the lifetime of the instance. #2 is admittedly more convoluted (and probably more rare) than #1, but it's still a reasonable use case which really you can't easily do with a simple finalization API like the one you were proposing. Python is turing-complete without __del__, but in some cases the alternatives are *really* worse. Giovanni Bajo [1] tempfile.NamedTemporaryFile can't always be used because it does not guarantee that the file can be reopened; for instance, zipfile.Zipfile() wants a filename, so if you want to create a temporary ZipFile you can't use tempfile.NamedTemporaryFile. From martin at v.loewis.de Sat Sep 23 13:33:14 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 23 Sep 2006 13:33:14 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <20060920083244.0817.JCARLSON@uci.edu> Message-ID: <45151B7A.50000@v.loewis.de> Adam Olsen schrieb: > Just a minor nit. I doubt we could accept UCS-2, we'd want UTF-16 > instead, with all the variable-width goodness that brings in. Sure we could; we can currently. > Or maybe not so minor. Old versions of windows used UCS-2, new > versions use UTF-16. The former should get errors if too high of a > character is used, the latter will need conversion if we're not using > UTF-16. Define "used". Surrogate pairs work well in the NTFS of Windows NT 3.1; no errors are reported. Regards, Martin From martin at v.loewis.de Sat Sep 23 13:38:02 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 23 Sep 2006 13:38:02 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: Message-ID: <45151C9A.3090908@v.loewis.de> Adam Olsen schrieb: > As far as I can tell, CPython on windows uses UTF-16 with code units. > Perhaps not intentionally, but by default (not throwing an error on > surrogates). It's intentionally; that's what PEP 261 specifies. Regards, Martin From martin at v.loewis.de Sat Sep 23 13:50:36 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 23 Sep 2006 13:50:36 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <015c01c6dd68$bb6bcfa0$e303030a@trilan> References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> <015c01c6dd68$bb6bcfa0$e303030a@trilan> Message-ID: <45151F8C.2070304@v.loewis.de> Giovanni Bajo schrieb: > Is there a design document explaining the rationale of unicode type, the > status quo? There is a document documenting the status quo: the source code. Contributors to this thread (or, for that matter, to this mailing list) should really familiarize themselves with the source code before posting - nobody is willing to answer question that can be answered just by looking at the source code. Now, there might be questions like "why is this or that done that way?" People are more open to answer questions like that if the poster demonstrates that he knows what the way is, and can suggest theories as to why it might be the way it is. > Any time this subject is raised on the mailing list, the net > result is "you guys don't understand unicode". Well, let us know what is > good and what is bad of the current unicode type; what is by design and what > is an implementation detail; what you want to absolutely keep, and what you > want to absolutely change. I am *really* confused about the status quo of > the unicode type (which is why I keep myself out of technical discussions on > the matter of course). Is there any desire to let people understand and join > the discussion? It's clear that there should be only a single character string type, and that should be close to the current Unicode type, in semantics and implementation. Regards, Martin From martin at v.loewis.de Sat Sep 23 14:01:56 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 23 Sep 2006 14:01:56 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <45126E76.9020600@nekomancer.net> References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> <45126E76.9020600@nekomancer.net> Message-ID: <45152234.1090303@v.loewis.de> G?bor Farkas schrieb: > while i understand the constraints, i think it's not a good decision to > leave this to be implementation-dependent. > > the strings seem to me as such a basic functionality, that it's > behaviour should not depend on the platform. > > for example, how is an application developer then supposed to write > their applications? An application developer should always know what the target platforms are. For example, does the code need to work with IronPython or not? Python is not aiming at 100% portability at all costs. Many aspects are platform dependent, and while this has complicated some applications, is has simplified others (which could make use of platform details that otherwise would not have been exposed to the Python programmer). > should he write his own slicing/whatever functions to get consistent > behaviour on linux/windows? Depends on the application, and the specific slicing operations. If the slicing appears in the processing of .ini files (say), no platform-dependent slicing should be necessary. > i think this is not just a 'theoretical' issue. it's a very practical > issue. the only reason why it does not seem to be important, because > currently not much of the non-16-bit unicode characters are used. No, there is a deeper reason. A typical program only performs substring operations on selected boundaries (such as whitespace, or punctuation). Those are typically in the BMP (not sure whether *any* punctuation is outside the BMP). > but the same way i could say, that because most of the unix-world is > utf-8, for those pythons the best way is to handle it internally as > utf-8, couldn't i? I think you live in a free country: you can certainly say that. I think you would be wrong. The common on-disk/on-wire representation of text should not influence the design of an in-memory representation. > it simply seems to me strange to make compromises that makes the life of > the cpython-users harder, just to make the life for the > jython/ironpython developers (i mean the 'creators') easier. Guido didn't say that the life of the CPython user needs to be hard. He said it will be implementation-dependent, referring to Jython and IronPython. Whether or not CPython uses a consistent representation or consistent python-level experience across platforms is a different issue. CPython could behave absolutely consistently, and use four-byte Unicode on all systems, and the length of a non-BMP string would still be implementation-defined. Regards, Martin From martin at v.loewis.de Sat Sep 23 14:09:00 2006 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sat, 23 Sep 2006 14:09:00 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <4511E644.2030306@blueyonder.co.uk> References: <4511E644.2030306@blueyonder.co.uk> Message-ID: <451523DC.2050901@v.loewis.de> David Hopwood schrieb: >> Assuming my Unicode lingo is right and code point represents a >> letter/character/digraph/whatever, then it will be a code point. Doing one >> of my rare channels of Guido, I *really* doubt he wants to expose the >> technical details of Unicode to the point of having people need to realize >> that UTF-8 takes two bytes to represent "?". > > The argument used here is not valid. People do need to realize that *all* > Unicode encodings are variable-length, in the sense that abstract characters > can be represented by multiple code points. Brett did not make such an argument. He made an argument that users should not need to care that "?" in UTF-8 is two bytes. And I agree: users should not have to worry about this wrt. internal representation. > For example, "?" can be represented either as the precomposed character U+00F6, > or as "o" followed by a combining diaeresis (U+006F U+0308). Programs must > avoid splitting sequences of code points that represent a single abstract > character. Why is that? Many programs never encounter cases where this would matter, so why do such program have to operate correctly if that case was encountered? > It simply is not possible to do correct string processing in Unicode that > will "work the way [programmers] are used to when compared to working in ASCII". Brett didn't say that this was a goal. > Should we nevertheless try to avoid making the use of Unicode strings > unnecessarily difficult for people who have minimal knowledge of Unicode? > Absolutely, but not at the expense of making basic operations on strings > asymptotically less efficient. O(1) indexing and slicing is a basic > requirement, even if it has to be done using code units. It's not possible to implement slicing in constant time, unless string views are introduced. Currently, slicing takes time linear with the length of the result string. Regards, Martin From nas at arctrix.com Sat Sep 23 17:45:33 2006 From: nas at arctrix.com (Neil Schemenauer) Date: Sat, 23 Sep 2006 15:45:33 +0000 (UTC) Subject: [Python-3000] Removing __var References: <200609221329.17334.fdrake@acm.org> <6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com> <6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com> Message-ID: Bob Ippolito wrote: > The point is that legitimate __ usage is supposedly so rare that this > verbosity doesn't matter. If it's verbose, people definitely won't use > it until they need to, where right now people do it all the time cause > it's "private". It's very rare, in my experience. I vote to rip it out. Neil From qrczak at knm.org.pl Sat Sep 23 18:34:20 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 23 Sep 2006 18:34:20 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: <03bb01c6def4$257b6c70$4bbd2997@bagio> (Giovanni Bajo's message of "Sat, 23 Sep 2006 11:39:20 +0200") References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> Message-ID: <8764fesfsj.fsf@qrnik.zagroda> "Giovanni Bajo" writes: > 1) There's a way to destruct the handle BEFORE __del__ is called, > which would require killing the weakref / deregistering the > finalization hook. Weakrefs should have a method which runs their callback and unregisters them. > 2) The objects required in the destructor can be mutated / changed > during the lifetime of the instance. For instance, a class that > wraps Win32 FindFirstFirst/FindFirstNext and support transparent > directory recursion needs something similar. Listing files with transparent directory recursion can be implemented in terms of listing files of a given directory, such that a finalizer is only used with the low level object. > Another example is a class which creates named temporary files > and needs to remove them on finalization. It might need to create > several different temporary files (say, self.handle is the filename > in that case)[1], so the filename needed in the destructor changes > during the lifetime of the instance. Again: move the finalizer to a single temporary file object, and refer to such object instead of a raw handle. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From bob at redivi.com Sat Sep 23 19:18:25 2006 From: bob at redivi.com (Bob Ippolito) Date: Sat, 23 Sep 2006 10:18:25 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: <03bb01c6def4$257b6c70$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> Message-ID: <6a36e7290609231018m65bed7edq116ab807aa5cf6f5@mail.gmail.com> On 9/23/06, Giovanni Bajo wrote: > Bob Ippolito wrote: > > >> class Wrapper2: > >> def __init__(self, *args): > >> self.handle = CAPI.init(*args) > >> > >> def foo(self): > >> CAPI.foo(self.handle) > >> > >> def restart(self): > >> self.handle = CAPI.restart(self.handle) > >> > >> def close(self): > >> CAPI.close(self.handle) > >> self.handle = None > >> > >> def __del__(self): > >> if self.handle is not None: > >> self.close() > > > > I've never seen an API that works like that. Have you? > > The class above shows a case where: > > 1) There's a way to destruct the handle BEFORE __del__ is called, which would > require killing the weakref / deregistering the finalization hook. I believe > you agree that this is pretty common (I've around 10 usages of this pattern, > __del__ with a separate explicit closure method, in one Python base-code of > mine). Easy enough, that would be a second function and the dict would change a bit. > 2) The objects required in the destructor can be mutated / changed during the > lifetime of the instance. For instance, a class that wraps Win32 > FindFirstFirst/FindFirstNext and support transparent directory recursion needs > something similar. Or CreateToolhelp32Snapshot() with the Module32First/Next > stuff. Another example is a class which creates named temporary files and needs > to remove them on finalization. It might need to create several different > temporary files (say, self.handle is the filename in that case)[1], so the > filename needed in the destructor changes during the lifetime of the instance. > > #2 is admittedly more convoluted (and probably more rare) than #1, but it's > still a reasonable use case which really you can't easily do with a simple > finalization API like the one you were proposing. Python is turing-complete > without __del__, but in some cases the alternatives are *really* worse. You can of course easily do this with a simple finalization API. Supporting this simply requires that multiple cleanup functions be allowed per object. -bob From jcarlson at uci.edu Sat Sep 23 20:03:43 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sat, 23 Sep 2006 11:03:43 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <451523DC.2050901@v.loewis.de> References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de> Message-ID: <20060923104310.0863.JCARLSON@uci.edu> "Martin v. L?wis" wrote: > David Hopwood schrieb: [snip] > > Should we nevertheless try to avoid making the use of Unicode strings > > unnecessarily difficult for people who have minimal knowledge of Unicode? > > Absolutely, but not at the expense of making basic operations on strings > > asymptotically less efficient. O(1) indexing and slicing is a basic > > requirement, even if it has to be done using code units. > > It's not possible to implement slicing in constant time, unless string > views are introduced. Currently, slicing takes time linear with the > length of the result string. I believe he was referring to discovering the memory address where slicing should begin. In the case of Latin-1, UCS-2, or UCS-4, given a starting address and some position i, it is trivial to discover the memory position of character i. In the case of UTF-8, given a starting address and some position i, one needs to somewhat parse the UTF-8 representation to discover the memory position of character i. For me, having recently remembered what was in a unicode string, and verifying it by checking the source, the question in my mind is whether we want to stick with the same 2-representation implementation (default encoding and UTF-16 or UCS-4 depending on build), or go with more or fewer representations. We can reduce memory consumption by using a single representation, whether it be constant or variable based on content, though in some cases (utf-16, ucs-4) we would lose the 'native' single-segment char (C char) buffer interface. Using multiple representations, and choosing those representations carefully based on platform (always keep utf-8 as one of the representations on linux, always keep utf-16 as one of the representations in Windows), we may be able to increase platform API calling speed, if such is desireable. After re-reading the source, and thinking a bit more, about my only real concern is memory use of Python 3.x . The current implementation works, so I'm +1 on keeping it "as is", but I'm also +0 on some implementation that would reduce memory use (with limited, if any slowdown) for as many platforms as possible, not any higher because changing the underlying implementation would be a PITA. - Josiah From martin at v.loewis.de Sat Sep 23 21:17:04 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 23 Sep 2006 21:17:04 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060923104310.0863.JCARLSON@uci.edu> References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de> <20060923104310.0863.JCARLSON@uci.edu> Message-ID: <45158830.8020908@v.loewis.de> Josiah Carlson schrieb: > For me, having recently remembered what was in a unicode string, and > verifying it by checking the source, the question in my mind is whether > we want to stick with the same 2-representation implementation (default > encoding and UTF-16 or UCS-4 depending on build), or go with more or > fewer representations. I would personally like to see a Python API that operates on code points, with support for 17 planes. I also think that efficient indexing is important. > We can reduce memory consumption by using a single representation, > whether it be constant or variable based on content, though in some > cases (utf-16, ucs-4) we would lose the 'native' single-segment char (C > char) buffer interface. I don't think reducing memory consumption is that important, for current hardware. Java and .NET have demonstrated that you can do "real" application with that approach. There are trade-offs, of course. I personally think the best trade-off would be to have a two-byte representation, along with a flag telling whether there are any surrogate pairs in the string. Indexing and length would be constant-time if there are no surrogates, and linear time if there are. > After re-reading the source, and thinking a bit more, about my only > real concern is memory use of Python 3.x . The current implementation > works, so I'm +1 on keeping it "as is", but I'm also +0 on some > implementation that would reduce memory use (with limited, if any > slowdown) for as many platforms as possible, not any higher because > changing the underlying implementation would be a PITA. I think supporting multiple representations at run-time would really be terrible. Any API of the "give me the data" kind would either have to expose the choice of representations, or perform a copy. Either alternative would produce many programming errors in extension modules. Regards, Martin From rasky at develer.com Sun Sep 24 02:04:36 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sun, 24 Sep 2006 02:04:36 +0200 Subject: [Python-3000] __close__ method References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> Message-ID: <07f701c6df6d$05b47660$4bbd2997@bagio> Micheal, many thanks for your interesting mail, which pointed out the outcome of the previous thread. Let me trying to answer some questions of yours about __close__. > But I'm hearing general agreement (at least among those contributing > to this thread) that it might be wise to change the status quo. Status quo of __del_: Pros: - Easy syntax: very simple to use in easy situations. - Easy semantic: familiar to beginners (similarity with other programming languages), and being the "opposite" of __init__ makes it easy to teach. Cons: - Makes reference loops uncollectable -> people learn fast to avoid it in most classes - Allow resurrection, which is a headache for Python developers > The two kinds of solutions I'm hearing are (1) those that are based > around making a helper object that gets stored as an attribute in > the object, or a list of weakrefs, or something like that, and (2) > the __close__ proposal (or perhaps keep the name __del__ but change > the semantics. > > The difficulties with (1) that have been acknowledged so far are > that the way you code things becomes somewhat less obvious, and > that there is the possibility of accidentally creating immortal > objects through reference loops. Exactly. To be able to code correctly these finalizers, you need to be much more Python savvy than you need to use __del__, because you need to understand and somehow master: - weakrefs - early binding of default arguments of functions which are not exactly the two brightest areas of Python. [ (2) the __close__ proposal ] > I would like to hear someone address the weaknesses of (2). > The first I know of is that the code in your __close__ method (or > __del__) must assume that it might have been in a reference loop > which was broken in some arbitrary place. As a result, it cannot > assume that all references it holds are still valid. To avoid > crashing the system, we'd probably have to set the broken > references to None (is that the best choice?), but can people > really write code that has to run assuming that its references > might be invalid? I might be wrong, but given the constraint that __close__ could be called multiple times for the same objects, and I don't see how this situation might appear. The cyclic GC could: 1) call __close__ on the instances *BEFORE* dropping the references. The code in __close__ could break the cycle itself. 2) only after that, assume that __close__ did not dispose anything related to the loop itself, and thus drop a random reference in the chain. This would cause other calls to __close__ on the instances, which should result in basically no-ops since they have been already executed. BTW: would it be possible to "nullify" the __close__ method after it has been executed once somehow, so that it won't get executed twice on the same instance? A single bit in the instance (with the meaning of "already closed") should be sufficient. If this is possible, then the above algorithm is easier to implement, and it also makes __close__ methods easier to implement. > A second problem I know of is, what if the code stores a reference > to self someplace? The ability for __del__ methods to resurrect > the object being finalized is one of the major sources of > complexity in the GC module, and changing the semantics to > __close__ doesn't fix this. I don't think __close__ can solve this problem, in fact. I don't specifically consider it a weakness of __close__, strictly speaking, though. > -------- examples only below this line -------- > > class MyClass2(object): > def __init__(self, resource1_name, resource2_name): > self.resource1 = acquire_resource(resource1_name) > self.resource2 = acquire_resource(resource2_name) > def flush(self): > self.resource1.flush() > self.resource2.flush() > if hasattr(self, 'next'): > self.next.flush() > def close(self): > self.resource1.release() > self.resource2.release() > def __close__(self): > self.flush() > self.close() > > x = MyClass2('db1', 'db2') > y = MyClass2('db3', 'db4') > x.next = y > y.next = x > > This version will encounter a problem. When the GC sees > the x <--> y loop it will break it somewhere... without > loss of generality, let us say it breaks the y -> x link > by setting y.next to None. Now y will be freed, so > __close__ will be called. __close__ will invoke self.flush() > which will then try to invoke self.next.flush(). But > self.next is None, so we'll get an exception and never > make it to invoking self.close(). With my algorithm, the following things will happen: 0) I assume that the resources can be flushed() even after having been released() without causing weird exceptions... Otherwise the code should be more defensive, and delete the references to the resources after disposal. 1) GC will first call __close__ on either instance (let's say x). This would close the instance by releasing the resources. x is marked as "already closed". y.flush() is invoked. 2) GC will then call __close__ on y. This would release y's resources, and invoke x.flush(). x.flush() would either have no side-effects, or being defensively coded against resource1/resource2 being None (since the resources of x have been already disposed at step 1). 3) The loop was not broken, so GC will drop a random reference. Let's say it breaks the y -> x link. This causes x to be disposed. x is marked as "already closed" so __close__ is not invoked. During disposal, the reference to y held in x.next is dropped. 4) y is disposed. It's marked as "already closed" so __close__ is not invoked. > ------ > > The other problem I discussed is illustrated by the following > malicious code: > > evil_list = [] > > class MyEvilClass(object): > def __close__(self): > evil_list.append(self) > > Do the proponents of __close__ propose a way of prohibiting > this behavior? Or do we continue to include complicated > logic the GC module to support it? I don't think anyone > cares how this code behaves so long as it doesn't segfault. I can see how this can confuse the GC, but I really don't know the details. I don't have any proposal as how to avoid this situation. Giovanni Bajo From jimjjewett at gmail.com Sun Sep 24 02:38:40 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Sat, 23 Sep 2006 20:38:40 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> Message-ID: On 9/22/06, Bob Ippolito wrote: > I still haven't seen one that can't be done pretty trivially > with a weakref. Perhaps the solution is to make > doing cleanup-by-weakref easier or more obvious? Possibly, but I've tried, and *I* couldn't come up with any way to use them that was (1) generic enough to put in a module, rather than a recipe (2) easy enough to still be an improvement, and (3) correct. > def __call__(self, object, func, *args, **kw): > def cleanup(ref): > self.refs.remove(ref) > func(*args, **kw) > self.refs.add(weakref.ref(object, cleanup)) Now remember something like Michael Chermside's "simplest" example, where you need to flush before closing. The obvious way is to pass self.close, but it doesn't actually work. Because it is a bound method, it silently makes the object effectively immortal. The "correct" way is to write another function which is basically an awkward copy of self.close. At the moment, I can't think of any *good* way to ensure access to self.resource1 and self.resource2, but not to self. All the workarounds I can come up with make __del__ look pretty good, from a maintenance standpoint. -jJ From greg.ewing at canterbury.ac.nz Sun Sep 24 03:12:05 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 24 Sep 2006 13:12:05 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: <035f01c6def0$900594c0$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <035f01c6def0$900594c0$4bbd2997@bagio> Message-ID: <4515DB65.2090505@canterbury.ac.nz> Giovanni Bajo wrote: > What I am basically against is the need of removing an easy syntax which can > have problematic side effects if you are not adult enough, So what are you saying, that people who aren't adult enough should be given a tool that's nice and easy but leads them to write buggy code? That doesn't seem like a responsible thing to do. > complicated library workaround which requires deeper knowledge of Python > (weakrefs, lambdas, early binding of default arguments, just to name three), I don't see why it needs to be anywhere near that complicated. All use of weakrefs can be hidden behind a call such as register_finalizer(self, func, *args, **kwds) and we just need to say that func should be a plain function, not a bound method of self, and self shouldn't appear anywhere in the arguments. Anyone who's not intelligent enough to understand and follow those guidelines is not intelligent enough to avoid the pitfalls of using __del__ either, IMO. -- Greg From tanzer at swing.co.at Sun Sep 24 14:04:54 2006 From: tanzer at swing.co.at (Christian Tanzer) Date: Sun, 24 Sep 2006 14:04:54 +0200 Subject: [Python-3000] Removing __var In-Reply-To: Your message of "Fri, 22 Sep 2006 11:31:16 PDT." <6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com> Message-ID: "Bob Ippolito" wrote: > On 9/22/06, Thomas Heller wrote: > > Bob Ippolito schrieb: > > > On 9/22/06, Fred L. Drake, Jr. wrote: > > >> On Friday 22 September 2006 13:05, Christian Tanzer wrote: > > >> > It is useful in some situations, though. In particular, I use a > > >> > metaclass that sets `__super` to the right value. This wouldn't work > > >> > without name mangling. > > >> > > >> This also doesn't work if two classes in the inheritance hierarchy have the > > >> same __name__, if I understand how you're using this. My guess is that > > >> you're using calls like > > >> > > >> def doSomething(self, arg): > > >> self.__super.doSomething(arg + 1) > > > > > > In the one or two situations where it "is useful" you could always > > > write out what it would've done. > > > > > > self._ThisClass__super.doSomething(arg + 1) > > > > It is much more verbose, though. The question is are you writing > > this more often, or are you introspecting more often? > > The point is that legitimate __ usage is supposedly so rare that this > verbosity doesn't matter. If it's verbose, people definitely won't use > it until they need to, where right now people do it all the time cause > it's "private". How can you say that? I don't use __ for `private`, I use it for making cooperative super calls (and `__super` occurs 1397 in my sandbox). I definitely don't *want* to put the name of the class into a cooperative call. Compare self.__super.doSomething(arg + 1) with super(SomeClass, self).doSomething (arg + 1) The literal class name is verbose, error prone, and hostile to refactoring. I don't care about people supposedly abusing __ to define `private` attributes -- we are all consenting adults here. (And people trying to restrict visibility probably commit all sorts of blunders. Trying to stop that might mean taking away most of Python's features). -- Christian Tanzer http://www.c-tanzer.at/ From fredrik at pythonware.com Sun Sep 24 14:42:55 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Sun, 24 Sep 2006 14:42:55 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <45158830.8020908@v.loewis.de> References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de> <20060923104310.0863.JCARLSON@uci.edu> <45158830.8020908@v.loewis.de> Message-ID: Martin v. L?wis wrote: > I don't think reducing memory consumption is that important, for current > hardware. Java and .NET have demonstrated that you can do "real" > application with that approach. I've spent more time optimizing Python's string types than most, and that doesn't match my experiences at all. Operations on wide chars are often faster than one might think, but any processor can copy X bytes of data faster than it can copy X*4 bytes of data, and I doubt that's going to change soon. > I think supporting multiple representations at run-time would really > be terrible. Any API of the "give me the data" kind would either have > to expose the choice of representations, or perform a copy. Unless you can guarantee that *all* external API:s that a Python extension might want to use will use exactly the same internal representation as Python, that's something that we have to deal with anyway. > Either alternative would produce many programming errors in extension > modules. And even if that was true (which I don't believe), "many" would still be "very small" compared to the problems that reference counting and error handling is causing. From martin at v.loewis.de Sun Sep 24 18:31:12 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 24 Sep 2006 18:31:12 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de> <20060923104310.0863.JCARLSON@uci.edu> <45158830.8020908@v.loewis.de> Message-ID: <4516B2D0.9020109@v.loewis.de> Fredrik Lundh schrieb: >> I don't think reducing memory consumption is that important, for current >> hardware. Java and .NET have demonstrated that you can do "real" >> application with that approach. > > I've spent more time optimizing Python's string types than most, and > that doesn't match my experiences at all. Operations on wide chars are > often faster than one might think, but any processor can copy X bytes of > data faster than it can copy X*4 bytes of data, and I doubt that's going > to change soon. These statements don't contradict. You are saying that there is a measurable, perhaps significant difference between copying of single-byte vs. double-byte strings. I can believe this. My claim is that this still isn't that important, and that it will be "fast enough", anyway. In many cases, the application will be IO-bound, so the cost of string operations might be negligible, either way. Of course, both statements generalize across an unspecified set of applications, so it is a matter of personal preferences. >> I think supporting multiple representations at run-time would really >> be terrible. Any API of the "give me the data" kind would either have >> to expose the choice of representations, or perform a copy. > > Unless you can guarantee that *all* external API:s that a Python > extension might want to use will use exactly the same internal > representation as Python, that's something that we have to deal with anyway. APIs will certainly allow different kinds of memory buffers to create a Python string object. Creation is a fairly small part of the API; I believe it would noticeably simplify the implementation if there is only a single internal representation. >> Either alternative would produce many programming errors in extension > > modules. > > And even if that was true (which I don't believe), "many" would still > be "very small" compared to the problems that reference counting and > error handling is causing. We will see. We need a specification or implementation first to see, of course. Regards, Martin From talin at acm.org Sun Sep 24 20:55:47 2006 From: talin at acm.org (Talin) Date: Sun, 24 Sep 2006 11:55:47 -0700 Subject: [Python-3000] Transitional GC? Message-ID: <4516D4B3.50905@acm.org> I wonder if there is a way to create an API for extension modules that would allow a gradual phase-out of reference counting, towards a 'pure' GC. (Let's leave aside the merits of reference counting vs. non-reference counting for another thread - please.) Most of the discussion up to this point has assumed that there's a sharp line between the two GC schemes - in other words, once you switch over, you have to migrate every extension module all at once. I've been wondering, however, if there isn't some way for both schemes to coexist within the same interpreter, for some transitional period. You would have some modules that use the RC API, while other modules would use the 'tracing' API. Modules could gradually be ported to the new API until there were none left, at which point you could throw the switch and remove the RC support entirely. I'm assuming two things here: 1) That such a transitional scheme would have to be as efficient (or nearly so) as the existing scheme in terms of memory and speed. 2) That we're talking source-level compatibility only - there's no expectation that you would be able to link with modules compiled under the old API. I see two basic approaches to this. The first is to have reference-counting modules live in a predominately trace-based world; The other is to allow tracing modules to live in a predominately reference-counted world. The first approach is relatively straightforward - you simply add any object with a non-zero refcount to the root set. Objects whose refcounts fall to zero are not immediately deleted, but instead get placed into the youngest generation to be traced and collected. The second approach requires that an object be able to manage refcounts via its trace function. Consider what an extension module looks like under a tracing regime. Each extension class is required to provide a 'trace' function that iterates through all references held by an object. The 'trace' function need not know the purpose of the trace - in other words, it need not know *why* the references are being iterated, its only concern is to provide access to each references. This is most easily accomplished by passing a callback function to the trace function. The trace function iterates through the object's references and calls the callback once for each one. Because the extension modules doesn't know why the references are being traced, this gives us the freedom to redefine what a 'trace' means at various points in the transition. So one scheme would allow a 'traced' object to exist in a reference-counted world by using the trace function to release references. When an object is destroyed, the trace function is called, and the callback releases the reference. Dealing with mutation of references is trickier - there's a couple of approaches I've thought of by none are particularly efficient. I guess the traced object will have to call the old refcounting functions, but via macros which can be no-op'd later. -- Talin From rasky at develer.com Sun Sep 24 21:50:01 2006 From: rasky at develer.com (Giovanni Bajo) Date: Sun, 24 Sep 2006 21:50:01 +0200 Subject: [Python-3000] Removing __var References: <6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com> Message-ID: <0d2101c6e012$9f88be90$4bbd2997@bagio> Christian Tanzer wrote: > I don't use __ for `private`, I use it for making cooperative super > calls (and `__super` occurs 1397 in my sandbox). I think you might be confusing the symptom for the disease. To me, your mail means that Py3k should grow some syntactic sugar for super calls. I guess if that happens, you won't be missing __. Giovanni Bajo From martin at v.loewis.de Sun Sep 24 22:00:35 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 24 Sep 2006 22:00:35 +0200 Subject: [Python-3000] Transitional GC? In-Reply-To: <4516D4B3.50905@acm.org> References: <4516D4B3.50905@acm.org> Message-ID: <4516E3E3.6000005@v.loewis.de> Talin schrieb: > I wonder if there is a way to create an API for extension modules that > would allow a gradual phase-out of reference counting, towards a 'pure' GC. > > (Let's leave aside the merits of reference counting vs. non-reference > counting for another thread - please.) > > Most of the discussion up to this point has assumed that there's a sharp > line between the two GC schemes - in other words, once you switch over, > you have to migrate every extension module all at once. I think this is a minor issue. Your approach assumes that moving to a tracing GC will require module authors to change their code. Perhaps that isn't necessary. It is difficult to tell, in the abstract, whether your proposal works or not. Regards, Martin From jcarlson at uci.edu Sun Sep 24 23:45:36 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 24 Sep 2006 14:45:36 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <45158830.8020908@v.loewis.de> References: <20060923104310.0863.JCARLSON@uci.edu> <45158830.8020908@v.loewis.de> Message-ID: <20060924144006.086D.JCARLSON@uci.edu> "Martin v. L?wis" wrote: > Josiah Carlson schrieb: > > For me, having recently remembered what was in a unicode string, and > > verifying it by checking the source, the question in my mind is whether > > we want to stick with the same 2-representation implementation (default > > encoding and UTF-16 or UCS-4 depending on build), or go with more or > > fewer representations. > > I would personally like to see a Python API that operates on code > points, with support for 17 planes. I also think that efficient indexing > is important. Fully-featured unicode would be nice. > There are trade-offs, of course. I personally think the best trade-off > would be to have a two-byte representation, along with a flag telling > whether there are any surrogate pairs in the string. Indexing and > length would be constant-time if there are no surrogates, and linear > time if there are. What about a tree structure over the top of the string as I described in another post? If there are no surrogate pairs, the pointer to the tree is null. If there are surrogate pairs, we could either use the structure as I described, or even modify it so that we get even better memory utilization/performance (choose tree nodes based on where surrogate pairs are, up to some limit). - Josiah From jcarlson at uci.edu Sun Sep 24 23:54:21 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 24 Sep 2006 14:54:21 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <45158830.8020908@v.loewis.de> Message-ID: <20060924144621.0870.JCARLSON@uci.edu> Fredrik Lundh wrote: > Martin v. L?wis wrote: > > I think supporting multiple representations at run-time would really > > be terrible. Any API of the "give me the data" kind would either have > > to expose the choice of representations, or perform a copy. > > Unless you can guarantee that *all* external API:s that a Python > extension might want to use will use exactly the same internal > representation as Python, that's something that we have to deal with anyway. I think Martin meant with regards to, for example, choosing an internal Latin-1, UCS-2, or UCS-4 representation based on the code points of the string. I stated earlier that with a buffer interface that returned the *size* of elements, users could program based on internal representation, but I agree that it would be error prone. What if we just chose UTF-16 as an internal representation? No defualt system encoding version attached (as it is right now). Extension writers could write for the single representation, and convert if it isn't what they want (and where is the default system encoding ever what is desired?) - Josiah From gabor at nekomancer.net Mon Sep 25 01:48:29 2006 From: gabor at nekomancer.net (gabor) Date: Mon, 25 Sep 2006 01:48:29 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <45152234.1090303@v.loewis.de> References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com> <45126E76.9020600@nekomancer.net> <45152234.1090303@v.loewis.de> Message-ID: <4517194D.1030908@nekomancer.net> Martin v. L?wis wrote: > G?bor Farkas schrieb: >> while i understand the constraints, i think it's not a good decision to >> leave this to be implementation-dependent. >> >> the strings seem to me as such a basic functionality, that it's >> behaviour should not depend on the platform. >> >> for example, how is an application developer then supposed to write >> their applications? > > An application developer should always know what the target platforms > are. For example, does the code need to work with IronPython or not? i think if IronPython claims to be a python implementation, then at least for a simple hello-world style string manipulation program should behave the same way on IronPython and on Cpython. (of course when it's a 'bigger' program, that use some python libraries, then yes, he should know. but we are talking about a builtin type here) > Python is not aiming at 100% portability at all costs. Many aspects > are platform dependent, and while this has complicated some > applications, is has simplified others (which could make use of > platform details that otherwise would not have been exposed to the > Python programmer). hmmm.. i thought that all those 'platform dependent' aspects are in the libraries (win32/sys/posix/os/whatetever), and not in the "core" part. so, are there any in the "core" (stupid naming i know. i mean not-in-libraries) part? > >> should he write his own slicing/whatever functions to get consistent >> behaviour on linux/windows? > > Depends on the application, and the specific slicing operations. > If the slicing appears in the processing of .ini files (say), > no platform-dependent slicing should be necessary. why? or you simply assume that an ini file cannot contain non-bmp unicode characters? but if you'd like to have an example then: let's say in an application i only want to display the first 70 characters of a string. now, for this to behave correctly on non-bmp characters, i will need to write a custom function, correct? > >> but the same way i could say, that because most of the unix-world is >> utf-8, for those pythons the best way is to handle it internally as >> utf-8, couldn't i? > > I think you live in a free country: you can certainly say that > I think you would be wrong. The common on-disk/on-wire representation > of text should not influence the design of an in-memory representation. sorry, i should have clarified this more. i simply reacted to the situation that for example cpython-win32 and IronPython use 16bit unicode-strings, which makes it easy for them to communicate with the (afaik) mostly 16bit-unicode win32 API. on the other hand, for example GTK uses utf8-encoded strings...so when on linux the python-GTK bindings want to transfer strings, they will have to do charset-conversion. but this was only an example. > >> it simply seems to me strange to make compromises that makes the life of >> the cpython-users harder, just to make the life for the >> jython/ironpython developers (i mean the 'creators') easier. > > Guido didn't say that the life of the CPython user needs to be hard. hmmm.. for me having to worry about string-handling differences in the programming language i use qualifies as 'harder'. > He said it will be implementation-dependent, referring to Jython > and IronPython. > Whether or not CPython uses a consistent representation > or consistent python-level experience across platforms is a different > issue. CPython could behave absolutely consistently, and use four-byte > Unicode on all systems, and the length of a non-BMP string would > still be implementation-defined. > i understand that difference. (i just find it hard to believe, that string-handling does not seem important enough to make it truly cross-platform (or cross-implementation)) gabor From jcarlson at uci.edu Mon Sep 25 06:34:12 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 24 Sep 2006 21:34:12 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <4517194D.1030908@nekomancer.net> References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net> Message-ID: <20060924210217.0873.JCARLSON@uci.edu> gabor wrote: > Martin v. L?wis wrote: > > G?bor Farkas schrieb: [snip] > > Python is not aiming at 100% portability at all costs. Many aspects > > are platform dependent, and while this has complicated some > > applications, is has simplified others (which could make use of > > platform details that otherwise would not have been exposed to the > > Python programmer). > > hmmm.. i thought that all those 'platform dependent' aspects are in the > libraries (win32/sys/posix/os/whatetever), and not in the "core" part. > > so, are there any in the "core" (stupid naming i know. i mean > not-in-libraries) part? sys.setrecursionlimit(10000) def foo(): foo() Run that in Windows, and you get a MemoryError. Run it in Linux, and you get a segfault. Blame linux malloc. > >> should he write his own slicing/whatever functions to get consistent > >> behaviour on linux/windows? > > > > Depends on the application, and the specific slicing operations. > > If the slicing appears in the processing of .ini files (say), > > no platform-dependent slicing should be necessary. [snip] > let's say in an application i only want to display the first 70 > characters of a string. > > now, for this to behave correctly on non-bmp characters, i will need to > write a custom function, correct? That depends on what you mean by "now," and on the Python compile option. If you mean that "today ... i would need to write a custom function", then you would be correct on a utf-16 compiled Python for all characters with a code point > 65535, but not so on a ucs-4 build (but perhaps both when there are surrogate pairs). In the future, the plan, I believe, is to attempt to make utf-16 behave like ucs-4 eith regards to all operations available from Python, at least for all characters represented with a single code point. > >> but the same way i could say, that because most of the unix-world is > >> utf-8, for those pythons the best way is to handle it internally as > >> utf-8, couldn't i? > > > > I think you live in a free country: you can certainly say that > > I think you would be wrong. The common on-disk/on-wire representation > > of text should not influence the design of an in-memory representation. > > sorry, i should have clarified this more. > > i simply reacted to the situation that for example cpython-win32 and > IronPython use 16bit unicode-strings, which makes it easy for them to > communicate with the (afaik) mostly 16bit-unicode win32 API. > > on the other hand, for example GTK uses utf8-encoded strings...so when > on linux the python-GTK bindings want to transfer strings, they will > have to do charset-conversion. > > but this was only an example. The current CPython implementation keeps two representations of unicode strings in memory; the utf-16 or ucs-4 representation (depending on compile-time options) and a default system encoding representation. If you set your default system encoding to be utf-8, Python doesn't need to do anything more to hand unicode strings off to GTK, aside from recognizing that it has what it wants already. [snip] > hmmm.. for me having to worry about string-handling differences in the > programming language i use qualifies as 'harder'. With what Martin and Frederik have been saying recently, I don't believe that you have anything significant to worry about when it comes to string behavior on CPython vs. IronPython, Jython, or even PyPy. > > He said it will be implementation-dependent, referring to Jython > > and IronPython. > > Whether or not CPython uses a consistent representation > > or consistent python-level experience across platforms is a different > > issue. CPython could behave absolutely consistently, and use four-byte > > Unicode on all systems, and the length of a non-BMP string would > > still be implementation-defined. > > i understand that difference. > > (i just find it hard to believe, that string-handling does not seem > important enough to make it truly cross-platform (or cross-implementation)) It is important, arguably one of the most important pieces. But there are three parts; 1) code points not currently defined within the unicode spec, but who have specific encodings (based on the code point value), 2) in the case of UTF-16 representations, Python's handling of characters > 65535, 3) surrogates. I believe #1 is handled "correctly" today, Martin sounds like he wants #2 fixed for Py3k (I don't believe anyone *doesn't* want it fixed), and #3 could be fixed while fixing #2 with a little more work (if desired). - Josiah From martin at v.loewis.de Mon Sep 25 07:26:30 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 25 Sep 2006 07:26:30 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060924144006.086D.JCARLSON@uci.edu> References: <20060923104310.0863.JCARLSON@uci.edu> <45158830.8020908@v.loewis.de> <20060924144006.086D.JCARLSON@uci.edu> Message-ID: <45176886.2090201@v.loewis.de> Josiah Carlson schrieb: > What about a tree structure over the top of the string as I described in > another post? If there are no surrogate pairs, the pointer to the tree > is null. If there are surrogate pairs, we could either use the > structure as I described, or even modify it so that we get even better > memory utilization/performance (choose tree nodes based on where > surrogate pairs are, up to some limit). As always, it's a time-vs-space tradeoff. People tend to resolve these in favor of time, accepting an increase in space. I'm not so sure this is always the right answer. In the specific case, I'm also worried about the increase in complexness. That said, it is always good to have a prototype implementation to analyse the consequences better. Regards, Martin From qrczak at knm.org.pl Mon Sep 25 11:57:10 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Mon, 25 Sep 2006 11:57:10 +0200 Subject: [Python-3000] Transitional GC? In-Reply-To: <4516D4B3.50905@acm.org> (talin@acm.org's message of "Sun, 24 Sep 2006 11:55:47 -0700") References: <4516D4B3.50905@acm.org> Message-ID: <87psdkfevd.fsf@qrnik.zagroda> Talin writes: > I wonder if there is a way to create an API for extension modules that > would allow a gradual phase-out of reference counting, towards a 'pure' GC. I believe this is possible when C code doesn't access addresses of Python objects directly, but via handles. http://srfi.schemers.org/srfi-50/mail-archive/msg00295.html See "Minor" link there, and the whole SRFI-50 discussion about FFI styles. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Mon Sep 25 13:02:14 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Mon, 25 Sep 2006 13:02:14 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: <4515DB65.2090505@canterbury.ac.nz> (Greg Ewing's message of "Sun, 24 Sep 2006 13:12:05 +1200") References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <035f01c6def0$900594c0$4bbd2997@bagio> <4515DB65.2090505@canterbury.ac.nz> Message-ID: <87hcywgqfd.fsf@qrnik.zagroda> Greg Ewing writes: > All use of weakrefs can be hidden behind a call such as > > register_finalizer(self, func, *args, **kwds) It should be possible to finalize the object explicitly, given a handle returned by this function, and possibly to kill the finalizer without execution. The former is useful to implement close(). The latter is useful for weak dictionaries: when an entry is removed because it's overwritten, there is no need to keep a finalizer which will remove the old entry when the key dies. IMHO a weak reference can conveniently play the role of such handle. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From jimjjewett at gmail.com Mon Sep 25 16:33:26 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 25 Sep 2006 10:33:26 -0400 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060924210217.0873.JCARLSON@uci.edu> References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net> <20060924210217.0873.JCARLSON@uci.edu> Message-ID: On 9/25/06, Josiah Carlson wrote: > > gabor wrote: > > Martin v. L?wis wrote: > > > G?bor Farkas schrieb: > > >> should he write his own slicing/whatever functions to get consistent > > >> behaviour on linux/windows? > > now, for this to behave correctly on non-bmp characters, i will need to > > write a custom function, correct? As David Hopwood pointed out, to be fully correct, you already have to create a custom function even with bmp characters, because of decomposed characters. (Example: Representing a c-cedilla as a c and a combining cedilla, rather than as a single code point.) Separating those two would be wrong. Counting them as two characters for slicing purposes would usually be wrong. Even 32-bit representations are permitted to use surrogate pairs; it just doesn't often make sense. These are problems inherent to unicode (or at least to non-normalized unicode). Different python implementations may expose the problem in different places, but the problem is always there. We *could* specify that slicing and indexing act as though the underlying representation were normalized (and this would typically require normalization as part of construction), but I'm not sure that is the right answer. Even if it were trivial, there are reasons not to normalize. > It is important, arguably one of the most important pieces. But there > are three parts; 1) code points not currently defined within the unicode > spec, but who have specific encodings (based on the code point value), 2) > in the case of UTF-16 representations, Python's handling of characters > > 65535, 3) surrogates. > I believe #1 is handled "correctly" today, Martin sounds like he wants > #2 fixed for Py3k (I don't believe anyone *doesn't* want it fixed), and > #3 could be fixed while fixing #2 with a little more work (if desired). You also left out (4), decomposed characters, which is a more complex version of surrogates. Guido just stated that #2 is intentional, though he didn't pronounce that it should stay that way. There are sound arguments both ways. In particular, fixing it without fixing decomposed characters might incur the cost without the benefit. -jJ From paul at prescod.net Mon Sep 25 17:50:16 2006 From: paul at prescod.net (Paul Prescod) Date: Mon, 25 Sep 2006 08:50:16 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net> <20060924210217.0873.JCARLSON@uci.edu> Message-ID: <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com> On 9/25/06, Jim Jewett wrote: > > As David Hopwood pointed out, to be fully correct, you already have to > create a custom function even with bmp characters, because of > decomposed characters. (Example: Representing a c-cedilla as a c and > a combining cedilla, rather than as a single code point.) Separating > those two would be wrong. Counting them as two characters for slicing > purposes would usually be wrong. Even 32-bit representations are permitted to use surrogate pairs; it > just doesn't often make sense. There is at least one big difference between surrogate pairs and decomposed characters. The user can typically normalize away decompositions. How do you normalize away decompositions in a language that only supports 16-bit representations? Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060925/bb100953/attachment.html From fredrik at pythonware.com Mon Sep 25 18:01:21 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Mon, 25 Sep 2006 18:01:21 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <4516B2D0.9020109@v.loewis.de> References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de> <20060923104310.0863.JCARLSON@uci.edu> <45158830.8020908@v.loewis.de> <4516B2D0.9020109@v.loewis.de> Message-ID: Martin v. L?wis wrote: >>> I think supporting multiple representations at run-time would really >>> be terrible. Any API of the "give me the data" kind would either have >>> to expose the choice of representations, or perform a copy. >> >> Unless you can guarantee that *all* external API:s that a Python >> extension might want to use will use exactly the same internal >> representation as Python, that's something that we have to deal with anyway. > > APIs will certainly allow different kinds of memory buffers to > create a Python string object. Creation is a fairly small part > of the API creation is not the problem; it's the "give me the data" API that's the problem. or rather, the "give me the data in a form that's compatible with the 3rd party API that I'm about to call" API. > I believe it would noticeably simplify the implementation if there is > only a single internal representation. and I, wearing my string algorithm implementor hat, tend to disagree with that. writing source code that can be compiled into efficient code for multiple representations is mostly trivial, even in C. From david.nospam.hopwood at blueyonder.co.uk Tue Sep 26 01:19:54 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Tue, 26 Sep 2006 00:19:54 +0100 Subject: [Python-3000] How will unicode get used? In-Reply-To: <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com> References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net> <20060924210217.0873.JCARLSON@uci.edu> <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com> Message-ID: <4518641A.4070500@blueyonder.co.uk> Paul Prescod wrote: > On 9/25/06, Jim Jewett wrote: > >> As David Hopwood pointed out, to be fully correct, you already have to >> create a custom function even with bmp characters, because of >> decomposed characters. (Example: Representing a c-cedilla as a c and >> a combining cedilla, rather than as a single code point.) Separating >> those two would be wrong. Counting them as two characters for slicing >> purposes would usually be wrong. > > Even 32-bit representations are permitted to use surrogate pairs; it > just doesn't often make sense. > > There is at least one big difference between surrogate pairs and decomposed > characters. The user can typically normalize away decompositions. That depends what script they're using. For some scripts, they can't. -- David Hopwood From rhettinger at ewtllc.com Tue Sep 26 01:41:51 2006 From: rhettinger at ewtllc.com (Raymond Hettinger) Date: Mon, 25 Sep 2006 16:41:51 -0700 Subject: [Python-3000] Removing __del__ In-Reply-To: <03bb01c6def4$257b6c70$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> Message-ID: <4518693F.1050500@ewtllc.com> >>I've never seen an API that works like that. Have you? >> >> > >The class above shows a case where: > >1) There's a way to destruct the handle BEFORE __del__ is called, which would >require killing the weakref / deregistering the finalization hook. I believe >you agree that this is pretty common (I've around 10 usages of this pattern, >__del__ with a separate explicit closure method, in one Python base-code of >mine). > > ISTM, you've adopted __del__ as your best friend, learned to avoid it pitfalls, employed it throughout your code, and forsaken weakref based approaches which is understandable because weakrefs came along rather ate in the game. I congratulate you on that level of accomplishment. I support the original suggestion to remove __del__ because I think that most programmers would be better-off without it, that weakref-based alternatives are possible (though not necessarily easier or more succinct), and that explicit finalization is preferable to implicit (i.e. there's a reason for advice to wrap file access in a try/finally to make sure an explicit close() occurs). In a world dominated by new-style classes, it is a strong plus that weakrefs reliably avoid creating cycles which subtlely block or delay finalization. Eliminating __del__ will also mean an end to implementation headaches relating to issues stemming from arbitrary finalization code running while an object is still alive. The __del__ special method has long been a dark corner of Python, a rarely used and error-prone tool. Just having it around creates a suggestion that it would be a good idea to design code relying on implicit finalization and the fragile hope that you or some future maintainer doesn't accidently keep a reference to an object you had intended to vanish of its own accord. In short, __del__ should disappear not because it is useless but because it is hazardous. The consenting adults philosophy means that we don't put-up artificial barriers to intentional hacks, but it does not mean that we bait the hook and leave error-prone traps for the unwary. In Py3k, I would like to see explicit finalization as a preferred approach and for weakrefs be the one-way-to-do-it for designs with implicit finalization. Raymond From rasky at develer.com Tue Sep 26 10:59:33 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 26 Sep 2006 10:59:33 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> Message-ID: <120f01c6e14a$15edbfd0$4bbd2997@bagio> Raymond Hettinger wrote: > In short, __del__ should disappear not because it is useless but > because > it is hazardous. The consenting adults philosophy means that we don't > put-up artificial barriers to intentional hacks, but it does not mean > that we bait the hook and leave error-prone traps for the unwary. In > Py3k, I would like to see explicit finalization as a preferred > approach > and for weakrefs be the one-way-to-do-it for designs with implicit > finalization. Raymond, there is one thing I don't understand in your line of reasoning. You say that you prefer explicit finalization, but that implicit finalization still needs to be supported. And for that, you'd rather drop __del__ and use weakrefs. But why? You say that __del__ is harardous, but I can't see how weakrefs are less hazardous. As an implicit finalization method, they live on the fragile assumption that the callback won't hold a reference to the object: an assumption which cannot be enforced in any way but cautious programming and scrupolous auditing of the code. I assert that they hide bugs much better than __del__ does (it's pretty easy to find an offending __del__ by looking at gc.garbage, while it's harder to notice a missing finalization because the cycle loop involving the weakref callback was broken at the wrong point). I guess there's something escaping me. If we have to drop one, why is that __del__? And if __del__ could be fixed to reliably work in reference cycles, would you still want to drop it? Giovanni Bajo From greg.ewing at canterbury.ac.nz Tue Sep 26 11:57:13 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 26 Sep 2006 21:57:13 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: <120f01c6e14a$15edbfd0$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> Message-ID: <4518F979.50902@canterbury.ac.nz> Giovanni Bajo wrote: > I assert that they hide bugs much better than > __del__ does (it's pretty easy to find an offending __del__ by looking at > gc.garbage, It should be feasible to modify the cyclic GC to detect groups of objects that are only being kept alive by references from the finalizer list. These could be treated the same way as __del__-containing cycles are now, and moved to a garbage list. -- Greg From tim.peters at gmail.com Tue Sep 26 13:01:23 2006 From: tim.peters at gmail.com (Tim Peters) Date: Tue, 26 Sep 2006 07:01:23 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <120f01c6e14a$15edbfd0$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> Message-ID: <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> [Giovanni Bajo] > Raymond, there is one thing I don't understand in your line of reasoning. You > say that you prefer explicit finalization, but that implicit finalization still > needs to be supported. And for that, you'd rather drop __del__ and use > weakrefs. But why? You say that __del__ is harardous, but I can't see how > weakrefs are less hazardous. As an implicit finalization method, they live on > the fragile assumption that the callback won't hold a reference to the object: > an assumption which cannot be enforced in any way but cautious programming and > scrupolous auditing of the code. Nope, not so. Read Modules/gc_weakref.txt for the gory details. In outline, there are three objects of interest here: the weakly referenced object (WO), the weakref (WR) to the WO, and the callback (CB) callable attached to the WR. /Normally/ the CB is reachable (== not trash). If a reachable CB has a strong reference to the WO, then that keeps the WO reachable too, and of course the CB won't be invoked so long as its strong reference keeps the WO alive. The CB can't become trash either so long as the WR is reachable, since the WR holds a strong reference to the CB. If the WR becomes trash while the WO is reachable, the WR clears its reference to the CB, and then the CB will never be invoked period. OTOH, if the CB has a weak reference to the WO, then when the WO goes away and the CB is invoked, the CB's weak reference returns None instead of the WO. So in no case can a reachable CB actually get at the WO via the CB's own strong or weak reference to the WO. More, this is true even if the WO is just strongly reachable via any path /from/ a reachable CB: the fact that the CB is reachable guarantees the WO is reachable then too. Skipping details, things get muddier only when all three of these objects end up in cyclic trash (CT) "at the same time". The dodge Python currently takes is that, when a WR is part of CT, and the WR's referent is also part of CT, the WR's CB (if any) is never invoked. This is defensible since the order in which trash objects are finalized isn't defined, so it's legitimate to kill the WR first. It's unclear whether that's entirely desirable behavior, though. There were excruciating discussions about this earlier, but nobody had a concrete use case favoring a specific position. From rasky at develer.com Tue Sep 26 14:19:34 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 26 Sep 2006 14:19:34 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> Message-ID: <01d301c6e166$070bde90$e303030a@trilan> Tim Peters wrote: > [Giovanni Bajo] >> Raymond, there is one thing I don't understand in your line of >> reasoning. You say that you prefer explicit finalization, but that >> implicit finalization still needs to be supported. And for that, >> you'd rather drop __del__ and use weakrefs. But why? You say that >> __del__ is harardous, but I can't see how weakrefs are less >> hazardous. As an implicit finalization method, they live on the >> fragile assumption that the callback won't hold a reference to the >> object: an assumption which cannot be enforced in any way but >> cautious programming and scrupolous auditing of the code. > > Nope, not so. Read Modules/gc_weakref.txt for the gory details. > [...] > The dodge > Python currently takes is that, when a WR is part of CT, and the WR's > referent is also part of CT, the WR's CB (if any) is never invoked. > This is defensible since the order in which trash objects are > finalized isn't defined, so it's legitimate to kill the WR first. > It's unclear whether that's entirely desirable behavior, though. > There were excruciating discussions about this earlier, but nobody had > a concrete use case favoring a specific position. Thanks for the explanation, and I believe you are confirming my position. You are saying that the CB of a WR which is part of CT is never invoked. In the above quote, I'm saying that if the user does a mistake and writes a CB (as an implicit finalizer) which holds a reference to the WO, it is creating a CT, so the CB will never be invoked. For instance: class Wrapper: def __init__(self, *args): self.handle = CAPI.init(*args) self._wr = weakref.ref(self, lambda: CAPI.close(self.handle)) # BUG HERE! In this case, we have a CT: a Wrapper instance is the WO, which holds a strong reference to the WR (self._wr), which holds a strong reference to the CB (the lambda), which holds a strong reference to the WO again (through the implicit usage of nested scopes). Thus, in this case, the CB will never be called. Is that right? I have tried this variant to verify myself: >>> import weakref >>> class Wrapper: ... def __init__(self): ... def test(): ... print "finalizer called", self.a ... self.a = 1234 ... self._wr = weakref.ref(self, test) ... >>> w = Wrapper() >>> del w >>> import gc >>> gc.collect() 6 >>> gc.collect() 0 >>> gc.garbage [] Given these examples, I still can't see why weakrefs are thought to be a preferrable solution for implicit finalization, when compared to __del__. They mostly share the same problems when it comes to cyclic trash, but __del__ is far more easier to teach, explain and understand. I can teach not to use cycles with __del__ quickly and I can verify if there's a mistake by looking at gc.garbage; teaching how to properly use weakrefs, callbacks, and how to avoid reference loops with nested scopes is much harder to grasp in the first place, and does not seem to provide any advantage. =============================== Tim, I sort of hoped you jumped into this discussion. I had this link around I wanted to show you: http://mail.python.org/pipermail/python-dev/2000-March/002526.html I re-read most threads in those weeks about finalization issues with cyclic trash. Guido was proposing a solution with __del__ and CT, which approximately worked this way: - When a CT is detected, any __del__ method is invoked once per instance, in random order. - We make sure that each __del__ method is called once and only once per instance (by using some sort of flag; Guido was proposing to set self.__dict__["__del__"] = None, but that predates new-style classes as far as I can tell). - After all __del__ methods in the CT have been called exactly once, we collect the trash as usual (break links by reclaiming the __dict__ of the instances, or whatever). Since we are discussing Py3k here, I believe it is the right time to revive this discussion. The __close__ proposal I'm backing (sumed up in this mail: http://mail.python.org/pipermail/python-3000/2006-September/003892.html) is pretty similar to how Guido was proposing to modify __del__. If there are technical grounds for this (and my opinion does not matter much, but Guido was proposing the same thing, which kinds of gives me hope in this regard), I believe it would be a far superior solution for the problem of implicit finalization in the presence of CT in Py3k. I think the idea is that, if you make sure that a __close__ method is called exactly once (and before __dict__ is reclaimed), it really does not matter much in which order you call __close__ methods within the CT. I mean, it *might* matter for already-written in-the-wild __del__ methods of course, but it sounds a *very* reasonable constraint for Py3k's __close__ methods. I would like to see real-world examples where calling __close__ in random order break things. In the message linked above, you reply with: [Tim] > I would have no objection to "__del__ called only once" if it weren't > for that Python currently does something different. I don't know > whether people rely on that now; if they do, it's a much more > dangerous thing to change than adding a new keyword. Would you still have this same position? Do you consider this "only once" rule as a possible way to solve implicit finalization in GC? -- Giovanni Bajo From qrczak at knm.org.pl Tue Sep 26 14:24:27 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 26 Sep 2006 14:24:27 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> (Tim Peters's message of "Tue, 26 Sep 2006 07:01:23 -0400") References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> Message-ID: <87lko63jes.fsf@qrnik.zagroda> "Tim Peters" writes: > Read Modules/gc_weakref.txt for the gory details. "It's a feature of Python's weakrefs too that when a weakref goes away, the callback (if any) associated with it is thrown away too, unexecuted." I disagree with this choice. Doesn't it prevent weakrefs to be used as finalizers? Here is my semantics: The core weakref constructor has three arguments: a key, a value, and a finalizer. (The finalizer is conceptually a function with no parameters. In Python it's more convenient to make it a function with any arity, along with the associated arguments.) It's often the case that the key and the value are the same object. The simplified form of the weakref constructor makes this assumption and takes only a single object and a finalizer. The generic form is needed for dictionaries with weak keys. Creating a weak reference establishes a relationship: - The key keeps the value alive. - The weak reference and the finalizer are alive. When the key dies, the relationship ends, and the finalizer is added to a queue of finalizers to be executed. Given a weak reference, you can obtain the value, which can possibly return the information that the weakref is dead (None). You can also invoke the finalizer explicitly, which also ends the relationship (the thread is suspended if the finalizer is currently executing). And you can kill the weak reference, ending the relationship. I believe this is a sufficient design for most practical purposes. See also http://www.haible.de/bruno/papers/cs/weak/WeakDatastructures-writeup.html but I disagree with the section about finalizers. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From qrczak at knm.org.pl Tue Sep 26 15:15:17 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 26 Sep 2006 15:15:17 +0200 Subject: [Python-3000] Removing __del__ In-Reply-To: <01d301c6e166$070bde90$e303030a@trilan> (Giovanni Bajo's message of "Tue, 26 Sep 2006 14:19:34 +0200") References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> <01d301c6e166$070bde90$e303030a@trilan> Message-ID: <87irjahiqi.fsf@qrnik.zagroda> "Giovanni Bajo" writes: > Guido was proposing a solution with __del__ and CT, which > approximately worked this way: > > - When a CT is detected, any __del__ method is invoked once per > instance, in random order. This means that __del__ may attempt to use an object which has already had its __del__ called. > Since we are discussing Py3k here, I believe it is the right time to revive > this discussion. The __close__ proposal I'm backing (sumed up in this mail: > http://mail.python.org/pipermail/python-3000/2006-September/003892.html) is > pretty similar to how Guido was proposing to modify __del__. "1) call __close__ on the instances *BEFORE* dropping the references. The code in __close__ could break the cycle itself." Same problem as above. Note that the problem is solvable when the subset of links in these objects which is needed during finalization doesn't contain cycles. But the language implementation can't know *which* links are these. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From jimjjewett at gmail.com Tue Sep 26 15:22:21 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 26 Sep 2006 09:22:21 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <4518F979.50902@canterbury.ac.nz> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <4518F979.50902@canterbury.ac.nz> Message-ID: On 9/26/06, Greg Ewing wrote: > Giovanni Bajo wrote: > > I assert that they hide bugs much better than > > __del__ does (it's pretty easy to find an offending __del__ by looking at > > gc.garbage, > It should be feasible to modify the cyclic GC to > detect groups of objects that are only being kept > alive by references from the finalizer list. This would let you use a bound method again, but ... Given this complexity, what advantage would it have over __del__, let alone __close__? -jJ From jimjjewett at gmail.com Tue Sep 26 15:30:01 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 26 Sep 2006 09:30:01 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> Message-ID: On 9/26/06, Tim Peters wrote: > [Giovanni Bajo] > > You say that __del__ is harardous, but I can't see how > > weakrefs are less hazardous. As an implicit finalization method, they live on > > the fragile assumption that the callback won't hold a reference to the object: > Nope, not so. I think you read "live" as "not trash", but in this particular sentence, he meant it as "be useful". > Read Modules/gc_weakref.txt for the gory details. In > outline, there are three objects of interest here: the weakly > referenced object (WO), the weakref (WR) to the WO, and the callback > (CB) callable attached to the WR. > /Normally/ the CB is reachable (== not trash). (Otherwise it can't act as a finalizer, because it isn't around) > If a reachable CB has > a strong reference to the WO, then that keeps the WO reachable too, So it doesn't act as a finalizer; it acts as an immortalizer. All the pain of __del__, and it takes only one to make a loop. (Bound methods are in this category.) > OTOH, if the CB has a weak reference to the WO, then when the WO goes > away and the CB is invoked, the CB's weak reference returns None > instead of the WO. So it still can't act as a proper finalizer, if only because it isn't fast enough. -jJ From rasky at develer.com Tue Sep 26 15:32:10 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 26 Sep 2006 15:32:10 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> Message-ID: <033901c6e170$2b8103e0$e303030a@trilan> Jim Jewett wrote: >>> You say that __del__ is harardous, but I can't see how >>> weakrefs are less hazardous. As an implicit finalization method, >>> they live on the fragile assumption that the callback won't hold a >>> reference to the object: > >> Nope, not so. > > I think you read "live" as "not trash", but in this particular > sentence, he meant it as "be useful". Yes. Sorry for my bad English... -- Giovanni Bajo From ncoghlan at gmail.com Tue Sep 26 16:12:10 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 27 Sep 2006 00:12:10 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: <120f01c6e14a$15edbfd0$4bbd2997@bagio> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> Message-ID: <4519353A.2030103@gmail.com> Giovanni Bajo wrote: > Raymond Hettinger wrote: > >> In short, __del__ should disappear not because it is useless but >> because >> it is hazardous. The consenting adults philosophy means that we don't >> put-up artificial barriers to intentional hacks, but it does not mean >> that we bait the hook and leave error-prone traps for the unwary. In >> Py3k, I would like to see explicit finalization as a preferred >> approach >> and for weakrefs be the one-way-to-do-it for designs with implicit >> finalization. > > Raymond, there is one thing I don't understand in your line of reasoning. You > say that you prefer explicit finalization, but that implicit finalization still > needs to be supported. And for that, you'd rather drop __del__ and use > weakrefs. But why? You say that __del__ is harardous, but I can't see how > weakrefs are less hazardous. As I see it, __del__ is more hazardous because it's an attractive nuisance - it *looks* like it should be easy to use, but I'm willing to bet that a lot of the __del__ methods implemented in the wild are either actual or potential bugs. For example, it would be easy for a maintenance programmer to make a change to include a reference in a data structure from a child node back to its parent node to address a problem, and suddenly the application's memory usage goes through the roof due to uncollectable cycles. Even the initial implementation of the generator __del__ slot in the *Python 2.5 core* was buggy, leading to such cycles - if the developers of the Python interpreter find it hard to get __del__ right, then there's something seriously wrong with it in its current form. By explicitly stating that __del__ will go away in Py3k, with the current intent being to replace it with explicit finalization (via with statements) and the implicit finalization offered by weakref callbacks, it encourages people to look for ways to make the API for the latter easier to use. For example, a "finalizer" factory function could be added to weakref: _finalizer_refs = set() def finalizer(*args, **kwds): """Create a finalizer from an object, callback and keyword dictionary""" # Use positional args and a closure to avoid namespace collisions obj, callback = args def _finalizer(_ref=None): """Callable that invokes the finalization callback""" # Use closure to get at weakref to allow direct invocation # This creates a cycle, so this approach relies on cyclic GC # to clean up the finalizer objects! try: _finalizer_refs.remove(ref) except KeyError: pass else: callback(_finalizer) # Give callback access to keyword arguments _finalizer.__dict__ = kwds ref = weakref.ref(obj, _finalizer) _finalizer_refs.add(ref) return _finalizer Example usage: from weakref import finalizer class Wrapper(object): def __init__(self, x=1): self._data = finalizer(self, self.finalize, x=x) @staticmethod def finalize(data): print "Finalizing: value=%s!" % data.x def get_value(self): return self._data.x def increment(self, by=1): self._data.x += by def close(self): self._data() # Explicitly invoke the finalizer self._data = None >>> test = Wrapper() >>> test.get_value() 1 >>> test.increment(2) >>> test.get_value() 3 >>> del test Finalizing: value=3! >>> test = Wrapper() >>> test.get_value() 1 >>> test.increment(2) >>> test.get_value() 3 >>> test.close() Finalizing: value=3! >>> del test For comparison, here's the __del__ based version (which has the downside of potentially giving the cyclic GC fits if other attributes are added to the object): class Wrapper(object): def __init__(self, x=1): self._x = x def __del__(self): if self._x is not None: print "Finalizing: value=%s!" % self._x def get_value(self): return self._x def increment(self, by=1): self._x += by def close(self): self.__del__() self._x = None Not counting the import line, both versions are 13 lines long (granted, the weakref version would be a bit longer if the finalizer needed access to public attributes - in that case, the weakref version would need to use properties to hide the existence of the finalizer object). Cheers, Nick. P.S. the central finalizers list also works a treat for debugging why objects aren't getting finalized as expected - a simple loop like "for wr in weakref.finalizers: print gc.get_referrers(wr)" after a gc.collect() call works pretty well. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From jimjjewett at gmail.com Tue Sep 26 16:12:15 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 26 Sep 2006 10:12:15 -0400 Subject: [Python-3000] Removing __del__ In-Reply-To: <87irjahiqi.fsf@qrnik.zagroda> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> <01d301c6e166$070bde90$e303030a@trilan> <87irjahiqi.fsf@qrnik.zagroda> Message-ID: On 9/26/06, Marcin 'Qrczak' Kowalczyk wrote: > "Giovanni Bajo" writes: > > Guido was proposing a solution with __del__ and CT, which > > approximately worked this way: > > - When a CT is detected, any __del__ method is invoked once per > > instance, in random order. [Note that this "__del__" is closer to what we've been calling __close__ than to the existing __del__.] Note that the "at most" part of "once" is already a stronger promise than __close__. That's OK (maybe even helpful) for users, it just makes the implementation harder. > This means that __del__ [~= __close__] may attempt to use an object which > has already had its __del__ called. Yes; this is the most important change between between today's __del__ and the proposed __close__. Today's __del__ doesn't have to defend against messed up subobjects, because it immortalizes them. A __close__ method would need to defend against this, because of the arbitrary ordering. In practice, close methods already defend against this anyhow, largely because they know that they might be called by __del__ even after being called explicitly. -jJ From rasky at develer.com Tue Sep 26 16:41:52 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 26 Sep 2006 16:41:52 +0200 Subject: [Python-3000] Removing __del__ References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com> Message-ID: <047f01c6e179$e7b74d40$e303030a@trilan> Nick Coghlan wrote: >> Raymond, there is one thing I don't understand in your line of >> reasoning. You say that you prefer explicit finalization, but that >> implicit finalization still needs to be supported. And for that, >> you'd rather drop __del__ and use weakrefs. But why? You say that >> __del__ is harardous, but I can't see how weakrefs are less >> hazardous. > > As I see it, __del__ is more hazardous because it's an attractive > nuisance - it *looks* like it should be easy to use, but I'm willing > to bet that a lot of the __del__ methods implemented in the wild are > either actual or potential bugs. For example, it would be easy for a > maintenance programmer to make a change to include a reference in a > data structure from a child node back to its parent node to address a > problem, and suddenly the application's memory usage goes through the > roof due to uncollectable cycles. Is that easier or harder to detect such a cycle, compared to accidentally adding a reference to self (through implicit nested scopes, or bound methods) in the finalizer callback? You have to admit that, at best, they are equally hazardous. As things stand *now* (in Python 2.5 I mean), __del__ is easier to understand/teach, easier to debug (gc.garbage vs finalizers silently ignored), and easier to use (no boilerplate in user's code, no additional finalization API which does not even exist). I saw numerous proposal to address these weakref "defects" by adding some kind of finalizer API, by modifying the GC to put uncollectable loops with weakref finalizers in gc.garbage, and so on. Most finalization APIs (including yours) create cycles just by using them, which also mean that you *must* wait for the GC to kick in before the object is finalized, making it useless for several situations where you want implicit finalizations to happen immediately (file.close() just to name one). [and we are speaking of implicit finalization now, I know of 'with']. It would require some effort to make weakref finalizers *barely* as usable as __del__, and will absolutely not solve the problem per-se: the user will still have to pay attention and understand the hoops (different kind of hoops, but still hoops). So, why do we not spend this same time trying to *fix* __del__ instead? If somebody comes up with a sane way to define the semantic for a new finalizer method (like the __close__ proposal), which can be invoked *even* in the case of cycles, would you still prefer to go the weakref way? > Even the initial implementation of > the generator __del__ slot in the *Python 2.5 core* was buggy, > leading to such cycles - if the developers of the Python interpreter > find it hard to get __del__ right, then there's something seriously > wrong with it in its current form. I don't think it's a fair comparison: generator is a pretty complex class, compared to an average class developed in Python which might need a __del__ method. I would also bet that you would get your first attempt of finalization of generators through weakrefs wrong. > By explicitly stating that __del__ will go away in Py3k, with the > current intent being to replace it with explicit finalization (via > with statements) and the implicit finalization offered by weakref > callbacks, it encourages people to look for ways to make the API for > the latter easier to use. > > For example, a "finalizer" factory function could be added to weakref: > > _finalizer_refs = set() > def finalizer(*args, **kwds): > """Create a finalizer from an object, callback and keyword > dictionary""" # Use positional args and a closure to avoid > namespace collisions obj, callback = args > def _finalizer(_ref=None): > """Callable that invokes the finalization callback""" > # Use closure to get at weakref to allow direct invocation > # This creates a cycle, so this approach relies on cyclic GC > # to clean up the finalizer objects! > try: > _finalizer_refs.remove(ref) > except KeyError: > pass > else: > callback(_finalizer) > # Give callback access to keyword arguments > _finalizer.__dict__ = kwds > ref = weakref.ref(obj, _finalizer) > _finalizer_refs.add(ref) > return _finalizer So uhm, am I reading it bad or your implementation (like any other similar API I have seen till now) create a cycle *just* by using it? This finalizer API ofhuscates user code by forcing to use a separate _data object to hold (part of) the context for apparently no good reason, and make the object collectable *only* through the cyclic GC (while __del__ would happily be invoked in simple cases when the object goes out of context). > P.S. the central finalizers list also works a treat for debugging why > objects aren't getting finalized as expected - a simple loop like > "for wr in weakref.finalizers: print gc.get_referrers(wr)" after a > gc.collect() call works pretty well. Yes, this is indeed interesting. One step closer to get at the __del__ feature set :) -- Giovanni Bajo From rrr at ronadam.com Tue Sep 26 16:45:03 2006 From: rrr at ronadam.com (Ron Adam) Date: Tue, 26 Sep 2006 09:45:03 -0500 Subject: [Python-3000] Removing __del__ In-Reply-To: <01d301c6e166$070bde90$e303030a@trilan> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> <01d301c6e166$070bde90$e303030a@trilan> Message-ID: Giovanni Bajo wrote: > Since we are discussing Py3k here, I believe it is the right time to revive > this discussion. The __close__ proposal I'm backing (sumed up in this mail: > http://mail.python.org/pipermail/python-3000/2006-September/003892.html) is > pretty similar to how Guido was proposing to modify __del__. If there are > technical grounds for this (and my opinion does not matter much, but Guido > was proposing the same thing, which kinds of gives me hope in this regard), > I believe it would be a far superior solution for the problem of implicit > finalization in the presence of CT in Py3k. > > I think the idea is that, if you make sure that a __close__ method is called > exactly once (and before __dict__ is reclaimed), it really does not matter > much in which order you call __close__ methods within the CT. I mean, it > *might* matter for already-written in-the-wild __del__ methods of course, > but it sounds a *very* reasonable constraint for Py3k's __close__ methods. I > would like to see real-world examples where calling __close__ in random > order break things. How about...? (This isn't an area I'm real familiar with.) Replace __del__ with: a __final__ method and a __finalized__ flag. (or other equivalent names) Insist on explicit finalizing by raising an exception if an objects __finalize__ flag is still False when it looses it's last reference. Would this be difficult to do in a timely way so the traceback is meaningful? Would this avoid the problems being discussed with both __del__ and weak refs? Ron From ncoghlan at gmail.com Tue Sep 26 17:41:58 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 27 Sep 2006 01:41:58 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: <047f01c6e179$e7b74d40$e303030a@trilan> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com> <047f01c6e179$e7b74d40$e303030a@trilan> Message-ID: <45194A46.90406@gmail.com> Giovanni Bajo wrote: > It would require some effort to make weakref finalizers *barely* as usable > as __del__, and will absolutely not solve the problem per-se: the user will > still have to pay attention and understand the hoops (different kind of > hoops, but still hoops). So, why do we not spend this same time trying to > *fix* __del__ instead? If somebody comes up with a sane way to define the > semantic for a new finalizer method (like the __close__ proposal), which can > be invoked *even* in the case of cycles, would you still prefer to go the > weakref way? Yes. I believe any replacement for __del__ should be syntactic sugar for some form of weak reference callback. At the moment, we have two finalization methods (__del__ and weakref callbacks). Py3k gives us the opportunity to get rid of one of them. Since weakref callbacks are strictly more powerful, then __del__ should be the one to go. Having first made the decision to reduce the number of finalization mechanisms to exactly one, I then have no problem with the idea of developing an easy to use weakref-based approach to replace the current __del__ (which may or may not be a magic method). > So uhm, am I reading it bad or your implementation (like any other similar > API I have seen till now) create a cycle *just* by using it? To use Tim's terminology, the weakref (WR) and the callback (CB) are in a cycle with each other, so even after CB is invoked and removes WR from the global list of finalizers, the two objects won't go away until the next GC collection cycle. The weakly referenced object (WO) itself isn't part of the cycle and gets finalized at the first opportunity after its reference count goes to zero (as shown in my example - the finalizer ran without having to call gc.collect() first). And don't forget that in non-refcounting implementations like Jython, IronPython and some flavours of PyPy, even non-cyclic garbage is collected through the GC mechanism at an arbitrary time after the last reference is released. If you need prompt finalization (for activities such as closing file handles or database connections), that's the whole reason the 'with' statement was added in Python 2.5. All that aside, my example finalizer API only took an hour or two to write, compared to the significant amount of effort that has gone into the current __del__ implementation. There are actually a number of ways to write weakref based finalization that avoid that WR-CB cycle I used, but considering the trade-offs between those approaches is a lot more than a two-hour project (and, not the least bit incidentally, not an assessment I would really want to make on my own ;). > This finalizer > API ofhuscates user code by forcing to use a separate _data object to hold > (part of) the context for apparently no good reason, and make the object > collectable *only* through the cyclic GC (while __del__ would happily be > invoked in simple cases when the object goes out of context). It stores part of the context in a separate object for an *excellent* reason - it identifies clearly to the Python interpreter *which* parts of the object the finalizer can access. The biggest problem with __del__ is that it *doesn't* make that distinction, so the interpreter is forced to assume the finalizer might touch any part of the object (including the object itself), leading to all of the insanity with self-resurrection and the need to relegate things to gc.garbage. With a weakref-based approach, you only end up with two possible scenarios: 1. Object gets trashed and finalized 2. Object is kept immortal by a strong reference from the callback in the list of finalizers By avoiding teaching people that care about finalization the important distinction between "things the finalizer can get at" and "things the object can get at but the finalizer can't", you aren't doing them any favours, because maintaining that distinction is the easiest way to avoid creating uncollectable cycles (i.e. by making sure the finalizer can't get at the other objects that might reference back to the current one). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From rrr at ronadam.com Tue Sep 26 17:32:27 2006 From: rrr at ronadam.com (Ron Adam) Date: Tue, 26 Sep 2006 10:32:27 -0500 Subject: [Python-3000] Removing __del__ In-Reply-To: References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> <01d301c6e166$070bde90$e303030a@trilan> Message-ID: This was bit too brief I think... Ron Adam wrote: > How about...? (This isn't an area I'm real familiar with.) > > > Replace __del__ with: > > a __final__ method and a __finalized__ flag. (or other equivalent names) The __final__ method would need to be explicitly called, and the __finalized__ flag could be set either by the interpreter or the __final__ method when __final__ is called. __final__ would never be called implicitly by the interpreter. > Insist on explicit finalizing by raising an exception if an objects > __finalize__ flag is still False when it looses it's last reference. > > > Would this be difficult to do in a timely way so the traceback is meaningful? > > Would this avoid the problems being discussed with both __del__ and weak refs? > > > Ron Maybe just adding only an optional __finalized__ flag, that when False forces an exception if an object looses it's references, might be enough. I think.... It's not the actual closing/finishing/etc... that is the problem, it's the detecting when closing/finishing/etc... is not done that is the problem. Cheers, Ron From martin at v.loewis.de Tue Sep 26 21:14:29 2006 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 26 Sep 2006 21:14:29 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com> References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net> <20060924210217.0873.JCARLSON@uci.edu> <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com> Message-ID: <45197C15.9040005@v.loewis.de> Paul Prescod schrieb: > There is at least one big difference between surrogate pairs and > decomposed characters. The user can typically normalize away > decompositions. How do you normalize away decompositions in a language > that only supports 16-bit representations? I don't see the problem: You use UTF-16; all normal forms (NFC, NFD, NFKC, NFKD) can be represented in UTF-16 just fine. It is somewhat tricky to implement a normalization algorithm in UTF-16, since you must combine surrogate pairs first in order to find out what the canonical decomposition of the code point is; but it's just more code, and no problem in principle. Regards, Martin From qrczak at knm.org.pl Tue Sep 26 21:20:24 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Tue, 26 Sep 2006 21:20:24 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <45197C15.9040005@v.loewis.de> (Martin v. =?iso-8859-2?q?L=F6wis's?= message of "Tue, 26 Sep 2006 21:14:29 +0200") References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net> <20060924210217.0873.JCARLSON@uci.edu> <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com> <45197C15.9040005@v.loewis.de> Message-ID: <87ejtyjuyv.fsf@qrnik.zagroda> "Martin v. L?wis" writes: > It is somewhat tricky to implement a normalization algorithm in > UTF-16, since you must combine surrogate pairs first in order to > find out what the canonical decomposition of the code point is; > but it's just more code, and no problem in principle. The same issue is with virtually any algorithm: more code, more complex code is needed with UTF-16 than with UTF-32. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From martin at v.loewis.de Tue Sep 26 21:25:08 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 26 Sep 2006 21:25:08 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de> <20060923104310.0863.JCARLSON@uci.edu> <45158830.8020908@v.loewis.de> <4516B2D0.9020109@v.loewis.de> Message-ID: <45197E94.3010502@v.loewis.de> Fredrik Lundh schrieb: >> I believe it would noticeably simplify the implementation if there is > > only a single internal representation. > > and I, wearing my string algorithm implementor hat, tend to disagree > with that. writing source code that can be compiled into efficient code > for multiple representations is mostly trivial, even in C. I wouldn't call SRE's macro trickeries "trivial", though. Regards, Martin From paul at prescod.net Tue Sep 26 22:44:07 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 26 Sep 2006 13:44:07 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <45197C15.9040005@v.loewis.de> References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net> <20060924210217.0873.JCARLSON@uci.edu> <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com> <45197C15.9040005@v.loewis.de> Message-ID: <1cb725390609261344m51297926tac13968f33eaee82@mail.gmail.com> I misspoke. I meant to ask: "How do you normalize away surrogate pairs in UTF-16?" It was a rhetorical question. The point was just that decomposed characters can be handled by implicit or explicit normalization. Surrogate pairs can only be similarly normalized away if your model allows you to represent their normalized forms. A UTF-16 characters model would not. On 9/26/06, "Martin v. L?wis" wrote: > > Paul Prescod schrieb: > > There is at least one big difference between surrogate pairs and > > decomposed characters. The user can typically normalize away > > decompositions. How do you normalize away decompositions in a language > > that only supports 16-bit representations? > > I don't see the problem: You use UTF-16; all normal forms (NFC, NFD, > NFKC, NFKD) can be represented in UTF-16 just fine. > > It is somewhat tricky to implement a normalization algorithm in > UTF-16, since you must combine surrogate pairs first in order to > find out what the canonical decomposition of the code point is; > but it's just more code, and no problem in principle. > > Regards, > Martin > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060926/717e1ceb/attachment.html From greg.ewing at canterbury.ac.nz Wed Sep 27 02:36:14 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 27 Sep 2006 12:36:14 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: <87lko63jes.fsf@qrnik.zagroda> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> <87lko63jes.fsf@qrnik.zagroda> Message-ID: <4519C77E.6020503@canterbury.ac.nz> Marcin 'Qrczak' Kowalczyk wrote: > "It's a feature of Python's weakrefs too that when a weakref goes > away, the callback (if any) associated with it is thrown away too, > unexecuted." > > I disagree with this choice. Doesn't it prevent weakrefs to be used as > finalizers? No, it's quite possible to build a finalization mechanism on top of weakrefs. To register a finalizer F for an object O, you create a weak reference W to O and store it in a global list. You give W a callback that invokes F and then removes W from the global list. Now there's no way that W can go away before its callback is invoked, since that's the only thing that removes it from the global list. Furthermore, if the user makes a mistake and registers a function F that references its own object O, directly or indirectly, then eventually we will be left with a cycle that's only being kept alive from the global list via W and its callback. The cyclic GC can detect this situation and move the cycle to a garbage list or otherwise alert the user. I don't believe that this mechanism would be any harder to use *correctly* than __del__ methods currently are, and mistakes made in using it would be no harder to debug. -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 27 02:36:21 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 27 Sep 2006 12:36:21 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <4518F979.50902@canterbury.ac.nz> Message-ID: <4519C785.4010707@canterbury.ac.nz> Jim Jewett wrote: > Given this complexity, what advantage would it have over __del__, let > alone __close__? It wouldn't constitute an attractive nuisance, since it would force you to think about which pieces of information the finalizer really needs. This is something you need to do anyway if you're to ensure you don't get into trouble using __del__. The supposed "easiness" of __del__ is really just sloppiness that will turn around and bite you eventually (if you'll excuse the mixed metaphor). -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 27 02:36:35 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 27 Sep 2006 12:36:35 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: <047f01c6e179$e7b74d40$e303030a@trilan> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com> <047f01c6e179$e7b74d40$e303030a@trilan> Message-ID: <4519C793.2050800@canterbury.ac.nz> Giovanni Bajo wrote: > Is that easier or harder to detect such a cycle, compared to accidentally > adding a reference to self (through implicit nested scopes, or bound > methods) in the finalizer callback? I would put a notice in the docs strongly recommending that only global functions be registered as finalizers, not nested functions or bound methods. While not strictly necessary (or sufficient) for safety, following this guideline would greatly reduce the chance of accidentally creating troublesome cycles, I think. And if you did accidentally create such a cycle, it seems to me it would be much easier to fix than if you were using __del__, since you only need to make an adjustment to the parameter list of the finalizer. With __del__, you need to refactor your whole finalization strategy and create another object to do the finalization, which is a much bigger upheaval. > Most finalization APIs (including yours) create > cycles just by using them, which also mean that you *must* wait for the GC > to kick in before the object is finalized No, a weakref-based finalizer will kick in just as soon as __del__ would. I don't know what makes you think otherwise. > will absolutely not solve the problem per-se: the user will > still have to pay attention and understand the hoops Certainly, but it will make it much more obvious that the hoops are there in the first place, and exactly where and what shape they are. > So, why do we not spend this same time trying to > *fix* __del__ instead? So far nobody has found a *way* to fix __del__ (really fix it, that is, not just paper over the cracks). And a lot of smart people have given it a lot of thought over the years. If someone comes up with a way some day, we can always put __del__ back in. But I don't feel like holding my breath waiting for that to happen, when we have something else that we know will work. >> # Use closure to get at weakref to allow direct invocation >> # This creates a cycle, so this approach relies on cyclic GC >> # to clean up the finalizer objects! This implementation is broken. There's no need to create any such cycle. -- Greg From greg.ewing at canterbury.ac.nz Wed Sep 27 02:36:41 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 27 Sep 2006 12:36:41 +1200 Subject: [Python-3000] Removing __del__ In-Reply-To: <45194A46.90406@gmail.com> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com> <047f01c6e179$e7b74d40$e303030a@trilan> <45194A46.90406@gmail.com> Message-ID: <4519C799.8000603@canterbury.ac.nz> Nick Coghlan wrote: > the weakref (WR) and the callback (CB) are in a > cycle with each other, so even after CB is invoked and removes WR from the > global list of finalizers, the two objects won't go away until the next GC > collection cycle. The CB can drop its reference to the WR when it's invoked. -- Greg From ncoghlan at gmail.com Wed Sep 27 03:36:15 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 27 Sep 2006 11:36:15 +1000 Subject: [Python-3000] Removing __del__ In-Reply-To: <4519C793.2050800@canterbury.ac.nz> References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com> <008901c6de94$d2072ed0$4bbd2997@bagio> <20060922235602.GA3427@panix.com> <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com> <039d01c6def1$46df1ef0$4bbd2997@bagio> <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com> <03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com> <120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com> <047f01c6e179$e7b74d40$e303030a@trilan> <4519C793.2050800@canterbury.ac.nz> Message-ID: <4519D58F.5070103@gmail.com> Greg Ewing wrote: >>> # Use closure to get at weakref to allow direct invocation >>> # This creates a cycle, so this approach relies on cyclic GC >>> # to clean up the finalizer objects! > > This implementation is broken. There's no need > to create any such cycle. I know, but it was late and my brain wasn't up to the job of getting rid of it :) Here's a pretty easy way to fix it to avoid relying on the cyclic GC (actually based on your other message about explicitly breaking the cycle when the finalizer is invoked): _finalizer_refs = set() def finalizer(*args, **kwds): """Create a finalizer from an object, callback and keyword dictionary""" # Use positional args and a closure to avoid namespace collisions obj, callback = args def _finalizer(_ref=None): """Callable that invokes the finalization callback""" # Use closure to get at weakref to allow direct invocation try: ref = boxed_ref.pop() except IndexError: pass else: _finalizer_refs.remove(ref) callback(_finalizer) # Give callback access to keyword arguments _finalizer.__dict__ = kwds boxed_ref = [weakref.ref(obj, _finalizer)] _finalizer_refs.add(boxed_ref[0]) return _finalizer Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From jcarlson at uci.edu Thu Sep 28 01:32:33 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 27 Sep 2006 16:32:33 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <45176886.2090201@v.loewis.de> References: <20060924144006.086D.JCARLSON@uci.edu> <45176886.2090201@v.loewis.de> Message-ID: <20060927153914.089D.JCARLSON@uci.edu> "Martin v. L?wis" wrote: > > Josiah Carlson schrieb: > > What about a tree structure over the top of the string as I described in > > another post? If there are no surrogate pairs, the pointer to the tree > > is null. If there are surrogate pairs, we could either use the > > structure as I described, or even modify it so that we get even better > > memory utilization/performance (choose tree nodes based on where > > surrogate pairs are, up to some limit). > > As always, it's a time-vs-space tradeoff. People tend to resolve these > in favor of time, accepting an increase in space. I'm not so sure this > is always the right answer. In the specific case, I'm also worried about > the increase in complexness. > > That said, it is always good to have a prototype implementation to > analyse the consequences better. I'm away from my main machine at the moment, so I am unable to test my implementation, but I do have a sample. There are two main functions to this implementation. One which constructs a tree for O(log n) worst-case access to character addresses, and one which traverses the tree to discover the character address. For strings without surrogates, it's O(1) character address discovery. The implementation of surrogate discovery is very simple, using section 3.8 and 5.4 in the Unicode 4.0 standard. If there are no surrogates, it takes a single pass over the input, and constructs a single node (12 or 24 bytes, depending on the build, need to replace long with Py_ssize_t). If there are surrogates, it creates a block of nodes, adjusts pointers to create a tree, and returns a pointer to the root. The tree will have at most O(n/logn) nodes, though it will tend to create long blocks of non-surrogates, so that if you have a single surrogate in the middle of a huge string, it will be conceptually viewed as 3 blocks. Attached is my untested sample implementation (I'm away for the next week or so, and can't test), that should give an idea of what I was talking about. - Josiah -------------- next part -------------- A non-text attachment was scrubbed... Name: surrogate_tree.c Type: application/octet-stream Size: 4688 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20060927/8a293362/attachment.obj From martin at v.loewis.de Thu Sep 28 05:21:52 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 28 Sep 2006 05:21:52 +0200 Subject: [Python-3000] How will unicode get used? In-Reply-To: <20060927153914.089D.JCARLSON@uci.edu> References: <20060924144006.086D.JCARLSON@uci.edu> <45176886.2090201@v.loewis.de> <20060927153914.089D.JCARLSON@uci.edu> Message-ID: <451B3FD0.9030600@v.loewis.de> Josiah Carlson schrieb: > Attached is my untested sample implementation (I'm away for the next > week or so, and can't test), that should give an idea of what I was > talking about. Thanks. It is hard to tell what the impact on the implementation is. For example, ISTM that you have to regenerate the tree each time a new string is created. E.g. if you slice a string, you would have to regenerate the tree for the slice. Right? As for the implementation: If you are using a array-based heap, couldn't you just drop the left and right child pointers, and instead use indices 2*k and 2*k+1 to find the child nodes? This would get down memory overhead significantly; you'd only need the length of the array to determine what a leaf node is. Regards, Martin From jcarlson at uci.edu Thu Sep 28 05:49:38 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 27 Sep 2006 20:49:38 -0700 Subject: [Python-3000] How will unicode get used? In-Reply-To: <451B3FD0.9030600@v.loewis.de> References: <20060927153914.089D.JCARLSON@uci.edu> <451B3FD0.9030600@v.loewis.de> Message-ID: <20060927204323.08A4.JCARLSON@uci.edu> "Martin v. L?wis" wrote: > > Josiah Carlson schrieb: > > Attached is my untested sample implementation (I'm away for the next > > week or so, and can't test), that should give an idea of what I was > > talking about. > > Thanks. It is hard to tell what the impact on the implementation is. > For example, ISTM that you have to regenerate the tree each time > a new string is created. E.g. if you slice a string, you would > have to regenerate the tree for the slice. Right? Generally, yes. We could use the pre-existing tree information, but it would probably be simpler (and faster) to scan the string and re-create it. Really, one would create the tree when someone wants to access an index for the first time (or during creation, for fewer surprises), then use the index finding function to return the address of character i. > As for the implementation: If you are using a array-based heap, > couldn't you just drop the left and right child pointers, and > instead use indices 2*k and 2*k+1 to find the child nodes? > This would get down memory overhead significantly; you'd only > need the length of the array to determine what a leaf node is. Good point. I had originally malloced each node individually, but I zoned the heap optimization when I went with that style of construction. - Josiah