From sales@lookelu.com Sat Jul 1 17:17:47 2000 From: sales@lookelu.com (The Western Web) Date: Sat, 1 Jul 2000 16:17:47 Subject: [I18n-sig] The Western Web has just finished our new classified ad section. Please check it out and make sure that your classified ad has been moved. We are in the process of moving ads at this time, but would appreciate your help to insure that if your ad has been moved. If it hasn't been moved or you would like to place a new ad feel free to do so. We have added new sections in the classifieds, hay/feed/shavings, livestock, camelids, cattle, deer and elk, poultry, rabbits, sheep, livestock equipment, swine, donkeys, dogs and mules. We are currently receiving 100 new ads a day, and over 20,000 unique hits a day. Message-ID: <20000701231654.16DBE1CE48@dinsdale.python.org> The Western Web has just finished our new classified ad section. Please check it out and make sure that your classified ad has been moved. We are in the process of moving ads at this time, but would appreciate your help to insure that if your ad has been moved. If it hasn't been moved or you would like to place a new ad feel free to do so. We have added new sections in the classifieds, hay/feed/shavings, livestock, camelids, cattle, deer and elk, poultry, rabbits, sheep, livestock equipment, swine, donkeys, dogs and mules. We are currently receiving 100 new ads a day, and over 20,000 unique hits a day. http://www.thewesternweb.com The new classified section is automated now and your ads will be posted immediatly. You can also add Multi-Media files (photos, sound and video) on line. This is a free service to you so use it at your will. http://www.westernwebclassified.com We have also finished the Western Web Search Engine, which is solely optimized for the western way of life. Please stop by the search engine add your site. http://www.lookelu.com Our message board is also now up and running so please use it . http://www.westernmessageboard.com/cgi-bin/Ultimate.cgi We are sorry for any inconvenience. Thank you, http://www.thewesternweb.com This ad is being sent in compliance with Senate bill 1618, Title 3, section 301. http://www.senate.gov/ ~murkowski/commercialemail/S771index.html Here is a more detailed version of the legal notice above: This message is sent in compliance of the new e-mail bill: SECTION 301. Per Section 301, Paragraph (a)(2)(C) of S. 1618, http://www.senate.gov/~murkowski/commercialemail/S771index.html Further transmissions to you by the sender of this email may be stopped at no cost to you by sending a reply to this email address with the word "remove" in the subject line. From tdickenson@geminidataloggers.com Fri Jul 7 10:53:39 2000 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Fri, 07 Jul 2000 10:53:39 +0100 Subject: [I18n-sig] Unicode experience Message-ID: I'm just nearing the end of getting Zope to play well with unicode data. Most of the changes involved replacing a call to str, in situations where either a unicode or narrow string would be acceptable. My best alternative is: def convert_to_something_stringlike(x): if type(x)=3D=3Dtype(u''): return x else: return str(x) This seems like a fundamental operation - would it be worth having something similar in the standard library? Toby Dickenson tdickenson@geminidataloggers.com From mal@lemburg.com Fri Jul 7 12:15:49 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 07 Jul 2000 13:15:49 +0200 Subject: [I18n-sig] Unicode experience References: Message-ID: <3965BBE5.D67DD838@lemburg.com> Toby Dickenson wrote: > > I'm just nearing the end of getting Zope to play well with unicode > data. Most of the changes involved replacing a call to str, in > situations where either a unicode or narrow string would be > acceptable. > > My best alternative is: > > def convert_to_something_stringlike(x): > if type(x)==type(u''): > return x > else: > return str(x) > > This seems like a fundamental operation - would it be worth having > something similar in the standard library? You mean: for Unicode return Unicode and for everything else return strings ? It doesn't fit well with the builtins str() and unicode(). I'd say, make this a userland helper. BTW, could you elaborate a bit on your experience with adding Unicode support to Zope ? Such a report would certainly make a nice complement to the Unicode tutorial and help other people adding Unicode support to their apps. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@beopen.com Fri Jul 7 13:44:03 2000 From: guido@beopen.com (Guido van Rossum) Date: Fri, 07 Jul 2000 07:44:03 -0500 Subject: [I18n-sig] Unicode experience In-Reply-To: Your message of "Fri, 07 Jul 2000 13:15:49 +0200." <3965BBE5.D67DD838@lemburg.com> References: <3965BBE5.D67DD838@lemburg.com> Message-ID: <200007071244.HAA03694@cj20424-a.reston1.va.home.com> > Toby Dickenson wrote: > > > > I'm just nearing the end of getting Zope to play well with unicode > > data. Most of the changes involved replacing a call to str, in > > situations where either a unicode or narrow string would be > > acceptable. > > > > My best alternative is: > > > > def convert_to_something_stringlike(x): > > if type(x)==type(u''): > > return x > > else: > > return str(x) > > > > This seems like a fundamental operation - would it be worth having > > something similar in the standard library? Marc-Andre Lemburg replied: > You mean: for Unicode return Unicode and for everything else > return strings ? > > It doesn't fit well with the builtins str() and unicode(). I'd > say, make this a userland helper. I think this would be helpful to have in the std library. Note that in JPython, you'd already use str() for this, and in Python 3000 this may also be the case. At some point in the design discussion for the current Unicode support we also thought that we wanted str() to do this (i.e. allow 8-bit and Unicode string returns), until we realized that there were too many places that would be very unhappy if str() returned a Unicode string! The problem is similar to a situation you have with numbers: sometimes you want a coercion that converts everything to float except it should leave complex numbers complex. In other words it coerces up to float but it never coerces down to float. Luckily you can write that as "x+0.0" while converts int and long to float with the same value while leaving complex alone. For strings there is no compact notation like "+0.0" if you want to convert to string or Unicode -- adding "" might work in Perl, but not in Python. I propose ustr(x) with the semantics given by Toby. Class support (an __ustr__ method, with fallbacks on __str__ and __unicode__) would also be handy. > BTW, could you elaborate a bit on your experience with adding > Unicode support to Zope ? Such a report would certainly make > a nice complement to the Unicode tutorial and help other > people adding Unicode support to their apps. Yes, that's what we need. Thanks to Toby for pioneering this! --Guido van Rossum (home page: http://dinsdale.python.org/~guido/) From tdickenson@geminidataloggers.com Fri Jul 7 12:53:02 2000 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Fri, 7 Jul 2000 12:53:02 +0100 Subject: [I18n-sig] Unicode experience Message-ID: <9FC702711D39D3118D4900902778ADC812855D@JUPITER> > BTW, could you elaborate a bit on your experience with adding > Unicode support to Zope ? Such a report would certainly make > a nice complement to the Unicode tutorial and help other > people adding Unicode support to their apps. Ill write this all up asap. For the zope hackers, Ill post my patches early next week. From fredrik@pythonware.com Fri Jul 7 13:30:51 2000 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 7 Jul 2000 14:30:51 +0200 Subject: [Python-Dev] Re: [I18n-sig] Unicode experience References: <3965BBE5.D67DD838@lemburg.com> <200007071244.HAA03694@cj20424-a.reston1.va.home.com> Message-ID: <020001bfe80f$3b23ab10$0900a8c0@SPIFF> guido wrote: > I propose ustr(x) with the semantics given by Toby. +1 on concept. not sure about the name and the semantics. maybe a better name would be "unistr" (to match "unistr"). or maybe that's backwards? how about "String" (!). (the perfect name is "string", but that appears to be reserved by someone else...) as for the semantics, note that __str__ is allowed to return a unicode string in the current code base ("str" converts it to 8- bit using the default encoding). ustr/unistr/String should pass that one right through: def ustr(s): if type(s) in (type(""), type(u"")): return s s = s.__str__() if type(s) in (type(""), type(u"")): return s raise "__str__ returned wrong type" > Class support (an __ustr__ method, with fallbacks on __str__ > and __unicode__) would also be handy. -0 on this one (__str__ can already return either type, and if the goal is to get rid of both unichr and unistr in the future, we shouldn't add more hooks if we can avoid it. it's easier to remove stuff if you don't add them in the first place ;-) From mal@lemburg.com Fri Jul 7 13:56:37 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 07 Jul 2000 14:56:37 +0200 Subject: [Python-Dev] Re: [I18n-sig] Unicode experience References: <3965BBE5.D67DD838@lemburg.com> <200007071244.HAA03694@cj20424-a.reston1.va.home.com> <020001bfe80f$3b23ab10$0900a8c0@SPIFF> Message-ID: <3965D385.68D97C31@lemburg.com> Fredrik Lundh wrote: > > guido wrote: > > I propose ustr(x) with the semantics given by Toby. > > +1 on concept. > > not sure about the name and the semantics. Uhm, what's left then ;-) ? > maybe a better name would be "unistr" (to match "unistr"). > or maybe that's backwards? > > how about "String" (!). > > (the perfect name is "string", but that appears to be reserved > by someone else...) > > as for the semantics, note that __str__ is allowed to return a > unicode string in the current code base ("str" converts it to 8- > bit using the default encoding). ustr/unistr/String should pass > that one right through: > > def ustr(s): > if type(s) in (type(""), type(u"")): > return s > s = s.__str__() > if type(s) in (type(""), type(u"")): > return s > raise "__str__ returned wrong type" > > > Class support (an __ustr__ method, with fallbacks on __str__ > > and __unicode__) would also be handy. > > -0 on this one (__str__ can already return either type, and if the > goal is to get rid of both unichr and unistr in the future, we shouldn't > add more hooks if we can avoid it. it's easier to remove stuff if you > don't add them in the first place ;-) Agreed. I'm just adding coercion support for instances using that technique: instance defining __str__ can return Unicode objects which will then be used by the implementation whereever coercion to Unicode takes place. I'll add a similar hook to unicode(). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tdickenson@geminidataloggers.com Mon Jul 10 12:25:04 2000 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Mon, 10 Jul 2000 12:25:04 +0100 Subject: [I18n-sig] New Unicode default encoding scheme In-Reply-To: <3940D05E.9E266396@lemburg.com> References: <3940D05E.9E266396@lemburg.com> Message-ID: On Fri, 09 Jun 2000 13:09:19 +0200, "M.-A. Lemburg" wrote: >For this, the implementation maintains a global which can be set in >the site.py Python startup script. Subsequent changes are not >possible. The can be set and queried using the >two sys module APIs: I'm confused about the justification for this restriction. I can see that frequent arbitrary changes would be bad style, but is there any reason stronger than that? =46or Zope, the right place to set the default encoding is in __main__, which doesnt seem unreasonable. At the moment the restriction is enforced with a 'del sys.setdefaultencoding' near the end of site.py. This means the restriction can be bypassed with a 'reload(sys)'. Am I going to regret doing that? Toby Dickenson tdickenson@geminidataloggers.com From guido@beopen.com Mon Jul 10 15:25:05 2000 From: guido@beopen.com (Guido van Rossum) Date: Mon, 10 Jul 2000 09:25:05 -0500 Subject: [I18n-sig] New Unicode default encoding scheme In-Reply-To: Your message of "Mon, 10 Jul 2000 12:25:04 +0100." References: <3940D05E.9E266396@lemburg.com> Message-ID: <200007101425.JAA18328@cj20424-a.reston1.va.home.com> > On Fri, 09 Jun 2000 13:09:19 +0200, "M.-A. Lemburg" > wrote: > > >For this, the implementation maintains a global which can be set in > >the site.py Python startup script. Subsequent changes are not > >possible. The can be set and queried using the > >two sys module APIs: > > I'm confused about the justification for this restriction. I can see > that frequent arbitrary changes would be bad style, but is there any > reason stronger than that? > > For Zope, the right place to set the default encoding is in __main__, > which doesnt seem unreasonable. > > > At the moment the restriction is enforced with a > 'del sys.setdefaultencoding' near the end of site.py. This means the > restriction can be bypassed with a 'reload(sys)'. Am I going to regret > doing that? Yes, when it is dropped from the sys module altogether. Remember that it's an experimental feature! There's currently a discussion regarding this issue that will make this a likely outcome. The default encoding may well become fixed to ASCII for all practical purposes. One particular nasty issue is that allowing the default encoding to change may affect dictionary lookups in a bad way. I'll try to explain the issue here. First, Python uses the rule that if two objects a and b compare equal using ==, they can be used interchangeably as dictionary keys, even if they have different types. So, if d is {0:'yo'}, then d[0], d[0L], d[0.0], and d[0j] all succeed returning 'yo'. Similarly, if d is {'a':'ho'}, then d['a'] and d[u'a'] both return the same thing. Now consider d = {'\200': 'bo'}. If the encoding is variable, the Unicode character that is equal to '\200' is also variable. So, at one point in the program, where the default encoding is Latin-1, d[u'\200'] might work; but it might be illegal in another part, where the default encoding maps '\200' to something else (or possibly it's even invalid as a Unicode encoding, e.g. when the encoding is UTF-8). This by itself is not a showstopper. However if we look into the implementation of dictionaries, we see that the hash() function is used to make lookups in the internal hash table fast. The use of the hash() function by the dictionary type requires that if two values compare equal, they *must* have the same hash() value. Otherwise, we can run into the situation where a value is equal to one of the keys of the dictionary, but it isn't found when used in a lookup, because its hash is different! (The same restriction is the reason why mutable types like lists cannot be used as dictionary keys.) The speed of the dictionary implementation is fundamental for the speed of Python, and changing its implementation to rely less on hash values would mean an unacceptable degradation in performance. (And yes, failing lookups must also be fast!) Making the default encoding variable causes unsurmountable problems for the hash() function. A fixed encoding (whether UTF-8, ASCII or Latin-1) means that we can code the hash() of 8-bit and Unicode strings to have the same result for 8-bit character strings and Unicode strings that compare equal. Toby, would it be a problem for Zope if the system's default encoding were ASCII? You can still introduce the concept of a Zope default encoding, to be applied explicitly by all Zope code whenever you need it. This is what we are trying to get applications to do anyway: don't rely on the default encoding, always be explicit about the encoding. The ASCII default is intended to avoid having to worry about encodings when using ASCII string literals in code that manipulates strings and would otherwise work fine with either 8-bit or Unicode strings. --Guido van Rossum (home page: http://dinsdale.python.org/~guido/) From tdickenson@geminidataloggers.com Mon Jul 10 14:47:25 2000 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Mon, 10 Jul 2000 14:47:25 +0100 Subject: [I18n-sig] New Unicode default encoding scheme Message-ID: <9FC702711D39D3118D4900902778ADC8128561@JUPITER> Thanks for taking the time, > > >For this, the implementation maintains a global which can be set in > > >the site.py Python startup script. Subsequent changes are not > > >possible. The can be set and queried using the > > >two sys module APIs: > > I'm confused about the justification for this restriction. I can see > > that frequent arbitrary changes would be bad style, but is there any > > reason stronger than that? > Toby, would it be a problem for Zope if the system's default encoding > were ASCII? You can still introduce the concept of a Zope default > encoding, to be applied explicitly by all Zope code whenever you need > it. Yes, ascii is exactly what I want for Zope. The site.py in the current CVS permanently sets the default-encoding based on locale, and I wanted Zope's __main__ to change it back to ascii ! > This is what we are trying to get applications to do anyway: > don't rely on the default encoding, always be explicit about the > encoding. The ASCII default is intended to avoid having to worry > about encodings when using ASCII string literals in code that > manipulates strings and would otherwise work fine with either 8-bit or > Unicode strings. From guido@beopen.com Mon Jul 10 15:53:04 2000 From: guido@beopen.com (Guido van Rossum) Date: Mon, 10 Jul 2000 09:53:04 -0500 Subject: [I18n-sig] New Unicode default encoding scheme In-Reply-To: Your message of "Mon, 10 Jul 2000 14:47:25 +0100." <9FC702711D39D3118D4900902778ADC8128561@JUPITER> References: <9FC702711D39D3118D4900902778ADC8128561@JUPITER> Message-ID: <200007101453.JAA18473@cj20424-a.reston1.va.home.com> > Yes, ascii is exactly what I want for Zope. The site.py in the current > CVS permanently sets the default-encoding based on locale, and I wanted > Zope's __main__ to change it back to ascii ! For now, you can edit site.py as follows (I just did this to my own copy). If you prefer not to edit site.py, you can provide a sitecustomize.py module that calls sys.setdefaultencoding('ascii'). Index: site.py =================================================================== RCS file: /cvsroot/python/python/dist/src/Lib/site.py,v retrieving revision 1.12 diff -c -r1.12 site.py *** site.py 2000/06/28 14:48:01 1.12 --- site.py 2000/07/10 13:51:13 *************** *** 134,147 **** except LookupError: sys.setdefaultencoding('ascii') ! if 1: # Enable to support locale aware default string encodings. locale_aware_defaultencoding() elif 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. sys.setdefaultencoding('undefined') ! elif 0: # Enable to hard-code a site specific default string encoding. sys.setdefaultencoding('ascii') --- 134,147 ---- except LookupError: sys.setdefaultencoding('ascii') ! if 0: # Enable to support locale aware default string encodings. locale_aware_defaultencoding() elif 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. sys.setdefaultencoding('undefined') ! elif 1: # Enable to hard-code a site specific default string encoding. sys.setdefaultencoding('ascii') --Guido van Rossum (home page: http://dinsdale.python.org/~guido/) From dindin2k@yahoo.com Tue Jul 11 04:12:58 2000 From: dindin2k@yahoo.com (Dinesh Nadarajah) Date: Mon, 10 Jul 2000 20:12:58 -0700 (PDT) Subject: [I18n-sig] Python Translation Message-ID: <20000711031258.8061.qmail@web4001.mail.yahoo.com> Is there any working/ target towards translating Python to other languages. i.e. Some sort of structure like the *.po files in KDE such that native languages can be substituted for the standards keywords. Are there any plans to port Python to other (human) languages. Thanks. _Dinesh PS: Are there any tools to compile Python scripts to native code? (I know it would defeat the purpose of a scripting and cross platform capability - just curious). __________________________________________________ Do You Yahoo!? Get Yahoo! Mail – Free email you can access from anywhere! http://mail.yahoo.com/ From paul@prescod.net Tue Jul 11 18:20:38 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 11 Jul 2000 12:20:38 -0500 Subject: [I18n-sig] Python Translation References: <20000711031258.8061.qmail@web4001.mail.yahoo.com> Message-ID: <396B5766.721B3E73@prescod.net> Dinesh Nadarajah wrote: > > Is there any working/ target towards translating > Python to other languages. i.e. Some sort of structure > like the *.po files in KDE such that native languages > can be substituted for the standards keywords. It's debatable whether that is a good idea. Even if it was, it surely could not be done until Python's documentation could be translated into multiple languages so that people could know what the keywords mean. It isn't helpful to say "si" is like "if" go read the Python documentation. > Are > there any plans to port Python to other (human) > languages. Not that I know of. Porting error messages might be more doable. > PS: Are there any tools to compile Python scripts to > native code? (I know it would defeat the purpose of a > scripting and cross platform capability - just curious). You can compile Python to native code but you need a large "runtime library" which is basically the Python interpreter. Most code still ends up depending heavily on that library. So, e.g. Python function calls do not equate to machine language calls. Rather, they equate to something like PythonLibraryCallFunction("foo"). You don't get a huge speedup. -- Paul Prescod - Not encumbered by corporate consensus Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/~perlis-alan/quotes.html From fw@deneb.enyo.de Tue Jul 11 20:25:40 2000 From: fw@deneb.enyo.de (Florian Weimer) Date: 11 Jul 2000 21:25:40 +0200 Subject: [I18n-sig] Python Translation In-Reply-To: Dinesh Nadarajah's message of "Mon, 10 Jul 2000 20:12:58 -0700 (PDT)" References: <20000711031258.8061.qmail@web4001.mail.yahoo.com> Message-ID: <87sntg31vv.fsf@deneb.enyo.de> Dinesh Nadarajah writes: > Is there any working/ target towards translating > Python to other languages. i.e. Some sort of structure > like the *.po files in KDE such that native languages > can be substituted for the standards keywords. Are > there any plans to port Python to other (human) > languages. I hope this isn't the case. Microsoft did this to VBA in the 90ies, but everybody agrees that it was a complete fiasco. Of course, they made some avoidable mistakes, but in general, the whole idea might sound nice, but it doesn't work out in practice. If there is any interest in this topic, I'm going to sum up the major issues. From dindin2k@yahoo.com Tue Jul 11 22:30:51 2000 From: dindin2k@yahoo.com (Dinesh Nadarajah) Date: Tue, 11 Jul 2000 14:30:51 -0700 (PDT) Subject: [I18n-sig] Python Translation Message-ID: <20000711213051.26845.qmail@web4004.mail.yahoo.com> I posted this question because of the the wide spread interests in lingistics and language independent computing - in general. I also posted it when I ran into problems with Pyhton when a used a variable name with the upper ASCII (128-255) characters. Python did not recognize the charcters. Just a matter of interest. -Dinesh --- Florian Weimer wrote: > Dinesh Nadarajah writes: > > > Is there any working/ target towards translating > > Python to other languages. i.e. Some sort of > structure > > like the *.po files in KDE such that native > languages > > can be substituted for the standards keywords. Are > > there any plans to port Python to other (human) > > languages. > > I hope this isn't the case. Microsoft did this to > VBA in the 90ies, > but everybody agrees that it was a complete fiasco. > Of course, they > made some avoidable mistakes, but in general, the > whole idea might > sound nice, but it doesn't work out in practice. > > If there is any interest in this topic, I'm going to > sum up the major > issues. > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://www.python.org/mailman/listinfo/i18n-sig __________________________________________________ Do You Yahoo!? Get Yahoo! Mail – Free email you can access from anywhere! http://mail.yahoo.com/ From fw@deneb.enyo.de Sun Jul 16 14:12:32 2000 From: fw@deneb.enyo.de (Florian Weimer) Date: 16 Jul 2000 15:12:32 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy Message-ID: <87vgy6kym7.fsf@deneb.enyo.de> The UTF-8 decoder is still buggy (i.e. it doesn't pass Markus Kuhn's stress test), mainly due to the following construct: #define UTF8_ERROR(details) do { \ if (utf8_decoding_error(&s, &p, errors, details)) \ goto onError; \ continue; \ } while (0) (The "continue" statement is supposed to exit from the outer loop, but of course, it doesn't. Indeed, this is a marvelous example of the dangers of the C programming language and especially of the C preprocessor.) I've already sent a patch quite some time ago, but nobody bothered to apply it to the CVS tree. Shall I resend it or shall I assume that the Python developers don't care about this problem? From mal@lemburg.com Sun Jul 16 14:29:58 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 16 Jul 2000 15:29:58 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy References: <87vgy6kym7.fsf@deneb.enyo.de> Message-ID: <3971B8D6.C5D91480@lemburg.com> Florian Weimer wrote: > > The UTF-8 decoder is still buggy (i.e. it doesn't pass Markus Kuhn's > stress test), mainly due to the following construct: > > #define UTF8_ERROR(details) do { \ > if (utf8_decoding_error(&s, &p, errors, details)) \ > goto onError; \ > continue; \ > } while (0) > > (The "continue" statement is supposed to exit from the outer loop, > but of course, it doesn't. Indeed, this is a marvelous example of > the dangers of the C programming language and especially of the C > preprocessor.) > > I've already sent a patch quite some time ago, but nobody bothered to > apply it to the CVS tree. Sorry about that. > Shall I resend it or shall I assume that > the Python developers don't care about this problem? I've checked in a fix which should remedy the problem. Could you run the stress test using the fixed interpreter ? BTW, how much code is the stress test ? Maybe we should add some of it to the test suite. Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fw@deneb.enyo.de Sun Jul 16 15:04:06 2000 From: fw@deneb.enyo.de (Florian Weimer) Date: 16 Jul 2000 16:04:06 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: "M.-A. Lemburg"'s message of "Sun, 16 Jul 2000 15:29:58 +0200" References: <87vgy6kym7.fsf@deneb.enyo.de> <3971B8D6.C5D91480@lemburg.com> Message-ID: <87itu6kw89.fsf@deneb.enyo.de> "M.-A. Lemburg" writes: > I've checked in a fix which should remedy the problem. > Could you run the stress test using the fixed > interpreter ? Thanks. It's more consistent now, but I still don't like it. The basic question is whether a bad sequence like "c0 80" shall be replaced by one or multiple U+FFFD characters. I vote for a single replacement character because it seems natural, but different people may have different opinions here. ;-) > BTW, how much code is the stress test ? Maybe we should add > some of it to the test suite. Currently, it isn't automated (I only feed Markus Kuhn's UTF-8 test through the decoder), and I expect that an automated implementation would consist of around 100 lines of code. (The test covers just the most important borderline cases.) From mal@lemburg.com Sun Jul 16 16:36:18 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 16 Jul 2000 17:36:18 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy References: <87vgy6kym7.fsf@deneb.enyo.de> <3971B8D6.C5D91480@lemburg.com> <87itu6kw89.fsf@deneb.enyo.de> Message-ID: <3971D672.BB1C2AFE@lemburg.com> Florian Weimer wrote: > > "M.-A. Lemburg" writes: > > > I've checked in a fix which should remedy the problem. > > Could you run the stress test using the fixed > > interpreter ? > > Thanks. It's more consistent now, but I still don't like it. The > basic question is whether a bad sequence like "c0 80" shall be > replaced by one or multiple U+FFFD characters. I vote for a single > replacement character because it seems natural, but different people > may have different opinions here. ;-) Is there a standard way of dealing with these errors ? What do other languages do, e.g. Perl, TCL ? I don't have any problem changing the current implementation, but would of course like to stick to an accepted standard here. > > BTW, how much code is the stress test ? Maybe we should add > > some of it to the test suite. > > Currently, it isn't automated (I only feed Markus Kuhn's UTF-8 test > through the decoder), and I expect that an automated implementation > would consist of around 100 lines of code. (The test covers just the > most important borderline cases.) 100 LOCs is ok. Would you be willing to write this up and submit it as patch ? (What's the copyright on Markus Kuhn's test suite ?) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fw@deneb.enyo.de Sun Jul 16 19:55:41 2000 From: fw@deneb.enyo.de (Florian Weimer) Date: 16 Jul 2000 20:55:41 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: "M.-A. Lemburg"'s message of "Sun, 16 Jul 2000 17:36:18 +0200" References: <87vgy6kym7.fsf@deneb.enyo.de> <3971B8D6.C5D91480@lemburg.com> <87itu6kw89.fsf@deneb.enyo.de> <3971D672.BB1C2AFE@lemburg.com> Message-ID: <871z0tj45u.fsf@deneb.enyo.de> "M.-A. Lemburg" writes: > > Thanks. It's more consistent now, but I still don't like it. The > > basic question is whether a bad sequence like "c0 80" shall be > > replaced by one or multiple U+FFFD characters. I vote for a single > > replacement character because it seems natural, but different people > > may have different opinions here. ;-) > > Is there a standard way of dealing with these errors ? From Markus Kuhn's test file: | According to ISO 10646-1, sections R.7 and 2.3c, a device receiving | UTF-8 shall interpret a "malformed sequence in the same way that it | interprets a character that is outside the adopted subset". This means | usually that the malformed UTF-8 sequence is replaced by a replacement | character (U+FFFD), which looks a bit like an inverted question mark, | or a similar symbol. It might be a good idea to visually distinguish a | malformed UTF-8 sequence from a correctly encoded Unicode character | that is just not available in the current font but otherwise fully | legal. For both cases, a clearly recognisable symbol should be used. | Just ignoring malformed sequences or unavailable characters will make | debugging more difficult and can lead to user confusion. I've contacted Markus and he told me that the propoosed approach (i.e. replace the whole sequence with a replacement character) is used in the UTF-8 xterm extension for XFree86. OTOH, the C library interface makes this approach a bit complicated to implement, so it's likely that each octet in a malformed sequence is replaced by a replacement character there. In the future, if UTF-8-aware C libraries are widely deployed, xterm might use them, resulting in a changed behavior, more like the current Python one. > What do other languages do, e.g. Perl, TCL ? Sorry, I don't know. Anyone else? > I don't have any problem changing the current implementation, > but would of course like to stick to an accepted standard here. There doesn't seem to be any standard yet, and I doubt that there is already something like best common practice. :-( [Test module] > 100 LOCs is ok. Would you be willing to write this up and submit > it as patch ? It might take some time, but yes, I'm going to do it. > (What's the copyright on Markus Kuhn's test suite ?) I got permission to use it for this task from him. Is this sufficient, or do you need a disclaimer or something like that? From mal@lemburg.com Sun Jul 16 20:38:01 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 16 Jul 2000 21:38:01 +0200 Subject: [I18n-sig] FYI: Some Unicode and I18n pointers Message-ID: <39720F19.F9183AE4@lemburg.com> Here are some pointers to recent documents related to Unicode and I18n : Solaris Developer Connection: I18N Guidelines for C and C++ http://soldc.sun.com/articles/i18n/Cguidelines.I18N.html Perl, Unicode and I18N FAQ http://rf.net/~james/perli18n.html -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sun Jul 16 20:54:54 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 16 Jul 2000 21:54:54 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy References: <87vgy6kym7.fsf@deneb.enyo.de> <3971B8D6.C5D91480@lemburg.com> <87itu6kw89.fsf@deneb.enyo.de> <3971D672.BB1C2AFE@lemburg.com> <871z0tj45u.fsf@deneb.enyo.de> Message-ID: <3972130E.E48A364E@lemburg.com> Florian Weimer wrote: > > "M.-A. Lemburg" writes: > > > > Thanks. It's more consistent now, but I still don't like it. The > > > basic question is whether a bad sequence like "c0 80" shall be > > > replaced by one or multiple U+FFFD characters. I vote for a single > > > replacement character because it seems natural, but different people > > > may have different opinions here. ;-) > > > > Is there a standard way of dealing with these errors ? > > >From Markus Kuhn's test file: > > | According to ISO 10646-1, sections R.7 and 2.3c, a device receiving > | UTF-8 shall interpret a "malformed sequence in the same way that it > | interprets a character that is outside the adopted subset". This means > | usually that the malformed UTF-8 sequence is replaced by a replacement > | character (U+FFFD), which looks a bit like an inverted question mark, > | or a similar symbol. It might be a good idea to visually distinguish a > | malformed UTF-8 sequence from a correctly encoded Unicode character > | that is just not available in the current font but otherwise fully > | legal. For both cases, a clearly recognisable symbol should be used. > | Just ignoring malformed sequences or unavailable characters will make > | debugging more difficult and can lead to user confusion. > > I've contacted Markus and he told me that the propoosed approach (i.e. > replace the whole sequence with a replacement character) is used in > the UTF-8 xterm extension for XFree86. OTOH, the C library interface > makes this approach a bit complicated to implement, so it's likely > that each octet in a malformed sequence is replaced by a replacement > character there. In the future, if UTF-8-aware C libraries are widely > deployed, xterm might use them, resulting in a changed behavior, more > like the current Python one. Hmm, that would be a +0 for Python's version. Markus seems to always argue for the "replace with one character" option. BTW, I found some discussion of the subject: http://mail.nl.linux.org/linux-utf8/1999-10/msg00106.html http://mail.nl.linux.org/linux-utf8/1999-09/msg00149.html > > What do other languages do, e.g. Perl, TCL ? > > Sorry, I don't know. Anyone else? Both have native UTF-8 support... can anyone help out on this one ? > > I don't have any problem changing the current implementation, > > but would of course like to stick to an accepted standard here. > > There doesn't seem to be any standard yet, and I doubt that there is > already something like best common practice. :-( Perhaps we should just wait for somebody with more UTF-8 experience to comment on this. Whatever strategy is used, it doesn't help the user: she will have to correct the buggy input on way or another. More error indicating characters might make the location easier to find but could also be more annoying. > [Test module] > > > 100 LOCs is ok. Would you be willing to write this up and submit > > it as patch ? > > It might take some time, but yes, I'm going to do it. Great :-) > > (What's the copyright on Markus Kuhn's test suite ?) > > I got permission to use it for this task from him. Is this > sufficient, or do you need a disclaimer or something like that? I guess it should be available under the Python license (or a compatible one)... frankly, I'm not sure want the current requirements are (Python moved from CNRI to BeOpen). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fw@deneb.enyo.de Sun Jul 23 13:03:36 2000 From: fw@deneb.enyo.de (Florian Weimer) Date: 23 Jul 2000 14:03:36 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: "M.-A. Lemburg"'s message of "Sun, 16 Jul 2000 21:54:54 +0200" References: <87vgy6kym7.fsf@deneb.enyo.de> <3971B8D6.C5D91480@lemburg.com> <87itu6kw89.fsf@deneb.enyo.de> <3971D672.BB1C2AFE@lemburg.com> <871z0tj45u.fsf@deneb.enyo.de> <3972130E.E48A364E@lemburg.com> Message-ID: <87vgxxaw9z.fsf@deneb.enyo.de> "M.-A. Lemburg" writes: > > > What do other languages do, e.g. Perl, TCL ? > > > > Sorry, I don't know. Anyone else? > > Both have native UTF-8 support... can anyone help out on this > one ? Perl's UTF-8 support is still extremely rudimentary. Even Larry seems to admit that. The general Perl philosophy seems to preserve invalid UTF-8 sequences. (They use UTF-8 for their strings, that's why they can do this.) This is not applicable to Python, I think. Tcl seems to assume that invalid UTF-8 sequences are ISO-8859-1. At least this is what the code seems to do, its documentation says that replacement characters are used. It doesn't handle overlong sequences properly (contrary to the recommendation in RFC 2279). In Java, the behavior of the UTF-8 decoder is not specified in the language definition, which probably means that Java implementations differ a lot in this region. > Whatever strategy is used, it doesn't help the user: she will > have to correct the buggy input on way or another. More > error indicating characters might make the location easier > to find but could also be more annoying. Anyway, I think we can agree that a single replacement character shall be used in the following cases: - a valid UTF-8 sequence which encodes an UCS-4 character not representable in UTF-16 - a UTF-8 sequence which is an overlong representation of a character, but otherwise correct For the remaining cases, I would vote for the "one replacement character per source octet". After some thinking, this seems to be the most natural approach to me. If the UTF-8 stream is garbled, there's no point in being clever and trying to gues character bounds, because this information is very likely meaningless anyway. As a safety mesaure, I'd suggest to state that Python's behavior may change in a later version if the chosen approach proves to be inadequate. From wunder@ultraseek.com Sun Jul 23 21:21:55 2000 From: wunder@ultraseek.com (Walter Underwood) Date: Sun, 23 Jul 2000 13:21:55 -0700 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: <87vgxxaw9z.fsf@deneb.enyo.de> Message-ID: <174466.3173347315@[192.168.8.114]> I'd rather that it not try to "repair" broken UTF-8. If it isn't UTF-8, throw an exception, and let the caller decide. For example, when parsing XML, invalide UTF-8 means the whole document is invalid. It is considered polite to say where the first invalid character occurs, but it is not acceptable to continue parsing. An XML parser cannot use a UTF-8 decoder that accepts invalide UTF-8. Code that deals with multiple encodings usually needs to do some encoding guessing up front, before choosing an encoder. If the guess is wrong, I'd want the decoder to fail, so we can try the next most likely endcoding. We're busy converting our search engine to use Unicode, so I'm really familiar with the issues right now. wunder -- Walter Underwood Senior Staff Engineer, Ultraseek Server, Inktomi Corp. http://www.ultraseek.com/ http://www.inktomi.com/ From fw@deneb.enyo.de Sun Jul 23 21:40:19 2000 From: fw@deneb.enyo.de (Florian Weimer) Date: 23 Jul 2000 22:40:19 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: Walter Underwood's message of "Sun, 23 Jul 2000 13:21:55 -0700" References: <174466.3173347315@[192.168.8.114]> Message-ID: <87n1j8mvgs.fsf@deneb.enyo.de> Walter Underwood writes: > I'd rather that it not try to "repair" broken UTF-8. If it isn't > UTF-8, throw an exception, > and let the caller decide. This option already exists. This isn't appropriate for some applications, though. Sometimes you just have the data und you have to make the best out of it, and you can't ask someone to give you a fixed version. > We're busy converting our search engine to use Unicode, so I'm > really familiar with the issues right now. And your search engine stops processing a document as soon as it encounters an invalid UTF-8 sequence even though the majority of it is valid UTF-8? I don't think so. From wunder@ultraseek.com Mon Jul 24 00:28:56 2000 From: wunder@ultraseek.com (Walter Underwood) Date: Sun, 23 Jul 2000 16:28:56 -0700 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: <87n1j8mvgs.fsf@deneb.enyo.de> Message-ID: <50914.3173358536@[192.168.8.114]> --On Sunday, July 23, 2000 10:40 PM +0200 Florian Weimer wrote: > > And your search engine stops processing a document as soon as it > encounters an invalid UTF-8 sequence even though the majority of it is > valid UTF-8? I don't think so. Actually, since it is likely to have errors and not be readable in an application, tossing it could be the best choice. Showing people hits that they can't read is not very polite. But we do try harder than that. It falls back to a different character set. Eventually, it ends up in a very liberal character set, like windows-1252, where almost all 8-bit values are legal. We do a similar thing with XML -- if it fails the parse, we try it as HTML, and our HTML parser will take almost anything. But back to the subject, I'm not sure that repairing invalid UTF-8 is a good idea. The HTML experience is that it is a really bad idea to accept invalid documents. If it is necessary, we might want to call it something different than a decoder. wunder -- Walter Underwood Senior Staff Engineer, Ultraseek Server, Inktomi Corp. http://www.ultraseek.com/ http://www.inktomi.com/ From mal@lemburg.com Mon Jul 24 09:26:25 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 24 Jul 2000 10:26:25 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy References: <174466.3173347315@[192.168.8.114]> Message-ID: <397BFDB1.5FF108CF@lemburg.com> Walter Underwood wrote: > > I'd rather that it not try to "repair" broken UTF-8. If it isn't UTF-8, > throw an exception, > and let the caller decide. Note that we are talking about the "replace" error handling case here. The default "strict" mode will throw an exception. > For example, when parsing XML, invalide UTF-8 means the whole document is > invalid. > It is considered polite to say where the first invalid character occurs, > but it is not > acceptable to continue parsing. An XML parser cannot use a UTF-8 decoder > that accepts > invalide UTF-8. > > Code that deals with multiple encodings usually needs to do some encoding > guessing > up front, before choosing an encoder. If the guess is wrong, I'd want the > decoder to > fail, so we can try the next most likely endcoding. > > We're busy converting our search engine to use Unicode, so I'm really > familiar with > the issues right now. Please keep us informed of any quirks you may experience during this conversion. We can use some real life reports for the new Unicode support in Python to polish up the implementation and design. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/