From tree@basistech.com Tue May 15 20:28:05 2001 From: tree@basistech.com (Tom Emerson) Date: Tue, 15 May 2001 15:28:05 -0400 Subject: [I18n-sig] Extending definition of errors argument to codecs Message-ID: <15105.33605.300173.26763@cymru.basistech.com> I'd like to propose an extension to the Codec error reporting mechanism: The 'errors' argument to encode/decode et al. would be much more useful as a callable object. The current semantics of 'strict', 'ignore', and 'replace' are trivially implemented in this scheme, while allowing a specific application to extend a codec with custom error handling if necessary. Something along the lines of: class CodecError: def __call__(self, bytes): pass class CodecError_Replace ( CodecError ): def __call__(self, bytes): return u'\uFFFD' class CodecError_Strict ( CodecError ): def __call__(self, bytes): raise UnicodeError, "cannot map byte range to Unicode" Why would this be useful? I'm working text that purports to be in Big 5, but in fact it is encoded with CP950. CP950 is identical to Big 5 except that it has a handful of extra codepoints in the 0xF9 VDA block (taken from the Eten extension). When using the current Big 5 codec on these files I sometimes blow up because of these extended characters. I would love to be able to do something like: class CodecError_CP950 ( Codec_Error_Strict ): def __call__(self, bytes): if bytes == '\xf9\xd6': return u'\u7881' Codec_Error_Strict.__call__(self, bytes) This effectively allows me to expand upon the repertoire encoded by a the codec without modifying the tables and rebuilding (as I do now as a work around), generating new tables, or whatever else. Food for thought. The above design is off-the-cuff, but I think it is close to my thoughts on the matter. OK, flame away. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Tue May 15 21:12:22 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 15 May 2001 22:12:22 +0200 Subject: [I18n-sig] Extending definition of errors argument to codecs References: <15105.33605.300173.26763@cymru.basistech.com> Message-ID: <3B018DA6.B346732A@lemburg.com> Tom Emerson wrote: > > I'd like to propose an extension to the Codec error reporting mechanism: > > The 'errors' argument to encode/decode et al. would be much more > useful as a callable object. The current semantics of 'strict', > 'ignore', and 'replace' are trivially implemented in this scheme, > while allowing a specific application to extend a codec with custom > error handling if necessary. This has been proposed some months ago already. The problem with this approach is that it seriously breaks binary compatibility at the C level, since all C APIs use const char *error. The call interface would also have to be a little more context aware, so that the callback actually has a chance of modifying the current codec process -- simply returning a usable replacement character isn't enough in the general case where might want to be able to resync with input stream in case there's a break in synchronization. If you can come up with a patch which maintains backward compatibility e.g. by adding a compatibility layer using lots of PyUnicode_EncodeEx() APIs, there's good chance of getting this into the core. Still, it's lots of work and I'm not sure whether it wouldn't be more worthwhile adding these sort of special error handling schemes to the codecs in question rather than making them a generic option for all codecs. > Something along the lines of: > > class CodecError: > def __call__(self, bytes): > pass > > class CodecError_Replace ( CodecError ): > def __call__(self, bytes): > return u'\uFFFD' > > class CodecError_Strict ( CodecError ): > def __call__(self, bytes): > raise UnicodeError, "cannot map byte range to Unicode" > > Why would this be useful? I'm working text that purports to be in Big > 5, but in fact it is encoded with CP950. CP950 is identical to Big 5 > except that it has a handful of extra codepoints in the 0xF9 VDA block > (taken from the Eten extension). When using the current Big 5 codec on > these files I sometimes blow up because of these extended > characters. I would love to be able to do something like: > > class CodecError_CP950 ( Codec_Error_Strict ): > def __call__(self, bytes): > if bytes == '\xf9\xd6': > return u'\u7881' > Codec_Error_Strict.__call__(self, bytes) > > This effectively allows me to expand upon the repertoire encoded by a > the codec without modifying the tables and rebuilding (as I do now as > a work around), generating new tables, or whatever else. > > Food for thought. The above design is off-the-cuff, but I think it is > close to my thoughts on the matter. > > OK, flame away. > > -tree > > -- > Tom Emerson Basis Technology Corp. > Sr. Sinostringologist http://www.basistech.com > "Beware the lollipop of mediocrity: lick it once and you suck forever" > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Tue May 15 22:09:52 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 15 May 2001 23:09:52 +0200 Subject: [I18n-sig] Extending definition of errors argument to codecs In-Reply-To: <3B018DA6.B346732A@lemburg.com> (mal@lemburg.com) References: <15105.33605.300173.26763@cymru.basistech.com> <3B018DA6.B346732A@lemburg.com> Message-ID: <200105152109.f4FL9q804004@mira.informatik.hu-berlin.de> > This has been proposed some months ago already. The problem with > this approach is that it seriously breaks binary compatibility > at the C level, since all C APIs use const char *error. As discussed last time, this is not a serious problem. You could move the existing API to use callable objects as arguments, and provide wrapper functions that still accept strings. > simply returning a usable replacement character isn't enough in the > general case That points to the major problem we had list time: We could not agree on what the general case is. In every demonstrated use case, a simple replacement string would have been enough (remember that, in the XML case, it would have also been a replacement *string*, e.g. "Ⴓ") > Still, it's lots of work and I'm not sure whether it wouldn't > be more worthwhile adding these sort of special error handling > schemes to the codecs in question rather than making them > a generic option for all codecs. Ok, this is an improvement over the last time this discussion came up, where we only agreed to implement an "XML" error handling or some such. Regards, Martin From paulp@ActiveState.com Wed May 16 18:32:43 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 16 May 2001 10:32:43 -0700 Subject: [I18n-sig] UTF-8 and BOM Message-ID: <3B02B9BB.E1F6AE39@ActiveState.com> Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives users an option. Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading character. The UTF-16 decoder removes it. I recognize that the BOM is not useful as a "byte order mark" for UTF-8 data but I would still suggest that the UTF-8 decoder should remove it for these reasons: 1) Microsoft has taken the stance that a BOM is legal on UTF-8 data 2) Doing so is legal: "Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is only to distinguish UTF-8 from other UTF encodings =97 it has nothing to do with byte order. [KW]" http://www.unicode.org/unicode/faq/utf_bom.html 3) I think that distinguising UTF-8 from other encodings through the BOM is actually a great idea and I wish that every UTF-8 creator would do it! 4) The behavior would be consistent with the UTF-16 behavior. ---- import codecs with_bom =3D u"\uFEFFabcd" utf_8 =3D with_bom.encode("utf-8") utf_16 =3D with_bom.encode("utf-16") print repr(codecs.utf_8_decode(utf_8)) (u'\ufeffabcd', 7) print repr(codecs.utf_16_decode(utf_16)) (u'abcd', 12) --=20 Take a recipe. Leave a recipe. =20 Python Cookbook! http://www.ActiveState.com/pythoncookbook From mal@lemburg.com Wed May 16 19:48:51 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 16 May 2001 20:48:51 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> Message-ID: <3B02CB93.A9DCFD8@lemburg.com> Paul Prescod wrote: > > Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives > users an option. > > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading > character. The UTF-16 decoder removes it. I recognize that the BOM is > not useful as a "byte order mark" for UTF-8 data but I would still > suggest that the UTF-8 decoder should remove it for these reasons: > 1) Microsoft has taken the stance that a BOM is legal on UTF-8 data BOMs are standard Unicode char points, so they are legal in all Unicode encodings. > 2) Doing so is legal: > > "Q: Is the UTF-8 encoding scheme the same irrespective of whether the > underlying processor is little endian or big endian? > > A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no > endian problem as there is for encoding forms that use 16-bit or 32-bit > code units. Where a BOM is used with UTF-8, it is only to distinguish > UTF-8 from other UTF encodings - it has nothing to do with byte order. > [KW]" > > http://www.unicode.org/unicode/faq/utf_bom.html ... as I said :-) > 3) I think that distinguising UTF-8 from other encodings through the > BOM is actually a great idea and I wish that every UTF-8 creator would > do it! Uhm, I can't follow you here... BOMs in UTF-8 look like this: >>> u'\ufeff'.encode('utf-8') '\xef\xbb\xbf' which is somewhat different from '\xff\xfe' or '\xfe\xff'. > 4) The behavior would be consistent with the UTF-16 behavior. >>> u'\ufeff'.encode('utf-16') '\xff\xfe\xff\xfe' >>> u'\ufeff'.encode('utf-16-le') '\xff\xfe' >>> u'\ufeff'.encode('utf-16-be') '\xfe\xff' >>> u'\ufeff'.encode('utf-8') '\xef\xbb\xbf' -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From guido@digicool.com Wed May 16 20:55:35 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 16 May 2001 14:55:35 -0500 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: Your message of "Wed, 16 May 2001 20:48:51 +0200." <3B02CB93.A9DCFD8@lemburg.com> References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> Message-ID: <200105161955.OAA04144@cj20424-a.reston1.va.home.com> > > 3) I think that distinguising UTF-8 from other encodings through the > > BOM is actually a great idea and I wish that every UTF-8 creator would > > do it! > > Uhm, I can't follow you here... BOMs in UTF-8 look like this: > > >>> u'\ufeff'.encode('utf-8') > '\xef\xbb\xbf' > > which is somewhat different from '\xff\xfe' or '\xfe\xff'. I think he meant that this serves as a sort-of "magic number" for UTF-8 files. I find that kind of cute myself. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From paulp@ActiveState.com Wed May 16 20:06:55 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 16 May 2001 12:06:55 -0700 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <200105161955.OAA04144@cj20424-a.reston1.va.home.com> Message-ID: <3B02CFCF.A26624E8@ActiveState.com> Guido van Rossum wrote: > >... > > I think he meant that this serves as a sort-of "magic number" for > UTF-8 files. I find that kind of cute myself. :-) What he said. Thanks to this trick, notepad and Visual Studio are extremely good at auto-detecting encodings for Unicode text files created with either tool. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From paulp@ActiveState.com Wed May 16 20:26:41 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 16 May 2001 12:26:41 -0700 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> Message-ID: <3B02D471.6628A0@ActiveState.com> "M.-A. Lemburg" wrote: > >... > > BOMs are standard Unicode char points, so they are legal in all > Unicode encodings. My point is that it is legal to interpret it as a BOM and not just a character. >... > Uhm, I can't follow you here... BOMs in UTF-8 look like this: > > >>> u'\ufeff'.encode('utf-8') > '\xef\xbb\xbf' > > which is somewhat different from '\xff\xfe' or '\xfe\xff'. That's what's great about it! >... > >>> u'\ufeff'.encode('utf-16') > '\xff\xfe\xff\xfe' It is curious that decoding this removes both FEFF characters. Is it right that the decoder removes all BOM sequences? >>> codecs.utf_16_decode( codecs.BOM*10 + "a".encode("UTF-16") + codecs.BOM*10) (u'a', 44) -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From mal@lemburg.com Wed May 16 20:59:50 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 16 May 2001 21:59:50 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> Message-ID: <3B02DC36.113E7BE9@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > >... > > > > BOMs are standard Unicode char points, so they are legal in all > > Unicode encodings. > > My point is that it is legal to interpret it as a BOM and not just a > character. That's correct (and also the reasoning behind adding BOM in files or streams and being allowed to remove them at your own will). > >... > > Uhm, I can't follow you here... BOMs in UTF-8 look like this: > > > > >>> u'\ufeff'.encode('utf-8') > > '\xef\xbb\xbf' > > > > which is somewhat different from '\xff\xfe' or '\xfe\xff'. > > That's what's great about it! Ok, now I get it: you want to use '\xef\xbb\xbf' as file encoding identifier. Sounds like a good idea ! > >... > > >>> u'\ufeff'.encode('utf-16') > > '\xff\xfe\xff\xfe' > > It is curious that decoding this removes both FEFF characters. Is it > right that the decoder removes all BOM sequences? > > >>> codecs.utf_16_decode( codecs.BOM*10 + "a".encode("UTF-16") + codecs.BOM*10) > (u'a', 44) Yes. The codec is smart enough to even handle input stream with mixed byte orders (it switches dynamically based on what it finds in the stream). Note that BYTE ORDER MARK is only a comment for char point '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding or removing these will not cause any visible effect in the text or change the formatting. That's why you can add or remove them at your own will. So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove all BOM marks from its input, or add BOM marks in some places or add a codec utf-8-bom which prepends BOM to the start of all encoded strings ? -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Wed May 16 22:07:49 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 16 May 2001 23:07:49 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B02B9BB.E1F6AE39@ActiveState.com> (message from Paul Prescod on Wed, 16 May 2001 10:32:43 -0700) References: <3B02B9BB.E1F6AE39@ActiveState.com> Message-ID: <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading > character. The UTF-16 decoder removes it. I recognize that the BOM is > not useful as a "byte order mark" for UTF-8 data but I would still > suggest that the UTF-8 decoder should remove it for these reasons: I think it is good to remove the BOM when decoding UTF-8. Most likely, the only reason that this is not done is that nobody thought that there might be one. I disagree that putting the BOM into a file is a good thing - I think it is stupid to do so. First of all, auto-detection can always be fooled, so there should be a higher-level protocol for reliable data processing. UTF-8 is relatively easy to auto-detect if you believe in auto-detection - it's just that looking at the first few bytes it not sufficient. OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two UTF-8 files to get another UTF-8 file. That properly is lost if there is a BOM in the file. Regards, Martin From mal@lemburg.com Wed May 16 22:27:13 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 16 May 2001 23:27:13 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> Message-ID: <3B02F0B1.8863FDB1@lemburg.com> "Martin v. Loewis" wrote: > > > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading > > character. The UTF-16 decoder removes it. I recognize that the BOM is > > not useful as a "byte order mark" for UTF-8 data but I would still > > suggest that the UTF-8 decoder should remove it for these reasons: > > I think it is good to remove the BOM when decoding UTF-8. Most likely, > the only reason that this is not done is that nobody thought that > there might be one. > > I disagree that putting the BOM into a file is a good thing - I think > it is stupid to do so. First of all, auto-detection can always be > fooled, so there should be a higher-level protocol for reliable data > processing. UTF-8 is relatively easy to auto-detect if you believe in > auto-detection - it's just that looking at the first few bytes it not > sufficient. > > OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two > UTF-8 files to get another UTF-8 file. That properly is lost if there > is a BOM in the file. Why should a BOM behave any different than any other Unicode character ? BOMs can be added and deleted in pretty much all places of a Unicode text -- that's their intent after all, so I don't see how they could break any property of an encoding. Or did you have the same misunderstanding as I did ? ... Paul is talking about the UTF-8 encoding of the BOM mark ('\xef\xbb\xbf'), not the FF FE or FE FF byte sequence as is seen in UTF-16 streams. -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From paulp@ActiveState.com Wed May 16 22:41:35 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 16 May 2001 14:41:35 -0700 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> Message-ID: <3B02F40F.C6C1CE4A@ActiveState.com> "Martin v. Loewis" wrote: > > > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading > > character. The UTF-16 decoder removes it. I recognize that the BOM is > > not useful as a "byte order mark" for UTF-8 data but I would still > > suggest that the UTF-8 decoder should remove it for these reasons: > > I think it is good to remove the BOM when decoding UTF-8. Most likely, > the only reason that this is not done is that nobody thought that > there might be one. Okay good. > I disagree that putting the BOM into a file is a good thing - I think > it is stupid to do so. First of all, auto-detection can always be > fooled, so there should be a higher-level protocol for reliable data > processing. There should be but there isn't always. What is the standard way for tagging UTF-8 documents on the Windows file system? > UTF-8 is relatively easy to auto-detect if you believe in > auto-detection - it's just that looking at the first few bytes it not > sufficient. Yes, we're going to autodetect by trying to decode the data but that's a pretty expensive operation. You never know if the very first non-ASCII char will appear in the last few bytes of the file. Anyhow, it doesn't matter. If I want a BOM in files I write out, I can add it. My main goal is to have the reader do the right thing with "Microsoft-format" Unicode files. > OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two > UTF-8 files to get another UTF-8 file. That properly is lost if there > is a BOM in the file. So what if there is a BOM in the middle of the data stream. MAL's decoder will just remove it anyhow. :) -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From paulp@ActiveState.com Wed May 16 22:57:06 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 16 May 2001 14:57:06 -0700 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> Message-ID: <3B02F7B2.F932C084@ActiveState.com> "M.-A. Lemburg" wrote: > >... > > Note that BYTE ORDER MARK is only a comment for char point > '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding > or removing these will not cause any visible effect in the > text or change the formatting. That's why you can add or > remove them at your own will. I'm not sure I buy that, but one could argue that a Zero width no-break space character is a legitimate character whether you can see it on a computer screen or not...but I don't care enough to make that argument. > So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove > all BOM marks from its input, or add BOM marks in some places > or add a codec utf-8-bom which prepends BOM to the start of > all encoded strings ? I'd like the UTF-8 codec to treat BOMs (especially leading BOMs) as the UTF-16 one does. Probably BOM_UTF8 should be added to codecs.py. I'm not sure whether we need another codec. Probably not... -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From mal@lemburg.com Wed May 16 23:20:49 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 May 2001 00:20:49 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com> Message-ID: <3B02FD41.20675BC3@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > >... > > > > Note that BYTE ORDER MARK is only a comment for char point > > '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding > > or removing these will not cause any visible effect in the > > text or change the formatting. That's why you can add or > > remove them at your own will. > > I'm not sure I buy that, but one could argue that a Zero width no-break > space character is a legitimate character whether you can see it on a > computer screen or not...but I don't care enough to make that argument. Text data is different than binary data. Unicode text which uses combining characters (e.g. accent and 'e' to produce 'é') is equivalent to text which uses the combined character point directly. This corner of Unicode is not well covered yet in Python's Unicode implementation. The two major missing items are normalization and collation support. > > So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove > > all BOM marks from its input, or add BOM marks in some places > > or add a codec utf-8-bom which prepends BOM to the start of > > all encoded strings ? > > I'd like the UTF-8 codec to treat BOMs (especially leading BOMs) as the > UTF-16 one does. Probably BOM_UTF8 should be added to codecs.py. I'm not > sure whether we need another codec. Probably not... You have to be careful here: UTF-16 prepends a BOM mark to every string pushed through the codec -- even small snippets. You certainly don't want to make that the default for the much more common UTF-8 which has no real requirement to include BOM marks at all... having the decoder automatically remove BOM marks is easy to implement and won't cause any harm, but carelessly adding them will get us into trouble. -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From paulp@ActiveState.com Wed May 16 23:26:56 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 16 May 2001 15:26:56 -0700 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com> <3B02FD41.20675BC3@lemburg.com> Message-ID: <3B02FEB0.2A4135A6@ActiveState.com> "M.-A. Lemburg" wrote: > >... > > You have to be careful here: UTF-16 prepends a BOM mark to > every string pushed through the codec -- even small snippets. > You certainly don't want to make that the default for the > much more common UTF-8 which has no real requirement to include > BOM marks at all... having the decoder automatically remove > BOM marks is easy to implement and won't cause any harm, > but carelessly adding them will get us into trouble. Yes, I meant to say that the standard decoder should remove them and left it up to you whether we should have another codec where the encoder adds them. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From martin@loewis.home.cs.tu-berlin.de Thu May 17 05:22:42 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 17 May 2001 06:22:42 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B02F0B1.8863FDB1@lemburg.com> (mal@lemburg.com) References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> Message-ID: <200105170422.f4H4MgC01079@mira.informatik.hu-berlin.de> > Why should a BOM behave any different than any other Unicode > character ? BOMs can be added and deleted in pretty much all > places of a Unicode text -- that's their intent after all, so > I don't see how they could break any property of an encoding. > > Or did you have the same misunderstanding as I did ? ... > Paul is talking about the UTF-8 encoding of the BOM mark ('\xef\xbb\xbf'), > not the FF FE or FE FF byte sequence as is seen in UTF-16 streams. So am I, and I think that when decoding UTF-8, the first Unicode character should be removed when it is the BOM, by the UTF-8 decoder. It should be removed in that place because it was inserted only to identify UTF-8 (just as the byte sequence FF FE was inserted into the UTF-16 stream to identify it as UTF-16, and to identify the byte order). I don't think the decoder should remove the BOM from any other location in the text, since removing it *does* change the content of the text. It may be removed as part of applying some normalization, but that should not happen unless the application explicitly requests that normalization. In fact, none of the Unicode normalization forms removes the BOM (see TR #15). The BOM is recommended to be a valid character in identifiers, and it is recommended to remove it before comparing identifiers (since it is a formatting character). Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu May 17 05:28:56 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 17 May 2001 06:28:56 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B02F7B2.F932C084@ActiveState.com> (message from Paul Prescod on Wed, 16 May 2001 14:57:06 -0700) References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com> Message-ID: <200105170428.f4H4Su501137@mira.informatik.hu-berlin.de> > "M.-A. Lemburg" wrote: > > > >... > > > > Note that BYTE ORDER MARK is only a comment for char point > > '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. No, and yes. "BYTE ORDER MARK" is not in the comment field of the database, but in the "Unicode 1.0 name" of the database. [Paul] > I'm not sure I buy that, but one could argue that a Zero width no-break > space character is a legitimate character whether you can see it on a > computer screen or not...but I don't care enough to make that argument. I do. A reader must not remove the BOM, unless it is clearly meant to indicate the encoding of a document. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu May 17 05:25:36 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 17 May 2001 06:25:36 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B02F40F.C6C1CE4A@ActiveState.com> (message from Paul Prescod on Wed, 16 May 2001 14:41:35 -0700) References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F40F.C6C1CE4A@ActiveState.com> Message-ID: <200105170425.f4H4Pa401135@mira.informatik.hu-berlin.de> > > I disagree that putting the BOM into a file is a good thing - I think > > it is stupid to do so. First of all, auto-detection can always be > > fooled, so there should be a higher-level protocol for reliable data > > processing. > > There should be but there isn't always. What is the standard way for > tagging UTF-8 documents on the Windows file system? There probably is none, although giving them a .txt extension is a good starting point. What is the standard for tagging KOI8-R documents on the Windows file system? > So what if there is a BOM in the middle of the data stream. MAL's > decoder will just remove it anyhow. :) Yes, and I think this is a bug. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu May 17 05:32:24 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 17 May 2001 06:32:24 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B02FD41.20675BC3@lemburg.com> (mal@lemburg.com) References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com> <3B02FD41.20675BC3@lemburg.com> Message-ID: <200105170432.f4H4WOw01160@mira.informatik.hu-berlin.de> > Text data is different than binary data. Unicode text > which uses combining characters (e.g. accent and 'e' to produce > 'é') is equivalent to text which uses the combined character > point directly. Are you saying that the BOM is removed under normalization? Which normalization form? > You have to be careful here: UTF-16 prepends a BOM mark to > every string pushed through the codec -- even small snippets. That seems like an error also. When writing to a UTF-16 stream, I want the BOM to appear only in the first bytes of the resulting file. Regards, Martin From paulp@ActiveState.com Thu May 17 17:46:12 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 17 May 2001 09:46:12 -0700 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F40F.C6C1CE4A@ActiveState.com> <200105170425.f4H4Pa401135@mira.informatik.hu-berlin.de> Message-ID: <3B040054.102EBE46@ActiveState.com> "Martin v. Loewis" wrote: > >... > > There probably is none, although giving them a .txt extension is a > good starting point. What is the standard for tagging KOI8-R documents > on the Windows file system? There isn't one. But utf-8 is an encoding that is growing in popularity and KOI8-R is one that is shrinking. The unreliability of "code pages" is a big part of what Unicode is supposed to fix. > > So what if there is a BOM in the middle of the data stream. MAL's > > decoder will just remove it anyhow. :) > > Yes, and I think this is a bug. Nevertheless, I don't see how concatenating two BOM-prefixed UTF-8 streams is any more or less problematic than concatenating two BOM-prefixed UTF-16 streams. I'll repeat that I'm not saying that the UTF-8 encoder should add a BOM. Until this convention is more common, we shouldn't try to be innovative. But I still think that BOMs on UTF-8 are a good idea. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From paulp@ActiveState.com Fri May 18 19:45:22 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 18 May 2001 11:45:22 -0700 Subject: [I18n-sig] Transparent Encoding Message-ID: <3B056DC2.D143F641@ActiveState.com> I would like to suggest that if the "data_encoding" parameter of EncodedFile is missing or None, the encoding "unicode_internal" should be used. Right now it is not really clear how to use the EncodedFile to *encode* or *decode* as opposed to *transcode* (translate between encodings). In fact it is documented only as a transcoder even though I think that it will more often be used as an encoder or decoder. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From mal@lemburg.com Fri May 18 21:05:20 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 18 May 2001 22:05:20 +0200 Subject: [I18n-sig] Transparent Encoding References: <3B056DC2.D143F641@ActiveState.com> Message-ID: <3B058080.7CEFF26C@lemburg.com> Paul Prescod wrote: > > I would like to suggest that if the "data_encoding" parameter of > EncodedFile is missing or None, the encoding "unicode_internal" should > be used. Right now it is not really clear how to use the EncodedFile to > *encode* or *decode* as opposed to *transcode* (translate between > encodings). In fact it is documented only as a transcoder even though I > think that it will more often be used as an encoder or decoder. EncodedFile() creates an object which interfaces between two worlds: the file and the program. In this sense it is always a recoder. I don't see why you want to make unicode-internal the default for data_encoding... if you don't want an encoding, you shouldn't use EncodedFile() at all. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From paulp@ActiveState.com Fri May 18 21:40:37 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 18 May 2001 13:40:37 -0700 Subject: [I18n-sig] Transparent Encoding References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> Message-ID: <3B0588C5.97E5E2E8@ActiveState.com> "M.-A. Lemburg" wrote: > >... > > I don't see why you want to make unicode-internal the default > for data_encoding... if you don't want an encoding, you shouldn't > use EncodedFile() at all. What's a better idiom for stream = codecs.EncodeFile(fileobj, "unicode-internal", "utf-8") I want to a writable fileobj in a transparent UTF-8 encoder? ---- Also, in Python 2.1 I just noticed that this code does some weird pointer thing that crashes Python sometimes: >>> for i in (1,2,3): ... codecs.EncodedFile(open("foo.txt","w"), "unicode-internal", "utf-8").write(u"\u2222") ... >>> ^Z Sometimes it crashes immediately and sometimes it only crashes when you try to shut down Python. I can submit a bug report if you can't diagnose this easily and haven't heard of it before. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From fw@deneb.enyo.de Fri May 18 22:05:51 2001 From: fw@deneb.enyo.de (Florian Weimer) Date: 18 May 2001 23:05:51 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B02F0B1.8863FDB1@lemburg.com> ("M.-A. Lemburg"'s message of "Wed, 16 May 2001 23:27:13 +0200") References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> Message-ID: <877kze74gg.fsf@deneb.enyo.de> "M.-A. Lemburg" writes: > Why should a BOM behave any different than any other Unicode > character ? BOMs can be added and deleted in pretty much all > places of a Unicode text -- that's their intent after all, so > I don't see how they could break any property of an encoding. The BOM is overloaded with two meanings, it's certainly not a no-op character. From fw@deneb.enyo.de Fri May 18 22:04:13 2001 From: fw@deneb.enyo.de (Florian Weimer) Date: 18 May 2001 23:04:13 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B02B9BB.E1F6AE39@ActiveState.com> (Paul Prescod's message of "Wed, 16 May 2001 10:32:43 -0700") References: <3B02B9BB.E1F6AE39@ActiveState.com> Message-ID: <87bsoq74j6.fsf@deneb.enyo.de> Paul Prescod writes: > 3) I think that distinguising UTF-8 from other encodings through the > BOM is actually a great idea and I wish that every UTF-8 creator would > do it! I think it's even mandated by ISO/IEC 10646-1:2000. However, the BOM is incompatible with the traditional Unix tools, so most people (especially the Linux-UTF-8 folks) recommend not to use it. From martin@loewis.home.cs.tu-berlin.de Fri May 18 22:05:09 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 18 May 2001 23:05:09 +0200 Subject: [I18n-sig] Transparent Encoding In-Reply-To: <3B0588C5.97E5E2E8@ActiveState.com> (message from Paul Prescod on Fri, 18 May 2001 13:40:37 -0700) References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com> Message-ID: <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de> > What's a better idiom for > > stream = codecs.EncodeFile(fileobj, "unicode-internal", "utf-8") > > I want to a writable fileobj in a transparent UTF-8 encoder? Is fileobj already given as open, or do you have a filename for it? If the latter, just do stream = codecs.open(filename, "w", encoding="utf-8") If the former, do writer = codecs.lookup("utf-8")[3] # or # enc, dec, reader, writer = codecs.lookup("utf-8") stream = writer(fileobj) An EncodedFile is not suitable since it has byte strings on both ends, and Unicode strings only inside. Regards, Martin From paulp@ActiveState.com Fri May 18 23:01:57 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 18 May 2001 15:01:57 -0700 Subject: [I18n-sig] Transparent Encoding References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com> <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de> Message-ID: <3B059BD5.CE05BBF3@ActiveState.com> "Martin v. Loewis" wrote: > >... > An EncodedFile is not suitable since it has byte strings on both ends, > and Unicode strings only inside. EncodedFile seems to work as I ask if I pass it the encoding name as "unicode-internal". Furthermore, code that does that is much simpler than code that looks up the codec manually. I'm not a big fan of those codec tuples. Current: writer = codecs.lookup("utf-8")[3] stream = writer(fileobj) Proposed: codecs.EncodedFile(fileobj, None, "utf-8") As I understand it, you can almost always go without looking up the encoder tuple thanks to the .encode method. And you can almost always go without looking up the decoder, thanks to the .decode method. This EncodedFile convention would allow most common cases of wrapping Unicode to avoid looking up the tuple also. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From mal@lemburg.com Sat May 19 11:16:55 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 19 May 2001 12:16:55 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> Message-ID: <3B064817.95FB5F5E@lemburg.com> Florian Weimer wrote: > > "M.-A. Lemburg" writes: > > > Why should a BOM behave any different than any other Unicode > > character ? BOMs can be added and deleted in pretty much all > > places of a Unicode text -- that's their intent after all, so > > I don't see how they could break any property of an encoding. > > The BOM is overloaded with two meanings, it's certainly not a no-op > character. I didn't say that a BOM is a no-op character, just that adding or removing a BOM character doesn't break the encoding. For more infos on BOMs and how they are intended to be used, please see the Unicode FAQ: http://www.unicode.org/unicode/faq/utf_bom.html#24 The problem with BOMs is that they are supposed to appear at the start of a string. However, if you concatenate two such strings, the BOM in the middle will turn into a normal ZWNBSP character. To be fully standards compliant, string concat of a UTF-16 string (which start with BOM marks) would have to be special cased. This is not possible though, since strings don't have any encoding information. The only way to properly deal with all this is at application level, since only the programmer knows which string will actually form the start of a file or a larger text string. What I could do, is add a UTF-8 codec which prepends a BOM mark and removes it from the stream during decode. The programmer would have to do use this codec in case she wants to prepend UTF-8 files with a BOM then. I'm still unsure whether I should change the UTF-16 decoder to only remove the BOM at the start of the stream -- the above case where BOMs are inserted due to string concatenation is very common (each .write() to a file will produce such a BOM mark). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Sat May 19 13:08:08 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 19 May 2001 14:08:08 +0200 Subject: [I18n-sig] Transparent Encoding References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com> <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de> <3B059BD5.CE05BBF3@ActiveState.com> Message-ID: <3B066228.96022493@lemburg.com> Paul Prescod wrote: > > "Martin v. Loewis" wrote: > > > >... > > An EncodedFile is not suitable since it has byte strings on both ends, > > and Unicode strings only inside. > > EncodedFile seems to work as I ask if I pass it the encoding name as > "unicode-internal". Furthermore, code that does that is much simpler > than code that looks up the codec manually. I'm not a big fan of those > codec tuples. > > Current: > > writer = codecs.lookup("utf-8")[3] > stream = writer(fileobj) > > Proposed: > > codecs.EncodedFile(fileobj, None, "utf-8") > > As I understand it, you can almost always go without looking up the > encoder tuple thanks to the .encode method. And you can almost always go > without looking up the decoder, thanks to the .decode method. This > EncodedFile convention would allow most common cases of wrapping Unicode > to avoid looking up the tuple also. Paul, I still don't understand what you really want to achieve. Do you want a file-like object which writes utf-8 and can take Unicode as input for write (as well as strings which are then handled in the usual ASCII way) and returns Unicode for .read() ? The encoding 'unicode-internal' is really only meant for low-level access to how we chose to represent Unicode at C level. This could well change in some future version (note that Unicode is still evolving and probably will continue to do so for some time; e.g. Unicode 3.1 is just out the door and adds another 50k character points, using the non-BMP space for the first time...). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From guido@digicool.com Sat May 19 16:35:18 2001 From: guido@digicool.com (Guido van Rossum) Date: Sat, 19 May 2001 11:35:18 -0400 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: Your message of "Sat, 19 May 2001 12:16:55 +0200." <3B064817.95FB5F5E@lemburg.com> References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> Message-ID: <200105191535.LAA01629@cj20424-a.reston1.va.home.com> > The problem with BOMs is that they are supposed to appear at > the start of a string. Taken out of context, this strikes me as nonsense. Strings in memory (Python Unicode strings anyway) have absolutely no need for a byte order mark since they are always in the right (native) byte order. It is *files* that are supposed to have a BOM at the start. I think the difference is worth noting: I don't mind if apps that read files have to deal with the BOM (including, of course, using the proper byte order to read the rest of the file). But it is absurd to expect code dealing with *strings* to handle BOMs. --Guido van Rossum (home page: http://www.python.org/~guido/) From martin@loewis.home.cs.tu-berlin.de Sat May 19 07:59:10 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 19 May 2001 08:59:10 +0200 Subject: [I18n-sig] Transparent Encoding In-Reply-To: <3B059BD5.CE05BBF3@ActiveState.com> (message from Paul Prescod on Fri, 18 May 2001 15:01:57 -0700) References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com> <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de> <3B059BD5.CE05BBF3@ActiveState.com> Message-ID: <200105190659.f4J6xAs01264@mira.informatik.hu-berlin.de> > > An EncodedFile is not suitable since it has byte strings on both ends, > > and Unicode strings only inside. > > EncodedFile seems to work as I ask if I pass it the encoding name as > "unicode-internal". What do you mean, "seems to work". The encoding "unicode-internal" still produces byte strings, e.g. >>> s=u"Hallo" >>> s.encode("unicode-internal") 'H\x00a\x00l\x00l\x00o\x00' >>> s u'Hallo' A unicode-internal encoded byte string is *not* the same thing as a Unicode string. > Furthermore, code that does that is much simpler > than code that looks up the codec manually. I'm not a big fan of those > codec tuples. > > Current: > > writer = codecs.lookup("utf-8")[3] > stream = writer(fileobj) > > Proposed: > > codecs.EncodedFile(fileobj, None, "utf-8") -0. Regards, Martin From tdickenson@geminidataloggers.com Mon May 21 11:06:46 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Mon, 21 May 2001 11:06:46 +0100 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <200105191535.LAA01629@cj20424-a.reston1.va.home.com> References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> Message-ID: On Sat, 19 May 2001 11:35:18 -0400, Guido van Rossum wrote: >> The problem with BOMs is that they are supposed to appear at >> the start of a string. > >Taken out of context, this strikes me as nonsense. Strings in memory >(Python Unicode strings anyway) have absolutely no need for a byte >order mark since they are always in the right (native) byte order. Thats true for Unicode strings. However, a python plain string containing an encoded Unicode string (in *any* character encoding) is no different to a file here - its just a block-o-bytes. >it is absurd to >expect code dealing with *strings* to handle BOMs. I agree with that, and is a good reason why the codecs should always remove them. "M.-A. Lemburg" wrote: >I'm still unsure whether I should change the UTF-16 decoder >to only remove the BOM at the start of the stream -- the above >case where BOMs are inserted due to string concatenation >is very common (each .write() to a file will produce such >a BOM mark). Toby Dickenson tdickenson@geminidataloggers.com From walter@livinglogic.de Mon May 21 12:08:34 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Mon, 21 May 2001 13:08:34 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> Message-ID: <200105211308340281.00448C08@mail.livinglogic.de> On 21.05.01 at 11:06 Toby Dickenson wrote: > [...] > >it is absurd to > >expect code dealing with *strings* to handle BOMs. > > I agree with that, and is a good reason why the codecs should always > remove them. ??? This is a good reason why the codec should pass the \ufeff through, because a \ufeff in a unicode object should not be considered to be a BOM but a ZWNBSP (it might e.g. be used to give hints to a hyphenation or ligature algorithm.) > "M.-A. Lemburg" wrote: > > >I'm still unsure whether I should change the UTF-16 decoder > >to only remove the BOM at the start of the stream -- the above > >case where BOMs are inserted due to string concatenation > >is very common (each .write() to a file will produce such > >a BOM mark). Then the write function has an error. A BOM should only be written at the start of the file and not on every call to write(). The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24) states: Q: I am using a protocol that has BOM at the start of text. How do I represent an initial ZWNBSP? A: Use the sequence FEFF FEFF But with the current decoder implementation *both* \ufeffs will be removed, so the ZWNBSP disappears. Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From mal@lemburg.com Mon May 21 12:45:46 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 21 May 2001 13:45:46 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <200105211308340281.00448C08@mail.livinglogic.de> Message-ID: <3B08FFEA.7130A83F@lemburg.com> Walter Doerwald wrote: > > On 21.05.01 at 11:06 Toby Dickenson wrote: > > > [...] > > >it is absurd to > > >expect code dealing with *strings* to handle BOMs. > > > > I agree with that, and is a good reason why the codecs should always > > remove them. > > ??? This is a good reason why the codec should pass the \ufeff > through, because a \ufeff in a unicode object should not be > considered to be a BOM but a ZWNBSP (it might e.g. be used to > give hints to a hyphenation or ligature algorithm.) True. > > "M.-A. Lemburg" wrote: > > > > >I'm still unsure whether I should change the UTF-16 decoder > > >to only remove the BOM at the start of the stream -- the above > > >case where BOMs are inserted due to string concatenation > > >is very common (each .write() to a file will produce such > > >a BOM mark). > > Then the write function has an error. A BOM should only be > written at the start of the file and not on every call to > write(). That's hard to implement... how would the codec know where the stream starts -- it only interfaces to the underyling stream using .read() and .write() ? > The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24) > states: > Q: I am using a protocol that has BOM at the start of text. > How do I represent an initial ZWNBSP? > > A: Use the sequence FEFF FEFF > > But with the current decoder implementation *both* \ufeffs > will be removed, so the ZWNBSP disappears. Note that this only happens in the UTF-16 codec. All other codecs pass through the BOMs as-is. Perhaps I should modify the UTF-16 codec to only remove BOMs when used in UTF-16 mode (without byte order indication) and not in UTF-16-LE/UTF-16-BE mode ?! ... and then only at the start of a string. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Mon May 21 15:40:56 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 21 May 2001 16:40:56 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: (message from Toby Dickenson on Mon, 21 May 2001 11:06:46 +0100) References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> Message-ID: <200105211440.f4LEeuB01307@mira.informatik.hu-berlin.de> > Thats true for Unicode strings. > > However, a python plain string containing an encoded Unicode string > (in *any* character encoding) is no different to a file here - its > just a block-o-bytes. The problem with that approach is that writing to a UTF-16-encoded file (as obtained by codecs.open(filename, "w", encoding="utf-16")) will put the BOM in front of every chunk of data as passed to .write(). That is an error, IMO, the stream writer should only put the BOM into the beginning of the entire file. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Mon May 21 15:44:20 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 21 May 2001 16:44:20 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <200105211308340281.00448C08@mail.livinglogic.de> (walter@livinglogic.de) References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <200105211308340281.00448C08@mail.livinglogic.de> Message-ID: <200105211444.f4LEiKQ01309@mira.informatik.hu-berlin.de> > > >it is absurd to > > >expect code dealing with *strings* to handle BOMs. > > > > I agree with that, and is a good reason why the codecs should always > > remove them. > > ??? This is a good reason why the codec should pass the \ufeff > through, because a \ufeff in a unicode object should not be > considered to be a BOM but a ZWNBSP (it might e.g. be used to > give hints to a hyphenation or ligature algorithm.) I agree. The decoder should *never* remove the BOM in the middle of a string. > Then the write function has an error. A BOM should only be > written at the start of the file and not on every call to > write(). I agree. Fixing that should not be too difficult; the codec instance just needs to change its .encode and .decode attributes after the first write. This raises the question what: f = open("/tmp/foo","w",encoding="utf-16") f.close() should give: an empty file, or a file containing just the BOM? Regards, Martin From martin@loewis.home.cs.tu-berlin.de Mon May 21 15:50:41 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 21 May 2001 16:50:41 +0200 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <3B08FFEA.7130A83F@lemburg.com> (mal@lemburg.com) References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <200105211308340281.00448C08@mail.livinglogic.de> <3B08FFEA.7130A83F@lemburg.com> Message-ID: <200105211450.f4LEofx01332@mira.informatik.hu-berlin.de> > That's hard to implement... how would the codec know where the > stream starts -- it only interfaces to the underyling stream > using .read() and .write() ? The stream readers and writers should assume that the first read and write operation use the ZWNBSP as the BOM, so they should stop giving a byte-order meaning to the BOM once they have seen the first chunk of data. That is best implemented by replacing the .encode function with utf_16_be/le_encode (as appropriate). > Note that this only happens in the UTF-16 codec. All other codecs > pass through the BOMs as-is. Perhaps I should modify the UTF-16 > codec to only remove BOMs when used in UTF-16 mode (without byte > order indication) and not in UTF-16-LE/UTF-16-BE mode ?! You may want to study the RFC just to be sure, but I think this is how UTF-16-[BL]E are defined. Regards, Martin From mal@lemburg.com Mon May 21 18:02:35 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 21 May 2001 19:02:35 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <200105211308340281.00448C08@mail.livinglogic.de> <3B08FFEA.7130A83F@lemburg.com> <200105211450.f4LEofx01332@mira.informatik.hu-berlin.de> Message-ID: <3B094A2B.D7192F4C@lemburg.com> "Martin v. Loewis" wrote: > > > That's hard to implement... how would the codec know where the > > stream starts -- it only interfaces to the underyling stream > > using .read() and .write() ? > > The stream readers and writers should assume that the first read and > write operation use the ZWNBSP as the BOM, so they should stop giving > a byte-order meaning to the BOM once they have seen the first chunk of > data. That is best implemented by replacing the .encode function with > utf_16_be/le_encode (as appropriate). Patches are welcome :-) > > Note that this only happens in the UTF-16 codec. All other codecs > > pass through the BOMs as-is. Perhaps I should modify the UTF-16 > > codec to only remove BOMs when used in UTF-16 mode (without byte > > order indication) and not in UTF-16-LE/UTF-16-BE mode ?! > > You may want to study the RFC just to be sure, but I think this is how > UTF-16-[BL]E are defined. According to the Unicode FAQ, BOM marks should only be used where the byte order is not immediatly clear. In the case -LE and -BE, this information is available, which is why the codecs don't prepend a BOM mark. Ok, I will modify the UTF-16-LE and -BE decoders to not remove BOMs anymore and fix the UTF-16 decoder to only remove BOMs at the start of the string. With these changes you should be able to fix the UTF-16 stream codec to be more RFC compliant. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From guido@digicool.com Mon May 21 17:55:21 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 21 May 2001 12:55:21 -0400 Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: Your message of "Mon, 21 May 2001 13:45:46 +0200." <3B08FFEA.7130A83F@lemburg.com> References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <200105211308340281.00448C08@mail.livinglogic.de> <3B08FFEA.7130A83F@lemburg.com> Message-ID: <200105211657.f4LGtcs20688@odiug.digicool.com> > > Then the write function has an error. A BOM should only be > > written at the start of the file and not on every call to > > write(). > > That's hard to implement... how would the codec know where the > stream starts -- it only interfaces to the underyling stream > using .read() and .write() ? To me this looks like it should be an application issue. The application should write an explicit BOM at the start of each file it writes. The codecs shouldn't do anything with BOMs -- just pass them through. I'm pretty sure that's what the intention of BOMs in the Unicode standard was, because it's the only reasonable approach -- if it isn't, I'd like to see chapter and verse quoted. ;-) --Guido van Rossum (home page: http://www.python.org/~guido/) From barry@wooz.org Mon May 21 20:49:33 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Mon, 21 May 2001 15:49:33 -0400 Subject: [I18n-sig] pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> Message-ID: <15113.29005.357449.812516@anthem.wooz.org> A very long time ago I wrote: >> I have a tentative patch for Tools/i18n/pygettext.py which adds >> optional extraction of module, class, method, and function >> docstrings. >> One question: should docstring extraction be turned on my >> default? >>>>> And "MvL" == Martin v Loewis >>>>> responded: MvL> I'd say so, yes. People who are confronted with gettext for MvL> the first time will say "Wow, it even does that!". In the MvL> rare cases where doc strings would confuse the meat of the MvL> catalog, people will be able to turn that off. Perhaps it MvL> may be good to indicate in the catalog that this is a doc MvL> string? I'm thinking of MvL> #, py-doc MvL> I don't know the exact specification of the #, comments, but MvL> it can look like MvL> #, c-format, fuzzy MvL> i.e. it appears to be a comma-separated list of informative MvL> flags. Translators could then decide to deal with doc strings MvL> in a different manner (e.g follow different grammatical MvL> conventions). Nearest I can tell, according to http://www.gnu.org/manual/gettext/html_chapter/gettext_2.html#SEC9 I think the correct thing to do is to mark docstring extractions with #. docstring comments. I'm going to check in a patch to do this now, although for backwards compatibility I think I will still leave docstring extraction disabled by default (enabled it with -D / --docstrings). -Barry From mal@lemburg.com Tue May 22 09:57:43 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 22 May 2001 10:57:43 +0200 Subject: [I18n-sig] UTF-8 and BOM References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <200105211308340281.00448C08@mail.livinglogic.de> <3B08FFEA.7130A83F@lemburg.com> <200105211450.f4LEofx01332@mira.informatik.hu-berlin.de> <3B094A2B.D7192F4C@lemburg.com> Message-ID: <3B0A2A07.85EFB484@lemburg.com> "M.-A. Lemburg" wrote: > > "Martin v. Loewis" wrote: > > > > > That's hard to implement... how would the codec know where the > > > stream starts -- it only interfaces to the underyling stream > > > using .read() and .write() ? > > > > The stream readers and writers should assume that the first read and > > write operation use the ZWNBSP as the BOM, so they should stop giving > > a byte-order meaning to the BOM once they have seen the first chunk of > > data. That is best implemented by replacing the .encode function with > > utf_16_be/le_encode (as appropriate). Patches are welcome :-) > > > Note that this only happens in the UTF-16 codec. All other codecs > > > pass through the BOMs as-is. Perhaps I should modify the UTF-16 > > > codec to only remove BOMs when used in UTF-16 mode (without byte > > > order indication) and not in UTF-16-LE/UTF-16-BE mode ?! > > > > You may want to study the RFC just to be sure, but I think this is how > > UTF-16-[BL]E are defined. > > According to the Unicode FAQ, BOM marks should only be used > where the byte order is not immediatly clear. In the case -LE and > -BE, this information is available, which is why the codecs > don't prepend a BOM mark. > > Ok, I will modify the UTF-16-LE and -BE decoders to not remove > BOMs anymore and fix the UTF-16 decoder to only remove BOMs at > the start of the string. With these changes you should be able > to fix the UTF-16 stream codec to be more RFC compliant. Done. See the CVS versions of Misc/NEWS and Include/unicodeobject.h for details. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tkg@menteith.com Wed May 23 04:17:58 2001 From: tkg@menteith.com (Tony Graham) Date: Tue, 22 May 2001 23:17:58 -0400 (EST) Subject: [I18n-sig] UTF-8 and BOM In-Reply-To: <118422204@toto.iv> Message-ID: <15115.11238.318000.934980@menteith.com> At 21 May 2001 12:55 -0400, Guido van Rossum wrote: > I'm pretty sure that's what the intention of BOMs in the Unicode > standard was, because it's the only reasonable approach -- if it > isn't, I'd like to see chapter and verse quoted. ;-) See Section 5.6 in http://www.unicode.org/unicode/uni2book/ch13.pdf. I could also quote a chapter from "Unicode: A Primer," but it doesn't have any verses. Regards, Tony Graham. From walter@livinglogic.de Wed May 23 10:35:32 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Wed, 23 May 2001 11:35:32 +0200 Subject: [I18n-sig] Re: [XML-SIG] XML and Unicode In-Reply-To: <20010522193314.E22396@mnot.net> References: <20010522150638.C22396@mnot.net> <3B0AEA6A.9CCD2A1F@lemburg.com> <20010522193314.E22396@mnot.net> Message-ID: <200105231135320031.00663C0E@mail.livinglogic.de> On 22.05.01 at 19:33 Mark Nottingham wrote: > OK, so I'm not getting something then. The attached test script (and > data file) is the problem pared down - if u'string' is a neutral > encoding, and .encode('utf-8') generates a utf-8 encoded string of > that encoding, then the utf-8.html output file should display > correctly; however, it doesn't, while the latin-1 output does > (because the input is latin-1). >>> open("ISO-8859-1.xml","rb").read() '\r\nNet 21 \x96= The Survivors\r\n\r\n' The character contained in your test XML file seems to be \x96, which is a control character in Unicode, but in Windows it's used as an endash. If you want a "real" endash you should use the Unicode ndash U+2013: "Net 21 – The Survivors". But then encoding the output with latin-1 will no longer work. > [...] BTW, you might want to try several variants for the name of the output encoding, because although Python encode method recognises the name, your web browser might not. Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7= www.livinglogic.de From keichwa@gmx.net Thu May 24 21:02:47 2001 From: keichwa@gmx.net (Karl Eichwalder) Date: 24 May 2001 22:02:47 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15113.29005.357449.812516@anthem.wooz.org> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> Message-ID: barry@wooz.org (Barry A. Warsaw) writes: > >>>>> And "MvL" == Martin v Loewis > >>>>> responded: > MvL> #, py-doc > > MvL> I don't know the exact specification of the #, comments, but > MvL> it can look like > > MvL> #, c-format, fuzzy > > MvL> i.e. it appears to be a comma-separated list of informative > MvL> flags. Translators could then decide to deal with doc strings > MvL> in a different manner (e.g follow different grammatical > MvL> conventions). > I think the correct thing to do is to mark docstring extractions with > > #. docstring > > comments. No, #. is reserved for literally extracted comments; #, is for meta-comments. Martin's proposal sounds better. -- work : ke@suse.de | ,__o : http://www.suse.de/~ke/ | _-\_<, home : keichwa@gmx.net | (*)/'(*) From barry@wooz.org Fri May 25 00:15:50 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Thu, 24 May 2001 19:15:50 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> Message-ID: <15117.38438.361043.255768@anthem.wooz.org> >>>>> "KE" == Karl Eichwalder writes: >> I think the correct thing to do is to mark docstring >> extractions with #. docstring comments. KE> No, #. is reserved for literally extracted comments; #, is for KE> meta-comments. Martin's proposal sounds better. You probably know better than me, but, is that opinion based on more information than is available in the GNU gettext manual? http://www.gnu.org/manual/gettext/html_node/gettext_9.html#SEC9 seems to imply to me that #, comments define only two flags (i.e. "fuzzy" and "c-format" / "no-c-format") and it doesn't say that the flags are extensible or user definable. Then again, it doesn't say that #. comments are reserved. It basically just says that #-whitespace comments are reserved for the translators. I'm happy to switch it, but I'd really like to have a reference I can point to to short-circuit any further discussion. Even a mailing list archive url would be fine. Thanks, -Barry From keichwa@gmx.net Fri May 25 06:11:57 2001 From: keichwa@gmx.net (Karl Eichwalder) Date: 25 May 2001 07:11:57 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15117.38438.361043.255768@anthem.wooz.org> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> Message-ID: barry@wooz.org (Barry A. Warsaw) writes: > You probably know better than me, but, is that opinion based on more > information than is available in the GNU gettext manual? This is another piece of info you'll find within the gettext manual: -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- cut here -=3D-=3D-=3D= -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- Therefore the `xgettext' adds a special tag to those messages it thinks might be a format string. There is no absolute rule for this, only a heuristic. In the `.po' file the entry is marked using the `c-format' flag in the `#,' comment line (*note PO Files::). The careful reader now might say that this again can cause problems. The heuristic might guess it wrong. This is true and therefore `xgettext' knows about special kind of comment which lets the programmer take over the decision. If in the same line or the immediately preceding line of the `gettext' keyword the `xgettext' program find a comment containing the words `xgettext:c-format' it will mark the string in any case with the `c-format' flag. This kind of comment should be used when `xgettext' does not recognize the string as a format string but is really is one and it should be tested. Please note that when the comment is in the same line of the `gettext' keyword, it must be before the string to be translated. -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- cut here -=3D-=3D-=3D= -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- > http://www.gnu.org/manual/gettext/html_node/gettext_9.html#SEC9 >=20 > seems to imply to me that #, comments define only two flags > (i.e. "fuzzy" and "c-format" / "no-c-format") and it doesn't say that > the flags are extensible or user definable. Then again, it doesn't > say that #. comments are reserved. It basically just says that > #-whitespace comments are reserved for the translators. You're right. The term AUTOMATIC-COMMENTS is not properly defined. Also FLAG leave open some questions. > I'm happy to switch it, but I'd really like to have a reference I can > point to to short-circuit any further discussion. Even a mailing list > archive url would be fine. It's now bruno Haible who maintains the gettext suite. There's a po-utils-forum mailinglist at IRO.UMontreal.CA initiated by Fran=E7ois (thanks); mostly for my own amusement ;) The mailinglist is archived -- at the moment I don't know where. You can start browsing here: http://www.iro.umontreal.ca/~pinard/po-utils/HTML/ But right now "titan" (Fran=E7ois' workstation?) does not want to talk to me. Please, try again later. The other gettext forum is gnu.utils.bugs . Karl --=20 work : ke@suse.de | ,__o : http://www.suse.de/~ke/ | _-\_<, home : keichwa@gmx.net | (*)/'(*) From barry@wooz.org Fri May 25 15:20:58 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Fri, 25 May 2001 10:20:58 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> Message-ID: <15118.27210.930905.339141@anthem.wooz.org> Ah cool. For those just coming in: the issue is that pygettext.py extracts Python docstrings if you give it the -D/--docstring flag. I want to mark such docstrings in the .pot file because translators may not want or need to translate every docstring. The documentation for .po file comments is a little sparse here. I agree that the logical place for such markings is in the #, comments, e.g.: #, docstring #: Mailman/Archiver/Archiver.py:142 msgid "The mbox name where messages are left for archive construction." msgstr "" But the po-file format documentation doesn't say that additional flags can be defined for #, comments. It seems to me a simple omission in the documentation, right? Is the intent of #, flags that the extraction tools can define additional, language-specific flags? -Barry From martin@loewis.home.cs.tu-berlin.de Fri May 25 21:12:42 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 25 May 2001 22:12:42 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15118.27210.930905.339141@anthem.wooz.org> (barry@wooz.org) References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> Message-ID: <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> > But the po-file format documentation doesn't say that additional flags > can be defined for #, comments. It seems to me a simple omission in > the documentation, right? Is the intent of #, flags that the > extraction tools can define additional, language-specific flags? I'd say that nobody has thought of that. Bruno is probably the person to give a definitive yay or nay here, but I'd hope that tools shouldn't go into flames if they see an extra flag. Atleast GNU msgmerge does not show any concern. Of course, it would be better if this possibility could be codified somewhere, and if gettext.texi could serve as the repository of well-known flags - even if they don't all have a meaning to GNU gettext. Adding such documentation is probably an issue of submitting patches against gettext.texi. Regards, Martin From tree@basistech.com Wed May 30 22:37:16 2001 From: tree@basistech.com (Tom Emerson) Date: Wed, 30 May 2001 17:37:16 -0400 Subject: [I18n-sig] Unicode normalization and collation implementation? Message-ID: <15125.26636.297182.646562@cymru.basistech.com> I need to use the Unicode collation algorithm from Python --- has anyone implemented this yet? I'd rather not do it, so if someone else has code, share the wealth. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Thu May 31 08:16:41 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 31 May 2001 09:16:41 +0200 Subject: [I18n-sig] Unicode normalization and collation implementation? References: <15125.26636.297182.646562@cymru.basistech.com> Message-ID: <3B15EFD9.E6BAD2A9@lemburg.com> Tom Emerson wrote: > > I need to use the Unicode collation algorithm from Python --- has > anyone implemented this yet? I'd rather not do it, so if someone else > has code, share the wealth. No. It's been on the plate for some time now, though. Note that if your are going to start working in this direction, you should focus on normalization form C since this is probably the most often used (and practical) one: http://www.unicode.org/unicode/reports/tr15/ -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Thu May 31 15:37:20 2001 From: tree@basistech.com (Tom Emerson) Date: Thu, 31 May 2001 10:37:20 -0400 Subject: [I18n-sig] Unicode normalization and collation implementation? In-Reply-To: <3B15EFD9.E6BAD2A9@lemburg.com> References: <15125.26636.297182.646562@cymru.basistech.com> <3B15EFD9.E6BAD2A9@lemburg.com> Message-ID: <15126.22304.663571.552971@cymru.basistech.com> M.-A. Lemburg writes: > Note that if your are going to start working in this direction, > you should focus on normalization form C since this is probably > the most often used (and practical) one: No, I need form D for the collation algorithm, so this is what I'm doing first. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Thu May 31 15:51:25 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 31 May 2001 16:51:25 +0200 Subject: [I18n-sig] Unicode normalization and collation implementation? References: <15125.26636.297182.646562@cymru.basistech.com> <3B15EFD9.E6BAD2A9@lemburg.com> <15126.22304.663571.552971@cymru.basistech.com> Message-ID: <3B165A6C.31B390C5@lemburg.com> Tom Emerson wrote: > > M.-A. Lemburg writes: > > Note that if your are going to start working in this direction, > > you should focus on normalization form C since this is probably > > the most often used (and practical) one: > > No, I need form D for the collation algorithm, so this is what I'm > doing first. Does that mean you are going to start working in that direction ? (would be great !) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Thu May 31 15:56:04 2001 From: tree@basistech.com (Tom Emerson) Date: Thu, 31 May 2001 10:56:04 -0400 Subject: [I18n-sig] Unicode normalization and collation implementation? In-Reply-To: <3B165A6C.31B390C5@lemburg.com> References: <15125.26636.297182.646562@cymru.basistech.com> <3B15EFD9.E6BAD2A9@lemburg.com> <15126.22304.663571.552971@cymru.basistech.com> <3B165A6C.31B390C5@lemburg.com> Message-ID: <15126.23428.889431.364510@cymru.basistech.com> M.-A. Lemburg writes: > > No, I need form D for the collation algorithm, so this is what I'm > > doing first. > > Does that mean you are going to start working in that direction ? > (would be great !) Yes, as I said, I need the Unicode collation algorithm now, so I'll be working on normalization and collation over the next week or two. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Thu May 31 18:13:03 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 31 May 2001 19:13:03 +0200 Subject: [I18n-sig] XML and UTF-16 Message-ID: <3B167B9F.344D6992@lemburg.com> What is the standard file layout to use for storing an XML file in UTF-16 ? 1) encode the whole file in UTF-16 (possibly prepended with a BOM) or 2) write the first line containing the XML header (which has the encoding information) in ASCII and then proceed with UTF-16 starting after the newline character or 3) none of the above: you simply don't do this ;-) Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Thu May 31 18:23:31 2001 From: tree@basistech.com (Tom Emerson) Date: Thu, 31 May 2001 13:23:31 -0400 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <3B167B9F.344D6992@lemburg.com> References: <3B167B9F.344D6992@lemburg.com> Message-ID: <15126.32275.110670.236066@cymru.basistech.com> M.-A. Lemburg writes: > What is the standard file layout to use for storing an XML file > in UTF-16 ? I thought this was covered in the XML specification as a non-normative appendix. Maybe not. > 1) encode the whole file in UTF-16 (possibly prepended with a BOM) Yes. You can then pretty easily autodetect the which Unicode transformation format is being used by looking at the first ten or so bytes. If the BOM is present, that's a big clue right there. UTF-16-BE will have the first " 2) write the first line containing the XML header (which has the > encoding information) in ASCII and then proceed with UTF-16 > starting after the newline character Ugh, no. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Thu May 31 18:39:17 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 31 May 2001 19:39:17 +0200 Subject: [I18n-sig] XML and UTF-16 References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> Message-ID: <3B1681C5.71FD484D@lemburg.com> Tom Emerson wrote: > > M.-A. Lemburg writes: > > What is the standard file layout to use for storing an XML file > > in UTF-16 ? > > I thought this was covered in the XML specification as a non-normative > appendix. Maybe not. I was too lazy to look it up :-) > > 1) encode the whole file in UTF-16 (possibly prepended with a BOM) > > Yes. You can then pretty easily autodetect the which Unicode > transformation format is being used by looking at the first ten or > so bytes. > > If the BOM is present, that's a big clue right there. > > UTF-16-BE will have the first " > 003C 003F 0078 006D 006E > > while UTF-16-LE will have it encoded as > > 3C00 3F00 7800 6D00 6E00 > > ASCII and UTF-8 will just have > > 3C 3F 78 6D 6E Perhaps we should have some smart auto-detection API somewhere which does this automagically ?! Something like guess_xml_encoding(data) -> encoding string It could work by looking at the first 256 bytes of the data string and then apply all the tricks needed to extract the encoding information (or default to UTF-8 if no such information is given). > > 2) write the first line containing the XML header (which has the > > encoding information) in ASCII and then proceed with UTF-16 > > starting after the newline character > > Ugh, no. Thought so :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Thu May 31 18:52:11 2001 From: tree@basistech.com (Tom Emerson) Date: Thu, 31 May 2001 13:52:11 -0400 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <3B1681C5.71FD484D@lemburg.com> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> Message-ID: <15126.33995.327715.84261@cymru.basistech.com> M.-A. Lemburg writes: > Perhaps we should have some smart auto-detection API somewhere > which does this automagically ?! Something like > > guess_xml_encoding(data) -> encoding string > > It could work by looking at the first 256 bytes of the data > string and then apply all the tricks needed to extract the > encoding information (or default to UTF-8 if no such information > is given). Yes, I think this would be a good idea. I would use something along the lines of: 0) Assume UTF-8. 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the appropriate transmission format and endian nature. Goto 4. 2) Look for the UTF-8 uniBOM, since some editors like putting that in. Ignore it and goto 4. 3) Look for the sundry forms of '") -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From paulp@ActiveState.com Thu May 31 22:17:18 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 31 May 2001 14:17:18 -0700 Subject: [I18n-sig] XML and UTF-16 References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> Message-ID: <3B16B4DE.B0E8ADD4@ActiveState.com> Tom Emerson wrote: > >... > > Yes. You can then pretty easily autodetect the which Unicode > transformation format is being used by looking at the first ten or > so bytes. Actually, the first four bytes are sufficient to get you started. Then you have to look at the encoding declaration if present. > If the BOM is present, that's a big clue right there. """Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.""" -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From paulp@ActiveState.com Thu May 31 22:21:24 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 31 May 2001 14:21:24 -0700 Subject: [I18n-sig] XML and UTF-16 References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> Message-ID: <3B16B5D4.730D8E30@ActiveState.com> "M.-A. Lemburg" wrote: > >... > > Perhaps we should have some smart auto-detection API somewhere > which does this automagically ?! Something like > > guess_xml_encoding(data) -> encoding string > > It could work by looking at the first 256 bytes of the data > string and then apply all the tricks needed to extract the > encoding information (or default to UTF-8 if no such information > is given). This might help: http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/52257 I think Lars has a version too... -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From tree@basistech.com Thu May 31 22:23:00 2001 From: tree@basistech.com (Tom Emerson) Date: Thu, 31 May 2001 17:23:00 -0400 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <3B16B4DE.B0E8ADD4@ActiveState.com> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> Message-ID: <15126.46644.277960.763113@cymru.basistech.com> Paul Prescod writes: > Tom Emerson wrote: > > Yes. You can then pretty easily autodetect the which Unicode > > transformation format is being used by looking at the first ten or > > so bytes. > > Actually, the first four bytes are sufficient to get you started. Then > you have to look at the encoding declaration if present. Even for UTF-32? -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@loewis.home.cs.tu-berlin.de Thu May 31 21:28:31 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 31 May 2001 22:28:31 +0200 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <15126.32275.110670.236066@cymru.basistech.com> (message from Tom Emerson on Thu, 31 May 2001 13:23:31 -0400) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> Message-ID: <200105312028.f4VKSVe02837@mira.informatik.hu-berlin.de> > M.-A. Lemburg writes: > > What is the standard file layout to use for storing an XML file > > in UTF-16 ? > > I thought this was covered in the XML specification as a non-normative > appendix. Maybe not. Indeed it is. In addition to the procedure you outline, they also anticipate that a higher-level protocol (such as HTTP) may identify a content type. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu May 31 21:46:31 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 31 May 2001 22:46:31 +0200 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <15126.33995.327715.84261@cymru.basistech.com> (message from Tom Emerson on Thu, 31 May 2001 13:52:11 -0400) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> Message-ID: <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> > Yes, I think this would be a good idea. I would use something along > the lines of: Please have a look at xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost follows the procedure in the XML recommendation, except that it does not expect "unusual" byte orders (2134, 3412), and that it does not detect EBCDIC. > 0) Assume UTF-8. > > 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the > appropriate transmission format and endian nature. Goto 4. > > 2) Look for the UTF-8 uniBOM, since some editors like putting that in. > Ignore it and goto 4. I see this was added to the XML recommendation only in the second edition, so I should also added to xmlproc. > 3) Look for the sundry forms of ' with appropriate endian variants. If found, assume the detected > encoding. Goto 4. Please note that ASCII is not detectable this way: If you see ' References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> Message-ID: <15126.46901.610405.498190@cymru.basistech.com> Martin v. Loewis writes: > Please note that ASCII is not detectable this way: If you see ' then you don't know anything about the encoding except that you should > be able to parse the encoding= attribute successfully if present. Yes, of course --- I wasn't sufficiently explicit. If you see " <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <15126.46644.277960.763113@cymru.basistech.com> Message-ID: <3B16B8E6.D1E083@ActiveState.com> Tom Emerson wrote: > > Paul Prescod writes: > > Tom Emerson wrote: > > > Yes. You can then pretty easily autodetect the which Unicode > > > transformation format is being used by looking at the first ten or > > > so bytes. > > > > Actually, the first four bytes are sufficient to get you started. Then > > you have to look at the encoding declaration if present. > > Even for UTF-32? I think so. UTF-32 is a 32-bit encoding and 32 bits are 4 bytes. You only need one character (either a BOM or a "<") sign to know what you are dealing with. You were right that it is an appendix to the spec: http://www.w3.org/TR/REC-xml.html#sec-guessing -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From tree@basistech.com Thu May 31 22:35:30 2001 From: tree@basistech.com (Tom Emerson) Date: Thu, 31 May 2001 17:35:30 -0400 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <3B16B8E6.D1E083@ActiveState.com> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <15126.46644.277960.763113@cymru.basistech.com> <3B16B8E6.D1E083@ActiveState.com> Message-ID: <15126.47394.654300.731399@cymru.basistech.com> Paul Prescod writes: > I think so. UTF-32 is a 32-bit encoding and 32 bits are 4 bytes. You > only need one character (either a BOM or a "<") sign to know what you > are dealing with. Well, you know that the first UTF-32 character is "<", but no more. I'd at least look for " <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <15126.46644.277960.763113@cymru.basistech.com> <3B16B8E6.D1E083@ActiveState.com> <15126.47394.654300.731399@cymru.basistech.com> Message-ID: <3B16BB45.1A15560D@ActiveState.com> Tom Emerson wrote: > > Paul Prescod writes: > > I think so. UTF-32 is a 32-bit encoding and 32 bits are 4 bytes. You > > only need one character (either a BOM or a "<") sign to know what you > > are dealing with. > > Well, you know that the first UTF-32 character is "<", but no > more. I'd at least look for " also overly paranoid. You could be looking at " such. Would it matter if you were looking at References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <15126.46644.277960.763113@cymru.basistech.com> <3B16B8E6.D1E083@ActiveState.com> <15126.47394.654300.731399@cymru.basistech.com> <3B16BB45.1A15560D@ActiveState.com> Message-ID: <15126.47990.808992.298339@cymru.basistech.com> Paul Prescod writes: > Would it matter if you were looking at document without an XML declaration would be in error. The declaration > is required for everything other than UTF-8 and UTF-16. I guess my point is that it is better to be overly conservative up front and look for at least two complete characters (in whatever encoding) before attempting to process the document. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@loewis.home.cs.tu-berlin.de Thu May 31 23:12:11 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 1 Jun 2001 00:12:11 +0200 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <15126.47394.654300.731399@cymru.basistech.com> (message from Tom Emerson on Thu, 31 May 2001 17:35:30 -0400) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <15126.46644.277960.763113@cymru.basistech.com> <3B16B8E6.D1E083@ActiveState.com> <15126.47394.654300.731399@cymru.basistech.com> Message-ID: <200105312212.f4VMCBl04236@mira.informatik.hu-berlin.de> > Well, you know that the first UTF-32 character is "<", but no > more. According to the procedure specified in the XML recommendation, this is enough for auto-detection, so you clearly don't need to look at more bytes when parsing XML. In any case, what would you do if you find out that the next few bytes cannot be interpreted as ?xml in UTF-32? You would probably signal an error. So would you if the document is not well-formed XML if treated as UTF-32 after looking at the first few bytes. Regards, Martin