From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Dec 6 11:05:26 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 6 Dec 2000 20:05:26 +0900 Subject: [I18n-sig] naming codecs Message-ID: <200012061105.UAA21517@dhcp198.grad.sccs.chukyo-u.ac.jp> Hi all, I'd like to receive some suggestions about the naming of a codec. I consider releasing a version of the JapaneseCodecs package that will include a new codec for a variant of ISO-2022-JP. The codec is almost the same as the ISO-2022-JP codec, but it can encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which can not be encoded with ISO-2022-JP as defined in RFC1468. I believe there is a demand for the codec, but I have no idea on the name of the codec. I'd like to give it a name that is different from all standard encoding names, since the encoding for which the codec works is not defined as a standard (e.g. RFCs). I'd also like to avoid an encoding name that is likely to be used as a standard encoding name in the future. Does anyone have a good name for the codec? Or, how may I think about the naming of a codec? Any suggestions are welcome. Thanks, -- KAJIYAMA, Tamito From mal@lemburg.com Wed Dec 6 11:26:52 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 06 Dec 2000 12:26:52 +0100 Subject: [I18n-sig] naming codecs References: <200012061105.UAA21517@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <3A2E227C.96FD2304@lemburg.com> Tamito KAJIYAMA wrote: > > Hi all, > > I'd like to receive some suggestions about the naming of a codec. > > I consider releasing a version of the JapaneseCodecs package > that will include a new codec for a variant of ISO-2022-JP. The > codec is almost the same as the ISO-2022-JP codec, but it can > encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which > can not be encoded with ISO-2022-JP as defined in RFC1468. > > I believe there is a demand for the codec, but I have no idea > on the name of the codec. I'd like to give it a name that is > different from all standard encoding names, since the encoding > for which the codec works is not defined as a standard > (e.g. RFCs). I'd also like to avoid an encoding name that is > likely to be used as a standard encoding name in the future. > > Does anyone have a good name for the codec? Or, how may I think > about the naming of a codec? Any suggestions are welcome. Why not simply append another "-" part to the name, e.g. "iso-2022-jp-hw" or "iso-2022-jp-extended" ? -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Dec 6 11:42:44 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 6 Dec 2000 20:42:44 +0900 Subject: [I18n-sig] naming codecs In-Reply-To: <3A2E227C.96FD2304@lemburg.com> (mal@lemburg.com) References: <3A2E227C.96FD2304@lemburg.com> Message-ID: <200012061142.UAA21632@dhcp198.grad.sccs.chukyo-u.ac.jp> Thank you for the quick reply. M.-A. Lemburg wrote: | | > I consider releasing a version of the JapaneseCodecs package | > that will include a new codec for a variant of ISO-2022-JP. The | > codec is almost the same as the ISO-2022-JP codec, but it can | > encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which | > can not be encoded with ISO-2022-JP as defined in RFC1468. | > | > I believe there is a demand for the codec, but I have no idea | > on the name of the codec. I'd like to give it a name that is | > different from all standard encoding names, since the encoding | > for which the codec works is not defined as a standard | > (e.g. RFCs). I'd also like to avoid an encoding name that is | > likely to be used as a standard encoding name in the future. | > | > Does anyone have a good name for the codec? Or, how may I think | > about the naming of a codec? Any suggestions are welcome. | | Why not simply append another "-" part to the name, | e.g. "iso-2022-jp-hw" or "iso-2022-jp-extended" ? I like "iso-2022-jp-extended", but I wonder if this naming convention may be used. There are the standard encoding names ISO-2022-JP-1 and ISO-2022-JP-2 in addition to ISO-2022-JP, and also there are ISO-2022-CN and ISO-2022-CN-EXT. So, a simple "-variant" part is likely to conflict with a standard encoding name in the future. However, an abbreviated and/or tricky "-variant" part such as "-hw" is not user-friendly. I also wonder if I am thinking too much... -- KAJIYAMA, Tamito From mal@lemburg.com Wed Dec 6 12:16:19 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 06 Dec 2000 13:16:19 +0100 Subject: [I18n-sig] naming codecs References: <3A2E227C.96FD2304@lemburg.com> <200012061142.UAA21632@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <3A2E2E13.41EE8BA4@lemburg.com> Tamito KAJIYAMA wrote: > > Thank you for the quick reply. > > M.-A. Lemburg wrote: > | > | > I consider releasing a version of the JapaneseCodecs package > | > that will include a new codec for a variant of ISO-2022-JP. The > | > codec is almost the same as the ISO-2022-JP codec, but it can > | > encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which > | > can not be encoded with ISO-2022-JP as defined in RFC1468. > | > > | > I believe there is a demand for the codec, but I have no idea > | > on the name of the codec. I'd like to give it a name that is > | > different from all standard encoding names, since the encoding > | > for which the codec works is not defined as a standard > | > (e.g. RFCs). I'd also like to avoid an encoding name that is > | > likely to be used as a standard encoding name in the future. > | > > | > Does anyone have a good name for the codec? Or, how may I think > | > about the naming of a codec? Any suggestions are welcome. > | > | Why not simply append another "-" part to the name, > | e.g. "iso-2022-jp-hw" or "iso-2022-jp-extended" ? > > I like "iso-2022-jp-extended", but I wonder if this naming > convention may be used. There are the standard encoding names > ISO-2022-JP-1 and ISO-2022-JP-2 in addition to ISO-2022-JP, and > also there are ISO-2022-CN and ISO-2022-CN-EXT. So, a simple > "-variant" part is likely to conflict with a standard encoding > name in the future. However, an abbreviated and/or tricky > "-variant" part such as "-hw" is not user-friendly. Hmm, I don't think there's anything user friendly about 'iso-2022-jp' either... User friendly would be 'japanese' and then have the codec registry figure out what the user means with this by applying some voodoo magic ;-) Seriously, I think the codec name should include at least a hint as to what it does -- so perhaps '-halfwidth-katakana' would be more appropriate. You can always provide shorter aliases either by means of providing more than one codec .py file for the same codec or by registering a codec search function which implements the aliasing. BTW, what happened to the idea of using package names for optional codecs ? E.g. you can install the codecs in a package 'mycodecs' and then reference them using 'mycodecs.iso-1234' without having to register a codec search function at startup. This would not only clarify the origin of the codec, but also allow using different codec implementations for the same encoding (e.g. one in Python and another in C). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Wed Dec 6 19:13:22 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 6 Dec 2000 20:13:22 +0100 Subject: [I18n-sig] naming codecs In-Reply-To: <200012061105.UAA21517@dhcp198.grad.sccs.chukyo-u.ac.jp> (message from Tamito KAJIYAMA on Wed, 6 Dec 2000 20:05:26 +0900) References: <200012061105.UAA21517@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <200012061913.UAA00746@loewis.home.cs.tu-berlin.de> > I consider releasing a version of the JapaneseCodecs package > that will include a new codec for a variant of ISO-2022-JP. The > codec is almost the same as the ISO-2022-JP codec, but it can > encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which > can not be encoded with ISO-2022-JP as defined in RFC1468. So how exactly does it encode them? Is that your own invention, or is there some precedent for that encoding (e.g. in an operating system, or text processing system)? > I believe there is a demand for the codec, but I have no idea on the > name of the codec. If it is your own invention, I'd be surprised if there was demand. If you just follow some existing practice, then I'd assume that this practice has a name - which should be used in the name of the codec. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Dec 7 06:17:06 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 7 Dec 2000 15:17:06 +0900 Subject: [I18n-sig] naming codecs In-Reply-To: <200012061913.UAA00746@loewis.home.cs.tu-berlin.de> (martin@loewis.home.cs.tu-berlin.de) References: <3A2F1B701E3.FEEANODA@172.16.112.1> Message-ID: <200012070617.PAA22443@dhcp198.grad.sccs.chukyo-u.ac.jp> Martin v. Loewis wrote: | | > I consider releasing a version of the JapaneseCodecs package | > that will include a new codec for a variant of ISO-2022-JP. The | > codec is almost the same as the ISO-2022-JP codec, but it can | > encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which | > can not be encoded with ISO-2022-JP as defined in RFC1468. | | So how exactly does it encode them? | | Is that your own invention, or is there some precedent for that | encoding (e.g. in an operating system, or text processing system)? Halfwidth Katakana in Unicode corresponds to the character set JIS X 0201 Katakana, and this character set can be designated by the escape sequence "\033(I" in the framework of ISO 2022. For example, GNU Emacs and LV (http://www.ff.iij4u.or.jp/~nrt/lv/) can handle this encoding. This is not my invention. | > I believe there is a demand for the codec, but I have no idea on the | > name of the codec. | | If it is your own invention, I'd be surprised if there was demand. If | you just follow some existing practice, then I'd assume that this | practice has a name - which should be used in the name of the codec. GNU Emacs gives the name "iso-2022-7bit" to the encoding that can encode JIS X 0201 Katakana. However, this name is too big to use for the ISO-2022-JP variant, since iso-2022-7bit can encode all character sets that GNU Emacs supports. Regards, -- KAJIYAMA, Tamito From andy@reportlab.com Thu Dec 7 08:53:33 2000 From: andy@reportlab.com (Andy Robinson) Date: Thu, 7 Dec 2000 08:53:33 -0000 Subject: [I18n-sig] naming codecs In-Reply-To: <200012070617.PAA22443@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: > GNU Emacs gives the name "iso-2022-7bit" to the encoding that > can encode JIS X 0201 Katakana. However, this name is too big > to use for the ISO-2022-JP variant, since iso-2022-7bit can > encode all character sets that GNU Emacs supports. > How about "iso-2022-hwkk"? I think somebody who knows Japanese information processing will probably guess this, and they will certainly not forget it. - Andy Robinson From frank63@ms5.hinet.net Fri Dec 8 04:17:56 2000 From: frank63@ms5.hinet.net (Frank J.S. Chen) Date: Fri, 8 Dec 2000 04:17:56 -0000 Subject: [I18n-sig] Re:naming codecs Message-ID: <200012072016.EAA08931@ms5.hinet.net> > > GNU Emacs gives the name "iso-2022-7bit" to the encoding that > > can encode JIS X 0201 Katakana. However, this name is too big > > to use for the ISO-2022-JP variant, since iso-2022-7bit can > > encode all character sets that GNU Emacs supports. > > > How about "iso-2022-hwkk"? I think somebody who knows Japanese > information processing will probably guess this, and they will > certainly not forget it. > Agree. Since Halfwidth Katakana is a special encoding scheme for Japan in Japanese computers system history, it is absolutely required that someone knows and uses associated legacy operating system that supports Halfwidth Katakana. But I think it still has a "jp" in that name, like ISO-2022-JP-HWKK, so that someone like me won't consider it as a mysterous mapping table at a first glance. ---------------------------------------------------------------------------- ------- Chen Chien-Hsun Taipei,Taiwan,R.O.C. From kajiyama@grad.sccs.chukyo-u.ac.jp Mon Dec 11 21:43:21 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Tue, 12 Dec 2000 06:43:21 +0900 Subject: [I18n-sig] codec aliases Message-ID: <200012112143.GAA07454@dhcp198.grad.sccs.chukyo-u.ac.jp> Hi all, I'm implementing a modularized version of my Japanese codecs following Marc-Andre's proposal, and am having two problems regarding to codec aliases. >>> import codecs >>> funcs = codecs.lookup("japanese.jis-7") Traceback (most recent call last): File "", line 1, in ? LookupError: unknown encoding >>> funcs = codecs.lookup("japanese.jis_7") Traceback (most recent call last): File "", line 1, in ? LookupError: unknown encoding >>> funcs = codecs.lookup("japanese.iso-2022-jp") >>> funcs = codecs.lookup("japanese.iso_2022_jp") >>> funcs = codecs.lookup("japanese.jis-7") Traceback (most recent call last): File "", line 1, in ? LookupError: unknown encoding >>> funcs = codecs.lookup("japanese.jis_7") >>> One problem is that the alias "japanese.jis-7" does not work unless the corresponding original name "japanese.iso-2022-jp" have been referred once. This is because the alias is defined by means of getaliases() in japanese/iso_2022_jp.py, and this module is not imported when the first time the original name is referred. Is there a work-around for this problem? The other problem is that hyphens and underscores are significant in an alias, although they are not in an original name. A work-around is to define all combinations of hyphens and underscores for an alias (e.g. defining both "japanese.jis-7" and "japanese.jis_7"), but this seems not a good idea for me. Regards, -- KAJIYAMA, Tamito From mal@lemburg.com Mon Dec 11 22:55:59 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 11 Dec 2000 23:55:59 +0100 Subject: [I18n-sig] codec aliases References: <200012112143.GAA07454@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <3A355B7F.4B25A64C@lemburg.com> Tamito KAJIYAMA wrote: > > Hi all, > > I'm implementing a modularized version of my Japanese codecs > following Marc-Andre's proposal, and am having two problems > regarding to codec aliases. > > >>> import codecs > >>> funcs = codecs.lookup("japanese.jis-7") > Traceback (most recent call last): > File "", line 1, in ? > LookupError: unknown encoding > >>> funcs = codecs.lookup("japanese.jis_7") > Traceback (most recent call last): > File "", line 1, in ? > LookupError: unknown encoding > >>> funcs = codecs.lookup("japanese.iso-2022-jp") > >>> funcs = codecs.lookup("japanese.iso_2022_jp") > >>> funcs = codecs.lookup("japanese.jis-7") > Traceback (most recent call last): > File "", line 1, in ? > LookupError: unknown encoding > >>> funcs = codecs.lookup("japanese.jis_7") > >>> > > One problem is that the alias "japanese.jis-7" does not work > unless the corresponding original name "japanese.iso-2022-jp" > have been referred once. This is because the alias is defined > by means of getaliases() in japanese/iso_2022_jp.py, and this > module is not imported when the first time the original name is > referred. Is there a work-around for this problem? The only "work-around" I know of (which doesn't involve some kind of boot code) is by defining aliases via almost empty module which redirect the search function to the correct codec, e.g. codec_alias.py: --------------- from codec_alias_target import * > The other problem is that hyphens and underscores are > significant in an alias, although they are not in an original > name. A work-around is to define all combinations of hyphens > and underscores for an alias (e.g. defining both > "japanese.jis-7" and "japanese.jis_7"), but this seems not a > good idea for me. Codec aliases returned by codec.getaliases() must always use the underscore naming scheme. The standard search function will convert hyphens to underscores *before* applying the alias mapping, so there's no need to worry about different combinations of hyphens and underscores in the alias names (unless I've overlooked something here). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From kajiyama@grad.sccs.chukyo-u.ac.jp Tue Dec 12 02:01:15 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Tue, 12 Dec 2000 11:01:15 +0900 Subject: [I18n-sig] codec aliases In-Reply-To: <3A355B7F.4B25A64C@lemburg.com> (mal@lemburg.com) References: <200012112346.IAA07797@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <200012120201.LAA08133@dhcp198.grad.sccs.chukyo-u.ac.jp> M.-A. Lemburg wrote: | | > One problem is that the alias "japanese.jis-7" does not work | > unless the corresponding original name "japanese.iso-2022-jp" | > have been referred once. This is because the alias is defined | > by means of getaliases() in japanese/iso_2022_jp.py, and this | > module is not imported when the first time the original name is | > referred. Is there a work-around for this problem? | | The only "work-around" I know of (which doesn't involve some | kind of boot code) is by defining aliases via almost empty | module which redirect the search function to the correct | codec, e.g. | | codec_alias.py: | --------------- | from codec_alias_target import * I'm not sure how your work-around works. How is codec_alias.py used? Is that intended to be imported in site.py? I also think that aliases cannot be defined only by importing a codec module, since the aliases are defined by means of getaliases(), and this function is not invoked until the original name corresponding to the aliases is looked up first. I wonder if I need to put a call of codecs.register() somewhere in the modularized codecs... | > The other problem is that hyphens and underscores are | > significant in an alias, although they are not in an original | > name. A work-around is to define all combinations of hyphens | > and underscores for an alias (e.g. defining both | > "japanese.jis-7" and "japanese.jis_7"), but this seems not a | > good idea for me. | | Codec aliases returned by codec.getaliases() must always use | the underscore naming scheme. | | The standard search function will convert hyphens to underscores | *before* applying the alias mapping, so there's no need to worry | about different combinations of hyphens and underscores in | the alias names (unless I've overlooked something here). Returning names with underscores in getaliases() seems not sufficient. In encodings/__init__.py: def search_function(encoding): ... # Cache the encoding and its aliases _cache[encoding] = entry try: codecaliases = mod.getaliases() except AttributeError: pass else: for alias in codecaliases: _cache[alias] = entry return entry The names returned by mod.getaliases() are put into _cache as it is, so equivalent names with hyphens will not be defined. Regards, -- KAJIYAMA, Tamito From mal@lemburg.com Tue Dec 12 09:59:24 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 12 Dec 2000 10:59:24 +0100 Subject: [I18n-sig] codec aliases References: <200012112346.IAA07797@dhcp198.grad.sccs.chukyo-u.ac.jp> <200012120201.LAA08133@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <3A35F6FC.E2C0AA52@lemburg.com> Tamito KAJIYAMA wrote: > > M.-A. Lemburg wrote: > | > | > One problem is that the alias "japanese.jis-7" does not work > | > unless the corresponding original name "japanese.iso-2022-jp" > | > have been referred once. This is because the alias is defined > | > by means of getaliases() in japanese/iso_2022_jp.py, and this > | > module is not imported when the first time the original name is > | > referred. Is there a work-around for this problem? > | > | The only "work-around" I know of (which doesn't involve some > | kind of boot code) is by defining aliases via almost empty > | module which redirect the search function to the correct > | codec, e.g. > | > | codec_alias.py: > | --------------- > | from codec_alias_target import * > > I'm not sure how your work-around works. How is codec_alias.py > used? Is that intended to be imported in site.py? > > I also think that aliases cannot be defined only by importing > a codec module, since the aliases are defined by means of > getaliases(), and this function is not invoked until the > original name corresponding to the aliases is looked up first. > > I wonder if I need to put a call of codecs.register() somewhere > in the modularized codecs... The above scenario should enable you to write one codec, say "main_codec.py" which provides the Real Thing and then allow you to add aliases to this codec by adding any number of additionl redirection codec modules, e.g. "codec_alias_1.py", "codec_alias_2.py" which all contain just one line: from main_codec import * Now, when the search function is queried for e.g. "codec-alias-1" it will import codec_alias_1.py and then apply the usual processing (even register the additional aliases). However, the functionality is provided by main_codec.py. There's no need to call any registration function prior to using one of the codec aliases in this setup. The import mechanism will play the part of the aliasing engine in this case. > | > The other problem is that hyphens and underscores are > | > significant in an alias, although they are not in an original > | > name. A work-around is to define all combinations of hyphens > | > and underscores for an alias (e.g. defining both > | > "japanese.jis-7" and "japanese.jis_7"), but this seems not a > | > good idea for me. > | > | Codec aliases returned by codec.getaliases() must always use > | the underscore naming scheme. > | > | The standard search function will convert hyphens to underscores > | *before* applying the alias mapping, so there's no need to worry > | about different combinations of hyphens and underscores in > | the alias names (unless I've overlooked something here). > > Returning names with underscores in getaliases() seems not > sufficient. In encodings/__init__.py: > > def search_function(encoding): > ... > # Cache the encoding and its aliases > _cache[encoding] = entry > try: > codecaliases = mod.getaliases() > except AttributeError: > pass > else: > for alias in codecaliases: > _cache[alias] = entry > return entry > > The names returned by mod.getaliases() are put into _cache as it > is, so equivalent names with hyphens will not be defined. So I have indeed overlooked something. Thanks for pointing me at it (I don't currently have time to test what I write here, so please bare with me). The aliases should really be added to the aliases.aliases dictionary instead of _cache and also prevent overwrites of already existing aliases (since these would cause strange and unwanted effects). I'll think about this some more and check in a patch to implement the above scheme. Thanks again, -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ BTW: Your email occasionally bounces -- e.g. the last message I sent you got back to me (fortunately, you still seem to get the i18n-sig message). From mal@lemburg.com Tue Dec 12 14:46:27 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 12 Dec 2000 15:46:27 +0100 Subject: [I18n-sig] codec aliases References: <200012112346.IAA07797@dhcp198.grad.sccs.chukyo-u.ac.jp> <200012120201.LAA08133@dhcp198.grad.sccs.chukyo-u.ac.jp> <3A35F6FC.E2C0AA52@lemburg.com> Message-ID: <3A363A43.63E8528E@lemburg.com> "M.-A. Lemburg" wrote: > > > | > The other problem is that hyphens and underscores are > > | > significant in an alias, although they are not in an original > > | > name. A work-around is to define all combinations of hyphens > > | > and underscores for an alias (e.g. defining both > > | > "japanese.jis-7" and "japanese.jis_7"), but this seems not a > > | > good idea for me. > > | > > | Codec aliases returned by codec.getaliases() must always use > > | the underscore naming scheme. > > | > > | The standard search function will convert hyphens to underscores > > | *before* applying the alias mapping, so there's no need to worry > > | about different combinations of hyphens and underscores in > > | the alias names (unless I've overlooked something here). > > > > Returning names with underscores in getaliases() seems not > > sufficient. In encodings/__init__.py: > > > > def search_function(encoding): > > ... > > # Cache the encoding and its aliases > > _cache[encoding] = entry > > try: > > codecaliases = mod.getaliases() > > except AttributeError: > > pass > > else: > > for alias in codecaliases: > > _cache[alias] = entry > > return entry > > > > The names returned by mod.getaliases() are put into _cache as it > > is, so equivalent names with hyphens will not be defined. > > So I have indeed overlooked something. Thanks for pointing me > at it (I don't currently have time to test what I write here, > so please bare with me). The aliases should really be added to > the aliases.aliases dictionary instead of _cache and also prevent > overwrites of already existing aliases (since these would cause > strange and unwanted effects). > > I'll think about this some more and check in a patch to implement > the above scheme. I've checked in a patch which should provide the needed functionality. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From kajiyama@grad.sccs.chukyo-u.ac.jp Tue Dec 12 21:29:36 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 13 Dec 2000 06:29:36 +0900 Subject: [I18n-sig] codec aliases In-Reply-To: <3A35F6FC.E2C0AA52@lemburg.com> (mal@lemburg.com) References: <200012130020000000030090000@reserve.mag2.com> Message-ID: <200012122129.GAA10208@dhcp198.grad.sccs.chukyo-u.ac.jp> M.-A. Lemburg wrote: | | > I'm not sure how your work-around works. How is codec_alias.py | > used? Is that intended to be imported in site.py? (snip) | The above scenario should enable you to write one codec, | say "main_codec.py" which provides the Real Thing and then | allow you to add aliases to this codec by adding any number | of additionl redirection codec modules, e.g. "codec_alias_1.py", | "codec_alias_2.py" which all contain just one line: | | from main_codec import * | | Now, when the search function is queried for e.g. | "codec-alias-1" it will import codec_alias_1.py and then | apply the usual processing (even register the additional | aliases). However, the functionality is provided by | main_codec.py. I see. Thank you for the elaboration. I've followed exactly your scenario, and it works! I've also omitted getaliases() from each main codec, since that function is no longer used. This weekend I'll release a new version of the JapaneseCodecs package that includes the modularized codecs and more. | BTW: Your email occasionally bounces -- e.g. the last message | I sent you got back to me (fortunately, you still seem to get the | i18n-sig message). Something seems go wrong... If you don't mind, please remove my email address (from To: or Cc:) when replying and send messages only to the list. Thanks, -- KAJIYAMA, Tamito From uche.ogbuji@fourthought.com Wed Dec 13 22:59:31 2000 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 13 Dec 2000 15:59:31 -0700 Subject: [I18n-sig] Mixed encodings and XML Message-ID: <3A37FF53.206662F3@fourthought.com> [crossposted: 4Suite, xml-sig, i18n-sig] Time for me to expose my ignorance on XML and i18n again. How would one go about creating a well-formed XML document with multiple encodings? For instance, if I had UCS-2, UTF-8 and BIG5 all in one doc, how could I make it work. Take the following example ftp://ftp.fourthought.com/pub/etc/HOWTO/cjkv.doc This document is a CJKV HOWTO by Chen Chien-Hsun. He originally wrote it in HTML. See ftp://ftp.fourthought.com/pub/etc/HOWTO/CJKV_4XSLT.HTM It contains many sections within HTML PREs with the different encodings I mentioned. They look like

... BIG5-encoded stuff ...

I need to convert the document to XML Docbook format. My naive attempts at converting to ... BIG5-encoded stuff ... Of course don't work because the parser takes one look at the BIG5 and throws a well-formedness error. Is there any way to manage this besides using XInclude? Do any of the Python parsers have any tricks that could help? Thanks. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From tree@basistech.com Wed Dec 13 23:09:47 2000 From: tree@basistech.com (Tom Emerson) Date: Wed, 13 Dec 2000 18:09:47 -0500 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: <3A37FF53.206662F3@fourthought.com> References: <3A37FF53.206662F3@fourthought.com> Message-ID: <14904.443.228020.168633@cymru.basistech.com> Uche Ogbuji writes: > It contains many sections within HTML PREs with the different encodings > I mentioned. They look like > >

> ... BIG5-encoded stuff ...
>

The LANG attribute does not specify an encoding, it specifies a language. You cannot safely imply anything about the encoding based on the value of the LANG attribute. For example, "zh-TW" text could be encoded in Big 5, Big 5+, GBK, CP950, CP936, EUC-CN (depending on the text), ISO-2022-CN, ISO-2022-CN-EXT, and others. The LANG attribute can be used by the application to help generate the appropriate glyph variants, however, though I don't know of any off hand that do this. > I need to convert the document to XML Docbook format. My naive attempts > at converting to > > > ... BIG5-encoded stuff ... > > > Of course don't work because the parser takes one look at the BIG5 and > throws a well-formedness error. Which it is required to do, see Section 4.3.3 of the XML specification. > Is there any way to manage this besides using XInclude? Do any of the > Python parsers have any tricks that could help? Convert all of those sections into Unicode, using UTF-8 as the encoding form. You could write a trivial Python script to do this for you. The bigger problem (IMHO) will be convincing your DocBook tool chain to handle the Asian characters. If you find a good solution to that (i.e., allowing Simplified and Traditional Chinese, Korean, and (say) Thai in a single document) let me know. -tree -- Tom Emerson Basis Technology Corp. Zenkaku Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From uche.ogbuji@fourthought.com Thu Dec 14 00:14:40 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 13 Dec 2000 17:14:40 -0700 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: Message from Tom Emerson of "Wed, 13 Dec 2000 18:09:47 EST." <14904.443.228020.168633@cymru.basistech.com> Message-ID: <200012140014.RAA15620@localhost.localdomain> > Uche Ogbuji writes: > > It contains many sections within HTML PREs with the different encodings > > I mentioned. They look like > > > >

> > ... BIG5-encoded stuff ...
> >

> > The LANG attribute does not specify an encoding, it specifies a > language. You cannot safely imply anything about the encoding based on > the value of the LANG attribute. For example, "zh-TW" text could be > encoded in Big 5, Big 5+, GBK, CP950, CP936, EUC-CN (depending on the > text), ISO-2022-CN, ISO-2022-CN-EXT, and others. > > The LANG attribute can be used by the application to help generate the > appropriate glyph variants, however, though I don't know of any off > hand that do this. Makes sense, but I wasn't clear on this. > > I need to convert the document to XML Docbook format. My naive attempts > > at converting to > > > > > > ... BIG5-encoded stuff ... > > > > > > Of course don't work because the parser takes one look at the BIG5 and > > throws a well-formedness error. > > Which it is required to do, see Section 4.3.3 of the XML specification. I'm quite aware of this (I read the XML spec more often that I'd like to). That's why I said "of course". > > Is there any way to manage this besides using XInclude? Do any of the > > Python parsers have any tricks that could help? > > Convert all of those sections into Unicode, using UTF-8 as the > encoding form. You could write a trivial Python script to do this for > you. Not what I need, unfortunately. The whole point of the exercise is to have examples in the actual encodings. > The bigger problem (IMHO) will be convincing your DocBook tool chain > to handle the Asian characters. If you find a good solution to that > (i.e., allowing Simplified and Traditional Chinese, Korean, and (say) > Thai in a single document) let me know. Hmm? My docbook tool is simply 4XSLT, which handles the individual encodings just fine now. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From tree@basistech.com Thu Dec 14 01:05:43 2000 From: tree@basistech.com (Tom Emerson) Date: Wed, 13 Dec 2000 20:05:43 -0500 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: <200012140014.RAA15620@localhost.localdomain> References: <14904.443.228020.168633@cymru.basistech.com> <200012140014.RAA15620@localhost.localdomain> Message-ID: <14904.7399.328781.898962@cymru.basistech.com> uche.ogbuji@fourthought.com writes: > > Convert all of those sections into Unicode, using UTF-8 as the > > encoding form. You could write a trivial Python script to do this for > > you. > > Not what I need, unfortunately. The whole point of the exercise is > to have examples in the actual encodings. And the point of that is what? They will display (most probably) as jibberish within the browser... or is that the point? > Hmm? My docbook tool is simply 4XSLT, which handles the individual encodings > just fine now. Sure, but if you want to generate a LaTeX (and from there PDF or PS) version you're screwed, AFAIK. If you are just generating HTML then you're OK. -tree -- Tom Emerson Basis Technology Corp. Zenkaku Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From uche.ogbuji@fourthought.com Thu Dec 14 01:17:51 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 13 Dec 2000 18:17:51 -0700 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: Message from Tom Emerson of "Wed, 13 Dec 2000 20:05:43 EST." <14904.7399.328781.898962@cymru.basistech.com> Message-ID: <200012140117.SAA15823@localhost.localdomain> > uche.ogbuji@fourthought.com writes: > > > Convert all of those sections into Unicode, using UTF-8 as the > > > encoding form. You could write a trivial Python script to do this for > > > you. > > > > Not what I need, unfortunately. The whole point of the exercise is > > to have examples in the actual encodings. > > And the point of that is what? They will display (most probably) as > jibberish within the browser... or is that the point? Good question. I have not tried Chen Chien-Hsun's original HTML. Perhaps even that won't work in a browser. Makes sense. What does a browser do with a document with ^^^^^^^^^^ !!!!???!!!! In the header and then runs into a big patch of UCS-2 or BIG5? My guess is that it displays gibberish as you suggest. In this case, I think there's no point expecting HTML generated from XML to do any better and it simply makes sense to break out the alternatively encoded portions into separate, linked files. Chen, does this make sense? > > Hmm? My docbook tool is simply 4XSLT, which handles the individual encodings > > just fine now. > > Sure, but if you want to generate a LaTeX (and from there PDF or PS) > version you're screwed, AFAIK. If you are just generating HTML then > you're OK. Yeah. That's all for now. Thanks much. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From tree@basistech.com Thu Dec 14 01:22:19 2000 From: tree@basistech.com (Tom Emerson) Date: Wed, 13 Dec 2000 20:22:19 -0500 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: <200012140117.SAA15823@localhost.localdomain> References: <14904.7399.328781.898962@cymru.basistech.com> <200012140117.SAA15823@localhost.localdomain> Message-ID: <14904.8395.286379.623954@cymru.basistech.com> uche.ogbuji@fourthought.com writes: > Good question. I have not tried Chen Chien-Hsun's original HTML. > Perhaps even that won't work in a browser. Makes sense. What does > a browser do with a document with > > > ^^^^^^^^^^ > !!!!???!!!! > > In the header and then runs into a big patch of UCS-2 or BIG5? It treats those bytes as 8-bit Latin 1 characters and it displays them. Once you've seen enough of these you start recognizing the patterns, but it is still junk. > My guess is that it displays gibberish as you suggest. In this case, I think > there's no point expecting HTML generated from XML to do any better and it > simply makes sense to break out the alternatively encoded portions into > separate, linked files. No. What makes sense, if the intention of the original author is to show the Chinese text correctly, is to convert that section to UTF-8 and put that in the document. -tree -- Tom Emerson Basis Technology Corp. Zenkaku Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From uche.ogbuji@fourthought.com Thu Dec 14 02:45:46 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 13 Dec 2000 19:45:46 -0700 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: Message from Tom Emerson of "Wed, 13 Dec 2000 20:22:19 EST." <14904.8395.286379.623954@cymru.basistech.com> Message-ID: <200012140245.TAA16426@localhost.localdomain> > > My guess is that it displays gibberish as you suggest. In this case, I think > > there's no point expecting HTML generated from XML to do any better and it > > simply makes sense to break out the alternatively encoded portions into > > separate, linked files. > > No. What makes sense, if the intention of the original author is to > show the Chinese text correctly, is to convert that section to UTF-8 > and put that in the document. Eccovi! Now I understand why we've been talking past each other. I assumed you'd read the text in question: bad assumption, I admit. No. The intention is not to display Chinese characters correctly. The intention, I'm pretty sure, is to provide examples than can be cut and pasted in order for people to play with the various snippets themselves. As such, I'm not really concerned about what the HTML rendering looks like when it hits the different encodings. What I was originally writing about was: 1. Is there any way to convince an XML parser to work with source with mixed encoding. The exchange with you has helped disabuse me of any silly notion that this might be so. So I shall have to use XInclude. 2. Will the results of the rendering be such that the LATIN-1 parts can be read normally and the portions with other encodings would be available for cut and paste? If I use XInclude, no reason why not. So thanks for all the help. I think I was pretty much on a fool's errand from the start, but at least I know how to proceed. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From martin@loewis.home.cs.tu-berlin.de Thu Dec 14 03:05:01 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 14 Dec 2000 04:05:01 +0100 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: <3A37FF53.206662F3@fourthought.com> (message from Uche Ogbuji on Wed, 13 Dec 2000 15:59:31 -0700) References: <3A37FF53.206662F3@fourthought.com> Message-ID: <200012140305.EAA00999@loewis.home.cs.tu-berlin.de> > How would one go about creating a well-formed XML document with multiple > encodings? As others have pointed out: You don't. XML documents are in Unicode. They may have some other encoding *for transfer*, but conceptually, they are still in Unicode. > It contains many sections within HTML PREs with the different encodings > I mentioned. They look like > >

> ... BIG5-encoded stuff ...
>

So what you really want is to include binary data in a tag. As you've explained yourself when answering to Marc-Andre: That is not supported in XML. Of course, if XML had a BDATA type (or section) you could include a binary data fragment, and then any presentation tool would have to provide visualization (such as opening a hex editor on double-click). In the specific case of cjkv.doc, I guess the best approach would be: - use Python string escapes in Python code, e.g. sjisStr = "\0x88\0xc0\0x91\0x53\0x82\0xc9\0x8e\0x67\0x82\0xa6\0x82\0xe9" # Shift-JIS encoded source string - use Unicode text data where output is intended to be displayed properly - don't cite the output if it will come out as gibberish on any terminal (e.g. when printing both SJIS and UTF-8 on the same terminal). Instead, explain what the user will likely see. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Dec 14 04:31:20 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 14 Dec 2000 13:31:20 +0900 Subject: [I18n-sig] Re: Mixed encodings and XML In-Reply-To: <200012140245.TAA16426@localhost.localdomain> (uche.ogbuji@fourthought.com) References: <200012140120.KAA14252@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <200012140431.NAA14495@dhcp198.grad.sccs.chukyo-u.ac.jp> uche.ogbuji@fourthought.com wrote: | | The intention, I'm pretty sure, is to provide examples than | can be cut and pasted in order for people to play with the | various snippets themselves. I don't think that mixing different encodings in a document is a good idea. A brower assumes an encoding when reading a sequence of characters from a stream. If the browser finds one or more bytes out of the expected range, the result of decoding is undefined in general. So, cut-and-paste may or may not pass correct character data to the user. Safer ways for giving examples in various encodings are: - to use Unicode for displaying code snippets in the document the end users see on their browsers, and - to use native encodings in separate files to provide the real code snippets. Authoring an XML source of the document is another story. Regards, -- KAJIYAMA, Tamito From larsga@garshol.priv.no Thu Dec 14 10:03:11 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 14 Dec 2000 11:03:11 +0100 Subject: [XML-SIG] Re: [I18n-sig] Mixed encodings and XML In-Reply-To: <200012140245.TAA16426@localhost.localdomain> References: <200012140245.TAA16426@localhost.localdomain> Message-ID: * uche ogbuji | | 1. Is there any way to convince an XML parser to work with source | with mixed encoding. A single XML entity must be entirely in a single character encoding. A document, however, can be in any number of different encodings, provided each entity is internally consistent. You can have encoding declarations on both the document entity (in the form of the XML declaration) and on subordinate entities (using text declarations). So you can do what you want using entities. --Lars M. From uche.ogbuji@fourthought.com Thu Dec 14 15:21:44 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Thu, 14 Dec 2000 08:21:44 -0700 Subject: [XML-SIG] Re: [I18n-sig] Mixed encodings and XML In-Reply-To: Message from Lars Marius Garshol of "14 Dec 2000 11:03:11 +0100." Message-ID: <200012141521.IAA18229@localhost.localdomain> > > * uche ogbuji > | > | 1. Is there any way to convince an XML parser to work with source > | with mixed encoding. > > A single XML entity must be entirely in a single character encoding. > A document, however, can be in any number of different encodings, > provided each entity is internally consistent. You can have encoding > declarations on both the document entity (in the form of the XML > declaration) and on subordinate entities (using text declarations). > > So you can do what you want using entities. Excellent! Just when I'd convinced myself that I was on a fool's errand, comes Lars to the rescue. I gues it's been too long since I've exercised all of XML 1.0. I so rarely use entities that I completely forgot that they are exactly the solution. I can use entities in special XML elements, and extend the docbook stylesheet to output the contents of those elements to a separate file using the "ft:write-file" extension element. Perfect. Thanks. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From andy@reportlab.com Thu Dec 14 16:16:03 2000 From: andy@reportlab.com (Andy Robinson) Date: Thu, 14 Dec 2000 16:16:03 -0000 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: <200012140245.TAA16426@localhost.localdomain> Message-ID: > 1. Is there any way to convince an XML parser to work with > source with mixed > encoding. The exchange with you has helped disabuse me of > any silly notion > that this might be so. So I shall have to use XInclude. > > 2. Will the results of the rendering be such that the > LATIN-1 parts can be > read normally and the portions with other encodings would > be available for cut > and paste? If I use XInclude, no reason why not. I did exactly this in an internal help page for a company that was learning this stuff a year ago. I don't see a problem, because most CJKV encodings are 8-bit and ASCII compatible. Declare the document as Latin-1 - because that way your parser will not choke on or corrupt bytes above 127. Then paste in text in whatever encoding you want. Any Kanji text in one of the common ASCII-compatible encodings (Shift-JIS, EUC, or even UTF8) will appear as gobbledegook, but the underlying bytes will not be corrupted, so they should be able to paste them out. You should be able to transform the whole document from iso-latin-1 to utf8 and back without loss of data; do a quick test from Python to verify it. Not exactly an industrial solution, but it's not exactly an industrial problem. It would of course go horribly wrong if you used exotic encodings like UTF-16 with null bytes :-) - Andy Robinson From uche.ogbuji@fourthought.com Fri Dec 15 15:57:18 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Fri, 15 Dec 2000 08:57:18 -0700 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: Message from "Andy Robinson" of "Thu, 14 Dec 2000 16:16:03 GMT." Message-ID: <200012151557.IAA22017@localhost.localdomain> > I did exactly this in an internal help page for a company that was > learning this stuff a year ago. I don't see a problem, because most > CJKV encodings are 8-bit and ASCII compatible. Declare the document as > Latin-1 - because that way your parser will not choke on or corrupt > bytes above 127. Then paste in text in whatever encoding you want. > Any Kanji text in one of the common ASCII-compatible encodings > (Shift-JIS, EUC, or even UTF8) will appear as gobbledegook, but the > underlying bytes will not be corrupted, so they should be able to > paste them out. You should be able to transform the whole document > from iso-latin-1 to utf8 and back without loss of data; do a quick > test from Python to verify it. > > Not exactly an industrial solution, but it's not exactly an industrial > problem. > > It would of course go horribly wrong if you used exotic encodings like > UTF-16 with null bytes :-) Now I _know_ I need more sleep. I never even tried the simple expedient of adding the XML declaration with LATIN-1 encoding. Not even when the original HTML doc geve a strong hint by adding a META tag that did the same thing. Now my problem is completely solved without needing to resort to multiple files. Thanks, Andy. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Dec 15 19:40:17 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Sat, 16 Dec 2000 04:40:17 +0900 Subject: [I18n-sig] JapaneseCodecs 1.2 released Message-ID: <200012151940.EAA17536@dhcp198.grad.sccs.chukyo-u.ac.jp> Hi all, I've released JapaneseCodecs 1.2. As usual, the package is available at: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ Changed files have been checked into the CVS repository, too. Here is an excerpt from README.en: | o Version 1.2 <16 December 2000> | - All codecs are moved into the "japanese" module. | - The packages is now installed into $lib/site-packages/. | - The ISO-2022-JP codec now maps 0x5c and 0x7e to U+00A5 (yen | mark) and U+00AF (overline), respectively, when JIS X 0201 | Roman is designated. | (Thanks to SUZUKI Hisao ) | - New codec for ISO-2022-JP plus JIS X 0201 Katakana is added. | (Thanks to SUZUKI Hisao ) | - New codecs for JIS 0201 X Roman and Katakana are added. JapaneseCodecs is no longer installed into $lib/encodings. Moreover, Japanese codecs are packed into the "japanese" codecs package. A few examples of the new codec names are japanese.euc-jp, japanese.sjis, and japanese.iso-2022-jp. For those who participated in discussions on the naming of an ISO-2022-JP variant, the implementation of packaged codecs, and problems concerning codec aliases: Thanks a lot!! Best regards, -- KAJIYAMA, Tamito From martin@loewis.home.cs.tu-berlin.de Fri Dec 15 21:03:37 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 15 Dec 2000 22:03:37 +0100 Subject: [I18n-sig] Mixed encodings and XML In-Reply-To: <200012151557.IAA22017@localhost.localdomain> (uche.ogbuji@fourthought.com) References: <200012151557.IAA22017@localhost.localdomain> Message-ID: <200012152103.WAA00902@loewis.home.cs.tu-berlin.de> > Now my problem is completely solved without needing to resort to > multiple files. It is worked-around, not solved. You claim that those parts of the document are Latin-1, when they are actually different. A processor would normally convert that to some internal Unicode representation, and may it write out then in a different format, e.g. UTF-8. Then the essential information would be lost - even though this is an isomorphic transformation. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Dec 20 11:05:49 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 20 Dec 2000 20:05:49 +0900 Subject: [I18n-sig] error handling in charmap-based codecs Message-ID: <200012201105.UAA24380@dhcp198.grad.sccs.chukyo-u.ac.jp> Hi, Most standard codecs based on the charmap codec, such as iso8859_2 and koi8_r, appear not to do correct error handling. Although the default error handling scheme is "strict", characters that are not in a mapping are passed through without decoding/encoding. Worse, a error handling scheme specified is completely ignored. Following code excerpt from Object/unicodeobject.c points out the problem: 1965: /* Get mapping (char ordinal -> integer, Unicode char or None) */ 1966: w = PyInt_FromLong((long)ch); 1967: if (w == NULL) 1968: goto onError; 1969: x = PyObject_GetItem(mapping, w); 1970: Py_DECREF(w); 1971: if (x == NULL) { 1972: if (PyErr_ExceptionMatches(PyExc_LookupError)) { 1973: /* No mapping found: default to Latin-1 mapping */ 1974: PyErr_Clear(); 1975: *p++ = (Py_UNICODE)ch; 1976: continue; 1977: } 1978: goto onError; 1979: } Evidently, a character not in the 'mapping' object is passed as it is. I'm not sure why the if statement shown above has been put here. A error handling scheme works as expected if the mapping object returns None for an undefined key. So, I've added the following code to charmap-based codecs of mine: import UserDict class Mapping(UserDict.UserDict): def __getitem__(self, key): return self.get(key) decoding_map = Mapping({ ... }) encoding_map = Mapping({}) for k, v in decoding_map.items(): encoding_map[v] = k Either Objects/unicodeobject.c or the charmap-based codecs need a fix, I think. Regards, -- KAJIYAMA, Tamito From martin@loewis.home.cs.tu-berlin.de Wed Dec 20 11:36:16 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 20 Dec 2000 12:36:16 +0100 Subject: [I18n-sig] error handling in charmap-based codecs In-Reply-To: <200012201105.UAA24380@dhcp198.grad.sccs.chukyo-u.ac.jp> (message from Tamito KAJIYAMA on Wed, 20 Dec 2000 20:05:49 +0900) References: <200012201105.UAA24380@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <200012201136.MAA00869@loewis.home.cs.tu-berlin.de> > Most standard codecs based on the charmap codec, such as > iso8859_2 and koi8_r, appear not to do correct error handling. > Although the default error handling scheme is "strict", > characters that are not in a mapping are passed through without > decoding/encoding. Worse, a error handling scheme specified is > completely ignored. Indeed. I have filed a bug report, "Unicode encoders don't report errors properly", http://sourceforge.net/bugs/?func=detailbug&bug_id=116285&group_id=5470 Unfortunately, there is disagreement whether this is a bug, or what the nature of the bug is. > 1965: /* Get mapping (char ordinal -> integer, Unicode char or None) */ > 1966: w = PyInt_FromLong((long)ch); > 1967: if (w == NULL) > 1968: goto onError; > 1969: x = PyObject_GetItem(mapping, w); > 1970: Py_DECREF(w); > 1971: if (x == NULL) { > 1972: if (PyErr_ExceptionMatches(PyExc_LookupError)) { > 1973: /* No mapping found: default to Latin-1 mapping */ > 1974: PyErr_Clear(); > 1975: *p++ = (Py_UNICODE)ch; > 1976: continue; > 1977: } > 1978: goto onError; > 1979: } > > Evidently, a character not in the 'mapping' object is passed as > it is. I'm not sure why the if statement shown above has been > put here. I'm not sure, either. There is no documentation what the function is supposed to do, so it is hard to tell whether it does that correctly. IMO, it should read if (x == NULL) { if (PyErr_ExceptionMatches(PyExc_LookupError)) { /* No mapping found: default to Latin-1 mapping */ PyErr_Clear(); x = Py_None; Py_INCREF(x); } else goto onError; } I can't see any reason for defaulting to *Latin-1*. > A error handling scheme works as expected if the mapping object > returns None for an undefined key. So, I've added the following > code to charmap-based codecs of mine: Yes, that is also the proposed solution in response to my bug report. I don't like it at all as a solution; it's an ok work-around. As a solution, it is stupid: All codecs will have to pay the cost for UserDict accesses, and no codec makes uses of this 1:1 "feature" - when real solution is three-line change. Just my 0.02EUR, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Dec 20 12:31:05 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 20 Dec 2000 21:31:05 +0900 Subject: [I18n-sig] error handling in charmap-based codecs In-Reply-To: <200012201136.MAA00869@loewis.home.cs.tu-berlin.de> (martin@loewis.home.cs.tu-berlin.de) References: <200012201136.MAA00869@loewis.home.cs.tu-berlin.de> Message-ID: <200012201231.VAA24555@dhcp198.grad.sccs.chukyo-u.ac.jp> Martin v. Loewis wrote: | | IMO, it should read | | if (x == NULL) { | if (PyErr_ExceptionMatches(PyExc_LookupError)) { | /* No mapping found: default to Latin-1 mapping */ | PyErr_Clear(); | x = Py_None; | Py_INCREF(x); | } else | goto onError; | } I agree. | I can't see any reason for defaulting to *Latin-1*. Yes. Passing characters intact might be okay for Latin-1 variants, but is not at all for non-Latin encodings. Also, I don't think that defaulting to Latin-1 is the same as copying characters which do not have corresponding characters in an encoding. Regards, -- KAJIYAMA, Tamito From walter@livinglogic.de Wed Dec 20 14:06:25 2000 From: walter@livinglogic.de (Walter Doerwald) Date: Wed, 20 Dec 2000 15:06:25 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode Message-ID: <200012201506250171.00D313E3@mail.tmt.de> Problem: Most character encodings do not support the full range of Unicode characters. For these cases many high level protocols support a way of escaping a Unicode character (e.g. Python itself support the \x, \u and \U convention, XML supports character references via &#xxxx; etc.). The problem with the current implementation of unicode.encode is that for determining which characters are unencodable by a certain encoding, every single character has to be tried, because encode does not provide any information about the location of the error(s), so us =3D u"xxx" s =3D us.encode("encoding", errors=3D"strict") has to be replaced by: us =3D u"xxx" v =3D "" for c in us: try: v.append(c.encode("encoding", "strict")) except UnicodeError: v.append("&#" + ord(c) + ";") s =3D "".join(v) This slows down encoding dramatically as now the loop through the string is done in Python code and no longer in C code. Solution: One simple and extensible solution would be to be able to pass an error handler function as the error argument for encode. This error handler function is passed every unencodable character and might either raise an exception itself, or return a unicode string that will be encoded instead of the unencodable character. (Note that this requires the the encoding *must* be able to encode what is returned from the handler) Example: us =3D unicode("a=E4o=F6u=FC", "latin1") def xmlEscape(char): return u"&#" + unicode(ord(char),"ascii") + u";" print s.encode("us-ascii", xmlEscape) will result in aäoöuü With this scheme it would even be possible to reimplement the old error handling with the new one: def strict(char): raise UnicodeError("can't encode %r" % char) def ignore(char): return u"" def replace(char): return u"\uFFFD" Does this make sense? Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From martin@loewis.home.cs.tu-berlin.de Wed Dec 20 15:00:34 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 20 Dec 2000 16:00:34 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode In-Reply-To: <200012201506250171.00D313E3@mail.tmt.de> (walter@livinglogic.de) References: <200012201506250171.00D313E3@mail.tmt.de> Message-ID: <200012201500.QAA00859@loewis.home.cs.tu-berlin.de> > One simple and extensible solution would be to be able to > pass an error handler function as the error argument for encode. [...] > Does this make sense? That is indeed the best solution for that problem that I've heard so far. I'm not so sure about replacing "strict" with, say, codecs.strict, but in general, it seems like a very elegant approach to me. The only problem is that all existing codecs have to be reworked to take the extended interface into account. Would you like to work on a patch for Python? Regards, Martin From mal@lemburg.com Wed Dec 20 18:52:37 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 20 Dec 2000 19:52:37 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> Message-ID: <3A40FFF5.882E0D82@lemburg.com> Walter Doerwald wrote: > > Problem: > Most character encodings do not support the full range of > Unicode characters. For these cases many high level protocols > support a way of escaping a Unicode character (e.g. Python > itself support the \x, \u and \U convention, XML supports > character references via &#xxxx; etc.). The problem with the > current implementation of unicode.encode is that for determining > which characters are unencodable by a certain encoding, every > single character has to be tried, because encode does not > provide any information about the location of the error(s), so > > us = u"xxx" > s = us.encode("encoding", errors="strict") > > has to be replaced by: > > us = u"xxx" > v = "" > for c in us: > try: > v.append(c.encode("encoding", "strict")) > except UnicodeError: > v.append("&#" + ord(c) + ";") > s = "".join(v) > > This slows down encoding dramatically as now the loop through > the string is done in Python code and no longer in C code. > > Solution: > One simple and extensible solution would be to be able to > pass an error handler function as the error argument for encode. > This error handler function is passed every unencodable character > and might either raise an exception itself, or return a unicode > string that will be encoded instead of the unencodable character. > (Note that this requires the the encoding *must* be able to encode > what is returned from the handler) > > Example: > > us = unicode("a�o�u�", "latin1") > > def xmlEscape(char): > return u"&#" + unicode(ord(char),"ascii") + u";" > > print s.encode("us-ascii", xmlEscape) > > will result in > > aäoöuü > > With this scheme it would even be possible to reimplement the > old error handling with the new one: > > def strict(char): > raise UnicodeError("can't encode %r" % char) > > def ignore(char): > return u"" > > def replace(char): > return u"\uFFFD" > > Does this make sense? The problem with this is that the error handler will usually have to have access to the internal data structure of the codec to be able to process the error, e.g. in your example could be a single character, a UTF-16 sequence, etc. The codec in general knows better what to do in case of an error, that's why there's a simple string argument for the error handling: the codec can then decide on what to do depending on the value of this argument (and even call back to some error handler it implements as method). Since your main problem is locating the character causing the error, one possibility would be to extend the error instance to reference the position of the error as error instance attribute, e.g. unierror.position. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Dec 20 19:06:23 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 20 Dec 2000 20:06:23 +0100 Subject: [I18n-sig] error handling in charmap-based codecs References: <200012201105.UAA24380@dhcp198.grad.sccs.chukyo-u.ac.jp> <200012201136.MAA00869@loewis.home.cs.tu-berlin.de> Message-ID: <3A41032F.FA01042D@lemburg.com> "Martin v. Loewis" wrote: > > > Most standard codecs based on the charmap codec, such as > > iso8859_2 and koi8_r, appear not to do correct error handling. > > Although the default error handling scheme is "strict", > > characters that are not in a mapping are passed through without > > decoding/encoding. Worse, a error handling scheme specified is > > completely ignored. This is because I wanted to avoid having to put a huge number of mappings to None into the codec dictionaries. This would have caused the codec modules and dictionaries to become much larger than acceptable for the standard distribution. The charmap codec was originally written to simplify writing codecs for 8-bit encodings. Most of these only alter a few characters and this would warrant including mappings for all 256 characters in both directions. > Indeed. I have filed a bug report, "Unicode encoders don't report > errors properly", > > http://sourceforge.net/bugs/?func=detailbug&bug_id=116285&group_id=5470 > > Unfortunately, there is disagreement whether this is a bug, or what > the nature of the bug is. There is ? > > 1965: /* Get mapping (char ordinal -> integer, Unicode char or None) */ > > 1966: w = PyInt_FromLong((long)ch); > > 1967: if (w == NULL) > > 1968: goto onError; > > 1969: x = PyObject_GetItem(mapping, w); > > 1970: Py_DECREF(w); > > 1971: if (x == NULL) { > > 1972: if (PyErr_ExceptionMatches(PyExc_LookupError)) { > > 1973: /* No mapping found: default to Latin-1 mapping */ > > 1974: PyErr_Clear(); > > 1975: *p++ = (Py_UNICODE)ch; > > 1976: continue; > > 1977: } > > 1978: goto onError; > > 1979: } > > > > Evidently, a character not in the 'mapping' object is passed as > > it is. I'm not sure why the if statement shown above has been > > put here. > > I'm not sure, either. There is no documentation what the function is > supposed to do, so it is hard to tell whether it does that correctly. Ok, let me document it: It does what it's supposed to do :-) > IMO, it should read > > if (x == NULL) { > if (PyErr_ExceptionMatches(PyExc_LookupError)) { > /* No mapping found: default to Latin-1 mapping */ > PyErr_Clear(); > x = Py_None; > Py_INCREF(x); > } else > goto onError; > } > > I can't see any reason for defaulting to *Latin-1*. See above. The encodings using the charmap codec are usually only minor modifications of Latin-1. > > A error handling scheme works as expected if the mapping object > > returns None for an undefined key. So, I've added the following > > code to charmap-based codecs of mine: > > Yes, that is also the proposed solution in response to my bug > report. I don't like it at all as a solution; it's an ok work-around. > As a solution, it is stupid: All codecs will have to pay the cost for > UserDict accesses, and no codec makes uses of this 1:1 "feature" - > when real solution is three-line change. Huh ? The solution is simple: you only have to add mappings to None as appropriate. There's no need to change the codec. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Wed Dec 20 20:54:56 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 20 Dec 2000 21:54:56 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode In-Reply-To: <3A40FFF5.882E0D82@lemburg.com> (mal@lemburg.com) References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> Message-ID: <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> > The problem with this is that the error handler will usually > have to have access to the internal data structure of the codec > to be able to process the error, e.g. in your example > could be a single character, a UTF-16 sequence, etc. Please note that in his encoding, char is a Unicode string (specifically, character), so it can't be a UTF-16 sequence. What *encoder* that you know needs to have internal state? Anyway, if you think that state should be accessible to the error handling function, it won't be hard to pass state to the callback. E.g. you could pass the string being encoded, the current position, and optionally a Codec instance (many codecs would pass None, as they don't keep any state). > The codec in general knows better what to do in case of an error In the demonstrated use case, it doesn't know. It should create an XML character entity, but doesn't know anything about XML character entities. > Since your main problem is locating the character causing the > error, one possibility would be to extend the error instance > to reference the position of the error as error instance > attribute, e.g. unierror.position. That would work as well, but it would require to re-encode everything up to that position. The callback solution is more general. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Dec 20 21:22:17 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 20 Dec 2000 22:22:17 +0100 Subject: [I18n-sig] error handling in charmap-based codecs In-Reply-To: <3A41032F.FA01042D@lemburg.com> (mal@lemburg.com) References: <200012201105.UAA24380@dhcp198.grad.sccs.chukyo-u.ac.jp> <200012201136.MAA00869@loewis.home.cs.tu-berlin.de> <3A41032F.FA01042D@lemburg.com> Message-ID: <200012202122.WAA01518@loewis.home.cs.tu-berlin.de> > This is because I wanted to avoid having to put a huge number of > mappings to None into the codec dictionaries. This would have > caused the codec modules and dictionaries to become much larger > than acceptable for the standard distribution. I can't see the problem. If KeyError means "character not in the target character set", then why exactly would you have to put mappings to None into the codec dictionaries? Can you please give an example of a mapping that would need to be changed? > > I can't see any reason for defaulting to *Latin-1*. > > See above. The encodings using the charmap codec are usually > only minor modifications of Latin-1. I see, but I don't see. Let's take koi8_r.py as an example. It has a complete mapping for the range 128..255, the rest (0..127) is intended as a 1:1 mapping. I can't see a problem writing decoding_map = codecs.identity_dictionary(range(0,128)) decoding_map.update({ 0x0080: 0x2500, # BOX DRAWINGS LIGHT HORIZONTAL 0x0081: 0x2502, # BOX DRAWINGS LIGHT VERTICAL ... }) where codecs.identity_dictionary is defined as def identity_dictionary(rng): res = {} for i in rng:res[i]=i return res That will produce somewhat larger dictionaries once a codec is *used*, but it won't change the distribution significantly. > Huh ? The solution is simple: you only have to add mappings to None > as appropriate. There's no need to change the codec. So how can I correct the koi8_r codec without changing the C code? Regards, Martin From mal@lemburg.com Thu Dec 21 18:46:56 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 21 Dec 2000 19:46:56 +0100 Subject: [I18n-sig] error handling in charmap-based codecs References: <200012201105.UAA24380@dhcp198.grad.sccs.chukyo-u.ac.jp> <200012201136.MAA00869@loewis.home.cs.tu-berlin.de> <3A41032F.FA01042D@lemburg.com> <200012202122.WAA01518@loewis.home.cs.tu-berlin.de> Message-ID: <3A425020.3A80BC25@lemburg.com> "Martin v. Loewis" wrote: > > > This is because I wanted to avoid having to put a huge number of > > mappings to None into the codec dictionaries. This would have > > caused the codec modules and dictionaries to become much larger > > than acceptable for the standard distribution. > > I can't see the problem. If KeyError means "character not in the > target character set", then why exactly would you have to put mappings > to None into the codec dictionaries? Can you please give an example of > a mapping that would need to be changed? A mapping to None means: this mapping is undefined, so raise an exception. If this were the default, then all cpXXX.py would have to include all 1-1 mappings explicitely, e.g. 0x0020: 0x0020. This would cause the tables to enlarge substantially. To explicitely declare a mapping undefined, you'd have to add mappings to None. This is what causes the bug you reported on SF. A proper fix would involve adding the relevant mappings to all decode maps in the standard codecs. > > > I can't see any reason for defaulting to *Latin-1*. > > > > See above. The encodings using the charmap codec are usually > > only minor modifications of Latin-1. > > I see, but I don't see. Let's take koi8_r.py as an example. It has a > complete mapping for the range 128..255, the rest (0..127) is intended > as a 1:1 mapping. I can't see a problem writing > > decoding_map = codecs.identity_dictionary(range(0,128)) > decoding_map.update({ > > 0x0080: 0x2500, # BOX DRAWINGS LIGHT HORIZONTAL > 0x0081: 0x2502, # BOX DRAWINGS LIGHT VERTICAL > ... > }) > > where codecs.identity_dictionary is defined as > > def identity_dictionary(rng): > res = {} > for i in rng:res[i]=i > return res > > That will produce somewhat larger dictionaries once a codec is *used*, > but it won't change the distribution significantly. True; that would be an at runtime possibility -- perhaps we ought to provide more tools for creating those mapping tables ?! > > Huh ? The solution is simple: you only have to add mappings to None > > as appropriate. There's no need to change the codec. > > So how can I correct the koi8_r codec without changing the C code? Simple: add the missing mappings to None for the range 0..255. The mapping lives in the Python module koi8_r.py -- there's no need to touch any C code. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Dec 21 18:48:26 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 21 Dec 2000 19:48:26 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> Message-ID: <3A42507A.AD7D196E@lemburg.com> "Martin v. Loewis" wrote: > > > The problem with this is that the error handler will usually > > have to have access to the internal data structure of the codec > > to be able to process the error, e.g. in your example > > could be a single character, a UTF-16 sequence, etc. > > Please note that in his encoding, char is a Unicode string > (specifically, character), so it can't be a UTF-16 sequence. > What *encoder* that you know needs to have internal state? The codec is much general and kept symmetric for obvious reasons. In his case, char would be a Unicode string, but the input to an encoder could just as well be an image, a sound or some other abstract form of data storage. It is not unlikely that these encoder will need to keep state. Even for Unicode you will need to keep state in the encoder, e.g. to write an encoder which uses the Unicode compression algorithm as basis (the output stream contains markers to switch pages). > Anyway, if you think that state should be accessible to the error > handling function, it won't be hard to pass state to the callback. > E.g. you could pass the string being encoded, the current position, > and optionally a Codec instance (many codecs would pass None, as they > don't keep any state). Hmm, I don't think this is generally useful. Using the codec instances directly would be the right way to go, IMHO. I don't want to overload .encode() or unicode() with too much functionality. Writing your own function helpers which then apply all the necessary magic is simple and doesn't warrant changing APIs in the core. Since the error handling is extensible by adding new options such as 'callback', the existing codecs could be extended to provide this functionality as well. We'd only need a way to pass the callback to the codecs in some way, e.g. by using a keyword argument on the constructor or by subclassing it and providing a new method for the error handling in question. > > The codec in general knows better what to do in case of an error > > In the demonstrated use case, it doesn't know. It should create an XML > character entity, but doesn't know anything about XML character > entities. I meant that it knows better about the current state and parameters of the encoding and input it is working on. The ideal error handling scheme would call a method on the codec which you could then override to provide your own handling, e.g. XML entity encoding. > > Since your main problem is locating the character causing the > > error, one possibility would be to extend the error instance > > to reference the position of the error as error instance > > attribute, e.g. unierror.position. > > That would work as well, but it would require to re-encode everything > up to that position. The callback solution is more general. Sure, but the more general solution needs to be well designed. The above trick only adds additional information to the error instance -- this is easy to implement and doesn't break anything. Note: simply changing the error parameter to a PyObject doesn't work, since all C APIs expect a simple const char. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Fri Dec 22 13:57:20 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 22 Dec 2000 14:57:20 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode In-Reply-To: <3A423E4D.88C7639@lemburg.com> (mal@lemburg.com) References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> Message-ID: <200012221357.OAA00921@loewis.home.cs.tu-berlin.de> > Hmm, I don't think this is generally useful. Using the codec > instances directly would be the right way to go, IMHO. I don't > want to overload .encode() or unicode() with too much functionality. > Writing your own function helpers which then apply all the necessary > magic is simple and doesn't warrant changing APIs in the core. Ok, then I have a challenge for you. Write a codec family that emits XML character entities on encoding errors for any of the standard Python codecs. If its really simple, then I'd *really* appreciate concrete, working code. I really mean that - I doubt that this is simple. If a problem arises doing it for all of the encodings, just pick one. If that is still asked too much, outline a solution; preferably one that is as efficient as would be the solution involving the callback. > Since the error handling is extensible by adding new options such as > 'callback', the existing codecs could be extended to provide this > functionality as well. We'd only need a way to pass the callback to > the codecs in some way, e.g. by using a keyword argument on the > constructor or by subclassing it and providing a new method for the > error handling in question. That solution is quite similar to the callback approach, so we could probably chose either. I'm not entirely sure how the usage scenario is. Did you think that users, instead of writing u.encode("koi8-r",errors=xmlcharentities) would write I,forgot,which,parameter = codecs.lookup("koi8-r") encode = I() encode.install_error_cb(xmlcharentities) encode.encode(u,errors="callback") or did you have a more convenient API in mind? Also, how would I write the callback function for the koi8-r codec? > I meant that it knows better about the current state and > parameters of the encoding and input it is working on. The ideal > error handling scheme would call a method on the codec which > you could then override to provide your own handling, e.g. > XML entity encoding. Well, the proposed scheme *is* ideal, in that sense. > Sure, but the more general solution needs to be well designed. > The above trick only adds additional information to the error > instance -- this is easy to implement and doesn't break anything. Again, I'd like to see how the API is used - ease of implementation of the API is not my primary concern; I'd be willing to contribute involved implementations if they make the users' lifes easier. > Note: simply changing the error parameter to a PyObject doesn't > work, since all C APIs expect a simple const char. Sure. Looking from the Python core side of the things, it's a large change. Looking from the users' point of view, it's a small one. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Fri Dec 22 14:18:49 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 22 Dec 2000 15:18:49 +0100 Subject: [I18n-sig] error handling in charmap-based codecs In-Reply-To: <3A425020.3A80BC25@lemburg.com> (mal@lemburg.com) References: <200012201105.UAA24380@dhcp198.grad.sccs.chukyo-u.ac.jp> <200012201136.MAA00869@loewis.home.cs.tu-berlin.de> <3A41032F.FA01042D@lemburg.com> <200012202122.WAA01518@loewis.home.cs.tu-berlin.de> <3A425020.3A80BC25@lemburg.com> Message-ID: <200012221418.PAA01081@loewis.home.cs.tu-berlin.de> > Date: Thu, 21 Dec 2000 19:46:56 +0100 > A mapping to None means: this mapping is undefined, so raise an > exception. If this were the default, then all cpXXX.py would have > to include all 1-1 mappings explicitely, e.g. 0x0020: 0x0020. > This would cause the tables to enlarge substantially. I'm not sure what you mean by "tables". Please have a look at patch #103002; the actual increase in source code bytes for the Python core is quite minimal. Some overhead occurs when a codec is imported - it then adds at most 512 additional keys to the dictionaries that are actually used. > To explicitely declare a mapping undefined, you'd have to add > mappings to None. This is what causes the bug you reported on SF. > A proper fix would involve adding the relevant mappings to all > decode maps in the standard codecs. Why would that be smaller than adding the identity mappings? I hope you'd fill in the Nones using a loop, not by placing them in source code. Then the source code change is identical in both solutions. At run-time, the identity mapping is smaller than the mapping to None, since you'd need more than 65000 additional entries in each encoding_map. [how to correct the koi8-r codec] > Simple: add the missing mappings to None for the range 0..255. > The mapping lives in the Python module koi8_r.py -- there's > no need to touch any C code. That would be incorrect. u" ".encode("koi8-r") would then give a UnicodeError, when the result should be " ". I'm not sure why touching C code is a bad thing - especially when it is such a small change. There is clearly in error in the C function; at least three different people have independently noticed the misbehaviour, and identified that function as the cause. Besides yourself, I have not seen anybody defending the "feature". Regards, Martin From walter@amazonas.livinglogic.de Fri Dec 22 15:32:31 2000 From: walter@amazonas.livinglogic.de (=?us-ascii?Q?=22Walter_D=F6rwald=22?=) Date: Fri, 22 Dec 2000 16:32:31 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode In-Reply-To: <3A423E4D.88C7639@lemburg.com> References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> Message-ID: <200012221632310203.0105EF8A@mail.livinglogic.de> On 21.12.00 at 18:30 M.-A. Lemburg wrote: > "Martin v. Loewis" wrote: > > > > > The problem with this is that the error handler will usually > > > have to have access to the internal data structure of the codec > > > to be able to process the error, e.g. in your example > > > could be a single character, a UTF-16 sequence, etc. > > > > Please note that in his encoding, char is a Unicode string > > (specifically, character), so it can't be a UTF-16 sequence. > > What *encoder* that you know needs to have internal state? > > The codec is much general and kept symmetric for obvious reasons. > In his case, char would be a Unicode string, but the input to > an encoder could just as well be an image, a sound or some other > abstract form of data storage. It is not unlikely that these > encoder will need to keep state. > > Even for Unicode you will need to keep state in the encoder, > e.g. to write an encoder which uses the Unicode compression > algorithm as basis (the output stream contains markers to > switch pages). But I don't see how this internal encoder state should influence what the error handler does. There are two layers involved: The character encoding layer and the "unencodable character escape mechanism". Both layers are completely independent, even in your "Unicode compression" example, where the "unencodable character escape mechanism" is XML character entities. > > Anyway, if you think that state should be accessible to the error > > handling function, it won't be hard to pass state to the callback. > > E.g. you could pass the string being encoded, the current position, > > and optionally a Codec instance (many codecs would pass None, as they > > don't keep any state). > > Hmm, I don't think this is generally useful. Using the codec > instances directly would be the right way to go, IMHO. I don't > want to overload .encode() or unicode() with too much functionality. We're only talking about encoding here. You right that state might be required for a decoder. > Writing your own function helpers which then apply all the necessary > magic is simple and doesn't warrant changing APIs in the core. It is not as simple as the error handler, but I could live with that. The big problem is that it effectively kill the speed of your application. Every XML application written in Python, no matter what is does internally, will in the end have to produce an output bytestring. Normally the output encoding should be one that produces no unencodable characters, but you have to be prepared to handle them. With the error handler the complete encoding will be done in C code (with very infrequent calls to the error handler), so this scheme gives the best speed possible. > Since the error handling is extensible by adding new options > such as 'callback', I would prefer a more object oriented way of extending the error handling. > the existing codecs could be extended to > provide this functionality as well. We'd only need a way to > pass the callback to the codecs in some way, e.g. by using > a keyword argument on the constructor or by subclassing it > and providing a new method for the error handling in question. There is no need for a string argument 'callback' and an additional callback function/method that is passed to the encoder. When the error argument is a string, the old mechanism can be used, when it is a callable object the new will be used. > [...] Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From mal@lemburg.com Fri Dec 22 18:15:38 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 22 Dec 2000 19:15:38 +0100 Subject: [I18n-sig] Proposal: Extended error handling forunicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> Message-ID: <3A439A4A.B71F35DA@lemburg.com> "Walter D�rwald" wrote: > > On 21.12.00 at 18:30 M.-A. Lemburg wrote: > > [about state in encoders and error handlers] > But I don't see how this internal encoder state should influence > what the error handler does. There are two layers involved: The > character encoding layer and the "unencodable character escape > mechanism". Both layers are completely independent, even in your > "Unicode compression" example, where the "unencodable character > escape mechanism" is XML character entities. This is true for your XML entity escape example, but error resolving in general will likely need to know about the current state of the encoder, e.g. to be able to write data corresponding page in the Unicode compression example or to force a switch of the current page to a different one. I know that error handling could be more generic, but passing a callable object instead of the error parameter is not an option since the internal APIs all use a const char parameter for error. Besides, I consider such an approach a hack and not a solution. Instead of trying to tweak the implementation into providing some kind of new error scheme, let's focus on finding a generic framework which could provide a solution for the general case while not breaking the existing applications. > > Writing your own function helpers which then apply all the necessary > > magic is simple and doesn't warrant changing APIs in the core. > > It is not as simple as the error handler, but I could live with that. > > The big problem is that it effectively kill the speed of your > application. Every XML application written in Python, no matter > what is does internally, will in the end have to produce an output > bytestring. Normally the output encoding should be one that produces > no unencodable characters, but you have to be prepared to handle > them. With the error handler the complete encoding will be done > in C code (with very infrequent calls to the error handler), so > this scheme gives the best speed possible. It would give even better performance if the codec would provide this hook in some way at C level. Note that almost all codecs have their own error handlers written in C already. > > Since the error handling is extensible by adding new options > > such as 'callback', > > I would prefer a more object oriented way of extending the error > handling. Sure, but we have to assure backward compatibility as well. > > the existing codecs could be extended to > > provide this functionality as well. We'd only need a way to > > pass the callback to the codecs in some way, e.g. by using > > a keyword argument on the constructor or by subclassing it > > and providing a new method for the error handling in question. > > There is no need for a string argument 'callback' and > an additional callback function/method that is passed to the > encoder. When the error argument is a string, the old mechanism > can be used, when it is a callable object the new will be used. This is bad style and also gives problems in the core implementation (have a look at unicodeobject.c). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Dec 23 12:27:31 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 23 Dec 2000 13:27:31 +0100 Subject: [I18n-sig] Proposal: Extended error handling for unicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221357.OAA00921@loewis.home.cs.tu-berlin.de> Message-ID: <3A449A33.8DD063DD@lemburg.com> "Martin v. Loewis" wrote: > > > Hmm, I don't think this is generally useful. Using the codec > > instances directly would be the right way to go, IMHO. I don't > > want to overload .encode() or unicode() with too much functionality. > > Writing your own function helpers which then apply all the necessary > > magic is simple and doesn't warrant changing APIs in the core. > > Ok, then I have a challenge for you. Write a codec family that emits > XML character entities on encoding errors for any of the standard > Python codecs. If its really simple, then I'd *really* appreciate > concrete, working code. I really mean that - I doubt that this is > simple. If a problem arises doing it for all of the encodings, just > pick one. If that is still asked too much, outline a solution; > preferably one that is as efficient as would be the solution involving > the callback. Martin, I have a feeling that we both want to achieve the same thing. The only difference is that you want to add it fast and without reflecting about the APIs and needed changes, while I prefer to first draw up a design and then make a decision based on that design. The latter needs more time and some tossing around of ideas. Your approach is one of the possible ways to do this. Please let's not fight over this, but instead discuss a general design for error handlers. The design will have to assure (at least) these things: * backward compatibility * fast implementation * reuse of existing codecs * extensible * fits in with the existing C APIs (or extends these) * provides ways to set an error handler at C level as both C function and Python callable object About the "function helpers": see below. > > Since the error handling is extensible by adding new options such as > > 'callback', the existing codecs could be extended to provide this > > functionality as well. We'd only need a way to pass the callback to > > the codecs in some way, e.g. by using a keyword argument on the > > constructor or by subclassing it and providing a new method for the > > error handling in question. > > That solution is quite similar to the callback approach, so we could > probably chose either. I'm not entirely sure how the usage scenario > is. Did you think that users, instead of writing > > u.encode("koi8-r",errors=xmlcharentities) > > would write > > I,forgot,which,parameter = codecs.lookup("koi8-r") > encode = I() > encode.install_error_cb(xmlcharentities) > encode.encode(u,errors="callback") > > or did you have a more convenient API in mind? This is what I was referring to with the "function helpers" above. An alternative would probably be adding another optional argument to the .encode() method and the unicode() API: u.encode('koi8-r', 'callback', myerrorhandler) and unicode(data, 'koi8-r', 'callback', myerrorhandler) > [...] > > > Note: simply changing the error parameter to a PyObject doesn't > > work, since all C APIs expect a simple const char. > > Sure. Looking from the Python core side of the things, it's a large > change. Looking from the users' point of view, it's a small one. Right and that's why we have to be careful about the design. Cheers and Merry Christmas, -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/