From brian@garage.co.jp Fri Mar 10 07:59:01 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Fri, 10 Mar 2000 16:59:01 +0900 Subject: [I18n-sig] link: Lessons learned in internationalizing the ECMAScript standard Message-ID: <38C8AB455D.F6B6BRIAN@smtp.garage.co.jp> Hi - Python's obviously not JavaScript :-), but maybe there are some lessons which can be learned from this: http://www-4.ibm.com/software/developer/library/internationalization-support.html --Brian Hooper From mal@lemburg.com Fri Mar 10 09:00:58 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 10 Mar 2000 10:00:58 +0100 Subject: [I18n-sig] link: Lessons learned in internationalizing the ECMAScript standard References: <38C8AB455D.F6B6BRIAN@smtp.garage.co.jp> Message-ID: <38C8B9CA.8BF7AAF2@lemburg.com> Brian Takashi Hooper wrote: > > Hi - > > Python's obviously not JavaScript :-), but maybe there are some lessons > which can be learned from this: > > http://www-4.ibm.com/software/developer/library/internationalization-support.html The document makes some good points. I esp. like the sections about string operations w/r to i18n. Note that Python also uses UTF-16 as internal format, it does provide the combining character properties for all characters, but does not (in the core) have support to normalize strings. If someone needs this functionality a Unicode toolbox would be easy to write using the information from the Unicode database included in the core. BTW, the Python CVS version should include the Unicode patch RSN... I suppose, Guido is going to post an announcement about this too, so that the code can be put to some real world testing ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Fri Mar 10 10:06:04 2000 From: andy@reportlab.com (Andy Robinson) Date: Fri, 10 Mar 2000 10:06:04 -0000 Subject: [I18n-sig] Draft SIG page up for review Message-ID: A month too late, I have placed a draft page up at http://www.reportlab.com/i18n/i18nsig.html Any quick inclusions/omissions/errors, before it goes up on python.org? Thanks, Andy Robinson From mal@lemburg.com Fri Mar 10 11:23:21 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 10 Mar 2000 12:23:21 +0100 Subject: [I18n-sig] Draft SIG page up for review References: Message-ID: <38C8DB29.7EC6D17F@lemburg.com> Andy Robinson wrote: > > A month too late, I have placed a draft page up at > http://www.reportlab.com/i18n/i18nsig.html > > Any quick inclusions/omissions/errors, before it goes up on python.org? Looks fine... except maybe that we will want to change the email address to i18n-sig-owner. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Fri Mar 10 13:53:59 2000 From: andy@reportlab.com (Andy Robinson) Date: Fri, 10 Mar 2000 13:53:59 -0000 Subject: [I18n-sig] Draft SIG page up for review In-Reply-To: <38C8DB29.7EC6D17F@lemburg.com> Message-ID: > Looks fine... except maybe that we will want to change > the email address to i18n-sig-owner. > It goes into a template - the real thing is up there now http://www.python.org/sigs/i18n-sig/ - Andy From guido@python.org Sat Mar 11 00:20:01 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 10 Mar 2000 19:20:01 -0500 Subject: [I18n-sig] Unicode patches checked in Message-ID: <200003110020.TAA17777@eric.cnri.reston.va.us> I've just checked in a massive patch from Marc-Andre Lemburg which adds Unicode support to Python. This work was financially supported by Hewlett-Packard. Marc-Andre has done a tremendous amount of work, for which I cannot thank him enough. We're still awaiting some more things: Marc-Andre gave me documentation patches which will be reviewed by Fred Drake before they are checked in; Fredrik Lundh has developed a new regular expression which is Unicode-aware and which should be checked in real soon now. Also, the documentation is probably incomplete and will be updated, and of course there may be bugs -- this should be considered alpha software. However, I believe it is quite good already, otherwise I wouldn't have checked it in! I'd like to invite everyone with an interest in Unicode or Python 1.6 to check out this new Unicode-aware Python, so that we can ensure a robust code base by the time Python 1.6 is released (planned release date: June 1, 2000). The download links are below. Links: http://www.python.org/download/cvs.html Instructions on how to get access to the CVS version. (David Ascher is making nightly tarballs of the CVS version available at http://starship.python.net/crew/da/pythondists/) http://starship.python.net/crew/lemburg/unicode-proposal.txt The latest version of the specification on which the Marc has based his implementation. http://www.python.org/sigs/i18n-sig/ Home page of the i18n-sig (Internationalization SIG), which has lots of other links about this and related issues. http://www.python.org/search/search_bugs.html The Python Bugs List. Use this for all bug reports. Note that next Tuesday I'm going on a 10-day trip, with limited time to read email and no time to solve problems. The usual crowd will take care of urgent updates. See you at the Intel Computing Continuum Conference in San Francisco or at the Python Track at Software Development 2000 in San Jose! --Guido van Rossum (home page: http://www.python.org/~guido/) From shichang@icubed.com" I would love to test the Python 1.6 (Unicode support) in Chinese language aspect, but I don't know where I can get a copy of OS that supports Chinese. Anyone can point me a direction? -----Original Message----- From: Guido van Rossum [SMTP:guido@python.org] Sent: Saturday, March 11, 2000 12:20 AM To: Python mailing list; python-announce@python.org; python-dev@python.org; i18n-sig@python.org; string-sig@python.org Cc: Marc-Andre Lemburg Subject: Unicode patches checked in I've just checked in a massive patch from Marc-Andre Lemburg which adds Unicode support to Python. This work was financially supported by Hewlett-Packard. Marc-Andre has done a tremendous amount of work, for which I cannot thank him enough. We're still awaiting some more things: Marc-Andre gave me documentation patches which will be reviewed by Fred Drake before they are checked in; Fredrik Lundh has developed a new regular expression which is Unicode-aware and which should be checked in real soon now. Also, the documentation is probably incomplete and will be updated, and of course there may be bugs -- this should be considered alpha software. However, I believe it is quite good already, otherwise I wouldn't have checked it in! I'd like to invite everyone with an interest in Unicode or Python 1.6 to check out this new Unicode-aware Python, so that we can ensure a robust code base by the time Python 1.6 is released (planned release date: June 1, 2000). The download links are below. Links: http://www.python.org/download/cvs.html Instructions on how to get access to the CVS version. (David Ascher is making nightly tarballs of the CVS version available at http://starship.python.net/crew/da/pythondists/) http://starship.python.net/crew/lemburg/unicode-proposal.txt The latest version of the specification on which the Marc has based his implementation. http://www.python.org/sigs/i18n-sig/ Home page of the i18n-sig (Internationalization SIG), which has lots of other links about this and related issues. http://www.python.org/search/search_bugs.html The Python Bugs List. Use this for all bug reports. Note that next Tuesday I'm going on a 10-day trip, with limited time to read email and no time to solve problems. The usual crowd will take care of urgent updates. See you at the Intel Computing Continuum Conference in San Francisco or at the Python Track at Software Development 2000 in San Jose! --Guido van Rossum (home page: http://www.python.org/~guido/) -- http://www.python.org/mailman/listinfo/python-list From brian@garage.co.jp Mon Mar 13 12:05:50 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Mon, 13 Mar 2000 21:05:50 +0900 Subject: [I18n-sig] thinking of CJK codec, some questions Message-ID: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp> Hi there i18n-siggers - First of all, thank you very very much Marc-Andre (and Fredrik Lundh for the original implementation) for all your hard work, I checked out the CVS checkin yesterday and played with it a little, and took a print out of the source home with me. It seems really well thought out and organized. I scrutinized the code base thinking about issues for a CJK codec, and came up with a few questions: 1. Should the CJK ideograms also be included in the unicodehelpers numeric converters? From my perspective, I'd really like to see them go in, and think that it would make sense, too - any opinions? 2. Same as above with double-width alphanumeric characters - I assume these should probably also be included in the lowercase / uppercase helpers? Or will there be a way to add to these lists through the codec API (for those worried about data from unused codecs clogging up their character type helpers, maybe this would be a good option to have; I would by contrast like to be able to exclude all the extra Latin 1 stuff that I don't need, hmm.) 3. Same thing for whitespace - I think there are a number of double-width whitespace characters around also. 4. Are there any conventions for how non-standard codecs should be installed? Should they be added to Python's encodings directory, or should they just be added to site-packages or site-python like other third-party modules? 5. Are there any existing tools for converting from Unicode mapping files to a C source file that can be handily made into a dynamic library, or am I on my own there? Anyone who has any opinions on the above please chime in, I'm trying to start a discussion :-) ! Also, while I was reading the code, I found a few typos and spelling mistakes (for example the notoriously often misspelled 'occurrence'). While I doubt this is a very high priority, from watching the checkins list apparently Guido accepts spelling patches - so, I have a big context diff, who should I send it to? Thanks, -Brian Hooper From mal@lemburg.com Mon Mar 13 13:58:24 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 13 Mar 2000 14:58:24 +0100 Subject: [I18n-sig] thinking of CJK codec, some questions References: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp> Message-ID: <38CCF400.A7B64CDC@lemburg.com> Brian Takashi Hooper wrote: > > Hi there i18n-siggers - > > First of all, thank you very very much Marc-Andre (and Fredrik Lundh for > the original implementation) for all your hard work, I checked out the > CVS checkin yesterday and played with it a little, and took a print out > of the source home with me. It seems really well thought out and > organized. > > I scrutinized the code base thinking about issues for a CJK codec, and > came up with a few questions: > > 1. Should the CJK ideograms also be included in the unicodehelpers > numeric converters? From my perspective, I'd really like to see them go > in, and think that it would make sense, too - any opinions? > > 2. Same as above with double-width alphanumeric characters - I assume > these should probably also be included in the lowercase / uppercase > helpers? Or will there be a way to add to these lists through the codec > API (for those worried about data from unused codecs clogging up their > character type helpers, maybe this would be a good option to have; I > would by contrast like to be able to exclude all the extra Latin 1 stuff > that I don't need, hmm.) > > 3. Same thing for whitespace - I think there are a number of > double-width whitespace characters around also. I'm not sure I understand what you are intending here: the unicodectype.c file contains a switch statements which were deduced from the UnicodeData.txt file available at the Unicode.org FTP site. It contains all mappings which were defined in that files -- unless my parser omitted some. If you plan to add new mappings which are not part of the Unicode standard, I would suggest adding them to a separate module. E.g. you could extend the versions available through the unicodedata module. But beware: the Unicode methods only use the mappings defined in the unicodectype.c file. > 4. Are there any conventions for how non-standard codecs should be > installed? Should they be added to Python's encodings directory, or > should they just be added to site-packages or site-python like other > third-party modules? You can drop them anyplace you want... and then have them register a search function. The standard encodings package uses modules as codec basis but you could just as well provide other means of looking up and even creating codecs on-the-fly. Don't know what the standard installation method is... this hasn't been sorted out yet. My current thinking is to include all standard and small codecs in the standard dist and include the bigger ones in a separate Python add-on distribution (e.g. a tar file that gets untarred on top of an existing installation). A smart installer should ideally take care of this... > 5. Are there any existing tools for converting from Unicode mapping > files to a C source file that can be handily made into a dynamic > library, or am I on my own there? No, there is a tool to convert them to a Python source file though (Misc/gencodec.py). The created codecs will use the builtin generic mapping codec as basis for their work. If mappings get huge (like the CJK ones), I would create a new parser though, which then generates extension modules to have the mapping available as static C data rather than as Python dictionary on the heap... gencodec.py should provide a good template for such a tool. > Anyone who has any opinions on the above please chime in, I'm trying to > start a discussion :-) ! > > Also, while I was reading the code, I found a few typos and spelling > mistakes (for example the notoriously often misspelled 'occurrence'). Ahem ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From brian@garage.co.jp Mon Mar 13 14:42:41 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Mon, 13 Mar 2000 23:42:41 +0900 Subject: [I18n-sig] thinking of CJK codec, some questions In-Reply-To: <38CCF400.A7B64CDC@lemburg.com> References: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp> <38CCF400.A7B64CDC@lemburg.com> Message-ID: <38CCFE6129.16E7BRIAN@smtp.garage.co.jp> Hi again, On Mon, 13 Mar 2000 14:58:24 +0100 "M.-A. Lemburg" wrote: [snip] > I'm not sure I understand what you are intending here: the > unicodectype.c file contains a switch statements which were > deduced from the UnicodeData.txt file available at the > Unicode.org FTP site. It contains all mappings which were defined > in that files -- unless my parser omitted some. > > If you plan to add new mappings which are not part of the > Unicode standard, I would suggest adding them to a separate > module. E.g. you could extend the versions available through > the unicodedata module. But beware: the Unicode methods > only use the mappings defined in the unicodectype.c file. My mistake - I thought for some reason that double-width Latin characters, such as are used in Japanese, were part of the CJK ideogram code space that starts from \u3400, so I was expecting them to map to lower values in Unicode than they actually do (a double-width 'A', for example, is \uFF21. > > > 4. Are there any conventions for how non-standard codecs should be > > installed? Should they be added to Python's encodings directory, or > > should they just be added to site-packages or site-python like other > > third-party modules? > > You can drop them anyplace you want... and then have them > register a search function. The standard encodings package > uses modules as codec basis but you could just as well provide > other means of looking up and even creating codecs on-the-fly. > > Don't know what the standard installation method is... this > hasn't been sorted out yet. > > My current thinking is to include all standard and small > codecs in the standard dist and include the bigger ones > in a separate Python add-on distribution (e.g. a tar file > that gets untarred on top of an existing installation). > A smart installer should ideally take care of this... Maybe one using Distutils? I guess it would make the most sense if you run the install script with /usr/local/bin/python, for example, then the codecs would get installed in the proper place for that Python installation to use them... > > > 5. Are there any existing tools for converting from Unicode mapping > > files to a C source file that can be handily made into a dynamic > > library, or am I on my own there? > > No, there is a tool to convert them to a Python source file > though (Misc/gencodec.py). The created codecs will use the > builtin generic mapping codec as basis for their work. > > If mappings get huge (like the CJK ones), I would create a > new parser though, which then generates extension modules > to have the mapping available as static C data rather > than as Python dictionary on the heap... gencodec.py > should provide a good template for such a tool. You recommend in the unicode proposal that the mapping should probably be a buildable as a shared library, to allow multiple interpreter instances to share the table - for platforms which don't support this option, then, would it make sense to make the codec such that the mapping tables can be statically linked into the interpreter? Or, in such a case, do you think would it be better to try to set things up so that the mapping tables can be read from a file? --Brian From mal@lemburg.com Mon Mar 13 15:47:44 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 13 Mar 2000 16:47:44 +0100 Subject: [I18n-sig] thinking of CJK codec, some questions References: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp> <38CCF400.A7B64CDC@lemburg.com> <38CCFE6129.16E7BRIAN@smtp.garage.co.jp> Message-ID: <38CD0DA0.8DD4FC38@lemburg.com> Brian Takashi Hooper wrote: > > > I'm not sure I understand what you are intending here: the > > unicodectype.c file contains a switch statements which were > > deduced from the UnicodeData.txt file available at the > > Unicode.org FTP site. It contains all mappings which were defined > > in that files -- unless my parser omitted some. > > > > If you plan to add new mappings which are not part of the > > Unicode standard, I would suggest adding them to a separate > > module. E.g. you could extend the versions available through > > the unicodedata module. But beware: the Unicode methods > > only use the mappings defined in the unicodectype.c file. > My mistake - I thought for some reason that double-width Latin > characters, such as are used in Japanese, were part of the CJK ideogram > code space that starts from \u3400, so I was expecting them to map to > lower values in Unicode than they actually do (a double-width 'A', for > example, is \uFF21. Unicode is built upon ASCII -- I don't think that other encodings were taken into account during the ordinal assignment (not 100% sure though). You should be able to get at the numeric information of DBCS chars (this is what you're talking about, right ?) by first converting them to Unicode. > > > > > 4. Are there any conventions for how non-standard codecs should be > > > installed? Should they be added to Python's encodings directory, or > > > should they just be added to site-packages or site-python like other > > > third-party modules? > > > > You can drop them anyplace you want... and then have them > > register a search function. The standard encodings package > > uses modules as codec basis but you could just as well provide > > other means of looking up and even creating codecs on-the-fly. > > > > Don't know what the standard installation method is... this > > hasn't been sorted out yet. > > > > My current thinking is to include all standard and small > > codecs in the standard dist and include the bigger ones > > in a separate Python add-on distribution (e.g. a tar file > > that gets untarred on top of an existing installation). > > A smart installer should ideally take care of this... > Maybe one using Distutils? I guess it would make the most sense if you > run the install script with /usr/local/bin/python, for example, then the > codecs would get installed in the proper place for that Python > installation to use them... Right. distutils could be a solution on Unix -- the problem of using distutils is that you first have to have a working Python installation for it to work, so such an approach would only work in two steps: first Python core, then extended codecs package. > > > > > 5. Are there any existing tools for converting from Unicode mapping > > > files to a C source file that can be handily made into a dynamic > > > library, or am I on my own there? > > > > No, there is a tool to convert them to a Python source file > > though (Misc/gencodec.py). The created codecs will use the > > builtin generic mapping codec as basis for their work. > > > > If mappings get huge (like the CJK ones), I would create a > > new parser though, which then generates extension modules > > to have the mapping available as static C data rather > > than as Python dictionary on the heap... gencodec.py > > should provide a good template for such a tool. > You recommend in the unicode proposal that the mapping should probably > be a buildable as a shared library, to allow multiple interpreter > instances to share the table - for platforms which don't support this > option, then, would it make sense to make the codec such that the > mapping tables can be statically linked into the interpreter? Or, in > such a case, do you think would it be better to try to set things up so > that the mapping tables can be read from a file? Since memory mapped files are not supported by Python per default I would suggest letting the system linker take care of sharing the constant C data from a shared (or statically linked) extension module. Reading the information directly from a file would probably be too slow. Note that the module would only have to provide a simple __getitem__ interface compatible object which then fetches the data from the static C data. The rest can then be done in Python in the same way as the other mapping codecs do their job. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From brian@garage.co.jp Tue Mar 14 08:10:47 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Tue, 14 Mar 2000 17:10:47 +0900 Subject: [I18n-sig] thinking of CJK codec, some questions In-Reply-To: <38CD0DA0.8DD4FC38@lemburg.com> References: <38CCFE6129.16E7BRIAN@smtp.garage.co.jp> <38CD0DA0.8DD4FC38@lemburg.com> Message-ID: <38CDF40712.82E2BRIAN@smtp.garage.co.jp> Hi! On Mon, 13 Mar 2000 16:47:44 +0100 "M.-A. Lemburg" wrote: [snip] > Unicode is built upon ASCII -- I don't think that other encodings > were taken into account during the ordinal assignment (not 100% > sure though). > > You should be able to get at the numeric information of DBCS > chars (this is what you're talking about, right ?) by first > converting them to Unicode. Yes - it looks like this is the case :-). > > > > > > > > 4. Are there any conventions for how non-standard codecs should be > > > > installed? Should they be added to Python's encodings directory, or > > > > should they just be added to site-packages or site-python like other > > > > third-party modules? > > > > > > You can drop them anyplace you want... and then have them > > > register a search function. The standard encodings package > > > uses modules as codec basis but you could just as well provide > > > other means of looking up and even creating codecs on-the-fly. > > > > > > Don't know what the standard installation method is... this > > > hasn't been sorted out yet. > > > > > > My current thinking is to include all standard and small > > > codecs in the standard dist and include the bigger ones > > > in a separate Python add-on distribution (e.g. a tar file > > > that gets untarred on top of an existing installation). > > > A smart installer should ideally take care of this... > > Maybe one using Distutils? I guess it would make the most sense if you > > run the install script with /usr/local/bin/python, for example, then the > > codecs would get installed in the proper place for that Python > > installation to use them... > > Right. distutils could be a solution on Unix -- the problem > of using distutils is that you first have to have a working > Python installation for it to work, so such an approach > would only work in two steps: first Python core, then extended > codecs package. I guess, then it would be nice to have something that could work in either case... Should encoding support be an option to ./configure, when you are first building Python? General question to everyone out there - should it be possible to intentionally build Python without Unicode support? > > > > > > > > 5. Are there any existing tools for converting from Unicode mapping > > > > files to a C source file that can be handily made into a dynamic > > > > library, or am I on my own there? > > > > > > No, there is a tool to convert them to a Python source file > > > though (Misc/gencodec.py). The created codecs will use the > > > builtin generic mapping codec as basis for their work. > > > > > > If mappings get huge (like the CJK ones), I would create a > > > new parser though, which then generates extension modules > > > to have the mapping available as static C data rather > > > than as Python dictionary on the heap... gencodec.py > > > should provide a good template for such a tool. > > You recommend in the unicode proposal that the mapping should probably > > be a buildable as a shared library, to allow multiple interpreter > > instances to share the table - for platforms which don't support this > > option, then, would it make sense to make the codec such that the > > mapping tables can be statically linked into the interpreter? Or, in > > such a case, do you think would it be better to try to set things up so > > that the mapping tables can be read from a file? > > Since memory mapped files are not supported by Python per > default I would suggest letting the system linker take care of > sharing the constant C data from a shared (or statically linked) > extension module. Reading the information directly from a file > would probably be too slow. > > Note that the module would only have to provide a simple > __getitem__ interface compatible object which then fetches > the data from the static C data. The rest can then be done > in Python in the same way as the other mapping codecs do their > job. Am I right in thinking that 'static C data' means something like static Py_UNICODE mapping[] = { ... }; ? Also, from a design standpoint do you (and anyone else on i18n) think it would be better to emphasize speed and / or memory efficiency by making specialized codecs for the different CJK encodings (for example, if a table such as the above is used, then in the case of a particular encoding, for example EUC, it may be possible to reduce the size of the table by introducing some EUC-specific casing into the encoder/decoder), or would it be better to try for a generalized implementation? We need something like codecs.charset_encode and codecs.charset_decode for CJK char sets - I was thinking that this might be best handled by a few separate C modules (for Japanese, one for SJIS, one for EUC, and one for JIS) that would in turn use similarly defined mapping modules, containing only one or more static conversion maps as arrays - in this sense I am leaning towards making tuned codecs for each encoding set. I want to try to make something that many people can use - does this sound like a reasonable approach, or am I on the wrong track here? --Brian From mal@lemburg.com Tue Mar 14 09:55:24 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 14 Mar 2000 10:55:24 +0100 Subject: [I18n-sig] thinking of CJK codec, some questions References: <38CCFE6129.16E7BRIAN@smtp.garage.co.jp> <38CD0DA0.8DD4FC38@lemburg.com> <38CDF40712.82E2BRIAN@smtp.garage.co.jp> Message-ID: <38CE0C8C.E0518D0D@lemburg.com> Brian Takashi Hooper wrote: > > Should encoding support be an option to ./configure, when you are first > building Python? General question to everyone out there - should it be > possible to intentionally build Python without Unicode support? How would you do this using configure ? As for the exclusion of Unicode: this is currently not planned. Doing this would cause the code to become very inelegant due to the many #ifdefs this introduces (the problem here being that Unicode support is tightly integrated into the interpreter in many places). > [Tools for creating codecs from mappings] > > > Note that the module would only have to provide a simple > > __getitem__ interface compatible object which then fetches > > the data from the static C data. The rest can then be done > > in Python in the same way as the other mapping codecs do their > > job. > Am I right in thinking that 'static C data' means something like > > static Py_UNICODE mapping[] = { ... }; Right. > ? Also, from a design standpoint do you (and anyone else on i18n) think > it would be better to emphasize speed and / or memory efficiency by > making specialized codecs for the different CJK encodings (for example, > if a table such as the above is used, then in the case of a particular > encoding, for example EUC, it may be possible to reduce the size of the > table by introducing some EUC-specific casing into the encoder/decoder), > or would it be better to try for a generalized implementation? How about a lib of common functions needed for CJK and then a few small extra modules for each of the specific codecs. Fast encoders/decoder should be done in C, the whole class business in Python. > We need > something like codecs.charset_encode and codecs.charset_decode for CJK > char sets - I was thinking that this might be best handled by a few > separate C modules (for Japanese, one for SJIS, one for EUC, and one for > JIS) that would in turn use similarly defined mapping modules, > containing only one or more static conversion maps as arrays - in this > sense I am leaning towards making tuned codecs for each encoding set. Andy mentioned that it should be possible to write codecs which do a couple of smaller switches and implement the other mappings using some more intelligent logic. The example I gave above has to be seen in the light of using the generic mapping codec -- which probably is not very much use in a multi-byte encoding world since it currently only supports 1-1 mappings. I'd suggest going Andy's way for the CJK codecs... Andy ? > I want to try to make something that many people can use - does this > sound like a reasonable approach, or am I on the wrong track here? Don't think so :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From brian@garage.co.jp Wed Mar 15 08:51:49 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Wed, 15 Mar 2000 17:51:49 +0900 Subject: [I18n-sig] thinking of CJK codec, some questions Message-ID: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> Hi, > Andy mentioned that it should be possible to write codecs > which do a couple of smaller switches and implement the other > mappings using some more intelligent logic. > > The example I gave above has to be seen in the light of using the > generic mapping codec -- which probably is not very much use in a > multi-byte encoding world since it currently only supports > 1-1 mappings. > > I'd suggest going Andy's way for the CJK codecs... Andy ? I like the idea of an encoding/decoding state machine, and have started thinking about how this would work in the breakdown for the CJKV codecs - what I've got is kind of like this: The top level class interfaces, and the StreamReader/Writer classes as well, will be in Python - I think we can probably group these generally into modal and non-modal encoding schemes (ISO-2022-JP being an example of the first, and EUC being an example of the second), the difference between the two largely being a difference merely in how streams are handled. (Note: Andy, please pipe in if I'm misrepresenting your idea, or even if I'm not, I'd like to know what you think about all this!) For the encoders/decoders I like Andy's idea of trying to generalize out a kind of 'mini-language' ala mxTextTools for specifying encoding/decoding logic separately and then just have a generalized engine that can generically handle multi-byte mapping tasks. So, the main task then is to come up with a generalization that can encompass all of the manipulations which might be necessary in order to specify the behavior of the mapping machine: 1. one thing it should definitely be able to do is specify a byte offset for data in a static table. So, for example, if I have something like: static Py_UNICODE *euc2unicode = { 0x3000, 0x3001, ... }; I should know to start indexing from (adding 0x8080 to the first JIS 0208 character, 0x2121) 0xa1a1, that is, EUC 0xa1a2 should be converted by looking up euc2unicode[1] => 0x3001 in Unicode. 2. another thing that it would be good to be able to do, I think, is to be able to somehow specify which map to look in. so, a character set should be able to be stored in multiple, non-contiguous static arrays; again using the example of EUC, the code set 2 zone (stuff that begins with 8e) should refer to a different mapping table than the code set 1 stuff (the regular JIS 0208 zone for EUC-JP). So, the encoder would be able to say -> OK, for a character in this range, I should look up the value at this offset into this mapping table. For EUC-JP, this would look like: first character look in table at offset at offset 0x21-7e JIS-Roman->Unicode - 0x21 0xa1-fe JIS 0208->Unicode - 0x8080 0x8e HW Katakana->Unicode - 0x8e00 (from JIS-Roman) 0x8f JIS 0212->Unicode - 0x8080 (lookup w/ second & third bytes) Actually, looking at this a little more, probably there should be a way of calculating the map index given some info about the dimensions of the map, i.e. it should be possible to set more than one offset, so that instead of having to have a table with a lot of extra placeholding space in it, then we know that if we have a 94x94 matrix (pretty common in the Japanese encodings, as you know), then we can store all the data in a 5590-element array and just index it according to our chosen offsets. 3. coming back from Unicode I'm wondering a little about this, since when we're coming back from Unicode basically we have no choice (that I can think of) but to have 2^16 * (max number of bytes in target encoding), with placeholders where there is no mapping. So, for something like EUC-TW, which has a maximum of 4 bytes per character, we need an encoding map 256K in size... is there a better way, that doesn't waste so much space? 'Course, I would hope that the Taiwanese would put enough memory in their machines (since memory's pretty cheap there). I guess the encoder/decoder should also know about how to do modal encodings - I guess this is easier though if we can assume we have the whole string, or some convenient chunk of it, to do encoding on. Or maybe modal and non-modal encoders/decoders should be separately implemented (possibly, sharing utility functions)? I still have to look at more examples of asian encodings and especially the ISO-2022 style ones, and vendor encodings, to get a better idea of what manipulations they should do. I was also thinking that the maps, to keep them separate from the encoders/decoders themselves, would be degenerate Python modules that would return void * pointers to the mapping tables via PyCObjects... this seemed to me to be a good way to do maps which will primarily be accessed by other C modules, rather than by Python... does this seem like an OK thing? Awaiting further enlightenment, --Brian From mal@lemburg.com Wed Mar 15 14:36:34 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 15 Mar 2000 15:36:34 +0100 Subject: [I18n-sig] thinking of CJK codec, some questions References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> Message-ID: <38CF9FF2.5E558813@lemburg.com> Just a few comments about the design (don't have any knowledge about Asian encodings): 1. Keep large mapping tables in single automatically generated C modules that export a lookup object (ones that define __getitem__). These could also be generated using some perfect hash table generator, BTW, to reduce memory consumption. 2. Write small special encoders/decoders that take the lookup table objects as argument. 3. Glue both together using Python code -- forget about the PyCObject idea :-) ... it causes too many problems when the import fails. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From brian@garage.co.jp Wed Mar 15 15:07:44 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Thu, 16 Mar 2000 00:07:44 +0900 Subject: [I18n-sig] thinking of CJK codec, some questions In-Reply-To: <38CF9FF2.5E558813@lemburg.com> References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> <38CF9FF2.5E558813@lemburg.com> Message-ID: <38CFA74044.B752BRIAN@smtp.garage.co.jp> Thanks, this is great advice, and the kind of feedback I have been looking for! Especially about not using PyCObject, which seemed like the thing to do but I have to admit some naivete about its proper use. I'll try thinking about this a bit more, along the lines you suggest. --Brian On Wed, 15 Mar 2000 15:36:34 +0100 "M.-A. Lemburg" wrote: > Just a few comments about the design (don't have any knowledge > about Asian encodings): > > 1. Keep large mapping tables in single automatically generated C > modules that export a lookup object (ones that define __getitem__). > These could also be generated using some perfect hash table > generator, BTW, to reduce memory consumption. > > 2. Write small special encoders/decoders that take the lookup > table objects as argument. > > 3. Glue both together using Python code -- forget about the > PyCObject idea :-) ... it causes too many problems when the import > fails. > > -- > Marc-Andre Lemburg > ______________________________________________________________________ > Business: http://www.lemburg.com/ > Python Pages: http://www.lemburg.com/python/ > > > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://www.python.org/mailman/listinfo/i18n-sig > From chris@ccbs.ntu.edu.tw Thu Mar 16 04:14:08 2000 From: chris@ccbs.ntu.edu.tw (Christian Wittern) Date: Thu, 16 Mar 2000 12:14:08 +0800 Subject: [I18n-sig] CJK codecs etc Message-ID: Hi everybody, I have some comments about CJK codecs, which are more from a user than a programmers perspective. 1.) Please provide a (configurable?) fallback for failed conversions. This is of course especially needed for conversions out of Unicode. What I have in mind is, for example, provide the Unicode codepoint as entity (&U-4e00;) or Java escape or some such, depending on the users choice. Don't just give a '?', what M$'s braindead conversion routines do and thus regularily drive me nuts. 2.) On the same topic, there are some fairly frequently codepoints that map to different codepoints in Japanese and Taiwans encoding, although this is in most cases not expected. These codepoints should have been eliminated by Unicodes unification rules, but crept in via the source-encoding separation rule -- not a very good decision in my opinion. I have a list of some such characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally, there should be a way for the user to influence the conversion by providing a list of his choice (with his modifications) to the codec, to overlay the predefined values. 3.) The nasty problem of user defined characters. I think there should be a default mapping of the user defined area in DBCS encodings to the Unicode code range for user characters. Microsoft uses fixed sequential tables and I think that is a good idea, since it is pretty straightforward. In big5 for example, the area of user defined characters starts at Fa40, Fa41 ..., which gets mapped to Unicode E000, E001, .. There should also be an option to use some kind of entity reference instead. 4.) I developped years ago the habit of using entity references for any characters not representable in the given characterset used by the system. I have seen this becoming more widespread in the user communities I work with. It would be very useful for us, if the Unicode conversion routines in Python could be told to tread some arbitray entity references (we use things like &M24501; for the characters assigned by the Mojikyo Font Institute (see www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS encoding). I realize that this is a rather specialised usage, but it would be great and very helpful to have some hook in the system to treat this stuff just like any other character. Any comments? All the best, Christian Dr. Christian Wittern Chung-Hwa Institute of Buddhist Studies 276, Kuang Ming Road, Peitou 112 Taipei, TAIWAN Tel. +886-2-2892-6111#65, Email chris@ccbs.ntu.edu.tw From brian@garage.co.jp Thu Mar 16 06:49:23 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Thu, 16 Mar 2000 15:49:23 +0900 Subject: [I18n-sig] thinking of CJK codec, some questions In-Reply-To: <38CF9FF2.5E558813@lemburg.com> References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> <38CF9FF2.5E558813@lemburg.com> Message-ID: <38D083F3254.189CBRIAN@smtp.garage.co.jp> On Wed, 15 Mar 2000 15:36:34 +0100 "M.-A. Lemburg" wrote: > Just a few comments about the design (don't have any knowledge > about Asian encodings): > > 1. Keep large mapping tables in single automatically generated C > modules that export a lookup object (ones that define __getitem__). > These could also be generated using some perfect hash table > generator, BTW, to reduce memory consumption. After researching perfect hash tables a little and thinking about it a little more, a question: I think this could work well for the decoding maps, but for encoding (from Unicode to a legacy encoding), wouldn't I have to be able to detect misses in my hash lookup? For example, if I had a string in Unicode that I was trying to convert to EUC-JP, and I looked up a Unicode character that has no mapping to EUC-JP, with a regular hash I my lookup will still succeed and I'll get back an EUC character anyway, but the wrong one... The only way I could think of to avoid this would be to store the key as part of the value (or alternately some kind of unique checksum), and then after lookup compare the original key to the key that was looked up in the table; if they are the same, then I've got a valid mapping, and if they are different than my lookup failed, and I should return some kind of sentinel value (0xFFFF or something?). Since the Unicode keys are all two bytes apiece, and for some of the largest CJK encoding standards the values are a max of 4 bytes long (e.g. EUC-TW), I then need to define my mapping table as containing values 8 bytes each in length, right? (assuming that I should keep the values of the array aligned along machine words) Is this complication worth the space savings, I wonder? I think a table built this way might be a little smaller than an unhashed plain old table, since in a three- or four-byte encoding there are generally always a lot fewer mapped values than there are spaces in the available plane... maybe I should go with mappings as simple array first and then figure out how to make them smaller if it seems to matter a lot to people? This is a pretty much a speed vs. space issue. Opinions? Does anyone have a cleverer way to detect the validity of a hash lookup? --Brian From mal@lemburg.com Thu Mar 16 10:21:29 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 16 Mar 2000 11:21:29 +0100 Subject: [I18n-sig] thinking of CJK codec, some questions References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> <38CF9FF2.5E558813@lemburg.com> <38D083F3254.189CBRIAN@smtp.garage.co.jp> Message-ID: <38D0B5A9.1A4A66E9@lemburg.com> Brian Takashi Hooper wrote: > > On Wed, 15 Mar 2000 15:36:34 +0100 > "M.-A. Lemburg" wrote: > > > Just a few comments about the design (don't have any knowledge > > about Asian encodings): > > > > 1. Keep large mapping tables in single automatically generated C > > modules that export a lookup object (ones that define __getitem__). > > These could also be generated using some perfect hash table > > generator, BTW, to reduce memory consumption. > After researching perfect hash tables a little and thinking about it a > little more, a question: I think this could work well for the decoding > maps, but for encoding (from Unicode to a legacy encoding), wouldn't I > have to be able to detect misses in my hash lookup? I'd suggest using the same technique as Python: lookup the hash(key) value and then compare the found entry (key,value) with the looked up key. Since we are lucky, you can use the identity function as hash function... keys still are Unicode ordinals, but now you also store them in the mapping result (some redundance, but better than putting them together with the keys). > For example, if I > had a string in Unicode that I was trying to convert to EUC-JP, and I > looked up a Unicode character that has no mapping to EUC-JP, with a > regular hash I my lookup will still succeed and I'll get back an EUC > character anyway, but the wrong one... The only way I could think of to > avoid this would be to store the key as part of the value (or > alternately some kind of unique checksum), and then after lookup compare > the original key to the key that was looked up in the table; if they are > the same, then I've got a valid mapping, and if they are different than > my lookup failed, and I should return some kind of sentinel value > (0xFFFF or something?). Since the Unicode keys are all two bytes > apiece, and for some of the largest CJK encoding standards the values > are a max of 4 bytes long (e.g. EUC-TW), I then need to define my > mapping table as containing values 8 bytes each in length, right? See above. > (assuming that I should keep the values of the array aligned along > machine words) Is this complication worth the space savings, I wonder? > I think a table built this way might be a little smaller than an > unhashed plain old table, since in a three- or four-byte encoding there > are generally always a lot fewer mapped values than there are spaces in > the available plane... maybe I should go with mappings as simple array > first and then figure out how to make them smaller if it seems to matter > a lot to people? This is a pretty much a speed vs. space issue. This is probably the way to go. You can always exchange the mapping table against some other technique as long as the interface stays the same. I think what more important now, is focussing on the encoders and decoders... > Opinions? Does anyone have a cleverer way to detect the validity of a > hash lookup? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Mar 16 10:35:04 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 16 Mar 2000 11:35:04 +0100 Subject: [I18n-sig] CJK codecs etc References: Message-ID: <38D0B8D8.5AC5C59C@lemburg.com> Christian Wittern wrote: > > Hi everybody, > > I have some comments about CJK codecs, which are more from a user than a > programmers perspective. > > 1.) Please provide a (configurable?) fallback for failed conversions. This > is of course especially needed for conversions out of Unicode. What I have > in mind is, for example, provide the Unicode codepoint as entity (&U-4e00;) > or Java escape or some such, depending on the users choice. Don't just give > a '?', what M$'s braindead conversion routines do and thus regularily drive > me nuts. Please read the Misc/unicode.txt file. There are different error handling techniques available... 'strict' (raise an error), 'ignore' (ignore the failed mapping), 'replace' (replace the failed mapping by some codec specific replacement char, e.g. '?'). The error argument is codec specific -- the above values must work though. > 2.) On the same topic, there are some fairly frequently codepoints that map > to different codepoints in Japanese and Taiwans encoding, although this is > in most cases not expected. These codepoints should have been eliminated by > Unicodes unification rules, but crept in via the source-encoding separation > rule -- not a very good decision in my opinion. I have a list of some such > characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally, > there should be a way for the user to influence the conversion by providing > a list of his choice (with his modifications) to the codec, to overlay the > predefined values. Everybody can write their own codecs... so no comment on this one ;-) > 3.) The nasty problem of user defined characters. I think there should be a > default mapping of the user defined area in DBCS encodings to the Unicode > code range for user characters. Microsoft uses fixed sequential tables and I > think that is a good idea, since it is pretty straightforward. In big5 for > example, the area of user defined characters starts at Fa40, Fa41 ..., which > gets mapped to Unicode E000, E001, .. There should also be an option to use > some kind of entity reference instead. The core Python Unicode implementation doesn't touch these private code areas at all. This issue is left to the codecs. Since they are probably of some importance to the Asian world due to the many corporate char sets, I guess the Asian codecs should provide some kind of logic to handle these areas as special cases... perhaps by passing an extra mapping table to the codec. > 4.) I developped years ago the habit of using entity references for any > characters not representable in the given characterset used by the system. I > have seen this becoming more widespread in the user communities I work with. > It would be very useful for us, if the Unicode conversion routines in Python > could be told to tread some arbitray entity references (we use things like > &M24501; for the characters assigned by the Mojikyo Font Institute (see > www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS > encoding). I realize that this is a rather specialised usage, but it would > be great and very helpful to have some hook in the system to treat this > stuff just like any other character. Hmm, sounds like some kind of SGML entity codec could solve this aspect... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From chris@ccbs.ntu.edu.tw Fri Mar 17 07:01:14 2000 From: chris@ccbs.ntu.edu.tw (Christian Wittern) Date: Fri, 17 Mar 2000 15:01:14 +0800 Subject: [I18n-sig] CJK codecs etc In-Reply-To: <38D0B8D8.5AC5C59C@lemburg.com> Message-ID: Marc-Andre Lemburg wrote: > Christian Wittern wrote: > > > > > > 1.) Please provide a (configurable?) fallback for failed > conversions. This > > is of course especially needed for conversions out of Unicode. > What I have > > in mind is, for example, provide the Unicode codepoint as > entity (&U-4e00;) > > or Java escape or some such, depending on the users choice. > Don't just give > > a '?', what M$'s braindead conversion routines do and thus > regularily drive > > me nuts. > > Please read the Misc/unicode.txt file. There are different error > handling techniques available... 'strict' (raise an error), > 'ignore' (ignore the failed mapping), 'replace' (replace the > failed mapping by some codec specific replacement char, e.g. '?'). Err. If you read my comment above, this is exactly what I *don't* want to see, since this is of no help at all. What I want to have is a fallback mechanism, that preserves the information contained in the file (or maps it to some other second best match). Simple raising an error or putting in some default char is not helpful to the user at all!!! Christian > > The error argument is codec specific -- the above values must > work though. > > > 2.) On the same topic, there are some fairly frequently > codepoints that map > > to different codepoints in Japanese and Taiwans encoding, > although this is > > in most cases not expected. These codepoints should have been > eliminated by > > Unicodes unification rules, but crept in via the > source-encoding separation > > rule -- not a very good decision in my opinion. I have a list > of some such > > characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally, > > there should be a way for the user to influence the conversion > by providing > > a list of his choice (with his modifications) to the codec, to > overlay the > > predefined values. > > Everybody can write their own codecs... so no comment on this one ;-) > > > 3.) The nasty problem of user defined characters. I think there > should be a > > default mapping of the user defined area in DBCS encodings to > the Unicode > > code range for user characters. Microsoft uses fixed sequential > tables and I > > think that is a good idea, since it is pretty straightforward. > In big5 for > > example, the area of user defined characters starts at Fa40, > Fa41 ..., which > > gets mapped to Unicode E000, E001, .. There should also be an > option to use > > some kind of entity reference instead. > > The core Python Unicode implementation doesn't touch these > private code areas at all. This issue is left to the codecs. > > Since they are probably of some importance to the Asian world > due to the many corporate char sets, I guess the Asian codecs > should provide some kind of logic to handle these areas as > special cases... perhaps by passing an extra mapping table > to the codec. That would solve the above point 2 as well and is all I have in mind here: Leave some hook that the user can pass some overlayed extra mapping table, without having to write a codec of his own. ALthough I realize the latter is possible, I don't think it is practicle and maybe not even desirable. I don't want to design a different car from scratch, just because I don't like the color:-) > > > 4.) I developped years ago the habit of using entity references for any > > characters not representable in the given characterset used by > the system. I > > have seen this becoming more widespread in the user communities > I work with. > > It would be very useful for us, if the Unicode conversion > routines in Python > > could be told to tread some arbitray entity references (we use > things like > > &M24501; for the characters assigned by the Mojikyo Font Institute (see > > www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS > > encoding). I realize that this is a rather specialised usage, > but it would > > be great and very helpful to have some hook in the system to treat this > > stuff just like any other character. > > Hmm, sounds like some kind of SGML entity codec could solve this > aspect... Right, but how would that be integrated with the other codecs? Christian Wittern, Taipei From mal@lemburg.com Fri Mar 17 08:40:49 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 17 Mar 2000 09:40:49 +0100 Subject: [I18n-sig] CJK codecs etc References: Message-ID: <38D1EF91.6B78B027@lemburg.com> Christian Wittern wrote: > > Marc-Andre Lemburg wrote: > > > Christian Wittern wrote: > > > > > > > > > 1.) Please provide a (configurable?) fallback for failed > > conversions. This > > > is of course especially needed for conversions out of Unicode. > > What I have > > > in mind is, for example, provide the Unicode codepoint as > > entity (&U-4e00;) > > > or Java escape or some such, depending on the users choice. > > Don't just give > > > a '?', what M$'s braindead conversion routines do and thus > > regularily drive > > > me nuts. > > > > Please read the Misc/unicode.txt file. There are different error > > handling techniques available... 'strict' (raise an error), > > 'ignore' (ignore the failed mapping), 'replace' (replace the > > failed mapping by some codec specific replacement char, e.g. '?'). > > Err. If you read my comment above, this is exactly what I *don't* want to > see, since this is of no help at all. What I want to have is a fallback > mechanism, that preserves the information contained in the file (or maps it > to some other second best match). Simple raising an error or putting in some > default char is not helpful to the user at all!!! Codecs may provide more than these three error handling modes -- the only requirement is that at least these three are defined. Note that 'replace' and 'ignore' do have their value when it comes to writing code that puts more priority on working without errors than 100% percent correct output. > > The error argument is codec specific -- the above values must > > work though. > > > > > 2.) On the same topic, there are some fairly frequently > > codepoints that map > > > to different codepoints in Japanese and Taiwans encoding, > > although this is > > > in most cases not expected. These codepoints should have been > > eliminated by > > > Unicodes unification rules, but crept in via the > > source-encoding separation > > > rule -- not a very good decision in my opinion. I have a list > > of some such > > > characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally, > > > there should be a way for the user to influence the conversion > > by providing > > > a list of his choice (with his modifications) to the codec, to > > overlay the > > > predefined values. > > > > Everybody can write their own codecs... so no comment on this one ;-) > > > > > 3.) The nasty problem of user defined characters. I think there > > should be a > > > default mapping of the user defined area in DBCS encodings to > > the Unicode > > > code range for user characters. Microsoft uses fixed sequential > > tables and I > > > think that is a good idea, since it is pretty straightforward. > > In big5 for > > > example, the area of user defined characters starts at Fa40, > > Fa41 ..., which > > > gets mapped to Unicode E000, E001, .. There should also be an > > option to use > > > some kind of entity reference instead. > > > > The core Python Unicode implementation doesn't touch these > > private code areas at all. This issue is left to the codecs. > > > > Since they are probably of some importance to the Asian world > > due to the many corporate char sets, I guess the Asian codecs > > should provide some kind of logic to handle these areas as > > special cases... perhaps by passing an extra mapping table > > to the codec. > > That would solve the above point 2 as well and is all I have in mind here: > Leave some hook that the user can pass some overlayed extra mapping table, > without having to write a codec of his own. ALthough I realize the latter is > possible, I don't think it is practicle and maybe not even desirable. I > don't want to design a different car from scratch, just because I don't like > the color:-) I think we are starting to pile up some good comments on what the Asian codecs should look like... perhaps its time for someone to jump in and write a proposal as basis for further discussion. (I don't have time for this and not even enough knowledge about the complexity of the Asian encodings, so I'll leave this to one of you...) > > > > > 4.) I developped years ago the habit of using entity references for any > > > characters not representable in the given characterset used by > > the system. I > > > have seen this becoming more widespread in the user communities > > I work with. > > > It would be very useful for us, if the Unicode conversion > > routines in Python > > > could be told to tread some arbitray entity references (we use > > things like > > > &M24501; for the characters assigned by the Mojikyo Font Institute (see > > > www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS > > > encoding). I realize that this is a rather specialised usage, > > but it would > > > be great and very helpful to have some hook in the system to treat this > > > stuff just like any other character. > > > > Hmm, sounds like some kind of SGML entity codec could solve this > > aspect... > > Right, but how would that be integrated with the other codecs? Codecs are stackable :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From brian@garage.co.jp Tue Mar 21 03:27:52 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Tue, 21 Mar 2000 12:27:52 +0900 Subject: [I18n-sig] iconv encoding/decoding Message-ID: <38D6EC3839E.18B2BRIAN@smtp.garage.co.jp> Hi all - Have others looked at the double-byte codec implementations for iconv, in glibc 2? This implementation uses customized lookup code for each encoding - I don't think it's possible to make something that will run much faster than this. However, maybe we would be better off making a state-machine based implementation that can be programmed and customized from Python. Looking at the iconv implementation should give us a good idea of what atomic actions are possible. There are also scripts for automating the table creation from the mapping tables at Unicode.org, and test data sets. Does Python's license allow us to borrow pieces from GPL'd software? --Brian (Thanks Ted for the reference) From mal@lemburg.com Tue Mar 21 09:25:23 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 21 Mar 2000 10:25:23 +0100 Subject: [I18n-sig] iconv encoding/decoding References: <38D6EC3839E.18B2BRIAN@smtp.garage.co.jp> Message-ID: <38D74003.C4D54838@lemburg.com> Brian Takashi Hooper wrote: > > Hi all - > > Have others looked at the double-byte codec implementations for iconv, > in glibc 2? > > This implementation uses customized lookup code for each encoding - I > don't think it's possible to make something that will run much faster > than this. However, maybe we would be better off making a state-machine > based implementation that can be programmed and customized from Python. > Looking at the iconv implementation should give us a good idea of what > atomic actions are possible. > > There are also scripts for automating the table creation from the > mapping tables at Unicode.org, and test data sets. Does Python's > license allow us to borrow pieces from GPL'd software? It does, but nothing GPLed can go into the core distribution and have such an important piece of software under GPL would harm the useability of these codecs in commercial apps. Borrowing a few ideas is allowed though :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Tue Mar 21 17:12:21 2000 From: andy@reportlab.com (Andy Robinson) Date: Tue, 21 Mar 2000 17:12:21 -0000 Subject: [I18n-sig] Asian Encodings Message-ID: I've been on vacation since The Big Patch - sorry about the lousy timing. I hope to get up a friendly tutorial on using the Unicode features shortly. In the meantime, some thoughts on the codecs and recent conversations: >1. Should the CJK ideograms also be included in the unicodehelpers >numeric converters? From my perspective, I'd really like to see them go >in, and think that it would make sense, too - any opinions? >2. Same as above with double-width alphanumeric characters - I assume >these should probably also be included in the lowercase / uppercase >helpers? Or will there be a way to add to these lists through the codec >API (for those worried about data from unused codecs clogging up their >character type helpers, maybe this would be a good option to have; I >would by contrast like to be able to exclude all the extra Latin 1 stuff >that I don't need, hmm.) >3. Same thing for whitespace - I think there are a number of >double-width whitespace characters around also. We have to be really careful about what goes in the Python core, and what is implemented as helper layers on top, with a preference for the latter where possible. If we have access to the character properties database, we could write some helper libraries which give the full range of isKatakana, isNumeric etc. in some dynamic way, without needing them hardcoded into the core; what we are really asking is 'does a character have a property'. I haven't checked the API for this yet, but if it is not there then we need it. >Don't know what the standard installation method is... this >hasn't been sorted out yet. I'm keen to sort this out, so we can start playing with codecs. Here's a bunch of ideas I'd like to float. From now on, please assume I am discussing some kind of CJK add-on package and not the Python core; it may benefit from some helper functions in the core, but is not for everybody. Character Sets and Encodings ---------------------------- Ken Lunde suggests that we should explicitly model Character Sets as distinct from Encodings; for example, Shift-JIS is an encoding which includes three character sets, (ASCII, JIS0208 Kanji and the Half width katakana). I tried to do this last year, but was not exactly sure of the point; AFAIK it is only useful if you want to reason about whether certain texts can survive certain round trips. Can anyone see a need to do this kind of thing? Bypassing Unicode ----------------- At some level, it should be possible to write and 'register' a codec which goes straight from, say, EUC to Shift_JIS without Unicode in the middle, using our codec machine. We need to figure out how this will be accessed; what is the clean way for a user to request the codec, without complicating or affecting anything in the present implementation. The present conventions of StreamWriters, StreamRecoders etc. are really useful, with or without Unicode. Can we overload to do codecs.lookup(sourceEncoding, destEncoding)? Or should it be something totally separate? Codecs State Machine -------------------- As you know I suggested an mxTextTools-inspired mini-language for doing stream transformations. I've never written this kind of thing before, but think it could be quite useful - I bet it could do data compression and image manipulation too. However, I have no experience designing languages. It seems to me that we should be able to convert data faster than we can rad/write to disk, but beyond that we need flexibility more than speed. Now what actions does it need? Should we steam straight in, or prototype it in Python? - what types? it cannot be as flexible as Python, or it will be no faster. Presumably most of the functions are statically typed, and we only need bytes/character, integers and booleans - what events when initialized ? construct mapping tables? - read n bytes from input into a string buffer - write n bytes from a string buffer to output - look up 1/2/n bytes in a mapping - full set of math and bit operators routines One good suggestion I had from Aaron Watters was that by treating it as a language, one could have a code-generation option as well as a runtime; we might be able to create C code for specific encodings on demand. Mapping tables: --------------- For CJKV stuff I strongly favour mapping tables which are built at run time. Mapping tables would be some of the possible inputs to our mini-language; we would be able to write routines saying 'until byte pattern x encountered do (read 2 bytes, look it up in a table, write the values found)', but with user-supplied mapping tables. These are currently implemented as dictionaries, but there are many contiguous ranges and a compact representation is possible. I did this last year for a client and it worked pretty well. Even the big CJKV ones come down to about 80 contiguous ranges. Conceptually, let's imagine that bytes 1 to 5 in source encoding map to 100-105 in destination; 6-10 map to 200-205; and 11-15 map to 300-305. Then we can create a 'compact map' structure like this... [(1, 5, 100), (6, 10, 200), (11, 15, 300)] ...and a routine which can expand it to a dictionary {1:100, 2:101 .... 15:305}. One can also write routines to invert maps, check if they represent a round trip and so on. The attraction is that the definitions can be in literal python modules, and look quite like the standards documents that create them. Furthermore, a lot of Japanese corporate encodings go like "Start with strict JIS-0208, and add these extra 17 characters..." - so one module could define all the variants for Japanese very cleanly and readably. I think this is a good way to tackle user-defined characters - tell them what to hack to add theirt 50 new characters and create an encoding with a new name. If this sounds sensible, I'll try to start on it. Test Harness ------------ A digression here, but perhaps we should build a web interface to convert arbitrary files and output as HTML, so everyone can test the output of the codecs as we write them. Is this useful? That's enough rambling for one day... Thanks, Andy From brian@garage.co.jp Wed Mar 22 01:53:48 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Wed, 22 Mar 2000 10:53:48 +0900 Subject: [I18n-sig] Asian Encodings In-Reply-To: References: Message-ID: <38D827AC273.18CBBRIAN@smtp.garage.co.jp> Hi Andy, welcome back, On Tue, 21 Mar 2000 17:12:21 -0000 "Andy Robinson" wrote: [snip] > Character Sets and Encodings > ---------------------------- > Ken Lunde suggests that we should explicitly model Character Sets as > distinct from Encodings; for example, Shift-JIS is an encoding which > includes three character sets, (ASCII, JIS0208 Kanji and the Half width > katakana). I tried to do this last year, but was not exactly sure of the > point; AFAIK it is only useful if you want to reason about whether certain > texts can survive certain round trips. Can anyone see a need to do this > kind of thing? One complication that kind of arises from this is, if you've had a look at the mappings which are available on Unicode.org, some of them are encoding maps and some of them are character set maps. Which of course by itself is not such a huge chore but makes automatically generating maps somewhat less trivial than if you ignore such considerations. [snip] > Mapping tables: > --------------- > For CJKV stuff I strongly favour mapping tables which are built at run time. > Mapping tables would be some of the possible inputs to our mini-language; we > would be able to write routines saying 'until byte pattern x encountered do > (read 2 bytes, look it up in a table, write the values found)', but with > user-supplied mapping tables. > > These are currently implemented as dictionaries, but there are many > contiguous ranges and a compact representation is possible. I did this last > year for a client and it worked pretty well. Even the big CJKV ones come > down to about 80 contiguous ranges. Conceptually, let's imagine that bytes > 1 to 5 in source encoding map to 100-105 in destination; 6-10 map to > 200-205; and 11-15 map to 300-305. Then we can create a 'compact map' > structure like this... > [(1, 5, 100), > (6, 10, 200), > (11, 15, 300)] > ...and a routine which can expand it to a dictionary {1:100, 2:101 .... > 15:305}. This is similar to the way a bunch of the codecs for glibc's iconv work - there is an index mapping table which consists of start and end ranges, and an index, which allows a lookup function to index properly into a big static array. iconv, as I posted earlier, is one place that it might be good to get ideas, both for ideas on what kinds of operations the codec machine should be able to do and data storage. How about making the interface to mappings simply __getitem__, as suggested earlier on this list by Marc-Andre? I think that might be the best way to ensure that we have lots of different options for what we can use as mappings. The Java i18n classes are also worth a look - they do everything as an inheritance hierarchy, with the logic for doing the conversion kind of bundled together with the maps themselves - everything inherits from either ByteToCharConverter or CharToByteConverter, and then defines a convert routine to do conversion. The inheritance relationships are kind of weird, I think - like, ByteToCharEUC_JP inherits from ByteToCharJIS0208, and contains ByteToCharJIS0201 and ByteToCharJIS0212 instances as class members. I like how the codecs return their max character width - this can sometimes be more than two bytes for some asian languages and helps to know for purposes of calculating memory allocation when going from Unicode back to a legacy encoding, for example. (If anyone's interested, I have decompiled copies of i18n.jar which I can put up someplace for people to look at). > One can also write routines to invert maps, check if they represent a round > trip and so on. The attraction is that the definitions can be in literal > python modules, and look quite like the standards documents that create > them. Furthermore, a lot of Japanese corporate encodings go like "Start > with strict JIS-0208, and add these extra 17 characters..." - so one module > could define all the variants for Japanese very cleanly and readably. I > think this is a good way to tackle user-defined characters - tell them what > to hack to add theirt 50 new characters and create an encoding with a new > name. If this sounds sensible, I'll try to start on it. > > > Test Harness > ------------ > A digression here, but perhaps we should build a web interface to convert > arbitrary files and output as HTML, so everyone can test the output of the > codecs as we write them. Is this useful? > > That's enough rambling for one day... > > Thanks, > > Andy > > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://www.python.org/mailman/listinfo/i18n-sig > From brian@garage.co.jp Wed Mar 22 02:17:43 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Wed, 22 Mar 2000 11:17:43 +0900 Subject: [I18n-sig] Asian Encodings In-Reply-To: References: Message-ID: <38D82D47136.18CCBRIAN@smtp.garage.co.jp> Hi again, One other thing I forgot to mention, is that we'll have to start thinking about (canonical) normalization, at least on a rudimentary level, for Asian encodings - one specific example I can think of is in Japanese with half-width katakana characters, there are a few diacritical marks (dakuten) which are represented themselves as separate characters - most encoding packages I've seen special case on these and turn them into their corresponding canonical representations. Without normalization, searches and processing for these characters become a bit of pain. So, one other goal of creating the East Asian codecs should also be to add some normalization support to the existing framework... other Unicode packages / implementations mostly use normalization form C for everything. Those that aren't familiar with Unicode Normalization Forms, here's the technical report, which is a good reference: http://www.unicode.org/unicode/reports/tr15/tr15-18.html --Brian From andy@reportlab.com Thu Mar 23 11:51:49 2000 From: andy@reportlab.com (Andy Robinson) Date: Thu, 23 Mar 2000 11:51:49 -0000 Subject: [I18n-sig] Codec Language In-Reply-To: <38D9DC5E103.DED4BRIAN@smtp.garage.co.jp> Message-ID: On the subject of a mini-language for dealing with Asian codecs...I'm fooling around with something in pure Python - a toy interpreter for a basic FSM - I'll try to post something up after the weekend. In the meantime, we should certainly list the actions we need to be able to perform at a conceptual level: 1. Data structures/types for bytes, strings, numbers and mapping tables 2. Read n bytes into designated buffers from input 3. Write contents of designated buffers to output 4. Look up contents of a buffer in a mapping table, and do somethign with the output (how to deal with failed lookups?) 5. Do math, string concenatenation, bit operations 6. Wide range of pattern-matching tests on short strings and bytes - byte in range, byte in set etc. mxTextTools gives loads of examples. Please pitch in with any suggested operations you think we need. The real issue seems to be, can we do it with an FSM that is not hideously complex to program? Or do we need a non-finite language in which infinite loops etc. are possible? The latter is easier to write things in, but may not be as safe or as fast. - Andy From mal@lemburg.com Thu Mar 23 12:11:00 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 23 Mar 2000 13:11:00 +0100 Subject: [I18n-sig] Kanji codec sample Message-ID: <38DA09D4.490C92B5@lemburg.com> Just thought this might be of interest to you. There is a sample implementation on the ftp.unicode.org site: ftp://ftp.unicode.org/Public/PROGRAMS/KANJIMAP/ Perhaps this could be used to get a quick start or at least some ideas about how Asian codecs could work... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From brian@garage.co.jp Thu Mar 23 13:32:12 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Thu, 23 Mar 2000 22:32:12 +0900 Subject: [I18n-sig] Codec Language In-Reply-To: References: <38D9DC5E103.DED4BRIAN@smtp.garage.co.jp> Message-ID: <38DA1CDC1FD.DED7BRIAN@smtp.garage.co.jp> Hi Andy, On Thu, 23 Mar 2000 11:51:49 -0000 "Andy Robinson" wrote: > On the subject of a mini-language for dealing with Asian codecs...I'm > fooling around with something in pure Python - a toy interpreter for a basic > FSM - I'll try to post something up after the weekend. In the meantime, we > should certainly list the actions we need to be able to perform at a > conceptual level: > > > 1. Data structures/types for bytes, strings, numbers and mapping tables > 2. Read n bytes into designated buffers from input > 3. Write contents of designated buffers to output > 4. Look up contents of a buffer in a mapping table, and do somethign with > the output (how to deal with failed lookups?) > 5. Do math, string concenatenation, bit operations > 6. Wide range of pattern-matching tests on short strings and bytes - byte in > range, byte in set etc. mxTextTools gives loads of examples. I'd been thinking along these lines too; from the encodings that I've surveyed currently, which I think includes most of the major ones for which there are unicode.org mappings available, the above should probably be sufficient to do the job. It also seems like with a scheme that allows a single codec to use multiple maps, it should be possible to do any of the asian codecs with only a two-byte key and four-byte value. The four-byte value would include the key that mapped to it, plus the value itself (which, as far as I've gathered, could always be two bytes), so that misses could be detected. The reason two bytes is enough is that even though there are extensions to many encodings which allow them to use more space outside the BMP, those added spaces are always mapped as contiguous planes, and never (at least in any of the encodings that I know of) larger than what can be mapped on a 2-byte grid. > > Please pitch in with any suggested operations you think we need. > > The real issue seems to be, can we do it with an FSM that is not hideously > complex to program? Or do we need a non-finite language in which infinite > loops etc. are possible? The latter is easier to write things in, but may > not be as safe or as fast. Allowing for both algorithmic and mapping codecs within the same implementation might confuse matters somewhat... what about separating things into mapping codecs (which will handle all the Unicode stuff), and a separate machine (or possibly extension to the mapping machine) that can do algorithmic transformations? This would whittle down the immediate problem to developing the mapping machine, which as far as I can tell should only have to support reading, writing, lookup, and comparison, at least for doing Unicode conversions. How does this sound? Also, I think another thing on our agenda should be to list up a preliminary list of encodings/character sets we're going to support from the beginning - this will also help to narrow the scope of the problem somewhat. There may eventually be other encodings which we'll want to support by adding some extra functionality to the machine; but in general, I don't think that there's any harm in making something that's really simple to do what we want to do now... If this sounds like a good idea then I'll draw up a preliminary list from the Unicode site, and then we can take a look at implementations (iconv, Java, and the KANJIMAP link Marc-Andre just posted, for example) to help figure out the FSM instruction set. What do you all think? --Brian From andy@reportlab.com Thu Mar 23 22:14:19 2000 From: andy@reportlab.com (Andy Robinson) Date: Thu, 23 Mar 2000 22:14:19 -0000 Subject: [I18n-sig] More grief on Windows Message-ID: <002201bf9515$75b4fab0$01ac2ac0@boulder> I've built the Unicode-aware Python on Windows, with a proper encodings library. The moment I try to look up a codec, python crashes... C:\users>python Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> unicode('hello','ascii') !!!! Application Error at this point ...try again... C:\users>python Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> unicode('hello') u'hello' >>> unicode('hello','utf-8') u'hello' >>> import codecs >>> codecs.lookup('ascii') !!!! Application Error at this point This happens on two different machines, building using VC++ and the standard workspace, both with a full CVS tree and no other Pythons lurking. I stepped through in Pythonwin, and found that __init__.py is called, and the 'ascii' module is loaded on demand correctly; immediately after this, it crashes. I don't have the skills to debug the C - yet. Is anyone else able to run the above snippets on Windows, or is it me? Thanks very much, Andy Robinson From mal@lemburg.com Thu Mar 23 22:48:10 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 23 Mar 2000 23:48:10 +0100 Subject: [I18n-sig] More grief on Windows References: <002201bf9515$75b4fab0$01ac2ac0@boulder> Message-ID: <38DA9F2A.D174B26A@lemburg.com> Andy Robinson wrote: > > I've built the Unicode-aware Python on Windows, with a proper encodings > library. > The moment I try to look up a codec, python crashes... > > C:\users>python > Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32 > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam > >>> unicode('hello','ascii') > !!!! Application Error at this point I can reproduce this on Linux too... I'll look into this and send a patch. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Thu Mar 23 23:16:55 2000 From: andy@reportlab.com (Andy Robinson) Date: Thu, 23 Mar 2000 23:16:55 -0000 Subject: [I18n-sig] Codec Language References: <38D9DC5E103.DED4BRIAN@smtp.garage.co.jp> <38DA1CDC1FD.DED7BRIAN@smtp.garage.co.jp> Message-ID: <000701bf951d$e1cc5c40$01ac2ac0@boulder> > Allowing for both algorithmic and mapping codecs within the same > implementation might confuse matters somewhat... what about separating > things into mapping codecs (which will handle all the Unicode stuff), > and a separate machine (or possibly extension to the mapping machine) > that can do algorithmic transformations? This would whittle down the > immediate problem to developing the mapping machine, which as far as I > can tell should only have to support reading, writing, lookup, and > comparison, at least for doing Unicode conversions. How does this > sound? I've been thinking hard what to do next, and actually I think the highest priorities are (a) build some kind if cgi test harness (maybe on Starship?), on which we can stash all manner of input files, and a front end which lets you specify input (file or a text field), say what encoding it is oiin, and say what encoding you want to see it in. Then, just using web browsers, we can actually see the results of type conversions, and can accumulate test files with subtle combinations of text. (b) write some pure Python Asian codecs, no matter how slow, using simple dictionaries for the mapping tables. This gives us a benchmark, documents the algorithms and features we are going to need, and lets people other than you and I see what features are needed in a faster codec machine. We should be able to move on that pretty fast. What do you think? BTW, I have often used uniconv.exe, a free utility from BasisTech - it is a command line program to do encoding conversion and character normalization transformations. Another really good test target would be to write a uniconv.py and a harness to run them both - when they give the same output for all encodings, we know we've done a good job. - Andy From mal@lemburg.com Thu Mar 23 23:21:31 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 24 Mar 2000 00:21:31 +0100 Subject: [I18n-sig] More grief on Windows References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> Message-ID: <38DAA6FB.63A036D5@lemburg.com> "M.-A. Lemburg" wrote: > > Andy Robinson wrote: > > > > I've built the Unicode-aware Python on Windows, with a proper encodings > > library. > > The moment I try to look up a codec, python crashes... > > > > C:\users>python > > Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32 > > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam > > >>> unicode('hello','ascii') > > !!!! Application Error at this point > > I can reproduce this on Linux too... I'll look into this and > send a patch. Here it is: --- CVS-Python/Python/codecs.c Fri Mar 24 00:02:04 2000 +++ Python+Unicode/Python/codecs.c Fri Mar 24 00:01:49 2000 @@ -91,11 +91,11 @@ PyObject *lowercasestring(const char *st If no codec is found, a KeyError is set and NULL returned. */ PyObject *_PyCodec_Lookup(const char *encoding) { - PyObject *result, *args = NULL, *v = NULL; + PyObject *result, *args = NULL, *v; int i, len; if (_PyCodec_SearchCache == NULL || _PyCodec_SearchPath == NULL) { PyErr_SetString(PyExc_SystemError, "codec module not properly initialized"); @@ -117,27 +117,26 @@ PyObject *_PyCodec_Lookup(const char *en Py_DECREF(v); return result; } /* Next, scan the search functions in order of registration */ - len = PyList_Size(_PyCodec_SearchPath); - if (len < 0) - goto onError; - args = PyTuple_New(1); if (args == NULL) goto onError; PyTuple_SET_ITEM(args,0,v); - v = NULL; + + len = PyList_Size(_PyCodec_SearchPath); + if (len < 0) + goto onError; for (i = 0; i < len; i++) { PyObject *func; func = PyList_GetItem(_PyCodec_SearchPath, i); if (func == NULL) goto onError; - result = PyEval_CallObject(func,args); + result = PyEval_CallObject(func, args); if (result == NULL) goto onError; if (result == Py_None) { Py_DECREF(result); continue; @@ -161,11 +160,10 @@ PyObject *_PyCodec_Lookup(const char *en PyDict_SetItem(_PyCodec_SearchCache, v, result); Py_DECREF(args); return result; onError: - Py_XDECREF(v); Py_XDECREF(args); return NULL; } static -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Fri Mar 24 09:53:21 2000 From: andy@reportlab.com (Andy Robinson) Date: Fri, 24 Mar 2000 09:53:21 -0000 Subject: [I18n-sig] More grief on Windows References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com> Message-ID: <000e01bf9576$ca342dc0$01ac2ac0@boulder> > "M.-A. Lemburg" wrote: > > I can reproduce this on Linux too... I'll look into this and > > send a patch. > > Here it is: Yup, that works for me. Thanks for the fast response. - Andy From andy@reportlab.com Fri Mar 24 09:53:25 2000 From: andy@reportlab.com (Andy Robinson) Date: Fri, 24 Mar 2000 09:53:25 -0000 Subject: [I18n-sig] More grief on Windows References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com> Message-ID: <000f01bf9576$cc89b680$01ac2ac0@boulder> > "M.-A. Lemburg" wrote: > > I can reproduce this on Linux too... I'll look into this and > > send a patch. > > Here it is: Yup, that works for me. Thanks for the fast response. - Andy From andy@reportlab.com Fri Mar 24 09:53:37 2000 From: andy@reportlab.com (Andy Robinson) Date: Fri, 24 Mar 2000 09:53:37 -0000 Subject: [I18n-sig] More grief on Windows References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com> Message-ID: <001001bf9576$d3b1d5f0$01ac2ac0@boulder> > "M.-A. Lemburg" wrote: > > I can reproduce this on Linux too... I'll look into this and > > send a patch. > > Here it is: Yup, that works for me. Thanks for the fast response. - Andy From andy@reportlab.com Fri Mar 24 09:53:44 2000 From: andy@reportlab.com (Andy Robinson) Date: Fri, 24 Mar 2000 09:53:44 -0000 Subject: [I18n-sig] More grief on Windows References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com> Message-ID: <001801bf9576$d82d7300$01ac2ac0@boulder> > "M.-A. Lemburg" wrote: > > I can reproduce this on Linux too... I'll look into this and > > send a patch. > > Here it is: Yup, that works for me. Thanks for the fast response. - Andy From mal@lemburg.com Fri Mar 31 22:15:53 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 01 Apr 2000 00:15:53 +0200 Subject: [I18n-sig] Test Suite for the Unicode codecs Message-ID: <38E52399.19220D0@lemburg.com> I would like to add some more testing to the mapping codecs in the Python encodings package. Right now I can only test for round-trips of lower character ordinal ranges and even those tests fail for a couple of encodings. Does anyone have access to some reference test suite for these mappings ? The mapping codec is probably not the cause for these errors. Perhaps the maps themselves aren't of high enough quality or maybe some mappings just cannot provide round-trip safety... Here are my findings in form of a Python test script with comments. The tests first translate an encoded into Unicode and then translate it back. Some have undefined mappings even in the lower ranges and others seem to be 1-n rather than 1-1. print 'Testing standard mapping codecs...', print '0-127...', s = ''.join(map(chr, range(128))) for encoding in ( 'cp037', 'cp1026', 'cp437', 'cp500', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp860', 'cp861', 'cp862', 'cp863', 'cp865', 'cp866', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_9', 'koi8_r', 'latin_1', 'mac_cyrillic', 'mac_latin2', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'cp856', 'cp857', 'cp864', 'cp869', 'cp874', 'mac_greek', 'mac_iceland','mac_roman', 'mac_turkish', 'cp1006', 'cp875', 'iso8859_8', ### These have undefined mappings: #'cp424', ): try: assert unicode(s,encoding).encode(encoding) == s except AssertionError: print '*** codec "%s" failed round-trip' % encoding except ValueError,why: print '*** codec for "%s" failed: %s' % (encoding, why) print '128-255...', s = ''.join(map(chr, range(128,256))) for encoding in ( 'cp037', 'cp1026', 'cp437', 'cp500', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp860', 'cp861', 'cp862', 'cp863', 'cp865', 'cp866', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_9', 'koi8_r', 'latin_1', 'mac_cyrillic', 'mac_latin2', ### These have undefined mappings: #'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', #'cp1256', 'cp1257', 'cp1258', #'cp424', 'cp856', 'cp857', 'cp864', 'cp869', 'cp874', #'mac_greek', 'mac_iceland','mac_roman', 'mac_turkish', ### These fail the round-trip: #'cp1006', 'cp875', 'iso8859_8', ): try: assert unicode(s,encoding).encode(encoding) == s except AssertionError: print '*** codec "%s" failed round-trip' % encoding except ValueError,why: print '*** codec for "%s" failed: %s' % (encoding, why) print 'done.' -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/