From goodger@users.sourceforge.net Mon Jul 1 19:23:36 2002 From: goodger@users.sourceforge.net (David Goodger) Date: Mon, 01 Jul 2002 14:23:36 -0400 Subject: [I18n-sig] encoding support for Docutils: please review In-Reply-To: Message-ID: Thanks for your reply, Martin. > I'd reorder this: (try command line). Try ASCII first, then UTF-8. If > ASCII passes, it most likely is ASCII. If not, and UTF-8 passes, it > most likely is UTF-8. Then try the locale's encoding. Out of curiosity, is there any point in trying both ASCII and UTF-8? UTF-8 is a strict superset of ASCII, so shouldn't checking UTF-8 alone be enough for both? If we don't care what the original encoding was (we just want Unicode text to process), does explicitly checking for ASCII buy us anything? -- David Goodger Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ From Matt Gushee Mon Jul 1 19:30:21 2002 From: Matt Gushee (Matt Gushee) Date: Mon, 1 Jul 2002 12:30:21 -0600 Subject: [I18n-sig] encoding support for Docutils: please review In-Reply-To: References: Message-ID: <20020701183021.GC361@swordfish.havenrock.com> On Mon, Jul 01, 2002 at 02:23:36PM -0400, David Goodger wrote: > Thanks for your reply, Martin. > > > I'd reorder this: (try command line). Try ASCII first, then UTF-8. If > > ASCII passes, it most likely is ASCII. Unless it's Shift-JIS. -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From martin@v.loewis.de Mon Jul 1 21:26:38 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 01 Jul 2002 22:26:38 +0200 Subject: [I18n-sig] encoding support for Docutils: please review In-Reply-To: References: Message-ID: David Goodger writes: > Out of curiosity, is there any point in trying both ASCII and UTF-8? UTF-8 > is a strict superset of ASCII, so shouldn't checking UTF-8 alone be enough > for both? If we don't care what the original encoding was (we just want > Unicode text to process), does explicitly checking for ASCII buy us > anything? The answer to the last question is "no". The point in checking ASCII specifically is that you then know that it is strictly ASCII (unless it is iso-2022-jp, that is); if that is not interesting to know, there is no point. Regards, Martin From Misha.Wolf@reuters.com Fri Jul 5 22:30:26 2002 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 05 Jul 2002 22:30:26 +0100 Subject: [I18n-sig] 22nd Unicode Conference, Sep 2002, San Jose, CA -- Register now! Message-ID: *********************************************************************** Register now! > Just 9 weeks to go > Register now! > Just 9 weeks to go *********************************************************************** Twenty-second International Unicode Conference (IUC22) Unicode and the Web: Evolution or Revolution? http://www.unicode.org/iuc/iuc22 September 9-13, 2002 San Jose, California *********************************************************************** Full program now live! >> Five days of 3 tracks! >> Check the Web site! *********************************************************************** NEWS > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc22 ) to check the Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. > Hotel guest room group rate valid to 16 August. > Early bird registration rate valid to 16 August. CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Microsoft Corporation Netscape Communications Oracle Corporation Reuters Ltd. Sun Microsystems, Inc. World Wide Web Consortium (W3C) GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site. CONFERENCE VENUE The Conference will take place at: DoubleTree Hotel San Jose 2050 Gateway Place San Jose, CA 95110 USA Tel: +1 408 453 4000 Fax: +1 408 437 2898 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ------------------------------------------------------------- --- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Jul 12 17:50:04 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Sat, 13 Jul 2002 01:50:04 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.7 released Message-ID: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp> Hi, I've released JapaneseCodecs 1.4.7. As usual, the source tarball is available at the following location: http://www.python.jp/Zope/download/JapaneseCodecs (in Japanese) http://www.python.jp/Zope/download/JapaneseCodecs/JapaneseCodecs-1.4.7.tar.gz Encoders and decoders now raise a ValueError instead of UnicodeError if their optional argument "errors" has an invalid value. Thanks Walter for reminding me! Regards, -- KAJIYAMA, Tamito From barry@zope.com Sat Jul 13 00:24:37 2002 From: barry@zope.com (Barry A. Warsaw) Date: Fri, 12 Jul 2002 19:24:37 -0400 Subject: [I18n-sig] JapaneseCodecs 1.4.7 released References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp> Message-ID: <15663.25909.341858.861899@anthem.wooz.org> >>>>> "TK" == Tamito KAJIYAMA writes: TK> Hi, TK> I've released JapaneseCodecs 1.4.7. As usual, the source TK> tarball is available at the following location: TK> http://www.python.jp/Zope/download/JapaneseCodecs (in TK> Japanese) TK> http://www.python.jp/Zope/download/JapaneseCodecs/JapaneseCodecs-1.4.7.tar.gz TK> Encoders and decoders now raise a ValueError instead of TK> UnicodeError if their optional argument "errors" has an TK> invalid value. Thanks Walter for reminding me! Thanks for the update, I've installed this update in the Mailman project. -Barry From MBleyer@DEFiNiENS.com Wed Jul 17 17:16:39 2002 From: MBleyer@DEFiNiENS.com (Bleyer, Michael) Date: Wed, 17 Jul 2002 18:16:39 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls Message-ID: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> Assume I have a list of unicode strings in UTF-16-le. Reading and parsing the list all works really fine. Now I want to create/copy a number of files and I want the file/directory names to be these unicode strings. When I give a unicode string to a file system call like shutil.copy() or os.makedir() Python converts the unicode string to a "regular" string using the default site encoding (which usually fails if 'ascii'). I can influence this by encode()'ing myself before I pass the string to the system function call, so far so good. However, I do have a problem if I have unicode strings from different, non-compatible encodings in my list (e.g. ISO latin-1 and some asian encoding), as I cannot use the same encoding conversion for all strings, some will fail. I can of course convert to UTF8 which will always work, but the filenames turn out to be garbage (because the OS does not interpret them as UTF8 but in the local encoding). My question is thus: since modern-day operating systems claim to support unicode (I assume) in filenames, how do I pass a unicode string directly to a system function call without having to convert to a "localized" encoding? Alternatively how can I find out the "proper" or "legal" encoding for a unicode string just by looking at the string (e.g. not with a brute force try-encode-except trial and error loop). As a side problem: how do I deal with filename length limits, since these are actually byte limits not character limits? If I do a u''[:255] followed by an encode I end up with a unicode string thats at most 255 characters long, but may be longer than 255 bytes after encoding. If I do encode followed by ''[:255] I get at most 255 bytes but my string may be illegal because I cut off in the middle of a 3-byte character. Any insights and suggestions greatly appreciated. Mike From mal@lemburg.com Wed Jul 17 17:50:59 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Jul 2002 18:50:59 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> Message-ID: <3D35A073.3060602@lemburg.com> Bleyer, Michael wrote: > Assume I have a list of unicode strings in UTF-16-le. Reading and parsing > the list all works really fine. > > Now I want to create/copy a number of files and I want the file/directory > names to be these unicode strings. > When I give a unicode string to a file system call like > shutil.copy() > or > os.makedir() > Python converts the unicode string to a "regular" string using the default > site encoding (which usually fails if 'ascii'). > I can influence this by encode()'ing myself before I pass the string to the > system function call, so far so good. > > However, I do have a problem if I have unicode strings from different, > non-compatible encodings in my list (e.g. ISO latin-1 and some asian > encoding), as I cannot use the same encoding conversion for all strings, > some will fail. I can of course convert to UTF8 which will always work, but > the filenames turn out to be garbage (because the OS does not interpret them > as UTF8 but in the local encoding). > > My question is thus: since modern-day operating systems claim to support > unicode (I assume) in filenames, how do I pass a unicode string directly to > a system function call without having to convert to a "localized" encoding? Python 2.2 tries to automagically encode Unicode into the encoding used by the OS. This only works if Python can figure out this encoding. AFAIK, only Windows platforms are supported. > Alternatively how can I find out the "proper" or "legal" encoding for a > unicode string just by looking at the string (e.g. not with a brute force > try-encode-except trial and error loop). If you know the encoding used by the file system, then you should simply encode the Unicode filename using that encoding. > As a side problem: how do I deal with filename length limits, since these > are actually byte limits not character limits? > If I do a u''[:255] followed by an encode I end up with a unicode string > thats at most 255 characters long, but may be longer than 255 bytes after > encoding. > If I do encode followed by ''[:255] I get at most 255 bytes but my string > may be illegal because I cut off in the middle of a 3-byte character. Good question. You could try the stripping after the encoding and then have Python decode the result using the 'ignore' error handling. That should give you the maximum sized Unicode string to use for encoding. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From martin@v.loewis.de Wed Jul 17 19:25:30 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 17 Jul 2002 20:25:30 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> Message-ID: "Bleyer, Michael" writes: > My question is thus: since modern-day operating systems claim to support > unicode (I assume) in filenames That is not really true. WinNT and MacOS do. Unix only supports byte-based file names, and there is an ongoing debate on how those should be used to represent non-ASCII in file names. The convention seems to be that the locale's encoding should be assumed for file names. As MAL explains, you can pass Unicode file names automatically in Python 2.2; you might need to invoke locale.setlocale for this to work properly. > Alternatively how can I find out the "proper" or "legal" encoding for a > unicode string just by looking at the string (e.g. not with a brute force > try-encode-except trial and error loop). For this, you need to tell us what system you use. > As a side problem: how do I deal with filename length limits, since these > are actually byte limits not character limits? Again, depends on the system. As a starting point, you need to find out what the limit is. > If I do a u''[:255] followed by an encode I end up with a unicode string > thats at most 255 characters long, but may be longer than 255 bytes after > encoding. Also, the limit might be smaller than 255. > If I do encode followed by ''[:255] I get at most 255 bytes but my string > may be illegal because I cut off in the middle of a 3-byte character. If truncation is acceptable, I recommend to truncate to 50% of the maximum size, and assert that the encoded result is smaller than the maximum size. You can try to be smart and use binary search to find the largest acceptable character string. Regards, Martin From martin@v.loewis.de Wed Jul 17 19:27:07 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 17 Jul 2002 20:27:07 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <3D35A073.3060602@lemburg.com> References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> Message-ID: "M.-A. Lemburg" writes: > Python 2.2 tries to automagically encode Unicode into the > encoding used by the OS. This only works if Python can figure > out this encoding. AFAIK, only Windows platforms are supported. No; it works on Unix as well (if nl_langinfo(CODEPAGE) is supported); you need to invoke setlocale to activate this support (in particular, the LC_CTYPE category). Regards, Martin From mal@lemburg.com Wed Jul 17 19:38:52 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Jul 2002 20:38:52 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> Message-ID: <3D35B9BC.8020004@lemburg.com> Martin v. Loewis wrote: > "M.-A. Lemburg" writes: > > >>Python 2.2 tries to automagically encode Unicode into the >>encoding used by the OS. This only works if Python can figure >>out this encoding. AFAIK, only Windows platforms are supported. > > > No; it works on Unix as well (if nl_langinfo(CODEPAGE) is supported); > you need to invoke setlocale to activate this support (in particular, > the LC_CTYPE category). You mean: call setlocale() to set something or fetch the encoding from it ? Setting a locale to something other than "C" will cause quite a few semantic changes, so you should beware... Note there's also locale.getdefaultlocale() which work on many platforms and returns the default locale and encoding for the platform Python currently runs on. BTW, running "python locale.py" prints your current settings. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From martin@v.loewis.de Wed Jul 17 19:54:07 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 17 Jul 2002 20:54:07 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <3D35B9BC.8020004@lemburg.com> References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> <3D35B9BC.8020004@lemburg.com> Message-ID: "M.-A. Lemburg" writes: > You mean: call setlocale() to set something or fetch the > encoding from it ? Setting a locale to something other than > "C" will cause quite a few semantic changes, so you should > beware... Indeed. However, setting the locale may be the only way to find out what the locale's encoding is. > Note there's also locale.getdefaultlocale() That is broken beyond repair, and should not be used for anything. It can't possibly work. > which work on many platforms and returns the default locale and > encoding for the platform Python currently runs on. In particular when it comes to the locale's encoding, it has no chance to work correctly, except on Windows. Regards, Martin From mal@lemburg.com Wed Jul 17 21:23:54 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Jul 2002 22:23:54 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> <3D35B9BC.8020004@lemburg.com> Message-ID: <3D35D25A.4080603@lemburg.com> Martin v. Loewis wrote: > "M.-A. Lemburg" writes: > > >>You mean: call setlocale() to set something or fetch the >>encoding from it ? Setting a locale to something other than >>"C" will cause quite a few semantic changes, so you should >>beware... > > > Indeed. However, setting the locale may be the only way to find out > what the locale's encoding is. > > >>Note there's also locale.getdefaultlocale() > > > That is broken beyond repair, and should not be used for anything. It > can't possibly work. Hmm, why is that ? >>which work on many platforms and returns the default locale and >>encoding for the platform Python currently runs on. > > > In particular when it comes to the locale's encoding, it has no chance > to work correctly, except on Windows. There's a large database in locale.py for this and a few support APIs which make use of it. It would probably be worthwhile to add an interface encoding(localename) which only returns the encoding used per default for that locale. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From martin@v.loewis.de Wed Jul 17 22:13:01 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 17 Jul 2002 23:13:01 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <3D35D25A.4080603@lemburg.com> References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> <3D35B9BC.8020004@lemburg.com> <3D35D25A.4080603@lemburg.com> Message-ID: "M.-A. Lemburg" writes: > > That is broken beyond repair, and should not be used for anything. It > > can't possibly work. > > Hmm, why is that ? It tries to find out locale information from environment variables. That is bound to fail because: - it may not know what variables to consider. In particular, on Unix, it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes a number of errors when trying to find the encoding: - if LANGUAGE is set, it is used to determine the encoding. This is incorrect; LANGUAGE cannot be used for that. For example, with LANGUAGE=german LANG=de_DE.UTF-8, it returns ['de_DE', 'ISO8859-1'] This is incorrect; the encoding should have been UTF-8 - it misses that LANGUAGE can contain contain colons to denote fallbacks, on GNU/Linux; with LANGUAGE=german:french LANG=de_DE.UTF-8, it returns ['de_DE', 'french'] This is even worse: french is not the name of an encoding - it may not know the syntax of the environment variables. For example, the current implementation breaks for "de_DE@euro"; this is an SF bug report. - it may not know the encoding associated with a locale. For example, for de_DE@euro, it is Latin-9 on Linux today, but might be UTF-8 on some other system. Likewise, locale.py just *knows* that de_DE means ".iso-8859-1" on any system - that can be easily wrong. - the language name return from getdefaultlocale is incorrect on Windows, see http://groups.google.com/groups?selm=917pjb%24ii2%241%40reader1.imaginet.fr Users apparently expect that they can pass the result of getdefaultlocale to setlocale, but this is not the case. > There's a large database in locale.py for this and a few > support APIs which make use of it. That is the major problem. This database is incorrect, cannot be corrected, and is both unmaintained and unmaintainable. > It would probably be worthwhile to add an interface > encoding(localename) which only returns the encoding used per > default for that locale. I would make this getencoding(), and document that you need to call setlocale before, to make use of the user settings. The official way, on Unix, to obtain the locale's encoding is to use nl_langinfo(CODESET), which only works if the LC_CTYPE facet has been set. On Windows, locale._getdefaultlocale fortunately already returns the current codeset (which isn't influenced by setlocale, anyway). Regards, Martin From mal@lemburg.com Wed Jul 17 22:30:51 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Jul 2002 23:30:51 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> <3D35B9BC.8020004@lemburg.com> <3D35D25A.4080603@lemburg.com> Message-ID: <3D35E20B.8080103@lemburg.com> Martin v. Loewis wrote: > "M.-A. Lemburg" writes: > > >>>That is broken beyond repair, and should not be used for anything. It >>>can't possibly work. >> >>Hmm, why is that ? > > > It tries to find out locale information from environment > variables. That is bound to fail because: > > - it may not know what variables to consider. In particular, on Unix, > it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes > a number of errors when trying to find the encoding: That's the search order which GNU readline uses (at least at the time I wrote the code). > - if LANGUAGE is set, it is used to determine the encoding. This is > incorrect; LANGUAGE cannot be used for that. For example, with > LANGUAGE=german LANG=de_DE.UTF-8, it returns > ['de_DE', 'ISO8859-1'] > This is incorrect; the encoding should have been UTF-8 > > - it misses that LANGUAGE can contain contain colons to denote > fallbacks, on GNU/Linux; with > LANGUAGE=german:french LANG=de_DE.UTF-8, it returns > ['de_DE', 'french'] > This is even worse: french is not the name of an encoding Interesting. Is the format documented somewhere ? It should be easy to fix this. > - it may not know the syntax of the environment variables. For > example, the current implementation breaks for "de_DE@euro"; this is > an SF bug report. This should be fixable too. What does the '@euro' mean ? Does it have to do with currency ? > - it may not know the encoding associated with a locale. For example, > for de_DE@euro, it is Latin-9 on Linux today, but might be UTF-8 on > some other system. Likewise, locale.py just *knows* that de_DE means > ".iso-8859-1" on any system - that can be easily wrong. Sure, but you normally only get the locale name and then have to make an educated guess for the encoding. If the encoding is known (e.g. by looking at the LANG environment variable), then that infomration should override the database information. > - the language name return from getdefaultlocale is incorrect on > Windows, see > > http://groups.google.com/groups?selm=917pjb%24ii2%241%40reader1.imaginet.fr > > Users apparently expect that they can pass the result of > getdefaultlocale to setlocale, but this is not the case. Hmm, the names returned by getdefaultlocale() and normalize() are standards. I wonder what Windows expects to see for setlocale(). >>There's a large database in locale.py for this and a few >>support APIs which make use of it. > > > That is the major problem. This database is incorrect, cannot be > corrected, and is both unmaintained and unmaintainable. I'd say, it's better than nothing :-) >>It would probably be worthwhile to add an interface >>encoding(localename) which only returns the encoding used per >>default for that locale. > > > I would make this getencoding(), and document that you need to call > setlocale before, to make use of the user settings. The official way, > on Unix, to obtain the locale's encoding is to use > nl_langinfo(CODESET), which only works if the LC_CTYPE facet has been > set. On Windows, locale._getdefaultlocale fortunately already returns > the current codeset (which isn't influenced by setlocale, anyway). Fine. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From martin@v.loewis.de Wed Jul 17 23:11:10 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 18 Jul 2002 00:11:10 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <3D35E20B.8080103@lemburg.com> References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> <3D35B9BC.8020004@lemburg.com> <3D35D25A.4080603@lemburg.com> <3D35E20B.8080103@lemburg.com> Message-ID: "M.-A. Lemburg" writes: > > - it may not know what variables to consider. In particular, on Unix, > > it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes > > a number of errors when trying to find the encoding: > > That's the search order which GNU readline uses (at least > at the time I wrote the code). GNU readline does not check LANGUAGE, and it uses setlocale if available (so you are talking about rarely-used fallback code). > > - it misses that LANGUAGE can contain contain colons to denote > > fallbacks, on GNU/Linux; with > > LANGUAGE=german:french LANG=de_DE.UTF-8, it returns > > ['de_DE', 'french'] > > This is even worse: french is not the name of an encoding > > Interesting. Is the format documented somewhere ? It should be > easy to fix this. Of LANGUAGE? I believe it's documented in the gettext documentation. > > - it may not know the syntax of the environment variables. For > > example, the current implementation breaks for "de_DE@euro"; this is > > an SF bug report. > > This should be fixable too. What does the '@euro' mean ? Does it > have to do with currency ? In a way. It is a "locale variant". A variant could be just about anything. Common variants are @euro (used to denote the variant that has the Euro for LC_CURRENCY), @nynorsk (used to tell apart the two Norwegian languages - now nb and no), and @xim, used for X Input Methods (like @xim=kinput2). It could be used for many other things, too. You can fix the parsing of the variants, but you cannot infer the encoding. > Sure, but you normally only get the locale name and then > have to make an educated guess for the encoding. That is my point: This algorithm must guess, and it *will* guess wrong. > If the encoding is known (e.g. by looking at the LANG environment > variable), then that infomration should override the database > information. In this specific case (of the @euro domains), the LANG variable does not explicitly mention the encoding. So that doesn't help. > Hmm, the names returned by getdefaultlocale() and normalize() > are standards. I wonder what Windows expects to see for > setlocale(). What standards? Posix? That has never impressed Microsoft. Instead of "fr_FR.cp1252", they accept "French_France.1252". That may even be Posix-conforming, though, which allows "_.". Locale names are *not* standard. An algorithm that assumes that they are is broken. > I'd say, it's better than nothing :-) Yes, that's why I propose to provide a replacement, and then deprecate the existing function. Regards, Martin From martin@v.loewis.de Thu Jul 18 16:14:33 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 18 Jul 2002 17:14:33 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <3D36BA55.1000802@lemburg.com> References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> <3D35B9BC.8020004@lemburg.com> <3D35D25A.4080603@lemburg.com> <3D35E20B.8080103@lemburg.com> <3D36BA55.1000802@lemburg.com> Message-ID: "M.-A. Lemburg" writes: > > You can fix the parsing of the variants, but you cannot infer the > > encoding. > > Why not ? I know that several locales use more than one > encoding for their script(s), Which locale, on which system? > but having at least a hint is better than no information at all. Where do you get the hint from? And why is it better to guess a random encoding than to guess "ascii" all the time? > I've never said that it will always guess right. AFAIK, > there is no platform independent solution to the problem. > I am all for adding more support for platform specific > solutions, though. For that, I would need to understand the meaning of getdefaultlocale first. What precisely is it supposed to return? I can understand the "encoding" part (what encoding is the user likely to use), but what is the meaning of the "language code" return value? And what can you do with that result? > > In this specific case (of the @euro domains), the LANG variable does > > not explicitly mention the encoding. So that doesn't help. > > It can be used as hint, e.g. in Germany we use Latin-1 as > encoding, so that's a good assumption. That is a wrong assumption. In Germany, we use windows-1252, iso-8859-1, iso-8859-15, and UTF-8. Many modern Unix installations use Latin-9 instead of Latin-1, since Latin-1 cannot represent the currency symbol of the locale. > >>I'd say, it's better than nothing :-) > > Yes, that's why I propose to provide a replacement, and then > > deprecate > > the existing function. > > Why a replacement and what kind of replacement ? It should well > be possible to add more support to the existing APIs and > perhaps extend them with new ones. Because the other APIs have different usage constraints. It *is* possible to find out the user's encoding reliably on many Unix systems, but you have to invoke setlocale for that to work. Calling setlocale behind the scenes is bad, so the users have to change their code. Also, this only returns the encoding. I don't know what the "language code" is or how to obtain it - even in a system specific way. Fortunately, I don't consider this a problem - since I can't see why anybody would want that value, either. Regards, Martin From martin@v.loewis.de Thu Jul 18 16:05:02 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 18 Jul 2002 17:05:02 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <023814FAC196D5119E4A00D0B76C0F98052125@mmuc.definiens.com> References: <023814FAC196D5119E4A00D0B76C0F98052125@mmuc.definiens.com> Message-ID: "Bleyer, Michael" writes: > What I want to do, is create file names from a list that has strings in both > encodings. The strings can be handled fine while in unicode, but as soon as > I try to convert all of them to one encoding, half of the conversions will > fail. I just want to convert them with the proper encoding and then pass the > bytestring to the system function, I don't worry about wether it will > _display_ right, just about wether the name is correct. For that, you need to give a definition of "correct". From your description, I'd say that encoding the strings as "utf-8" is also "correct" - it gives you byte strings that identify the original file names. > What I would like to have is some function that will tell me for a given > Unicode string, a list of all the encodings that this string can be > converted into (without having to try all available encodings in a brute > force loop), because I do not know the proper encoding a priori. I doubt that you can implement such function without a "brute force" algorithm of some kind. > Anyway, if there isn't a direct interface/solution, what would you > consider the best workaround for Python? Use brute force. Perhaps I'm still not understanding your problem clearly. To understand it better, can you please answer the following questions? - does your problem really have to do with file names? Or can it be considered as independent of the problem of file names? - would it help if, for each Unicode character, there was a list of encodings that can represent that character? Regards, Martin From yedian@worldnet.att.net Thu Jul 18 16:27:53 2002 From: yedian@worldnet.att.net (Dan Edwards) Date: Thu, 18 Jul 2002 23:27:53 +0800 Subject: [I18n-sig] Chinese Codecs? Message-ID: Is there a definitive source for Traditional and Simplified Chinese codecs for Python? I'm specifically looking for codecs to handle CP936 and CP950. I've looked at the pythonzh project at SourceForge and it appears to be 1) abandonware, 2) not functioning as expected. Tnx for any help on this question. Dan From MBleyer@DEFiNiENS.com Thu Jul 18 14:42:06 2002 From: MBleyer@DEFiNiENS.com (Bleyer, Michael) Date: Thu, 18 Jul 2002 15:42:06 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls Message-ID: <023814FAC196D5119E4A00D0B76C0F98052125@mmuc.definiens.com> > > Python 2.2 tries to automagically encode Unicode into the encoding > > used by the OS. This only works if Python can figure out this > > encoding. AFAIK, only Windows platforms are supported. > > No; it works on Unix as well (if nl_langinfo(CODEPAGE) is > supported); you need to invoke setlocale to activate this > support (in particular, the LC_CTYPE category). It does work on Unix as well with some caveats. However, I think maybe my original questions was not clear enough. Let's assume for the sake of the argument, that a call to locale.getdefaultlocale()[1] will get me the systems default encoding which I can use to encode my unicode strings so they show up properly when used in filenames. But I know that in some areas people work with two different incompatible (non-symmetric) encodings, for example people in Japan with mixed Sun and Windows networks. They have some filenames in one encoding and some in the other. One half of the filenames used always shows up as garbage, since they cannot be displayed in the other encoding and vice versa. Let's assume that people know this and accept it. What I want to do, is create file names from a list that has strings in both encodings. The strings can be handled fine while in unicode, but as soon as I try to convert all of them to one encoding, half of the conversions will fail. I just want to convert them with the proper encoding and then pass the bytestring to the system function, I don't worry about wether it will _display_ right, just about wether the name is correct. What I would like to have is some function that will tell me for a given Unicode string, a list of all the encodings that this string can be converted into (without having to try all available encodings in a brute force loop), because I do not know the proper encoding a priori. The system locale info will only tell me which encoding is _displayed_ properly, it does not mean that this encoding will be able to handle all my unicode strings. I am not sure if this is a fundamental problem with Unicode, as it seems to be a great way to store data but as soon as you actually want to do anything with it you need some extra META information that is not stored in the data itself (uhm, I don't mean to rant, I realize this holds true for other formats as well). I also know that an obvious answer to this whole issue could be "just keep your data in your local encoding and avoid using unicode", unfortunately the source format is UTF-16. Anyway, if there isn't a direct interface/solution, what would you consider the best workaround for Python? :-) Mike From mal@lemburg.com Thu Jul 18 13:53:41 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Jul 2002 14:53:41 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com> <3D35A073.3060602@lemburg.com> <3D35B9BC.8020004@lemburg.com> <3D35D25A.4080603@lemburg.com> <3D35E20B.8080103@lemburg.com> Message-ID: <3D36BA55.1000802@lemburg.com> Martin v. Loewis wrote: > "M.-A. Lemburg" writes: > > >>>- it may not know what variables to consider. In particular, on Unix, >>> it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes >>> a number of errors when trying to find the encoding: >> >>That's the search order which GNU readline uses (at least >>at the time I wrote the code). > > > GNU readline does not check LANGUAGE, and it uses setlocale if > available (so you are talking about rarely-used fallback code). See the gettext man page: If the LANGUAGE environment variable is set to a nonempty value, and the locale is not the "C" locale, the value of LANGUAGE is assumed to contain a colon separated list of locale names. The functions will attempt to look up a translation of msgid in each of the locales in turn. This is a GNU extension. >>> - it misses that LANGUAGE can contain contain colons to denote >>> fallbacks, on GNU/Linux; with >>> LANGUAGE=german:french LANG=de_DE.UTF-8, it returns >>> ['de_DE', 'french'] >>> This is even worse: french is not the name of an encoding >> >>Interesting. Is the format documented somewhere ? It should be >>easy to fix this. > > Of LANGUAGE? I believe it's documented in the gettext documentation. Yes. It looks as if parsing LANGUAGE is the wrong thing to do if you're looking for the default locale (ie. the one which is used at process startup time before any calls to setlocale()). >>>- it may not know the syntax of the environment variables. For >>> example, the current implementation breaks for "de_DE@euro"; this is >>> an SF bug report. >> >>This should be fixable too. What does the '@euro' mean ? Does it >>have to do with currency ? > > > In a way. It is a "locale variant". A variant could be just about > anything. Common variants are @euro (used to denote the variant that > has the Euro for LC_CURRENCY), @nynorsk (used to tell apart the two > Norwegian languages - now nb and no), and @xim, used for X Input > Methods (like @xim=kinput2). It could be used for many other things, > too. > > You can fix the parsing of the variants, but you cannot infer the > encoding. Why not ? I know that several locales use more than one encoding for their script(s), but having at least a hint is better than no information at all. Of course, if the system provides different means of accessing this information, then those means should be used instead. >>Sure, but you normally only get the locale name and then >>have to make an educated guess for the encoding. > > > That is my point: This algorithm must guess, and it *will* guess > wrong. I've never said that it will always guess right. AFAIK, there is no platform independent solution to the problem. I am all for adding more support for platform specific solutions, though. >>If the encoding is known (e.g. by looking at the LANG environment >>variable), then that infomration should override the database >>information. > > > In this specific case (of the @euro domains), the LANG variable does > not explicitly mention the encoding. So that doesn't help. It can be used as hint, e.g. in Germany we use Latin-1 as encoding, so that's a good assumption. >>Hmm, the names returned by getdefaultlocale() and normalize() >>are standards. I wonder what Windows expects to see for >>setlocale(). > > > What standards? Posix? That has never impressed Microsoft. Instead of > "fr_FR.cp1252", they accept "French_France.1252". That may even be > Posix-conforming, though, which allows "_.". > > Locale names are *not* standard. An algorithm that assumes that they > are is broken. I didn't say that locale names are always standard. To the contrary: I added the normalize() API to locale.py to map some of the commonly used non-standard locale names to the standards compatible ones (ISO 639 language code + ISO 3166 country code). >>I'd say, it's better than nothing :-) > > Yes, that's why I propose to provide a replacement, and then deprecate > the existing function. Why a replacement and what kind of replacement ? It should well be possible to add more support to the existing APIs and perhaps extend them with new ones. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From barry@zope.com Thu Jul 18 19:02:20 2002 From: barry@zope.com (Barry A. Warsaw) Date: Thu, 18 Jul 2002 14:02:20 -0400 Subject: [I18n-sig] Chinese Codecs? References: Message-ID: <15671.684.366934.581967@anthem.wooz.org> >>>>> "DE" == Dan Edwards writes: DE> Is there a definitive source for Traditional and Simplified DE> Chinese codecs for Python? I'm specifically looking for codecs DE> to handle CP936 and CP950. DE> I've looked at the pythonzh project at SourceForge and it DE> appears to be 1) abandonware, 2) not functioning as expected. DE> Tnx for any help on this question. I think I got Chinese codes for Mailman from http://sourceforge.net/projects/python-codecs/ but I don't know how current (or useable) they are. -Barry From Matt Gushee Thu Jul 18 20:29:33 2002 From: Matt Gushee (Matt Gushee) Date: Thu, 18 Jul 2002 13:29:33 -0600 Subject: [I18n-sig] Chinese Codecs? In-Reply-To: <15671.684.366934.581967@anthem.wooz.org> References: <15671.684.366934.581967@anthem.wooz.org> Message-ID: <20020718192933.GE10401@swordfish.havenrock.com> On Thu, Jul 18, 2002 at 02:02:20PM -0400, Barry A. Warsaw wrote: > > I think I got Chinese codes for Mailman from > http://sourceforge.net/projects/python-codecs/ but I don't know how > current (or useable) they are. Nor do I, since I only currently only work with Japanese. But I tried to contact their author, J.S. Frank Chen, a few months ago, to see about getting an updated version of a document of his for 4Suite. I never got any response, and judging from the record of his project activity, posts to mailing lists, etc., he seems to have vanished from the Internet. So it seems the Chinese codecs may be orphaned (not literally, I hope). -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From martin@v.loewis.de Thu Jul 18 22:14:04 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 18 Jul 2002 23:14:04 +0200 Subject: [I18n-sig] Chinese Codecs? In-Reply-To: References: Message-ID: "Dan Edwards" writes: > Is there a definitive source for Traditional and Simplified Chinese codecs > for Python? I'm specifically looking for codecs to handle CP936 and CP950. Depending on the platform you are using, you may try the iconv codec, thus accessing codecs in your C library. Regards, Martin From MBleyer@DEFiNiENS.com Fri Jul 19 10:06:13 2002 From: MBleyer@DEFiNiENS.com (Bleyer, Michael) Date: Fri, 19 Jul 2002 11:06:13 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls Message-ID: <023814FAC196D5119E4A00D0B76C0F98052127@mmuc.definiens.com> > For that, you need to give a definition of "correct". From > your description, I'd say that encoding the strings as > "utf-8" is also "correct" - it gives you byte strings that > identify the original file names. True. But then UTF8 is always "correct" because any Unicode string can be converted to UTF8. I guess most OSs use some other encoding for display though. > - does your problem really have to do with file names? Or can it be > considered as independent of the problem of file names? I guess it's not only file names but any system call. > - would it help if, for each Unicode character, there was a list of > encodings that can represent that character? Yup. If I have: myUString = u'' I'd like a function that returns a list of legal encodings for that string, e.g. myLegalEncodingList = locale.getLegalEncodings(myUString) The list would be something like ['cp1250','latin-1','utf8'] etc. for example. A function that only works with a single Unicode character would be good enough I guess. Now if I have a unicode string I would try to convert it to the system default encoding first and if that doesn't work, I would like to give the user some feedback and maybe some choice (from a list of legal encodings) over which encoding to use instead. Does that make sense? Mike From martin@v.loewis.de Fri Jul 19 17:14:27 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 19 Jul 2002 18:14:27 +0200 Subject: [I18n-sig] Passing unicode strings to file system calls In-Reply-To: <023814FAC196D5119E4A00D0B76C0F98052127@mmuc.definiens.com> References: <023814FAC196D5119E4A00D0B76C0F98052127@mmuc.definiens.com> Message-ID: "Bleyer, Michael" writes: > I'd like a function that returns a list of legal encodings for that string, > e.g. > myLegalEncodingList = locale.getLegalEncodings(myUString) > > The list would be something like > ['cp1250','latin-1','utf8'] etc. > for example. What has the locale to do with that? The set of encodings to encode a particular Unicode character is independent of the locale... > Now if I have a unicode string I would try to convert it to the system > default encoding first and if that doesn't work, I would like to give the > user some feedback and maybe some choice (from a list of legal encodings) > over which encoding to use instead. > > Does that make sense? Yes, that's a good approach. However, I'd recommend to use the locale's encoding instead of the system default encoding, as reported by nl_langinfo(CODEPAGE). As for computing the list of possible encodings: I'm not sure what the best data format for that would be. Regards, Martin From perky@FreeBSD.org Tue Jul 23 20:06:25 2002 From: perky@FreeBSD.org (Hye-Shik Chang) Date: Wed, 24 Jul 2002 04:06:25 +0900 Subject: [I18n-sig] KoreanCodecs 2.0.5 Released Message-ID: <20020723190625.GA68013@fallin.lv> Hello! I've just released KoreanCodecs 2.0.5. This version has changed these from previous version: - Add two new characters which is introduced by KSX1001-1998 (euro symbol and registered mark) - Raise not UnicodeError but ValueError when keyword argument "errors" is invalid. - hangul.isJaeum and hangul.isMoeum test the entire string, same as str.isdigit and its friends do. As always, you can download it from http://sourceforge.net/project/showfiles.php?group_id=46747 Thank you for listening! :) -- Hye-Shik Chang Yonsei University, Seoul ^D From barry@python.org Thu Jul 25 07:02:05 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 25 Jul 2002 02:02:05 -0400 Subject: [I18n-sig] JapaneseCodecs 1.4.7 released References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp> Message-ID: <15679.37981.879403.546544@anthem.wooz.org> >>>>> "TK" == Tamito KAJIYAMA writes: TK> I've released JapaneseCodecs 1.4.7. As usual, the source TK> tarball is available at the following location: TK> Encoders and decoders now raise a ValueError instead of TK> UnicodeError if their optional argument "errors" has an TK> invalid value. Thanks Walter for reminding me! I just realized we still have a problem with this distutils package. It still insists on installing japanese.pth in /usr/local/lib/python-2.2/site-packages even if I include --install-lib and --install-purelib switches to the "python setup.py install" command. This is bad because in Mailman, I don't want to pollute the Python distribution with the packages I bundle, so I install them in (Mailman's) $prefix/pythonlib directory. But because JapaneseCodecs's setup.py leaves *only* the japanese.pth file in site-packages, people's Python installations are now broken. I do the same thing with the email and KoreanCodecs packages, and they work fine. Is there some way we can stop the japanese.pth file from getting installed in the site-packages directory? (I vaguely remember this issue came up once before and I thought the problem had been solved, but it clearly is a problem w/ 1.4.7). Cheers, -Barry From martin@v.loewis.de Thu Jul 25 08:50:56 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Jul 2002 09:50:56 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.7 released In-Reply-To: <15679.37981.879403.546544@anthem.wooz.org> References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp> <15679.37981.879403.546544@anthem.wooz.org> Message-ID: barry@python.org (Barry A. Warsaw) writes: > I just realized we still have a problem with this distutils package. > It still insists on installing japanese.pth in > /usr/local/lib/python-2.2/site-packages even if I include > --install-lib and --install-purelib switches to the "python setup.py > install" command. Since japanese.pth is processed as a 'data' file, you have two options: 1. Only invoke the install_lib command, not the install command. This will then avoid the install_headers, install_scripts, and install_data commands (the first two not being used here, anyway). 2. Provide the --install-data= argument to the install command, to specify an alternative prefix for data files. Regards, Martin From barry@python.org Thu Jul 25 14:22:27 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 25 Jul 2002 09:22:27 -0400 Subject: [I18n-sig] JapaneseCodecs 1.4.7 released References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp> <15679.37981.879403.546544@anthem.wooz.org> Message-ID: <15679.64403.935865.272438@anthem.wooz.org> [Adding distutils-sig@python.org >>>>> "MvL" == Martin v Loewis writes: >> I just realized we still have a problem with this distutils >> package. It still insists on installing japanese.pth in >> /usr/local/lib/python-2.2/site-packages even if I include >> --install-lib and --install-purelib switches to the "python >> setup.py install" command. MvL> Since japanese.pth is processed as a 'data' file, you have MvL> two options: 1. Only invoke the install_lib command, not the MvL> install command. This will then avoid the install_headers, MvL> install_scripts, and install_data commands (the first two not MvL> being used here, anyway). MvL> 2. Provide the --install-data= argument to the install MvL> command, to specify an alternative prefix for data files. I think I'm going to go with your second suggestion, since it fits in better with what I've already got. ObDistutils: the install command has a --root option and a --home option, both of which would seem to do what I want, but neither quite do. E.g. invoking install with --root=/tmp/foo leaves me with /tmp/foo/usr/local/lib/python2.1/site-packages/ and invoking with --home=/tmp/foo leaves me with /tmp/foo/lib/python/ What I really want is an option to leave me with /tmp/foo/ so that I can put /tmp/foo on sys.path and be done with it, yet still guarantee that distutils will only install files under /tmp/foo and no where else. It seems that I'm left with this as my best option: % python setup.py install --install-lib /tmp/foo --install-purelib \ /tmp/foo --install-data /tmp/foo That still leaves me with /tmp/foo/lib/pythonX.Y/site-packages/japanese.pth but I'll ignore that for now . Am I whacked not to want those extra directories in what I have to set my PYTHONPATH to? Maybe I'm just bucking the natural order of things, but I still think I'd like an install command option that collapses those three options into one. If distutils is going to be used to install stuff in a site-packages override directory, or in a user-specific search-first directory, I think we need to make this simpler. but-maybe-I'm-insane-ly y'rs, -Barry