From goodger@users.sourceforge.net  Mon Jul  1 19:23:36 2002
From: goodger@users.sourceforge.net (David Goodger)
Date: Mon, 01 Jul 2002 14:23:36 -0400
Subject: [I18n-sig] encoding support for Docutils: please review
In-Reply-To: <m34rfl98hm.fsf@mira.informatik.hu-berlin.de>
Message-ID: <B9461667.252B8%goodger@users.sourceforge.net>

Thanks for your reply, Martin.

> I'd reorder this: (try command line). Try ASCII first, then UTF-8. If
> ASCII passes, it most likely is ASCII. If not, and UTF-8 passes, it
> most likely is UTF-8. Then try the locale's encoding.

Out of curiosity, is there any point in trying both ASCII and UTF-8?  UTF-8
is a strict superset of ASCII, so shouldn't checking UTF-8 alone be enough
for both?  If we don't care what the original encoding was (we just want
Unicode text to process), does explicitly checking for ASCII buy us
anything?

-- 
David Goodger  <goodger@users.sourceforge.net>  Open-source projects:
  - Python Docutils: http://docutils.sourceforge.net/
    (includes reStructuredText: http://docutils.sf.net/rst.html)
  - The Go Tools Project: http://gotools.sourceforge.net/


From Matt Gushee <mgushee@havenrock.com>  Mon Jul  1 19:30:21 2002
From: Matt Gushee <mgushee@havenrock.com> (Matt Gushee)
Date: Mon, 1 Jul 2002 12:30:21 -0600
Subject: [I18n-sig] encoding support for Docutils: please review
In-Reply-To: <B9461667.252B8%goodger@users.sourceforge.net>
References: <m34rfl98hm.fsf@mira.informatik.hu-berlin.de> <B9461667.252B8%goodger@users.sourceforge.net>
Message-ID: <20020701183021.GC361@swordfish.havenrock.com>

On Mon, Jul 01, 2002 at 02:23:36PM -0400, David Goodger wrote:
> Thanks for your reply, Martin.
> 
> > I'd reorder this: (try command line). Try ASCII first, then UTF-8. If
> > ASCII passes, it most likely is ASCII.

Unless it's Shift-JIS.

-- 
Matt Gushee
Englewood, Colorado, USA
mgushee@havenrock.com
http://www.havenrock.com/


From martin@v.loewis.de  Mon Jul  1 21:26:38 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 01 Jul 2002 22:26:38 +0200
Subject: [I18n-sig] encoding support for Docutils: please review
In-Reply-To: <B9461667.252B8%goodger@users.sourceforge.net>
References: <B9461667.252B8%goodger@users.sourceforge.net>
Message-ID: <m3y9cvmby9.fsf@mira.informatik.hu-berlin.de>

David Goodger <goodger@users.sourceforge.net> writes:

> Out of curiosity, is there any point in trying both ASCII and UTF-8?  UTF-8
> is a strict superset of ASCII, so shouldn't checking UTF-8 alone be enough
> for both?  If we don't care what the original encoding was (we just want
> Unicode text to process), does explicitly checking for ASCII buy us
> anything?

The answer to the last question is "no". The point in checking ASCII
specifically is that you then know that it is strictly ASCII (unless
it is iso-2022-jp, that is); if that is not interesting to know, there
is no point.

Regards,
Martin


From Misha.Wolf@reuters.com  Fri Jul  5 22:30:26 2002
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 05 Jul 2002 22:30:26 +0100
Subject: [I18n-sig] 22nd Unicode Conference, Sep 2002, San Jose, CA -- Register now!
Message-ID: <T5be9524491c407b707a24@reuters.com>

***********************************************************************
Register now! > Just 9 weeks to go > Register now! > Just 9 weeks to go
***********************************************************************

         Twenty-second International Unicode Conference (IUC22)
             Unicode and the Web: Evolution or Revolution?
                    http://www.unicode.org/iuc/iuc22
                          September 9-13, 2002
                          San Jose, California

***********************************************************************
Full program now live! >> Five days of 3 tracks! >> Check the Web site!
***********************************************************************

NEWS

 > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc22 )
   to check the Conference program and register.  To help you choose
   Conference sessions, we've included abstracts of talks and speakers'
   biographies.

 > Hotel guest room group rate valid to 16 August.

 > Early bird registration rate valid to 16 August.


CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Microsoft Corporation
   Netscape Communications
   Oracle Corporation
   Reuters Ltd.
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site.

CONFERENCE VENUE

The Conference will take place at:

   DoubleTree Hotel San Jose
   2050 Gateway Place
   San Jose, CA 95110
   USA

   Tel: +1 408 453 4000
   Fax: +1 408 437 2898

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding. The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646. In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations. Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc. Used with permission.


------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From kajiyama@grad.sccs.chukyo-u.ac.jp  Fri Jul 12 17:50:04 2002
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Sat, 13 Jul 2002 01:50:04 +0900
Subject: [I18n-sig] JapaneseCodecs 1.4.7 released
Message-ID: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp>

Hi,

I've released JapaneseCodecs 1.4.7.  As usual, the source
tarball is available at the following location:

http://www.python.jp/Zope/download/JapaneseCodecs (in Japanese)
http://www.python.jp/Zope/download/JapaneseCodecs/JapaneseCodecs-1.4.7.tar.gz

Encoders and decoders now raise a ValueError instead of
UnicodeError if their optional argument "errors" has an
invalid value.  Thanks Walter for reminding me!

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From barry@zope.com  Sat Jul 13 00:24:37 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Fri, 12 Jul 2002 19:24:37 -0400
Subject: [I18n-sig] JapaneseCodecs 1.4.7 released
References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp>
Message-ID: <15663.25909.341858.861899@anthem.wooz.org>

>>>>> "TK" == Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

    TK> Hi,

    TK> I've released JapaneseCodecs 1.4.7.  As usual, the source
    TK> tarball is available at the following location:

    TK> http://www.python.jp/Zope/download/JapaneseCodecs (in
    TK> Japanese)
    TK> http://www.python.jp/Zope/download/JapaneseCodecs/JapaneseCodecs-1.4.7.tar.gz

    TK> Encoders and decoders now raise a ValueError instead of
    TK> UnicodeError if their optional argument "errors" has an
    TK> invalid value.  Thanks Walter for reminding me!

Thanks for the update, I've installed this update in the Mailman
project.

-Barry


From MBleyer@DEFiNiENS.com  Wed Jul 17 17:16:39 2002
From: MBleyer@DEFiNiENS.com (Bleyer, Michael)
Date: Wed, 17 Jul 2002 18:16:39 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
Message-ID: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>

Assume I have a list of unicode strings in UTF-16-le. Reading and parsing
the list all works really fine.

Now I want to create/copy a number of files and I want the file/directory
names to be these unicode strings.
When I give a unicode string to a file system call like
shutil.copy()
or 
os.makedir()
Python converts the unicode string to a "regular" string using the default
site encoding (which usually fails if 'ascii').
I can influence this by encode()'ing myself before I pass the string to the
system function call, so far so good.

However, I do have a problem if I have unicode strings from different,
non-compatible encodings in my list (e.g. ISO latin-1 and some asian
encoding), as I cannot use the same encoding conversion for all strings,
some will fail. I can of course convert to UTF8 which will always work, but
the filenames turn out to be garbage (because the OS does not interpret them
as UTF8 but in the local encoding).

My question is thus: since modern-day operating systems claim to support
unicode (I assume) in filenames, how do I pass a unicode string directly to
a system function call without having to convert to a "localized" encoding?

Alternatively how can I find out the "proper" or "legal" encoding for a
unicode string just by looking at the string (e.g. not with a brute force
try-encode-except trial and error loop).

As a side problem: how do I deal with filename length limits, since these
are actually byte limits not character limits?
If I do a u''[:255] followed by an encode I end up with a unicode string
thats at most 255 characters long, but may be longer than 255 bytes after
encoding.
If I do encode followed by ''[:255] I get at most 255 bytes but my string
may be illegal because I cut off in the middle of a 3-byte character.

Any insights and suggestions greatly appreciated.

Mike


From mal@lemburg.com  Wed Jul 17 17:50:59 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Jul 2002 18:50:59 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
Message-ID: <3D35A073.3060602@lemburg.com>

Bleyer, Michael wrote:
> Assume I have a list of unicode strings in UTF-16-le. Reading and parsing
> the list all works really fine.
> 
> Now I want to create/copy a number of files and I want the file/directory
> names to be these unicode strings.
> When I give a unicode string to a file system call like
> shutil.copy()
> or 
> os.makedir()
> Python converts the unicode string to a "regular" string using the default
> site encoding (which usually fails if 'ascii').
> I can influence this by encode()'ing myself before I pass the string to the
> system function call, so far so good.
> 
> However, I do have a problem if I have unicode strings from different,
> non-compatible encodings in my list (e.g. ISO latin-1 and some asian
> encoding), as I cannot use the same encoding conversion for all strings,
> some will fail. I can of course convert to UTF8 which will always work, but
> the filenames turn out to be garbage (because the OS does not interpret them
> as UTF8 but in the local encoding).
> 
> My question is thus: since modern-day operating systems claim to support
> unicode (I assume) in filenames, how do I pass a unicode string directly to
> a system function call without having to convert to a "localized" encoding?

Python 2.2 tries to automagically encode Unicode into the
encoding used by the OS. This only works if Python can figure
out this encoding. AFAIK, only Windows platforms are supported.

> Alternatively how can I find out the "proper" or "legal" encoding for a
> unicode string just by looking at the string (e.g. not with a brute force
> try-encode-except trial and error loop).

If you know the encoding used by the file system, then you should
simply encode the Unicode filename using that encoding.

> As a side problem: how do I deal with filename length limits, since these
> are actually byte limits not character limits?
> If I do a u''[:255] followed by an encode I end up with a unicode string
> thats at most 255 characters long, but may be longer than 255 bytes after
> encoding.
> If I do encode followed by ''[:255] I get at most 255 bytes but my string
> may be illegal because I cut off in the middle of a 3-byte character.

Good question. You could try the stripping after the encoding
and then have Python decode the result using the 'ignore' error
handling. That should give you the maximum sized Unicode string
to use for encoding.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From martin@v.loewis.de  Wed Jul 17 19:25:30 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 17 Jul 2002 20:25:30 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
Message-ID: <m3y9cai545.fsf@mira.informatik.hu-berlin.de>

"Bleyer, Michael" <MBleyer@DEFiNiENS.com> writes:

> My question is thus: since modern-day operating systems claim to support
> unicode (I assume) in filenames

That is not really true. WinNT and MacOS do. Unix only supports
byte-based file names, and there is an ongoing debate on how those
should be used to represent non-ASCII in file names. The convention
seems to be that the locale's encoding should be assumed for file
names.

As MAL explains, you can pass Unicode file names automatically in
Python 2.2; you might need to invoke locale.setlocale for this to work
properly.

> Alternatively how can I find out the "proper" or "legal" encoding for a
> unicode string just by looking at the string (e.g. not with a brute force
> try-encode-except trial and error loop).

For this, you need to tell us what system you use.

> As a side problem: how do I deal with filename length limits, since these
> are actually byte limits not character limits?

Again, depends on the system. As a starting point, you need to find
out what the limit is.

> If I do a u''[:255] followed by an encode I end up with a unicode string
> thats at most 255 characters long, but may be longer than 255 bytes after
> encoding.

Also, the limit might be smaller than 255.

> If I do encode followed by ''[:255] I get at most 255 bytes but my string
> may be illegal because I cut off in the middle of a 3-byte character.

If truncation is acceptable, I recommend to truncate to 50% of the
maximum size, and assert that the encoded result is smaller than the
maximum size. You can try to be smart and use binary search to find
the largest acceptable character string.

Regards,
Martin


From martin@v.loewis.de  Wed Jul 17 19:27:07 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 17 Jul 2002 20:27:07 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <3D35A073.3060602@lemburg.com>
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
 <3D35A073.3060602@lemburg.com>
Message-ID: <m3u1myi51g.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> Python 2.2 tries to automagically encode Unicode into the
> encoding used by the OS. This only works if Python can figure
> out this encoding. AFAIK, only Windows platforms are supported.

No; it works on Unix as well (if nl_langinfo(CODEPAGE) is supported);
you need to invoke setlocale to activate this support (in particular,
the LC_CTYPE category).

Regards,
Martin


From mal@lemburg.com  Wed Jul 17 19:38:52 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Jul 2002 20:38:52 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>	<3D35A073.3060602@lemburg.com> <m3u1myi51g.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3D35B9BC.8020004@lemburg.com>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> 
>>Python 2.2 tries to automagically encode Unicode into the
>>encoding used by the OS. This only works if Python can figure
>>out this encoding. AFAIK, only Windows platforms are supported.
> 
> 
> No; it works on Unix as well (if nl_langinfo(CODEPAGE) is supported);
> you need to invoke setlocale to activate this support (in particular,
> the LC_CTYPE category).

You mean: call setlocale() to set something or fetch the
encoding from it ? Setting a locale to something other than
"C" will cause quite a few semantic changes, so you should
beware...

Note there's also locale.getdefaultlocale() which work on many
platforms and returns the default locale and encoding
for the platform Python currently runs on.

BTW, running "python locale.py" prints your current settings.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From martin@v.loewis.de  Wed Jul 17 19:54:07 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 17 Jul 2002 20:54:07 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <3D35B9BC.8020004@lemburg.com>
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
 <3D35A073.3060602@lemburg.com>
 <m3u1myi51g.fsf@mira.informatik.hu-berlin.de>
 <3D35B9BC.8020004@lemburg.com>
Message-ID: <m37kjui3sg.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> You mean: call setlocale() to set something or fetch the
> encoding from it ? Setting a locale to something other than
> "C" will cause quite a few semantic changes, so you should
> beware...

Indeed. However, setting the locale may be the only way to find out
what the locale's encoding is.

> Note there's also locale.getdefaultlocale()

That is broken beyond repair, and should not be used for anything. It
can't possibly work.

> which work on many platforms and returns the default locale and
> encoding for the platform Python currently runs on.

In particular when it comes to the locale's encoding, it has no chance
to work correctly, except on Windows.

Regards,
Martin


From mal@lemburg.com  Wed Jul 17 21:23:54 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Jul 2002 22:23:54 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>	<3D35A073.3060602@lemburg.com>	<m3u1myi51g.fsf@mira.informatik.hu-berlin.de>	<3D35B9BC.8020004@lemburg.com> <m37kjui3sg.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3D35D25A.4080603@lemburg.com>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> 
>>You mean: call setlocale() to set something or fetch the
>>encoding from it ? Setting a locale to something other than
>>"C" will cause quite a few semantic changes, so you should
>>beware...
> 
> 
> Indeed. However, setting the locale may be the only way to find out
> what the locale's encoding is.
> 
> 
>>Note there's also locale.getdefaultlocale()
> 
> 
> That is broken beyond repair, and should not be used for anything. It
> can't possibly work.

Hmm, why is that ?

>>which work on many platforms and returns the default locale and
>>encoding for the platform Python currently runs on.
> 
> 
> In particular when it comes to the locale's encoding, it has no chance
> to work correctly, except on Windows.

There's a large database in locale.py for this and a few
support APIs which make use of it.

It would probably be
worthwhile to add an interface encoding(localename)
which only returns the encoding used per default for
that locale.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From martin@v.loewis.de  Wed Jul 17 22:13:01 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 17 Jul 2002 23:13:01 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <3D35D25A.4080603@lemburg.com>
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
 <3D35A073.3060602@lemburg.com>
 <m3u1myi51g.fsf@mira.informatik.hu-berlin.de>
 <3D35B9BC.8020004@lemburg.com>
 <m37kjui3sg.fsf@mira.informatik.hu-berlin.de>
 <3D35D25A.4080603@lemburg.com>
Message-ID: <m3sn2if482.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> > That is broken beyond repair, and should not be used for anything. It
> > can't possibly work.
> 
> Hmm, why is that ?

It tries to find out locale information from environment
variables. That is bound to fail because:

- it may not know what variables to consider. In particular, on Unix,
  it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
  a number of errors when trying to find the encoding:

  - if LANGUAGE is set, it is used to determine the encoding. This is
    incorrect; LANGUAGE cannot be used for that. For example, with
    LANGUAGE=german LANG=de_DE.UTF-8, it returns
    ['de_DE', 'ISO8859-1']
    This is incorrect; the encoding should have been UTF-8

  - it misses that LANGUAGE can contain contain colons to denote
    fallbacks, on GNU/Linux; with
    LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
    ['de_DE', 'french']
    This is even worse: french is not the name of an encoding

- it may not know the syntax of the environment variables. For
  example, the current implementation breaks for "de_DE@euro"; this is
  an SF bug report.

- it may not know the encoding associated with a locale. For example,
  for de_DE@euro, it is Latin-9 on Linux today, but might be UTF-8 on
  some other system. Likewise, locale.py just *knows* that de_DE means
  ".iso-8859-1" on any system - that can be easily wrong.

- the language name return from getdefaultlocale is incorrect on
  Windows, see

  http://groups.google.com/groups?selm=917pjb%24ii2%241%40reader1.imaginet.fr

  Users apparently expect that they can pass the result of
  getdefaultlocale to setlocale, but this is not the case.

> There's a large database in locale.py for this and a few
> support APIs which make use of it.

That is the major problem. This database is incorrect, cannot be
corrected, and is both unmaintained and unmaintainable.

> It would probably be worthwhile to add an interface
> encoding(localename) which only returns the encoding used per
> default for that locale.

I would make this getencoding(), and document that you need to call
setlocale before, to make use of the user settings. The official way,
on Unix, to obtain the locale's encoding is to use
nl_langinfo(CODESET), which only works if the LC_CTYPE facet has been
set. On Windows, locale._getdefaultlocale fortunately already returns
the current codeset (which isn't influenced by setlocale, anyway).

Regards,
Martin


From mal@lemburg.com  Wed Jul 17 22:30:51 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Jul 2002 23:30:51 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>	<3D35A073.3060602@lemburg.com>	<m3u1myi51g.fsf@mira.informatik.hu-berlin.de>	<3D35B9BC.8020004@lemburg.com>	<m37kjui3sg.fsf@mira.informatik.hu-berlin.de>	<3D35D25A.4080603@lemburg.com> <m3sn2if482.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3D35E20B.8080103@lemburg.com>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> 
>>>That is broken beyond repair, and should not be used for anything. It
>>>can't possibly work.
>>
>>Hmm, why is that ?
> 
> 
> It tries to find out locale information from environment
> variables. That is bound to fail because:
> 
> - it may not know what variables to consider. In particular, on Unix,
>   it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
>   a number of errors when trying to find the encoding:

That's the search order which GNU readline uses (at least
at the time I wrote the code).

>   - if LANGUAGE is set, it is used to determine the encoding. This is
>     incorrect; LANGUAGE cannot be used for that. For example, with
>     LANGUAGE=german LANG=de_DE.UTF-8, it returns
>     ['de_DE', 'ISO8859-1']
>     This is incorrect; the encoding should have been UTF-8
> 
>   - it misses that LANGUAGE can contain contain colons to denote
>     fallbacks, on GNU/Linux; with
>     LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
>     ['de_DE', 'french']
>     This is even worse: french is not the name of an encoding

Interesting. Is the format documented somewhere ? It should be
easy to fix this.

> - it may not know the syntax of the environment variables. For
>   example, the current implementation breaks for "de_DE@euro"; this is
>   an SF bug report.

This should be fixable too. What does the '@euro' mean ? Does it
have to do with currency ?

> - it may not know the encoding associated with a locale. For example,
>   for de_DE@euro, it is Latin-9 on Linux today, but might be UTF-8 on
>   some other system. Likewise, locale.py just *knows* that de_DE means
>   ".iso-8859-1" on any system - that can be easily wrong.

Sure, but you normally only get the locale name and then
have to make an educated guess for the encoding. If the
encoding is known (e.g. by looking at the LANG environment
variable), then that infomration should override the
database information.

> - the language name return from getdefaultlocale is incorrect on
>   Windows, see
> 
>   http://groups.google.com/groups?selm=917pjb%24ii2%241%40reader1.imaginet.fr
> 
>   Users apparently expect that they can pass the result of
>   getdefaultlocale to setlocale, but this is not the case.

Hmm, the names returned by getdefaultlocale() and normalize()
are standards. I wonder what Windows expects to see for
setlocale().

>>There's a large database in locale.py for this and a few
>>support APIs which make use of it.
> 
> 
> That is the major problem. This database is incorrect, cannot be
> corrected, and is both unmaintained and unmaintainable.

I'd say, it's better than nothing :-)

>>It would probably be worthwhile to add an interface
>>encoding(localename) which only returns the encoding used per
>>default for that locale.
> 
> 
> I would make this getencoding(), and document that you need to call
> setlocale before, to make use of the user settings. The official way,
> on Unix, to obtain the locale's encoding is to use
> nl_langinfo(CODESET), which only works if the LC_CTYPE facet has been
> set. On Windows, locale._getdefaultlocale fortunately already returns
> the current codeset (which isn't influenced by setlocale, anyway).

Fine.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From martin@v.loewis.de  Wed Jul 17 23:11:10 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 18 Jul 2002 00:11:10 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <3D35E20B.8080103@lemburg.com>
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
 <3D35A073.3060602@lemburg.com>
 <m3u1myi51g.fsf@mira.informatik.hu-berlin.de>
 <3D35B9BC.8020004@lemburg.com>
 <m37kjui3sg.fsf@mira.informatik.hu-berlin.de>
 <3D35D25A.4080603@lemburg.com>
 <m3sn2if482.fsf@mira.informatik.hu-berlin.de>
 <3D35E20B.8080103@lemburg.com>
Message-ID: <m365zedmyp.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> > - it may not know what variables to consider. In particular, on Unix,
> >   it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
> >   a number of errors when trying to find the encoding:
> 
> That's the search order which GNU readline uses (at least
> at the time I wrote the code).

GNU readline does not check LANGUAGE, and it uses setlocale if
available (so you are talking about rarely-used fallback code).

> >   - it misses that LANGUAGE can contain contain colons to denote
> >     fallbacks, on GNU/Linux; with
> >     LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
> >     ['de_DE', 'french']
> >     This is even worse: french is not the name of an encoding
> 
> Interesting. Is the format documented somewhere ? It should be
> easy to fix this.

Of LANGUAGE? I believe it's documented in the gettext documentation.

> > - it may not know the syntax of the environment variables. For
> >   example, the current implementation breaks for "de_DE@euro"; this is
> >   an SF bug report.
> 
> This should be fixable too. What does the '@euro' mean ? Does it
> have to do with currency ?

In a way. It is a "locale variant". A variant could be just about
anything. Common variants are @euro (used to denote the variant that
has the Euro for LC_CURRENCY), @nynorsk (used to tell apart the two
Norwegian languages - now nb and no), and @xim, used for X Input
Methods (like @xim=kinput2). It could be used for many other things,
too.

You can fix the parsing of the variants, but you cannot infer the
encoding.

> Sure, but you normally only get the locale name and then
> have to make an educated guess for the encoding. 

That is my point: This algorithm must guess, and it *will* guess
wrong.

> If the encoding is known (e.g. by looking at the LANG environment
> variable), then that infomration should override the database
> information.

In this specific case (of the @euro domains), the LANG variable does
not explicitly mention the encoding. So that doesn't help.


> Hmm, the names returned by getdefaultlocale() and normalize()
> are standards. I wonder what Windows expects to see for
> setlocale().

What standards? Posix? That has never impressed Microsoft. Instead of
"fr_FR.cp1252", they accept "French_France.1252". That may even be
Posix-conforming, though, which allows "<lang>_<country>.<codeset>".

Locale names are *not* standard. An algorithm that assumes that they
are is broken.

> I'd say, it's better than nothing :-)

Yes, that's why I propose to provide a replacement, and then deprecate
the existing function.

Regards,
Martin


From martin@v.loewis.de  Thu Jul 18 16:14:33 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 18 Jul 2002 17:14:33 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <3D36BA55.1000802@lemburg.com>
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>
 <3D35A073.3060602@lemburg.com>
 <m3u1myi51g.fsf@mira.informatik.hu-berlin.de>
 <3D35B9BC.8020004@lemburg.com>
 <m37kjui3sg.fsf@mira.informatik.hu-berlin.de>
 <3D35D25A.4080603@lemburg.com>
 <m3sn2if482.fsf@mira.informatik.hu-berlin.de>
 <3D35E20B.8080103@lemburg.com>
 <m365zedmyp.fsf@mira.informatik.hu-berlin.de>
 <3D36BA55.1000802@lemburg.com>
Message-ID: <m3k7ntjcfa.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> > You can fix the parsing of the variants, but you cannot infer the
> > encoding.
> 
> Why not ? I know that several locales use more than one
> encoding for their script(s), 

Which locale, on which system?

> but having at least a hint is better than no information at all.

Where do you get the hint from? And why is it better to guess a
random encoding than to guess "ascii" all the time?

> I've never said that it will always guess right. AFAIK,
> there is no platform independent solution to the problem.
> I am all for adding more support for platform specific
> solutions, though.

For that, I would need to understand the meaning of getdefaultlocale
first. What precisely is it supposed to return? I can understand the
"encoding" part (what encoding is the user likely to use), but what is
the meaning of the "language code" return value? And what can you do
with that result?

> > In this specific case (of the @euro domains), the LANG variable does
> > not explicitly mention the encoding. So that doesn't help.
> 
> It can be used as hint, e.g. in Germany we use Latin-1 as
> encoding, so that's a good assumption.

That is a wrong assumption. In Germany, we use windows-1252,
iso-8859-1, iso-8859-15, and UTF-8. Many modern Unix installations use
Latin-9 instead of Latin-1, since Latin-1 cannot represent the
currency symbol of the locale.

> >>I'd say, it's better than nothing :-)
> > Yes, that's why I propose to provide a replacement, and then
> > deprecate
> > the existing function.
> 
> Why a replacement and what kind of replacement ? It should well
> be possible to add more support to the existing APIs and
> perhaps extend them with new ones.

Because the other APIs have different usage constraints. It *is*
possible to find out the user's encoding reliably on many Unix
systems, but you have to invoke setlocale for that to work. Calling
setlocale behind the scenes is bad, so the users have to change their
code.

Also, this only returns the encoding. I don't know what the "language
code" is or how to obtain it - even in a system specific
way. Fortunately, I don't consider this a problem - since I can't see
why anybody would want that value, either.

Regards,
Martin


From martin@v.loewis.de  Thu Jul 18 16:05:02 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 18 Jul 2002 17:05:02 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <023814FAC196D5119E4A00D0B76C0F98052125@mmuc.definiens.com>
References: <023814FAC196D5119E4A00D0B76C0F98052125@mmuc.definiens.com>
Message-ID: <m3ofd5jcv5.fsf@mira.informatik.hu-berlin.de>

"Bleyer, Michael" <MBleyer@DEFiNiENS.com> writes:

> What I want to do, is create file names from a list that has strings in both
> encodings. The strings can be handled fine while in unicode, but as soon as
> I try to convert all of them to one encoding, half of the conversions will
> fail. I just want to convert them with the proper encoding and then pass the
> bytestring to the system function, I don't worry about wether it will
> _display_ right, just about wether the name is correct. 

For that, you need to give a definition of "correct". From your
description, I'd say that encoding the strings as "utf-8" is also
"correct" - it gives you byte strings that identify the original file
names.

> What I would like to have is some function that will tell me for a given
> Unicode string, a list of all the encodings that this string can be
> converted into (without having to try all available encodings in a brute
> force loop), because I do not know the proper encoding a priori.

I doubt that you can implement such function without a "brute force"
algorithm of some kind.

> Anyway, if there isn't a direct interface/solution, what would you
> consider the best workaround for Python?

Use brute force.

Perhaps I'm still not understanding your problem clearly. To
understand it better, can you please answer the following questions?

- does your problem really have to do with file names? Or can it be
  considered as independent of the problem of file names?

- would it help if, for each Unicode character, there was a list of
  encodings that can represent that character?

Regards,
Martin


From yedian@worldnet.att.net  Thu Jul 18 16:27:53 2002
From: yedian@worldnet.att.net (Dan Edwards)
Date: Thu, 18 Jul 2002 23:27:53 +0800
Subject: [I18n-sig] Chinese Codecs?
Message-ID: <MNEMIPJKPGKFBFOIDIEIGEBJDOAA.yedian@worldnet.att.net>

Is there a definitive source for Traditional and Simplified Chinese codecs
for Python? I'm specifically looking for codecs to handle CP936 and CP950.

I've looked at the pythonzh project at SourceForge and it appears to be 1)
abandonware, 2) not functioning as expected.

Tnx for any help on this question.

Dan


From MBleyer@DEFiNiENS.com  Thu Jul 18 14:42:06 2002
From: MBleyer@DEFiNiENS.com (Bleyer, Michael)
Date: Thu, 18 Jul 2002 15:42:06 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
Message-ID: <023814FAC196D5119E4A00D0B76C0F98052125@mmuc.definiens.com>

> > Python 2.2 tries to automagically encode Unicode into the encoding 
> > used by the OS. This only works if Python can figure out this 
> > encoding. AFAIK, only Windows platforms are supported.
> 
> No; it works on Unix as well (if nl_langinfo(CODEPAGE) is 
> supported); you need to invoke setlocale to activate this 
> support (in particular, the LC_CTYPE category).

It does work on Unix as well with some caveats. However, I think maybe my
original questions was not clear enough.

Let's assume for the sake of the argument, that a call to
locale.getdefaultlocale()[1]
will get me the systems default encoding which I can use to encode my
unicode strings so they show up properly when used in filenames.

But I know that in some areas people work with two different incompatible
(non-symmetric) encodings, for example people in Japan with mixed Sun and
Windows networks. They have some filenames in one encoding and some in the
other. One half of the filenames used always shows up as garbage, since they
cannot be displayed in the other encoding and vice versa. Let's assume that
people know this and accept it.

What I want to do, is create file names from a list that has strings in both
encodings. The strings can be handled fine while in unicode, but as soon as
I try to convert all of them to one encoding, half of the conversions will
fail. I just want to convert them with the proper encoding and then pass the
bytestring to the system function, I don't worry about wether it will
_display_ right, just about wether the name is correct. 

What I would like to have is some function that will tell me for a given
Unicode string, a list of all the encodings that this string can be
converted into (without having to try all available encodings in a brute
force loop), because I do not know the proper encoding a priori.

The system locale info will only tell me which encoding is _displayed_
properly, it does not mean that this encoding will be able to handle all my
unicode strings.

I am not sure if this is a fundamental problem with Unicode, as it seems to
be a great way to store data but as soon as you actually want to do anything
with it you need some extra META information that is not stored in the data
itself (uhm, I don't mean to rant, I realize this holds true for other
formats as well). I also know that an obvious answer to this whole issue
could be "just keep your data in your local encoding and avoid using
unicode", unfortunately the source format is UTF-16.

Anyway, if there isn't a direct interface/solution, what would you consider
the best workaround for Python?
:-)

Mike


From mal@lemburg.com  Thu Jul 18 13:53:41 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Jul 2002 14:53:41 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
References: <023814FAC196D5119E4A00D0B76C0F98052123@mmuc.definiens.com>	<3D35A073.3060602@lemburg.com>	<m3u1myi51g.fsf@mira.informatik.hu-berlin.de>	<3D35B9BC.8020004@lemburg.com>	<m37kjui3sg.fsf@mira.informatik.hu-berlin.de>	<3D35D25A.4080603@lemburg.com>	<m3sn2if482.fsf@mira.informatik.hu-berlin.de>	<3D35E20B.8080103@lemburg.com> <m365zedmyp.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3D36BA55.1000802@lemburg.com>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> 
>>>- it may not know what variables to consider. In particular, on Unix,
>>>  it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
>>>  a number of errors when trying to find the encoding:
>>
>>That's the search order which GNU readline uses (at least
>>at the time I wrote the code).
> 
> 
> GNU readline does not check LANGUAGE, and it uses setlocale if
> available (so you are talking about rarely-used fallback code).

See the gettext man page:

         If the LANGUAGE environment variable is set to a nonempty  value,  and
        the  locale is not the "C" locale, the value of LANGUAGE is assumed to
        contain a colon separated list of locale  names.  The  functions  will
        attempt  to  look  up a translation of msgid in each of the locales in
        turn. This is a GNU extension.

>>>  - it misses that LANGUAGE can contain contain colons to denote
>>>    fallbacks, on GNU/Linux; with
>>>    LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
>>>    ['de_DE', 'french']
>>>    This is even worse: french is not the name of an encoding
>>
>>Interesting. Is the format documented somewhere ? It should be
>>easy to fix this.
> 
> Of LANGUAGE? I believe it's documented in the gettext documentation.

Yes. It looks as if parsing LANGUAGE is the wrong thing
to do if you're looking for the default locale (ie. the one
which is used at process startup time before any calls
to setlocale()).

>>>- it may not know the syntax of the environment variables. For
>>>  example, the current implementation breaks for "de_DE@euro"; this is
>>>  an SF bug report.
>>
>>This should be fixable too. What does the '@euro' mean ? Does it
>>have to do with currency ?
> 
> 
> In a way. It is a "locale variant". A variant could be just about
> anything. Common variants are @euro (used to denote the variant that
> has the Euro for LC_CURRENCY), @nynorsk (used to tell apart the two
> Norwegian languages - now nb and no), and @xim, used for X Input
> Methods (like @xim=kinput2). It could be used for many other things,
> too.
> 
> You can fix the parsing of the variants, but you cannot infer the
> encoding.

Why not ? I know that several locales use more than one
encoding for their script(s), but having at least a hint
is better than no information at all.

Of course, if the system provides different means of
accessing this information, then those means should be
used instead.

>>Sure, but you normally only get the locale name and then
>>have to make an educated guess for the encoding. 
> 
> 
> That is my point: This algorithm must guess, and it *will* guess
> wrong.

I've never said that it will always guess right. AFAIK,
there is no platform independent solution to the problem.
I am all for adding more support for platform specific
solutions, though.

>>If the encoding is known (e.g. by looking at the LANG environment
>>variable), then that infomration should override the database
>>information.
> 
> 
> In this specific case (of the @euro domains), the LANG variable does
> not explicitly mention the encoding. So that doesn't help.

It can be used as hint, e.g. in Germany we use Latin-1 as
encoding, so that's a good assumption.

>>Hmm, the names returned by getdefaultlocale() and normalize()
>>are standards. I wonder what Windows expects to see for
>>setlocale().
> 
> 
> What standards? Posix? That has never impressed Microsoft. Instead of
> "fr_FR.cp1252", they accept "French_France.1252". That may even be
> Posix-conforming, though, which allows "<lang>_<country>.<codeset>".
> 
> Locale names are *not* standard. An algorithm that assumes that they
> are is broken.

I didn't say that locale names are always standard. To the contrary:
I added the normalize() API to locale.py to map some of the
commonly used non-standard locale names to the standards
compatible ones (ISO 639 language code + <underscore> ISO 3166
country code).

>>I'd say, it's better than nothing :-)
> 
> Yes, that's why I propose to provide a replacement, and then deprecate
> the existing function.

Why a replacement and what kind of replacement ? It should well
be possible to add more support to the existing APIs and
perhaps extend them with new ones.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From barry@zope.com  Thu Jul 18 19:02:20 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Thu, 18 Jul 2002 14:02:20 -0400
Subject: [I18n-sig] Chinese Codecs?
References: <MNEMIPJKPGKFBFOIDIEIGEBJDOAA.yedian@worldnet.att.net>
Message-ID: <15671.684.366934.581967@anthem.wooz.org>

>>>>> "DE" == Dan Edwards <yedian@worldnet.att.net> writes:

    DE> Is there a definitive source for Traditional and Simplified
    DE> Chinese codecs for Python? I'm specifically looking for codecs
    DE> to handle CP936 and CP950.

    DE> I've looked at the pythonzh project at SourceForge and it
    DE> appears to be 1) abandonware, 2) not functioning as expected.

    DE> Tnx for any help on this question.

I think I got Chinese codes for Mailman from
http://sourceforge.net/projects/python-codecs/ but I don't know how
current (or useable) they are.

-Barry


From Matt Gushee <mgushee@havenrock.com>  Thu Jul 18 20:29:33 2002
From: Matt Gushee <mgushee@havenrock.com> (Matt Gushee)
Date: Thu, 18 Jul 2002 13:29:33 -0600
Subject: [I18n-sig] Chinese Codecs?
In-Reply-To: <15671.684.366934.581967@anthem.wooz.org>
References: <MNEMIPJKPGKFBFOIDIEIGEBJDOAA.yedian@worldnet.att.net> <15671.684.366934.581967@anthem.wooz.org>
Message-ID: <20020718192933.GE10401@swordfish.havenrock.com>

On Thu, Jul 18, 2002 at 02:02:20PM -0400, Barry A. Warsaw wrote:
> 
> I think I got Chinese codes for Mailman from
> http://sourceforge.net/projects/python-codecs/ but I don't know how
> current (or useable) they are.

Nor do I, since I only currently only work with Japanese. But I tried to
contact their author, J.S. Frank Chen, a few months ago, to see about
getting an updated version of a document of his for 4Suite. I never got
any response, and judging from the record of his project activity, posts
to mailing lists, etc., he seems to have vanished from the Internet.

So it seems the Chinese codecs may be orphaned (not literally, I hope).

-- 
Matt Gushee
Englewood, Colorado, USA
mgushee@havenrock.com
http://www.havenrock.com/


From martin@v.loewis.de  Thu Jul 18 22:14:04 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 18 Jul 2002 23:14:04 +0200
Subject: [I18n-sig] Chinese Codecs?
In-Reply-To: <MNEMIPJKPGKFBFOIDIEIGEBJDOAA.yedian@worldnet.att.net>
References: <MNEMIPJKPGKFBFOIDIEIGEBJDOAA.yedian@worldnet.att.net>
Message-ID: <m3ofd491sz.fsf@mira.informatik.hu-berlin.de>

"Dan Edwards" <yedian@worldnet.att.net> writes:

> Is there a definitive source for Traditional and Simplified Chinese codecs
> for Python? I'm specifically looking for codecs to handle CP936 and CP950.

Depending on the platform you are using, you may try the iconv codec,
thus accessing codecs in your C library.

Regards,
Martin


From MBleyer@DEFiNiENS.com  Fri Jul 19 10:06:13 2002
From: MBleyer@DEFiNiENS.com (Bleyer, Michael)
Date: Fri, 19 Jul 2002 11:06:13 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
Message-ID: <023814FAC196D5119E4A00D0B76C0F98052127@mmuc.definiens.com>


> For that, you need to give a definition of "correct". From 
> your description, I'd say that encoding the strings as 
> "utf-8" is also "correct" - it gives you byte strings that 
> identify the original file names.

True. But then UTF8 is always "correct" because any Unicode string can be
converted to UTF8.
I guess most OSs use some other encoding for display though.
 
> - does your problem really have to do with file names? Or can it be
>   considered as independent of the problem of file names?
I guess it's not only file names but any system call.
 
> - would it help if, for each Unicode character, there was a list of
>   encodings that can represent that character?

Yup. If I have:
myUString = u'<someUnicodeCharsHere>'

I'd like a function that returns a list of legal encodings for that string,
e.g.
myLegalEncodingList = locale.getLegalEncodings(myUString)

The list would be something like
['cp1250','latin-1','utf8'] etc.
for example.

A function that only works with a single Unicode character would be good
enough I guess.

Now if I have a unicode string I would try to convert it to the system
default encoding first and if that doesn't work, I would like to give the
user some feedback and maybe some choice (from a list of legal encodings)
over which encoding to use instead.

Does that make sense?

Mike


From martin@v.loewis.de  Fri Jul 19 17:14:27 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 19 Jul 2002 18:14:27 +0200
Subject: [I18n-sig] Passing unicode strings to file system calls
In-Reply-To: <023814FAC196D5119E4A00D0B76C0F98052127@mmuc.definiens.com>
References: <023814FAC196D5119E4A00D0B76C0F98052127@mmuc.definiens.com>
Message-ID: <m3n0sneluk.fsf@mira.informatik.hu-berlin.de>

"Bleyer, Michael" <MBleyer@DEFiNiENS.com> writes:

> I'd like a function that returns a list of legal encodings for that string,
> e.g.
> myLegalEncodingList = locale.getLegalEncodings(myUString)
> 
> The list would be something like
> ['cp1250','latin-1','utf8'] etc.
> for example.

What has the locale to do with that? The set of encodings to encode a
particular Unicode character is independent of the locale...

> Now if I have a unicode string I would try to convert it to the system
> default encoding first and if that doesn't work, I would like to give the
> user some feedback and maybe some choice (from a list of legal encodings)
> over which encoding to use instead.
> 
> Does that make sense?

Yes, that's a good approach. However, I'd recommend to use the
locale's encoding instead of the system default encoding, as reported
by nl_langinfo(CODEPAGE).

As for computing the list of possible encodings: I'm not sure what the
best data format for that would be.

Regards,
Martin


From perky@FreeBSD.org  Tue Jul 23 20:06:25 2002
From: perky@FreeBSD.org (Hye-Shik Chang)
Date: Wed, 24 Jul 2002 04:06:25 +0900
Subject: [I18n-sig] KoreanCodecs 2.0.5 Released
Message-ID: <20020723190625.GA68013@fallin.lv>

Hello!

 I've just released KoreanCodecs 2.0.5.

 This version has changed these from previous version:

  - Add two new characters which is introduced by KSX1001-1998
    (euro symbol and registered mark)
  - Raise not UnicodeError but ValueError when keyword argument
    "errors" is invalid.
  - hangul.isJaeum and hangul.isMoeum test the entire string,
    same as str.isdigit and its friends do.

 As always, you can download it from

  http://sourceforge.net/project/showfiles.php?group_id=46747


 Thank you for listening! :)

-- 
Hye-Shik Chang <perky@FreeBSD.org>
Yonsei University, Seoul
^D


From barry@python.org  Thu Jul 25 07:02:05 2002
From: barry@python.org (Barry A. Warsaw)
Date: Thu, 25 Jul 2002 02:02:05 -0400
Subject: [I18n-sig] JapaneseCodecs 1.4.7 released
References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp>
Message-ID: <15679.37981.879403.546544@anthem.wooz.org>

>>>>> "TK" == Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

    TK> I've released JapaneseCodecs 1.4.7.  As usual, the source
    TK> tarball is available at the following location:

    TK> Encoders and decoders now raise a ValueError instead of
    TK> UnicodeError if their optional argument "errors" has an
    TK> invalid value.  Thanks Walter for reminding me!

I just realized we still have a problem with this distutils package.
It still insists on installing japanese.pth in
/usr/local/lib/python-2.2/site-packages even if I include
--install-lib and --install-purelib switches to the "python setup.py
install" command.

This is bad because in Mailman, I don't want to pollute the Python
distribution with the packages I bundle, so I install them in
(Mailman's) $prefix/pythonlib directory.  But because JapaneseCodecs's
setup.py leaves *only* the japanese.pth file in site-packages,
people's Python installations are now broken.

I do the same thing with the email and KoreanCodecs packages, and they
work fine.  Is there some way we can stop the japanese.pth file from
getting installed in the site-packages directory?  (I vaguely remember
this issue came up once before and I thought the problem had been
solved, but it clearly is a problem w/ 1.4.7).

Cheers,
-Barry


From martin@v.loewis.de  Thu Jul 25 08:50:56 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 25 Jul 2002 09:50:56 +0200
Subject: [I18n-sig] JapaneseCodecs 1.4.7 released
In-Reply-To: <15679.37981.879403.546544@anthem.wooz.org>
References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp>
 <15679.37981.879403.546544@anthem.wooz.org>
Message-ID: <m3wurkqm8v.fsf@mira.informatik.hu-berlin.de>

barry@python.org (Barry A. Warsaw) writes:

> I just realized we still have a problem with this distutils package.
> It still insists on installing japanese.pth in
> /usr/local/lib/python-2.2/site-packages even if I include
> --install-lib and --install-purelib switches to the "python setup.py
> install" command.

Since japanese.pth is processed as a 'data' file, you have two options:
1. Only invoke the install_lib command, not the install command.
   This will then avoid the install_headers, install_scripts, and
   install_data commands (the first two not being used here, anyway).

2. Provide the --install-data= argument to the install command, to
   specify an alternative prefix for data files.

Regards,
Martin


From barry@python.org  Thu Jul 25 14:22:27 2002
From: barry@python.org (Barry A. Warsaw)
Date: Thu, 25 Jul 2002 09:22:27 -0400
Subject: [I18n-sig] JapaneseCodecs 1.4.7 released
References: <200207121650.g6CGo4N15961@grad.sccs.chukyo-u.ac.jp>
 <15679.37981.879403.546544@anthem.wooz.org>
 <m3wurkqm8v.fsf@mira.informatik.hu-berlin.de>
Message-ID: <15679.64403.935865.272438@anthem.wooz.org>

[Adding distutils-sig@python.org

>>>>> "MvL" == Martin v Loewis <martin@v.loewis.de> writes:

    >> I just realized we still have a problem with this distutils
    >> package.  It still insists on installing japanese.pth in
    >> /usr/local/lib/python-2.2/site-packages even if I include
    >> --install-lib and --install-purelib switches to the "python
    >> setup.py install" command.

    MvL> Since japanese.pth is processed as a 'data' file, you have
    MvL> two options: 1. Only invoke the install_lib command, not the
    MvL> install command. This will then avoid the install_headers,
    MvL> install_scripts, and install_data commands (the first two not
    MvL> being used here, anyway).

    MvL> 2. Provide the --install-data= argument to the install
    MvL> command, to specify an alternative prefix for data files.

I think I'm going to go with your second suggestion, since it fits in
better with what I've already got.

ObDistutils: the install command has a --root option and a --home
option, both of which would seem to do what I want, but neither quite
do.  E.g. invoking install with --root=/tmp/foo leaves me with
/tmp/foo/usr/local/lib/python2.1/site-packages/<pkg> and invoking with
--home=/tmp/foo leaves me with /tmp/foo/lib/python/<pkg>

What I really want is an option to leave me with /tmp/foo/<pkg> so
that I can put /tmp/foo on sys.path and be done with it, yet still
guarantee that distutils will only install files under /tmp/foo and no
where else.  It seems that I'm left with this as my best option:

% python setup.py install --install-lib /tmp/foo --install-purelib \
    /tmp/foo --install-data /tmp/foo

That still leaves me with /tmp/foo/lib/pythonX.Y/site-packages/japanese.pth
but I'll ignore that for now <wink>.

Am I whacked not to want those extra directories in what I have to set
my PYTHONPATH to?  Maybe I'm just bucking the natural order of things,
but I still think I'd like an install command option that collapses
those three options into one.

If distutils is going to be used to install stuff in a site-packages
override directory, or in a user-specific search-first directory, I
think we need to make this simpler.

but-maybe-I'm-insane-ly y'rs,
-Barry