[I18n-sig] Changing case

M.-A. Lemburg mal@lemburg.com
Tue, 11 Apr 2000 17:38:32 +0200


Guido van Rossum wrote:
> 
> This is quite independent of the source encoding when reading from a
> file.  I have some issues with the current approach (which seems to be
> "use whatever bytes you read" and thus defaults to Latin-1 if you use
> non-ASCII characters inUnicode string literals; otherwise it's
> whatever the user wants it to be.

What direction should we be heading: interpret the source
files under some encoding assumption deduced from the
platform, a command line switch or a #pragma, or simply fix
one encoding (e.g. Latin-1) ?

The current divergence between u"...chars..." and "...chars..."
really only stems from the fact that "...chars..." doesn't
have to know about the used encoding, while u"...chars..." does
to be able to convert the data to Unicode.

> Note in particular that a user who
> edits her source code in shift-JIS can currently *not* use shift-JIS
> in Unicode literals -- she must use something like
> unicode(".....","shift-jis") to get a Unicode string containing the
> correct Japanese characters encoded in Unicode.

See above -- without any further knowledge about the encoding
used to write the source file, there is no other way than to
simply fix one encoding (which happens to be Latin-1 due to
the way the first 256 Unicode ordinals are defined).

Note that even if the parser would know the encoding, you'd
still have a problem processing the strings at run-time:
8-bit strings do not carry any encoding information.
The only ways to fix this would be to define a global 8-bit
string encoding or add an encoding attribute to strings.

One possible way would be to define that all 8-bit strings
get converted to UTF-8 when parsed (by the compiler, eval(), etc.).
This would assure that all strings used at run-time would
in fact be UTF-8 and conversions to and from Unicode would
be possible without information loss.

The downside of this approach is that indexing and slicing do
not work well with UTF-8: a single input character can be
encoded by as much as 6 bytes (for 32-bit Unicode) ! I also
assume that many applications rely on the fact that
len("дц") == 2 and not 4.

Perhaps we should just loosen the used encoding for u"...chars..."
using #pragmas and/or cmd line switches. Then people around the
world would at least have a simple way to write programs which
still work everywhere, but can be written using any of the
encodings known to Python. 8-bit "...chars..." would then
be interpreted as before: user defined data using a user
defined encoding (the string->Unicode conversion would still
need to make the UTF-8 assumption, though).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/