[Python-Dev] Encoding of 8-bit strings and Python source code

M.-A. Lemburg mal@lemburg.com
Tue, 25 Apr 2000 22:13:39 +0200


Fredrik Lundh wrote:
> 
> I'll follow up with a longer reply later; just one correction:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > Ad 1. UTF-8 is used as basis in many other languages such
> > as TCL or Perl.  It is not an intuitive way of
> > writing strings and causes problems due to one character
> > spanning 1-6 bytes. Still, the world seems to be moving
> > into this direction, so going the same way can't be all
> > wrong...
> 
> the problem here is the current Python implementation
> doesn't use UTF-8 in the same way as Perl and Tcl.  Perl
> and Tcl only exposes one string type, and that type be-
> haves exactly like it should:
> 
>      "The Tcl string functions properly handle multi-
>     byte UTF-8 characters as single characters."
> 
>      "By default, Perl now thinks in terms of Unicode
>      characters instead of simple bytes. /.../ All the
>      relevant built-in functions (length, reverse, and
>      so on) now work on a character-by-character
>      basis instead of byte-by-byte, and strings are
>      represented internally in Unicode."
> 
> or in other words, both languages guarantee that given a
> string s:
> 
>     - s is a sequence of characters (not bytes)
>     - len(s) is the number of characters in the string
>     - s[i] is the i'th character
>     - len(s[i]) is 1
> 
> and as I've pointed out a zillion times, Python 1.6a2 doesn't.

Just a side note: we never discussed turning the native
8-bit strings into any encoding aware type.

> this
> should be solved, and I see (at least) four ways to do that:
>
> ...
> -- the Perl 5.6 way? (haven't looked at the implementation, but I'm
>    pretty sure someone told me it was done this way).   essentially
>    same as Tcl 8.2, but with an extra encoding field (to avoid con-
>    versions if data is just passed through).
> 
>     struct {
>         int encoding;
>         char* bytes; /* 8-bit representation */
>         Tcl_UniChar* unicode; /* 16-bit representation */
>     }
> 
> [imho: see Tcl 8.2]
> 
> -- my proposal: expose both types, but let them contain characters
>    from the same character set -- at least when used as strings.
> 
>    as before, 8-bit strings can be used to store binary data, so we
>    don't need a separate ByteArray type.  in an 8-bit string, there's
>    always one character per byte.
> 
> [imho: small changes to the existing code base, about as efficient as
> can be, no attempt to second-guess the user, fully backwards com-
> patible, fully compliant with the definition of strings in the language
> reference, patches are available, etc...]

Why not name the beast ?! In your proposal, the old 8-bit
strings simply use Latin-1 as native encoding. 

The current version doesn't make any encoding assumption as
long as the 8-bit strings do not get auto-converted. In that case
they are interpreted as UTF-8 -- which will (usually) fail
for Latin-1 encoded strings using the 8th bit, but hey, at least
you get an error message telling you what is going wrong.

The key to these problems is using explicit conversions where
8-bit strings meet Unicode objects.

Some more ideas along the convenience path:

Perhaps changing just the way 8-bit strings are coerced
to Unicode would help: strings would then be interpreted
as Latin-1. str(Unicode) and "t" would still return
UTF-8 to assure loss-less conversion.

Another way to tackle this would be to first try UTF-8
conversion during auto-conversion and then fallback to
Latin-1 in case it fails. Has anyone tried this ? Guido
mentioned that TCL does something along these lines...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/