Further changes to source encodings, and 7-bit source str's

Hallvard B Furuseth h.b.furuseth at usit.uio.no
Sun Aug 8 08:35:55 EDT 2004


John Roth wrote:

> The problem I have is that if you use utf-8 as the
> source encoding, you can suddenly drop multi-byte
> characters into an 8-bit string ***BY ACCIDENT***.
> (...)
> Now, my suggested solution of this problem was
> to require that 8-bit string literals in source that was
> encoded with UTF-8 be restricted to the 7-bit
> ascii subset.

Then shouldn't your solution catch all multibyte encodings, not
just UTF-8?

Martin v. Löwis wrote:

> [Hallvard] proposes your third alternative (ban non-ASCII
> characters in byte string literals), not just for UTF-8,
> but for all encodings. Not for all files, though, but
> only for selected files.

John Roth wrote:

> Which is what I don't like about it. It adds complexity
> to the language and a feature that I don't think is really
> necessary (restricting string literals for single-byte encodings.)

It's to prevent several errors:

* If the source file has one 'coding:' and the output destination has
  another character set/encoding, then the wrong character set will be
  output.  Python offers two simple solutions to this:
  - If the program is charset-aware, it can work with Unicode strings,
    and the 8-bit string literal should be a Unicode literal.
  - Otherwise, the program can stay away from Unicode and leave the
    charset problem to the user.

* A worse case of the above:  If the 8-bit output goes to an utf-8
  destination, it won't merely give the wrong character, it will have
  invalid format.  So a program which reads the output may close the
  connection it reads from, or fail to display the file at all, or -
  if it is not robust - crash.  I expect the same applies to other
  multibyte encodings, and probably some single-byte encodings too.

* If the program is charset-aware and works with Unicode strings,
  the Unicode handling blows up if it is passed an 8-bit str
  (example copied from Anders' pychecker feature request):

    # -*- coding: latin-1 -*-
    x = "blåbærgrød"
    unicode(x)
  -->
    Traceback (most recent call last):
      File "/tmp/u.py", line 3, in ?
        unicode(x)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5
    in position 2: ordinal not in range(128)

  The problem is that even though the file is tagged with latin-1, the
  string x does not inherit that tag.  So the Unicode handling doesn't
  know which character set, if any, the string contains.

> The other thing I don't like is that it still leaves the
> trap for the unwary which I'm discussing.

Well, I would like to see a feature like this turned on by default
eventually (both for UTF-8 and other character sets), but for the time
being I'll stick to getting the feature into Python in the first place.

Though I do seem to have been too unambitious.  For some reason I was
thinking it would be harder to get a new option into Python than a
per-file declaration.

-- 
Hallvard



More information about the Python-list mailing list