Further changes to source encodings, and 7-bit source str's
Hallvard B Furuseth
h.b.furuseth at usit.uio.no
Sun Aug 8 08:35:55 EDT 2004
John Roth wrote:
> The problem I have is that if you use utf-8 as the
> source encoding, you can suddenly drop multi-byte
> characters into an 8-bit string ***BY ACCIDENT***.
> (...)
> Now, my suggested solution of this problem was
> to require that 8-bit string literals in source that was
> encoded with UTF-8 be restricted to the 7-bit
> ascii subset.
Then shouldn't your solution catch all multibyte encodings, not
just UTF-8?
Martin v. Löwis wrote:
> [Hallvard] proposes your third alternative (ban non-ASCII
> characters in byte string literals), not just for UTF-8,
> but for all encodings. Not for all files, though, but
> only for selected files.
John Roth wrote:
> Which is what I don't like about it. It adds complexity
> to the language and a feature that I don't think is really
> necessary (restricting string literals for single-byte encodings.)
It's to prevent several errors:
* If the source file has one 'coding:' and the output destination has
another character set/encoding, then the wrong character set will be
output. Python offers two simple solutions to this:
- If the program is charset-aware, it can work with Unicode strings,
and the 8-bit string literal should be a Unicode literal.
- Otherwise, the program can stay away from Unicode and leave the
charset problem to the user.
* A worse case of the above: If the 8-bit output goes to an utf-8
destination, it won't merely give the wrong character, it will have
invalid format. So a program which reads the output may close the
connection it reads from, or fail to display the file at all, or -
if it is not robust - crash. I expect the same applies to other
multibyte encodings, and probably some single-byte encodings too.
* If the program is charset-aware and works with Unicode strings,
the Unicode handling blows up if it is passed an 8-bit str
(example copied from Anders' pychecker feature request):
# -*- coding: latin-1 -*-
x = "blåbærgrød"
unicode(x)
-->
Traceback (most recent call last):
File "/tmp/u.py", line 3, in ?
unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5
in position 2: ordinal not in range(128)
The problem is that even though the file is tagged with latin-1, the
string x does not inherit that tag. So the Unicode handling doesn't
know which character set, if any, the string contains.
> The other thing I don't like is that it still leaves the
> trap for the unwary which I'm discussing.
Well, I would like to see a feature like this turned on by default
eventually (both for UTF-8 and other character sets), but for the time
being I'll stick to getting the feature into Python in the first place.
Though I do seem to have been too unambitious. For some reason I was
thinking it would be harder to get a new option into Python than a
per-file declaration.
--
Hallvard
More information about the Python-list
mailing list