Proposal: require 7-bit source str's

"Martin v. Löwis" martin at v.loewis.de
Fri Aug 6 18:16:41 EDT 2004


Hallvard B Furuseth wrote:
> - For a number of source encodings (like utf-8:-) it should be easy
>   to parse and charset-convert in the same step, and only convert
>   selected parts of the source to Unicode.

Correct. However, that it works "for a number of source encodings"
is insufficient - if it doesn't work for all of them, it only 
unreasonably complicates the code.

For some source encodings (namely the CJK ones), conversion to UTF-8
is absolutely necessary even for proper lexical analysis, as the
byte that represents a backslash in ASCII might be the first byte
of a two-byte sequence.

> - I think the spec is buggy anyway.  Converting to Unicode and back
>   can change the string representation.  But I'll file a separate
>   bug report for that.

That is by design. The only effect of such a bug report will be that
the documentation clearly clarifies that. Users that need to make
sure the run-time representation of a string is the same of as the
source representation need to pick a source encoding that round-trips.

> Sorry, I thought you were speaking of promising a __future__ when all
> string literals are required to be 7-bit or u'' literals.

Yes, but that *will* cause a wide debate. Say, Python 3.5, to be
release 2017 or so. I could live with such a language, but I'm
certain many users can't, in any foreseeable future.

Regards,
Martin



More information about the Python-list mailing list