PEP 263 status check

Fri Aug 6 18:46:57 EDT 2004

John Roth wrote:
> I've
> been in this business for close to 40 years, and I'm
> quite well aware of backwards compatibility issues
> and issues with breaking existing code.
> 
> Programmers in general have a very strong, and
> let me repeat that, VERY STRONG assumption
> that an 8-bit string contains one byte per character
> unless there is a good reason to believe otherwise.

You clearly come from a Western business. In CJK
languages, people are very aware that characters can
have more than one byte. They consider UTF-8 as just
another multi-byte encoding, and used to consider it
as an encoding that Westerners made to complicate their
lifes. That attitude appears to be changing now, but
UTF-8 is not a clear winner in the worlds where we
Westerners would expect it to be a clear winner.

> The current design allows accidental inclusion of
> a character that is not in the 7bit ascii subset ***IN
> A PROGRAM THAT HAS A UTF-8 CHARACTER
> ENCODING DECLARATION*** to break that
> assumption without any kind of notice. 

This is a problem only for the Western world. In the
CJK languages, such programs were broken a long time
ago. I don't think Python needs to be so Americo-centric
as to protect American programmers from programming
mistakes.

> That in
> turn will break all of the assumptions that the string
> module and string methods are based on. That in
> turn is likely to break lots of existing modules and
> cause a lot of debugging time that could be avoided
> by proper design.

Indeed. If the program is currently not broken, why
are you changing the source encoding? If you are
trying to support multiple languages, a properly-
designed application would use gettext instead
of putting non-ASCII into source code.

If you are writing a new application, and you
put non-ASCII into the source, in UTF-8, are you
not testing your application properly?

> I'm not worried about this causing people to
> abandon Python. I'm more worried about the
> current situation causing enough grief that people
> will decided that utf-8 source code encoding isn't
> worth it.

Again, this is what Hallvard's PEP is for. It
does not apply to UTF-8 only, but I see no reason
why UTF-8 needs to be singled out.

> I'll withdraw my objection if you can seriously
> assure me that working with raw utf-8 in
> 8-bit character string literals is what most programmers
> are going to do most of the time.

In what time scale? Near time, most people will use
other source encodings. In the medium term, I expect
Unix will switch to UTF-8 throughout, at which point
using UTF-8 byte strings will work on every Unix
system - the scripts, by nature, won't work on non-Unix
systems, anyway. In the long term, I expect all Python
strings will be Unicode strings, unless explicitly
declared as byte strings.

Regards,
Martin