[I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Mon, 01 May 2000 17:32:38 -0400


> > Are you sure you understand what we are arguing about?
> 
> Here's what I thought we were arguing about:
> 
> If you put a bunch of "funny characters" into a Python string literal,
> and then compare that string literal against a Unicode object, should
> those funny characters be treated as logical units of text (characters)
> or as bytes? And if bytes, should some transformation be automatically
> performed to have those bytes be reinterpreted as characters according
> to some particular encoding scheme (probably UTF-8).
> 
> I claim that we should *as far as possible* treat strings as character
> lists and not add any new functionality that depends on them being byte
> list. Ideally, we could add a byte array type and start deprecating the
> use of strings in that manner. Yes, it will take a long time to fix this
> bug but that's what happens when good software lives a long time and the
> world changes around it.
> 
> > Earlier, you quoted some reference documentation that defines 8-bit
> > strings as containing characters.  That's taken out of context -- this
> > was written in a time when there was (for most people anyway) no
> > difference between characters and bytes, and I really meant bytes.
> 
> Actually, I think that that was Fredrik. 

Yes, I came across the post again later.  Sorry.

> Anyhow, you wrote the documentation that way because it was the most
> intuitive way of thinking about strings. It remains the most intuitive
> way. I think that that was the point Fredrik was trying to make.

I just wish he made the point more eloquently.  The eff-bot seems to
be in a crunchy mood lately...

> We can't make "byte-list" strings go away soon but we can start moving
> people towards the "character-list" model. In concrete terms I would
> suggest that old fashioned lists be automatically coerced to Unicode by
> interpreting each byte as a Unicode character. Trying to go the other
> way could cause the moral equivalent of an OverflowError but that's not
> a problem. 
> 
> >>> a=1000000000000000000000000000000000000L
> >>> int(a)
> Traceback (innermost last):
>   File "<stdin>", line 1, in ?
> OverflowError: long int too long to convert
> 
> And just as with ints and longs, we would expect to eventually unify
> strings and unicode strings (but not byte arrays).

OK, you've made your claim -- like Fredrik, you want to interpret
8-bit strings as Latin-1 when converting (not just comparing!) them to
Unicode.

I don't think I've heard a good *argument* for this rule though.  "A
character is a character is a character" sounds like an axiom to me --
something you can't prove or disprove rationally.

I have a bunch of good reasons (I think) for liking UTF-8: it allows
you to convert between Unicode and 8-bit strings without losses, Tcl
uses it (so displaying Unicode in Tkinter *just* *works*...), it is
not Western-language-centric.

Another reason: while you may claim that your (and /F's, and Just's)
preferred solution doesn't enter into the encodings issue, I claim it
does: Latin-1 is just as much an encoding as any other one.

I claim that as long as we're using an encoding we might as well use
the most accepted 8-bit encoding of Unicode as the default encoding.

I also think that the issue is blown out of proportions: this ONLY
happens when you use Unicode objects, and it ONLY matters when some
other part of the program uses 8-bit string objects containing
non-ASCII characters.  Given the long tradition of using different
encodings in 8-bit strings, at that point it is anybody's guess what
encoding is used, and UTF-8 is a better guess than Latin-1.

--Guido van Rossum (home page: http://www.python.org/~guido/)