py3k s***s

Diez B. Roggisch deets at nospam.web.de
Fri Apr 18 08:27:31 EDT 2008


Robin Becker schrieb:
> I'm in the process of attempting a straightforward port of a relatively 
> simple package which does most of its work by writing out files with a 
> more or less complicated set of possible encodings. So far I have used 
> all the 2to3 tools and a lot of effort, but still don't have a working 
> version. This must be the worst way to convert people to unicode. When 
> tcl went through this they chose the eminently sensible route of not 
> choosing a separate unicode type (they used utf8 byte strings instead). 
> Not only has python chosen to burden itself with two string types, but 
> with 3 they've swapped roles. This is certainly the first time I've had 
> to decide on an encoding before writing simple text to a file.

Which is the EXACT RIGHT THING TO DO! see below.
> 
> Of course we may end up with a better language, but it will be a 
> worse(more complex) tool for many simple tasks. Using a complex writing 
> with many glyphs costs effort no matter how you do it, but I just use 
> ascii :( and it's still an effort.
> 
> I find the differences in C/OS less hard to understand than why I need 
> bytes(x,'encoding') everywhere I just used to use str(x).

If you google my name + unicode, you see that I'm often answering 
questions regarding unicode. I wouldn't say I'm a recognized expert on 
the subject, but I certainly do know enough to deal with it whenever I 
encounter it.

And from my experience with the problems in general, and specificly in 
python, as well as trying to help others I can say that:

  - 95% of the times, the problem is in front of the keyboard.

  - programmers stubbornly refuse to *learn* what unicode is, and what 
an encoding is, and what role utf-8 plays. Instead, the resort to a 
voodoo-approach of throwing in various encode/decode-calls + a good deal 
of cat's feces in hope of wriggling themselves out of the problem.

  - it is NOT sensible to use utf8 as unicode-"type" - that is as bad as 
it can get because you don't see the errors, but instead mangle your 
data and end up with a byte-string-mess. If that is your road to heaven, 
by all means chose it - and don't use unicode at all. and be prepared 
for damnation :)

If your programs worked for now, but don't do anymore because of Py3K 
introducing mandatory unicode-objects for string-literals it pretty much 
follows that they *seem* to work, but very, very probably fail in the 
face of actual i18nized data.

The *only* sensible thing to do is follow these simple rules - and these 
apply with python 2.x, and will be enforced by 3k which is a good thing 
IMHO:

  - when you read data from somewhere, make sure you know which encoding 
it has, and *immediatly* convert it to unicode

  - when you write data, make sure you know which encoding you want it 
to have (in doubt, chose utf-8 to prevent loss of data) and apply it.

  - XML-parsers take byte-strings & spit out unicode. Period.

I neither want to imply that you are an Idiot nor that unicode doesn't 
have it's complexities. And I'd love to say that Python wouldn't add to 
these by having two string-types.

But the *real* problem is that it used to have only bytestrings, and 
finally Py3K will solve that issue.

Diez



More information about the Python-list mailing list