can't get utf8 / unicode strings from embedded python

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Aug 28 08:03:31 EDT 2013


On Tue, 27 Aug 2013 22:57:45 -0700, David M. Cotter wrote:

> I am very sorry that I have offended you to such a degree you feel it
> necessary to publicly eviscerate me.

You know David, you are right. I did over-react. And I apologise for 
that. I am sorry, I was excessively confrontational. (Although I think 
"eviscerate" is a bit strong.)

Putting aside my earlier sarcasm, the basic message remains the same: 
Python byte strings are not designed to work with Unicode characters, and 
if they do work, it is an accident, not defined behaviour.


> Perhaps I could have worded it like this:  "So far I have not seen any
> troubles including unicode characters in my strings, they *seem* to be
> fine for my use-case.  What kind of trouble has been seen with this by
> others?"

Exactly the same sort of trouble you were having earlier when you were 
inadvertently decoding the source file as MacRoman rather than UTF-8. 
Mojibake, garbage characters in your text, corrupted data.

http://en.wikipedia.org/wiki/Mojibake


The point is, you might not see these errors, because by accident all the 
relevant factors conspire to give you the correct result. You might test 
it on a Mac and on Windows and it all works well. You might even test it 
on a dozen different machines, and it works fine on all of them. But 
since you're relying on an accident of implementation, none of this is 
guaranteed. And then in eighteen months time, *something* changes -- a 
minor update to Python, a different version of Mac OS/X, an unusual 
Registry setting in Windows, who knows what?, and all of a sudden the 
factors no longer line up to give you the correct results and it all 
comes tumbling down in a big stinking mess. If you are lucky you will get 
a nice clear exception telling you something is broken, but more likely 
you'll just get corrupted data and mojibake and you, or the poor guy who 
maintains the code after you, will have no idea why. And you'll probably 
come here asking for our help to solve it.

If you came back and said "I tried it with the u prefix, and it broke a 
bunch of other code, and I don't have time to fix it now so I'm reverting 
to the u-less byte string form" I wouldn't *like* it but I could *accept* 
it as one of those sub-optimal compromises people make in Real Life. I've 
done the same thing myself, we probably all have: written code we knew 
was broken, but fixing it was too hard or too low a priority.


> Really, I wonder why you are so angry at me for having made a mistake? 
> I'm going to guess that you don't have kids.

What do kids have to do with this? Are you an adult or a child? *wink*

You didn't offend me so much as frustrate me. You had multiple people 
telling you the same thing, don't embed Unicode characters in a byte 
string, but you choose to not just ignore them but effectively declare 
that they were all wrong to give that advice, not just the people here 
but essentially the entire Python development community responsible for 
adding Unicode strings to the language. Can you blame me for feeling that 
your reply seemed rather arrogant?

In any case, I'm glad you responded with a little more restraint than I 
did, and I hope you can see my point of view and hopefully I haven't 
soured you on this forum.


-- 
Steven 



More information about the Python-list mailing list