encoding problem

Fri Dec 19 18:02:08 EST 2008

On Fri, 19 Dec 2008 15:20:08 -0700, Joe Strout wrote:

> Marc 'BlackJack' Rintsch wrote:
> 
>>> And because strings in Python, unlike in (say) REALbasic, do not know
>>> their encoding -- they're just a string of bytes.  If they were a
>>> string of bytes PLUS an encoding, then every string would know what it
>>> is, and things like conversion to another encoding, or concatenation
>>> of two strings that may differ in encoding, could be handled
>>> automatically.
>>>
>>> I consider this one of the great shortcomings of Python, but it's
>>> mostly just a temporary inconvenience -- the world is moving to
>>> Unicode, and with Python 3, we won't have to worry about it so much.
>> 
>> I don't see the shortcoming in Python <3.0.  If you want real strings
>> with characters instead of just a bunch of bytes simply use `unicode`
>> objects instead of `str`.
> 
> Fair enough -- that certainly is the best policy.  But working with any
> other encoding (sometimes necessary when interfacing with any other
> software), it's still a bit of a PITA.

But it has to be.  There is no automagic guessing possible.

>> And does REALbasic really use byte strings plus an encoding!?
> 
> You betcha!  Works like a dream.

IMHO a strange design decision.  A lot more hassle compared to an opaque 
unicode string type which uses some internal encoding that makes 
operations like getting a character at a given index easy or 
concatenating without the need to reencode.

Ciao,
	Marc 'BlackJack' Rintsch