encoding problem

Fri Dec 19 18:50:39 EST 2008

Marc 'BlackJack' Rintsch wrote:

>>> I don't see the shortcoming in Python <3.0.  If you want real strings
>>> with characters instead of just a bunch of bytes simply use `unicode`
>>> objects instead of `str`.
>> Fair enough -- that certainly is the best policy.  But working with any
>> other encoding (sometimes necessary when interfacing with any other
>> software), it's still a bit of a PITA.
> 
> But it has to be.  There is no automagic guessing possible.

Automagic guessing isn't possible if strings keep track of what encoding 
their data is.  And why shouldn't they?  We're a long way from the day 
when a "string" was nothing more than an array of bytes.  Adding a teeny 
bit of metadata makes life much easier.

>>> And does REALbasic really use byte strings plus an encoding!?
>> You betcha!  Works like a dream.
> 
> IMHO a strange design decision.

I get that you don't grok it, but I think that's because you haven't 
worked with it.  RB added encoding data to its strings years ago, and 
changed the default string encoding to UTF-8 at about the same time, and 
life has been delightful since then.  The only time you ever have to 
think about it is when you're importing a string from some unknown 
source (e.g. a socket), at which point you need to tell RB what encoding 
it is.  From that point on, you can pass that string around, extract 
substrings, split it into words, concatenate it with other strings, 
etc., and it all Just Works (tm).

In comparison, Python requires a lot more thought on the part of the 
programmer to keep track of what's what (unless, as you point out, you 
convert everything into unicode strings as soon as you get them, but 
that can be a very expensive operation to do on, say, a 500MB UTF-8 text 
file).

> A lot more hassle compared to an opaque 
> unicode string type which uses some internal encoding that makes 
> operations like getting a character at a given index easy or 
> concatenating without the need to reencode.

No.  RB supports UCS-2 encoding, too, and is smart enough to take 
advantage of the fixed character width of any encoding when that's what 
a string happens to be.  And no reencoding is used when it's not 
necessary (e.g., concatenating two strings of the same encoding, or 
adding an ASCII string to a string using any ASCII superset, such as 
UTF-8).  There's nothing stopping you from converting all your strings 
to UCS-2 when you get them, if that's your preference.

But saying that having only one string type that knows it's Unicode, and 
another string type that hasn't the foggiest clue how to interpret its 
data as text, is somehow easier than every string knowing what it is and 
doing the right thing -- well, that's just silly.

Best,
- Joe