encoding problem

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Sat Dec 20 07:07:07 EST 2008


On Fri, 19 Dec 2008 16:50:39 -0700, Joe Strout wrote:

> Marc 'BlackJack' Rintsch wrote:
> 
>>>> And does REALbasic really use byte strings plus an encoding!?
>>> You betcha!  Works like a dream.
>> 
>> IMHO a strange design decision.
> 
> I get that you don't grok it, but I think that's because you haven't
> worked with it.  RB added encoding data to its strings years ago, and
> changed the default string encoding to UTF-8 at about the same time, and
> life has been delightful since then.  The only time you ever have to
> think about it is when you're importing a string from some unknown
> source (e.g. a socket), at which point you need to tell RB what encoding
> it is.  From that point on, you can pass that string around, extract
> substrings, split it into words, concatenate it with other strings,
> etc., and it all Just Works (tm).

Except that you don't know for sure what the output encoding will be, as 
it depends on the operations on the strings in the program flow.  So to 
be sure you have to en- or recode at output too.  And then it is the same 
as in Python -- decode when bytes enter the program and encode when 
(unicode) strings leave the program.

> In comparison, Python requires a lot more thought on the part of the
> programmer to keep track of what's what (unless, as you point out, you
> convert everything into unicode strings as soon as you get them, but
> that can be a very expensive operation to do on, say, a 500MB UTF-8 text
> file).

So it doesn't require more thought.  Unless you complicate it yourself, 
but that is language independent.

I would not do operations on 500 MiB text in any language if there is any 
way to break that down into smaller chunks.  Slurping in large files 
doesn't scale very well.  On my Eee-PC even a 500 MiB byte `str` is (too) 
expensive.

> But saying that having only one string type that knows it's Unicode, and
> another string type that hasn't the foggiest clue how to interpret its
> data as text, is somehow easier than every string knowing what it is and
> doing the right thing -- well, that's just silly.

Sorry, I meant the implementation not the POV of the programmer, which 
seems to be quite the same.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list