encoding problem

Fri Dec 19 18:38:26 EST 2008

On Dec 20, 10:02 am, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Fri, 19 Dec 2008 15:20:08 -0700, Joe Strout wrote:
> > Marc 'BlackJack' Rintsch wrote:
>
> >>> And because strings in Python, unlike in (say) REALbasic, do not know
> >>> their encoding -- they're just a string of bytes.  If they were a
> >>> string of bytes PLUS an encoding, then every string would know what it
> >>> is, and things like conversion to another encoding, or concatenation
> >>> of two strings that may differ in encoding, could be handled
> >>> automatically.
>
> >>> I consider this one of the great shortcomings of Python, but it's
> >>> mostly just a temporary inconvenience -- the world is moving to
> >>> Unicode, and with Python 3, we won't have to worry about it so much.
>
> >> I don't see the shortcoming in Python <3.0.  If you want real strings
> >> with characters instead of just a bunch of bytes simply use `unicode`
> >> objects instead of `str`.
>
> > Fair enough -- that certainly is the best policy.  But working with any
> > other encoding (sometimes necessary when interfacing with any other
> > software), it's still a bit of a PITA.
>
> But it has to be.  There is no automagic guessing possible.
>
> >> And does REALbasic really use byte strings plus an encoding!?
>
> > You betcha!  Works like a dream.
>
> IMHO a strange design decision.  A lot more hassle compared to an opaque
> unicode string type which uses some internal encoding that makes
> operations like getting a character at a given index easy or
> concatenating without the need to reencode.

In general I quite agree with you ... hoever with Unicode "getting a
character at a given index" is fine unless and until you stray (or are
dragged!) outside the BMP and you have only a 16-bit Unicode
implementation.