unicode string problems

Tue Apr 2 01:05:34 EST 2002

bokr at oz.net (Bengt Richter) writes:

> >That assumes that the output encoding is known, or can be
> >determined. As-is, it can't - you don't know the encoding of f, and
> Well, I was wondering more where we're heading than 'As-is' ;-)
> 
> IOW, assume encoding was an optional keyword parameter to open/file.
> Then you'd know what output encoding was desired.

Provided that parameter is provided. You get this today, via
codecs.open.

> But no string would exist without at least an assumption as to its
> encoding.

This alternative (associate each string with an encoding) has been
considered and rejected: it was considered to be better to only
support Unicode, and explicit conversion. Otherwise, two strings with
different incompatible encodings might produce interesting results.

Notice that the current approach gives the same result as the "each
string has an encoding" would give if it worked: Somehow, the
application has to tell, for each string, what the encoding is. *If*
the application can tell, it should currently convert that string to
Unicode - then the information what byte means what character is
preserved in the Unicode string.

Also notice that there *is* an assumption to the encoding of each
string; sys.getdefaultencoding().

> I guess you could do unix file-type magic to infer encoding if you had to,
> but it wouldn't seem reliable or cheap except utf & co.

Indeed. That can't really work.

Regards,
Martin