unicode string problems

Bengt Richter bokr at oz.net
Mon Apr 1 21:58:49 EST 2002


On 02 Apr 2002 01:27:38 +0200, martin at v.loewis.de (Martin v. Loewis) wrote:

>bokr at oz.net (Bengt Richter) writes:
>
>> But it does make me think, should _all_ strings be subtypes
>> of a raw octet-string type according to their encoding? Then one
>> could visualize automatic inter-encoding promotions analogous to
>> numeric promotions, and if i/o sources and sinks have encoding
>> designators, Gonçalo's f.write("Março 2002" + march.Name()) should
>> "just work" if the output encoding permits.
>
>That assumes that the output encoding is known, or can be
>determined. As-is, it can't - you don't know the encoding of f, and
Well, I was wondering more where we're heading than 'As-is' ;-)

IOW, assume encoding was an optional keyword parameter to open/file.
Then you'd know what output encoding was desired.

>you don't know the encoding of "Março" (furthermore, the encoding of f
>won't help, since you have to perform the addition before invoking
>write).
Whoa, f is the last step. First would come the addition, and the premise
was that strings would have encoding attributes one way or another (different
subtype names?), so "Março 2002" would have a known encoding and what comes
from march.Name() would have a known encoding. The encoding of the sum would
involve identifying an encoding that contained codes for both character sets,
and converting to that as necessary. Then comes the comparison with the f encoding.
Logically speaking. It might be optimized in case f is unicode, since then there's no
point in merging to an in-between encoding, (if that's even a possibility).

But no string would exist without at least an assumption as to its encoding.
I guess you could do unix file-type magic to infer encoding if you had to,
but it wouldn't seem reliable or cheap except utf & co.

It's an interesting exercise in keeping distinctions between codes and
representations straight ;-)

>In an all-Unicode approach, f would have been obtained from
>codecs.open, and the string literal would have been a Unicode literal.
>
One day maybe I'll need to think about this ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list