why isn't Unicode the default encoding?

Mon Mar 20 18:48:30 EST 2006

John Salerno wrote:
> Interesting. So then the read() method, if given a numeric argument for 
> bytes to read, would act differently depending on if you were using 
> Unicode or not?

The read method currently returns a byte string, not a Unicode string.
It's not clear to me how the numeric argument should be interpreted when
it returns characters some day; it might be best to take the number as
counting characters, then. However, not supporting a numeric argument
at all might also be reasonable.

> As it is now, it seems to equate the bytes with number 
> of characters, but if the document was written using Unicode characters, 
> is it possible that read(2) might only pull out one character?

Unicode isn't a character coding (*all* documents in the world are
"written in Unicode", including those encoded with ASCII or
Latin-1).

In any case, it doesn't matter what encoding the document is in:
read(2) always returns two bytes. How many characters that constitutes
depends on the encoding - but read() doesn't return a character
string.

It might be that these two bytes are only part of a character,
e.g. if you need three bytes to encode a character, or it might
be that they are parts of two characters, e.g. when you get the
second byte of the first character and the first byte of the
second one. In some encodings (e.g. ISO-2022), these bytes
may indicate *no* character, e.g. when the bytes just indicate
an in-stream change of character set.

Regards,
Martin