Is there really a default source encoding?

"Martin v. Löwis" martin at v.loewis.de
Fri Jan 24 06:13:11 EST 2003


Brian Quinlan wrote:
> No, UTF-32 exists. For Japanese, UTF-8 requires (at minimum) 50% more
> space per character than UTF-8. I was being facetious with my UTF-32
> comment. But UTF-32 may become more efficient than UTF-16, for some
> languages (e.g. Sancrit), in the future.

Hardly so. UTF-16 requires four bytes per character in the worst case; 
UTF-32 requires four bytes per character for every character.

> I don't understand. In UTF-8, the BOM allows you to easily distinguish
> between documents with UTF-8 encoding and a locale dependant
> byte-encoding. For multibyte encodings (e.g. UTF-16) it is impossible to
> determine the encoding without knowing the byte order. Do you have some
> other solution with a feasible implementation?

I think the entire BOM issue is messed up. UTF-16 was originally 
intended to be big-endian, AFAIK. It is only those Microsoft engineers 
who missed this point (or deliberately ignored it) that required 
introduction of the BOM afterwards.

I usually refer to the UTF-8 BOM as "UTF-8 signature", as it does not 
indicate a byte order, but indicates the encoding itself.

Regards,
Martin





More information about the Python-list mailing list