Unicode BOM marks

Mon Mar 7 19:39:29 EST 2005

Francis Girard wrote:
> Well, no text files can't be concatenated ! Sooner or later, someone will use 
> "cat" on the text files your application did generate. That will be a lot of 
> fun for the new unicode aware "super-cat".

Well, no. For example, Python source code is not typically concatenated,
nor is source code in any other language. The same holds for XML files:
concatenating two XML documents (using cat) gives an ill-formed document
- whether the files start with an UTF-8 signature or not.

As for the "super-cat": there is actually no problem with putting U+FFFE
in the middle of some document - applications are supposed to filter it
out. The precise processing instructions in the Unicode standard vary
from Unicode version to Unicode version, but essentially, you are
supposed to ignore the BOM if you see it.

> BTW, the python "unicode" built-in function documentation says it returns a 
> "unicode" string which scarcely means something. What is the python 
> "internal" unicode encoding ?

A Unicode string is a sequence of integers. The numbers are typically
represented as base-2, but the details depend on the C compiler.
It is specifically *not* UTF-16, big or little endian (i.e. a single
number is *not* a sequence of bytes). It may be UCS-2 or UCS-4,
depending on a compile-time choice (which can be determined by looking
at sys.maxunicode, which in turn can be either 65535 or 1114111).

The programming interface to the individual characters is formed by
the unichr and ord builtin functions, which expect and return integers
between 0 and sys.maxunicode.

Regards,
Martin