Unicode BOM marks

Tue Mar 8 10:52:18 EST 2005

""Martin v. Löwis"" <martin at v.loewis.de> wrote in message 
news:422cf441$0$12162$9b622d9e at news.freenet.de...
> Francis Girard wrote:
>> Well, no text files can't be concatenated ! Sooner or later, someone will 
>> use "cat" on the text files your application did generate. That will be a 
>> lot of fun for the new unicode aware "super-cat".
>
> Well, no. For example, Python source code is not typically concatenated,
> nor is source code in any other language. The same holds for XML files:
> concatenating two XML documents (using cat) gives an ill-formed document
> - whether the files start with an UTF-8 signature or not.

And if you're talking HTML and XML, the situation is even worse, since
the application absolutely needs to be aware of the signature. HTML might
have a <meta ... > directive close to the front to tell you what the 
encoding
is supposed to be, and then again, it might not. You should be able to 
depend
on the first character being a <, but you might not be able to. FitNesse, 
for
example, sends FIT a file that consists of the HTML between the <body>
and </body> tags, and nothing else. This situation makes character set
detection in PyFit, um, interesting. (Fortunately, I have other ways of
dealing with FitNesse, but it's still an issue for batch  use.)

> As for the "super-cat": there is actually no problem with putting U+FFFE
> in the middle of some document - applications are supposed to filter it
> out. The precise processing instructions in the Unicode standard vary
> from Unicode version to Unicode version, but essentially, you are
> supposed to ignore the BOM if you see it.

It would be useful for "super-cat" to filter all but the first one, however.

John Roth
>
>
> Regards,
> Martin
>