Unicode BOM marks

Steve Horsley shoot at the.moon
Wed Mar 9 16:16:38 EST 2005


Francis Girard wrote:
> Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit :
> 
> Hi,
> 
> Thank you for your very informative answer. Some interspersed remarks  follow.
> 
> 
>>I personally would write my applications so that they put the signature
>>into files that cannot be concatenated meaningfully (since the
>>signature simplifies encoding auto-detection) and leave out the
>>signature from files which can be concatenated (as concatenating the
>>files will put the signature in the middle of a file).
>>
> 
> 
> Well, no text files can't be concatenated ! Sooner or later, someone will use 
> "cat" on the text files your application did generate. That will be a lot of 
> fun for the new unicode aware "super-cat".
> 

It is my understanding that the BOM (U+feff) is actually the 
  Unicode character "Non-breaking zero-width space". I take 
this to mean that the character can appear invisibly 
anywhere in text, and its appearance as the first character 
of a text is pretty harmless. Concateniating files will 
leave invisible space characters in the middle of the text, 
but presumably not in the middle of words, so no harm is 
done there either.

I suspect that the fact that an explicitly invisible 
character feff has an invalid character code fffe for its 
byte-reversed counterpart is no accident, and that the 
charecter was intended from inception to also server as a 
byte order indication.

Steve



More information about the Python-list mailing list