Unicode BOM marks

Mon Mar 7 17:56:57 EST 2005

Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit :

Hi,

Thank you for your very informative answer. Some interspersed remarks  follow.

>
> I personally would write my applications so that they put the signature
> into files that cannot be concatenated meaningfully (since the
> signature simplifies encoding auto-detection) and leave out the
> signature from files which can be concatenated (as concatenating the
> files will put the signature in the middle of a file).
>

Well, no text files can't be concatenated ! Sooner or later, someone will use 
"cat" on the text files your application did generate. That will be a lot of 
fun for the new unicode aware "super-cat".

> > I guess that this leading BOM mark are special marking bytes that can't
> > be, in no way, decoded as valid text.
> > Right ?
>
> Wrong. The BOM mark decodes as U+FEFF:
>  >>> codecs.BOM_UTF8.decode("utf-8")
>
> u'\ufeff'

I meant "valid text" to denote human readable actual real natural language 
text. My intent with this question was to get sure that we can easily 
distinguish a UTF-8 with the signature from one without. Your answer implies 
a "yes".

> > I also guess that this leading BOM mark is silently ignored by any
> > unicode aware file stream reader to which we already indicated that the
> > file follows the UTF-8 encoding standard.
> > Right ?
>
> No. It should eventually be ignored by the application, but whether the
> stream reader special-cases it or not is depends on application needs.
>

Well, for most of us, I think, the need is to transparently decode the input 
into a unique internal unicode encoding (UFT-16 for both java and Qt ; Qt 
docs saying there might be a need to switch to UFT-32 some day) and then be 
able to manipulate this internal text with the usual tools your programming 
system provides. By "transparent", I mean, at least, to be able to 
automatically process the two variants of the same UTF-8 encoding. We should 
only have to specify "UTF-8" and the streamer takes care of the rest.

BTW, the python "unicode" built-in function documentation says it returns a 
"unicode" string which scarcely means something. What is the python 
"internal" unicode encoding ?

>
> No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports
> it to the application when it finds it, and it will never generate the
> signature on its own. So processing the UTF-8 signature is left to the
> application in Python.
>
Ok.

> > In python documentation, I see theseconstants. The documentation is not
> > clear to which encoding these constants apply. Here's my understanding :
> >
> > BOM : UTF-8 only or UTF-8 and UTF-32 ?
>
> UTF-16.
>
> > BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
> > BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
>
> UTF-16
>
Ok.

> > Why should I need these constants if codecs decoder can handle them
> > without my help, only specifying the encoding ?
>
> Well, because the codecs don't. It might be useful to add a
> "utf-8-signature" codec some day, which generates the signature on
> encoding, and removes it on decoding.
>
Ok.

My sincere thanks,

Francis Girard

> Regards,
> Martin