Slightly OT: Unicode BOM

pcarey at lexmark.com pcarey at lexmark.com
Wed Oct 23 08:18:59 EDT 2002


Hello Python Champs.

Sorry for yet another unicode question (I'm slowly getting through the book, "Unicode A Primer")

I am amassing an xml file that contains text from 26 different languages, for use with a python app.
I have received all the translations back from our translations vendor (I specifically requested UTF-8 encoding).
So I pieced the xml file back together, and ran it through the well-formedness checker at http://validator.w3.org/
and received the following message:

"UTF-8 'BOM' detected and removed"

"The document contained an UTF-8 encoded Unicode Byte Order Mark (BOM)
as the first character and we have removed it before parsing.
Many XML Processors do not allow it.
To be on the safe side you should avoid using the BOM in UTF-8 encoded documents."

(If it matters, I am pretty sure that our Chinese Traditional translator sent back UTF-16, which I copied and
 pasted into the utf-8 xml file)

Anyone have any ideas on how I can be on the "safe side" and avoid using the BOM (byte order mark) in the future?
Also, if anyone has any tried-and-true references on Unicode, please pass those along.
(something like a Unicode-Python no-no FAQ would be awesome)

Feeling like a true rookie,

PETE






More information about the Python-list mailing list