remove BOM from string read from utf-8 file

Piet van Oostrum piet at cs.uu.nl
Fri Feb 27 09:51:35 EST 2004


>>>>> "Achim Domma" <domma at procoders.net> (AD) wrote:

AD> Hi,
AD> I read some text from a utf-8 encoded text file like this:

AD> text = codecs.open('example.txt','r','utf8').read()

AD> If I pass this text to a COM object, I can see that there is still the BOM
AD> in the file, which marks the file as utf-8. Simply removing the first
AD> character in the string is not ok, because the BOM is optional. So I tried
AD> something like this:

The BOM is in the file, but not in the string 'text'
text is a unicode string which consists of Unicode characters and the BOM
is not a Unicode character.

Check text[0] and len(text) to verify.

Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that
is the reason for the complaint.
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl



More information about the Python-list mailing list