remove BOM from string read from utf-8 file
Piet van Oostrum
piet at cs.uu.nl
Fri Feb 27 09:51:35 EST 2004
>>>>> "Achim Domma" <domma at procoders.net> (AD) wrote:
AD> Hi,
AD> I read some text from a utf-8 encoded text file like this:
AD> text = codecs.open('example.txt','r','utf8').read()
AD> If I pass this text to a COM object, I can see that there is still the BOM
AD> in the file, which marks the file as utf-8. Simply removing the first
AD> character in the string is not ok, because the BOM is optional. So I tried
AD> something like this:
The BOM is in the file, but not in the string 'text'
text is a unicode string which consists of Unicode characters and the BOM
is not a Unicode character.
Check text[0] and len(text) to verify.
Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that
is the reason for the complaint.
--
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl
More information about the Python-list
mailing list