Newbie problem with codecs

Fri Aug 22 01:37:38 EDT 2003

On Thu, 21 Aug 2003 10:06:53 GMT, "Andrew Dalke" <adalke at mindspring.com> wrote:

>Still, this might help.  Suppose you wanted to read from a utf-16-le
>encoded file and write to a utf-8 encoded file.  You can do

Very close, I want to read a utf16le into memory, convert to text, change 100
lines in the file, convert back to utf16le and write back to disk.

>The other options is to do the conversion through strings
>instead of through files.
>
># s = "....some set of bytes with your utf-16 in it .."
>s = open("input.utf16", "rb").read() # the whole file
>
># convert to unicode, given the encoding
>t = unicode(s, "utf-16-le")
>
># convert to utf-8 encoding
>s2 = t.encode("utf-8")
>
>open("output.utf8", "rb").write(s2)

My code so far
-------------------------------------------
import codecs
codecs.lookup("utf-16-le")
eng_file = open("c:/program files/microsoft games/train
simulator/trains/trainset/dash9/dash9.eng", "rb").read()	# read the whole file

t = unicode(eng_file, "utf-16-le")
print t
-----------------------------------------------------

The print fails (as expected) with a non printing char  '\ufeff'  which is of
course the BOM.
Is there a nice way to strip off the BOM?

The line where the conversion to utf8 is, I would like to convert to text but I
cannot find a built in command.

Many thanks so far