Newbie problem with codecs

Fri Aug 22 13:10:40 EDT 2003

"derek / nul" <abuseonly at sgrail.org> wrote in message
news:mkabkvkguslj36n1qd1gsot8hbvh5qm321 at 4ax.com...
> On Thu, 21 Aug 2003 10:06:53 GMT, "Andrew Dalke" <adalke at mindspring.com>
wrote:
>
> >Still, this might help.  Suppose you wanted to read from a utf-16-le
> >encoded file and write to a utf-8 encoded file.  You can do
>
> Very close, I want to read a utf16le into memory, convert to text, change
100
> lines in the file, convert back to utf16le and write back to disk.
>
> >The other options is to do the conversion through strings
> >instead of through files.
> >
> ># s = "....some set of bytes with your utf-16 in it .."
> >s = open("input.utf16", "rb").read() # the whole file
> >
> ># convert to unicode, given the encoding
> >t = unicode(s, "utf-16-le")
> >
> ># convert to utf-8 encoding
> >s2 = t.encode("utf-8")
> >
> >open("output.utf8", "rb").write(s2)
>
> My code so far
> -------------------------------------------
> import codecs
> codecs.lookup("utf-16-le")
> eng_file = open("c:/program files/microsoft games/train
> simulator/trains/trainset/dash9/dash9.eng", "rb").read() # read the whole
file
>
> t = unicode(eng_file, "utf-16-le")
> print t
> -----------------------------------------------------
>
> The print fails (as expected) with a non printing char  '\ufeff'  which is
of
> course the BOM.
> Is there a nice way to strip off the BOM?

"derek / nul" <abuseonly at sgrail.org> wrote:
> I need a pointer to converting utf-16-le to text

If there is a BOM, then it is not UTF-16LE; it is UTF-16.