the same strings, different utf-8 repr values?

Thu Sep 7 13:53:06 EDT 2006

slowness.chen at gmail.com wrote:
> I have two files:
>
> test.py:
> --------------------------------------------------
> # -*- encoding : utf8 -*-
> print 'in this file', repr('中文')
>
> # tt.txt is saved as utf8 encoding
> f = file('tt.txt')
> line1 = f.readline().strip()
> print 'another file', repr(line1)
> -------------------------------------------------------
>
> tt.txt:
> ----------------------------------------------------
> 中文
> test
> -------------------------------------------------------
> run test.py and I get the following output:
> in this file '\xe4\xb8\xad\xe6\x96\x87'
> another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'
>
> and I cann't encode line1 like:
>        line1.decode('utf8').encode('gbk')
> get this error:
> UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
> position 0:
> illegal multibyte sequence
>
> why did I get the different repr values?

Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John