the same strings, different utf-8 repr values?

John Machin sjmachin at lexicon.net
Thu Sep 7 13:53:06 EDT 2006


slowness.chen at gmail.com wrote:
> I have two files:
>
> test.py:
> --------------------------------------------------
> # -*- encoding : utf8 -*-
> print 'in this file', repr('中文')
>
> # tt.txt is saved as utf8 encoding
> f = file('tt.txt')
> line1 = f.readline().strip()
> print 'another file', repr(line1)
> -------------------------------------------------------
>
> tt.txt:
> ----------------------------------------------------
> 中文
> test
> -------------------------------------------------------
> run test.py and I get the following output:
> in this file '\xe4\xb8\xad\xe6\x96\x87'
> another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'
>
> and I cann't encode line1 like:
>        line1.decode('utf8').encode('gbk')
> get this error:
> UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
> position 0:
> illegal multibyte sequence
>
> why did I get the different repr values?

Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John




More information about the Python-list mailing list