lxml can't output right unicode result

contro opinion contropinion at gmail.com
Thu Sep 6 20:21:53 EDT 2012


i eidt a file and save it in gbk encode named test. my system is
:debian,locale,en.utf-8;python2.6,locale,utf-8.

<html>
<p>你</p>
</html>

in terminal i input:

xxd  test

0000000: 3c68 746d 6c3e 0a3c 703e c4e3 3c2f 703e  <html>.<p>..</p>
0000010: 0a3c 2f68 746d 6c3e 0a                   .</html>.

你 is you in english,
"\xc4\xe3" is the gbk encode of it.
"\xe4\xbd\xe3" is the utf-8 encode of it.
"u\x4f\x60" is the unicode encode of it.
now i parse it in lxml

>>> "你"
'\xe4\xbd\xa0'
>>> "你".decode("utf-8")
u'\u4f60'
>>> "你".decode("utf-8").encode("gbk")
'\xc4\xe3'
>>>

code1:

>>> import lxml.html
>>> root=lxml.html.parse("test")
>>> d=root.xpath("//p")
>>> d[0].text_content()
u'\xc4\xe3'

in material ,lxml parse file to output the unicode form.
why the d[0].text_content() can not output u'\x4f\x60'?

code2:

import codecs
import lxml.html
f = codecs.open('test', 'r', 'gbk')
root=lxml.html.parse(f)
d=root.xpath("//p")
d[0].text_content()
u'\xe4\xbd\xa0'

why the d[0].text_content() can not output u'\x4f\x60'?

i am confused by this problem for two days.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120906/0cd29c34/attachment.html>


More information about the Python-list mailing list