Using lxml to screen scrap a site, problem with charset

Дамјан Георгиевски gdamjan at gmail.com
Sun Feb 1 19:15:39 EST 2009


So, I'm using lxml to screen scrap a site that uses the cyrillic 
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META 
..content-type.. charset=..> header, but does have a HTTP header that 
specifies the charset... so they are standards compliant enough.

Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is а unicode string, but it has been wrongly decoded as 
latin1 -> unicode

So.. is this a deficiency/bug in lxml or I'm doing something wrong.
Also, what are my other options here?


I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.

-- 
дамјан ( http://softver.org.mk/damjan/ )

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan





More information about the Python-list mailing list