Using lxml to screen scrap a site, problem with charset

Wed Feb 4 15:02:52 EST 2009

Tim Arnold wrote:
> "?????? ???????????" <gdamjan at gmail.com> wrote in message 
> news:ciqh56-ses.ln1 at archaeopteryx.softver.org.mk...
>> So, I'm using lxml to screen scrap a site that uses the cyrillic
>> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
>> ..content-type.. charset=..> header, but does have a HTTP header that
>> specifies the charset... so they are standards compliant enough.
>>
>> Now when I run this code:
>>
>> from lxml import html
>> doc = html.parse('http://a1.com.mk/')
>> root = doc.getroot()
>> title = root.cssselect(('head title'))[0]
>> print title.text
>>
>> the title.text is ? unicode string, but it has been wrongly decoded as
>> latin1 -> unicode
> 
> The way I do that is to open the file with codecs, encoding=cp1251, read it 
> into variable and feed that to the parser.

Yes, if you know the encoding through an external source (especially when
parsing broken HTML), it's best to pass in either a decoded string or a
decoding file-like object, as in

	tree = lxml.html.parse( codecs.open(..., encoding='...') )

You can also create a parser with an encoding override:

	parser = etree.HTMLParser(encoding='...', **other_options)

Stefan