Using lxml to screen scrap a site, problem with charset
Stefan Behnel
stefan_ml at behnel.de
Wed Feb 4 15:02:52 EST 2009
Tim Arnold wrote:
> "?????? ???????????" <gdamjan at gmail.com> wrote in message
> news:ciqh56-ses.ln1 at archaeopteryx.softver.org.mk...
>> So, I'm using lxml to screen scrap a site that uses the cyrillic
>> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
>> ..content-type.. charset=..> header, but does have a HTTP header that
>> specifies the charset... so they are standards compliant enough.
>>
>> Now when I run this code:
>>
>> from lxml import html
>> doc = html.parse('http://a1.com.mk/')
>> root = doc.getroot()
>> title = root.cssselect(('head title'))[0]
>> print title.text
>>
>> the title.text is ? unicode string, but it has been wrongly decoded as
>> latin1 -> unicode
>
> The way I do that is to open the file with codecs, encoding=cp1251, read it
> into variable and feed that to the parser.
Yes, if you know the encoding through an external source (especially when
parsing broken HTML), it's best to pass in either a decoded string or a
decoding file-like object, as in
tree = lxml.html.parse( codecs.open(..., encoding='...') )
You can also create a parser with an encoding override:
parser = etree.HTMLParser(encoding='...', **other_options)
Stefan
More information about the Python-list
mailing list