[Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?

Stefan Behnel stefan_ml at behnel.de
Wed Jun 6 11:36:14 CEST 2012


Marc Tompkins, 06.06.2012 10:21:
> On Tue, Jun 5, 2012 at 11:22 PM, Stefan Behnel wrote:
> 
>> You can do this:
>>
>>    connection = urllib2.urlopen(url)
>>    tree = etree.parse(connection, my_html_parser)
>>
>> Alternatively, use fromstring() to parse from strings:
>>
>>    page = urllib2.urlopen(url)
>>    pagecontents = page.read()
>>     html_root = etree.fromstring(pagecontents, my_html_parser)
>>
>>
> Thank you!  fromstring() did the trick for me.
> 
> Interestingly, your first suggestion - parsing straight from the connection
> without an intermediate read() - appears to create the tree successfully,
> but my first strip_tags() fails, with the error "ValueError: Input object
> has no document: lxml.etree._ElementTree".

Weird. You may want to check the parser error log to see if it has any hint.


>> See the lxml tutorial.
> 
> I did - I've been consulting it religiously - but I missed the fact that I
> was mixing strings with file-like IO, and (as you mentioned) the error
> message really wasn't helping me figure out my problem.

Yes, I think it could do better here. Reporting a parser error with an
"unprintable error message" would at least make it less likely that users
are being diverted from the actual cause of the problem.


>> Also note that there's lxml.html, which provides an
>> extended tool set for HTML processing.
> 
> I've been using lxml.etree because I'm used to the syntax, and because
> (perhaps mistakenly) I was under the impression that its parser was more
> resilient in the face of broken HTML - this page has unclosed tags all over
> the place.

Both are using the same parser and share most of their API. lxml.html is
mostly just an extension to lxml.etree with special HTML tools.

Stefan



More information about the Tutor mailing list