[Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?

Wed Jun 6 11:36:14 CEST 2012

Marc Tompkins, 06.06.2012 10:21:
> On Tue, Jun 5, 2012 at 11:22 PM, Stefan Behnel wrote:
> 
>> You can do this:
>>
>>    connection = urllib2.urlopen(url)
>>    tree = etree.parse(connection, my_html_parser)
>>
>> Alternatively, use fromstring() to parse from strings:
>>
>>    page = urllib2.urlopen(url)
>>    pagecontents = page.read()
>>     html_root = etree.fromstring(pagecontents, my_html_parser)
>>
>>
> Thank you!  fromstring() did the trick for me.
> 
> Interestingly, your first suggestion - parsing straight from the connection
> without an intermediate read() - appears to create the tree successfully,
> but my first strip_tags() fails, with the error "ValueError: Input object
> has no document: lxml.etree._ElementTree".

Weird. You may want to check the parser error log to see if it has any hint.

>> See the lxml tutorial.
> 
> I did - I've been consulting it religiously - but I missed the fact that I
> was mixing strings with file-like IO, and (as you mentioned) the error
> message really wasn't helping me figure out my problem.

Yes, I think it could do better here. Reporting a parser error with an
"unprintable error message" would at least make it less likely that users
are being diverted from the actual cause of the problem.

>> Also note that there's lxml.html, which provides an
>> extended tool set for HTML processing.
> 
> I've been using lxml.etree because I'm used to the syntax, and because
> (perhaps mistakenly) I was under the impression that its parser was more
> resilient in the face of broken HTML - this page has unclosed tags all over
> the place.

Both are using the same parser and share most of their API. lxml.html is
mostly just an extension to lxml.etree with special HTML tools.

Stefan