[Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?

Wed Jun 6 08:22:35 CEST 2012

Marc Tompkins, 06.06.2012 03:10:
> I'm trying to parse a webpage using lxml; every time I try, I'm
> rewarded with "UnicodeDecodeError: 'ascii' codec can't decode byte
> 0x?? in position?????: ordinal not in range(128)"  (the byte value and
> the position occasionally change; the error never does.)
> 
> The page's encoding is UTF-8:
>      <meta http-equiv="content-type" content="text/html; charset=utf-8" />
> so I have tried:
> -  setting HTMLParser's encoding to 'utf-8'

That's the way to do it, although the parser should be able to figure it
out by itself, given the above content type declaration.

> Here's my current version, trying everything at once:
> 
> from __future__ import print_function
> import datetime
> import urllib2
> from lxml import etree
> url = 'http://www.wpc-edi.com/reference/codelists/healthcare/claim-adjustment-reason-codes/'
> page = urllib2.urlopen(url)
> pagecontents = page.read()
> pagecontents = pagecontents.decode('utf-8')
> pagecontents = pagecontents.encode('ascii', 'ignore')
> tree = etree.parse(pagecontents,
> etree.HTMLParser(encoding='utf-8',recover=True))

parse() is meant to parse from files and file-like objects, so you are
telling it to parse from the "file path" in pagecontents, which obviously
does not exist. I admit that the error message is not helpful.

You can do this:

    connection = urllib2.urlopen(url)
    tree = etree.parse(connection, my_html_parser)

Alternatively, use fromstring() to parse from strings:

    page = urllib2.urlopen(url)
    pagecontents = page.read()
    html_root = etree.fromstring(pagecontents, my_html_parser)

See the lxml tutorial. Also note that there's lxml.html, which provides an
extended tool set for HTML processing.

Stefan