[Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?
Stefan Behnel
stefan_ml at behnel.de
Wed Jun 6 08:22:35 CEST 2012
Marc Tompkins, 06.06.2012 03:10:
> I'm trying to parse a webpage using lxml; every time I try, I'm
> rewarded with "UnicodeDecodeError: 'ascii' codec can't decode byte
> 0x?? in position?????: ordinal not in range(128)" (the byte value and
> the position occasionally change; the error never does.)
>
> The page's encoding is UTF-8:
> <meta http-equiv="content-type" content="text/html; charset=utf-8" />
> so I have tried:
> - setting HTMLParser's encoding to 'utf-8'
That's the way to do it, although the parser should be able to figure it
out by itself, given the above content type declaration.
> Here's my current version, trying everything at once:
>
> from __future__ import print_function
> import datetime
> import urllib2
> from lxml import etree
> url = 'http://www.wpc-edi.com/reference/codelists/healthcare/claim-adjustment-reason-codes/'
> page = urllib2.urlopen(url)
> pagecontents = page.read()
> pagecontents = pagecontents.decode('utf-8')
> pagecontents = pagecontents.encode('ascii', 'ignore')
> tree = etree.parse(pagecontents,
> etree.HTMLParser(encoding='utf-8',recover=True))
parse() is meant to parse from files and file-like objects, so you are
telling it to parse from the "file path" in pagecontents, which obviously
does not exist. I admit that the error message is not helpful.
You can do this:
connection = urllib2.urlopen(url)
tree = etree.parse(connection, my_html_parser)
Alternatively, use fromstring() to parse from strings:
page = urllib2.urlopen(url)
pagecontents = page.read()
html_root = etree.fromstring(pagecontents, my_html_parser)
See the lxml tutorial. Also note that there's lxml.html, which provides an
extended tool set for HTML processing.
Stefan
More information about the Tutor
mailing list