Should HTML entity translation accept "&amp"?

John Nagle nagle at animats.com
Sun Jan 6 20:09:48 EST 2008


   Another in our ongoing series on "Parsing Real-World HTML".

   It's wrong, of course.  But Firefox will accept as HTML escapes

	&amp
	&gt
	&lt

as well as the correct forms

	&
	>
	<

To be "compatible", a Python screen scraper at

http://zesty.ca/python/scrape.py

has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode.  (Why isn't this a standard
Python library function?  Its inverse is available.)

This uses the regular expression

charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)

to recognize HTML escapes.

Note the ";?", which makes the closing ";" optional.

This seems fine until we hit something valid but unusual like

	http://www.example.com?foo=1&#1234567

for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
Unicode overflow.

For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior?  Too strict, or OK?

				John Nagle
				SiteTruth



More information about the Python-list mailing list