Should HTML entity translation accept "&amp"?

Sun Jan 6 20:09:48 EST 2008

   Another in our ongoing series on "Parsing Real-World HTML".

   It's wrong, of course.  But Firefox will accept as HTML escapes

	&amp
	&gt
	&lt

as well as the correct forms

	&
	>
	<

To be "compatible", a Python screen scraper at

http://zesty.ca/python/scrape.py

has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode.  (Why isn't this a standard
Python library function?  Its inverse is available.)

This uses the regular expression

charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)

to recognize HTML escapes.

Note the ";?", which makes the closing ";" optional.

This seems fine until we hit something valid but unusual like

	http://www.example.com?foo=1&#1234567

for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
Unicode overflow.

For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior?  Too strict, or OK?

				John Nagle
				SiteTruth