Should HTML entity translation accept "&amp"?

Mon Jan 7 03:14:01 EST 2008

On Jan 7, 1:09 am, John Nagle <na... at animats.com> wrote:
>    Another in our ongoing series on "Parsing Real-World HTML".
>
>    It's wrong, of course.  But Firefox will accept as HTML escapes
>
>         &amp
>         &gt
>         &lt
>
> as well as the correct forms
>
>         &
>         >
>         <
>
> To be "compatible", a Python screen scraper at
>
> http://zesty.ca/python/scrape.py
>
> has a function "htmldecode", which is supposed to recognize
> HTML escapes and generate Unicode.  (Why isn't this a standard
> Python library function?  Its inverse is available.)
>
> This uses the regular expression
>
> charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)
>
> to recognize HTML escapes.
>
> Note the ";?", which makes the closing ";" optional.
>
> This seems fine until we hit something valid but unusual like
>
>        http://www.example.com?foo=1??
>
> for which "htmldecode" tries to convert "1234567" into
> a Unicode character with that decimal number, and gets a
> Unicode overflow.
>
> For our own purposes, I rewrote "htmldecode" to require a
> sequence ending in ";", which means some bogus HTML escapes won't
> be recognized, but correct HTML will be processed correctly.
> What's general opinion of this behavior?  Too strict, or OK?
>
>                                 John Nagle
>                                 SiteTruth

Maybe htmltidy could help:
  http://tidy.sourceforge.net/
?