Should HTML entity translation accept "&"?
Paddy
paddy3118 at googlemail.com
Mon Jan 7 03:14:01 EST 2008
On Jan 7, 1:09 am, John Nagle <na... at animats.com> wrote:
> Another in our ongoing series on "Parsing Real-World HTML".
>
> It's wrong, of course. But Firefox will accept as HTML escapes
>
> &
> >
> <
>
> as well as the correct forms
>
> &
> >
> <
>
> To be "compatible", a Python screen scraper at
>
> http://zesty.ca/python/scrape.py
>
> has a function "htmldecode", which is supposed to recognize
> HTML escapes and generate Unicode. (Why isn't this a standard
> Python library function? Its inverse is available.)
>
> This uses the regular expression
>
> charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)
>
> to recognize HTML escapes.
>
> Note the ";?", which makes the closing ";" optional.
>
> This seems fine until we hit something valid but unusual like
>
> http://www.example.com?foo=1??
>
> for which "htmldecode" tries to convert "1234567" into
> a Unicode character with that decimal number, and gets a
> Unicode overflow.
>
> For our own purposes, I rewrote "htmldecode" to require a
> sequence ending in ";", which means some bogus HTML escapes won't
> be recognized, but correct HTML will be processed correctly.
> What's general opinion of this behavior? Too strict, or OK?
>
> John Nagle
> SiteTruth
Maybe htmltidy could help:
http://tidy.sourceforge.net/
?
More information about the Python-list
mailing list