[ python-Bugs-856617 ] HTMLParser parsers AT&T to AT
SourceForge.net
noreply at sourceforge.net
Thu Dec 11 13:32:22 EST 2003
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Bugs item #856617, was opened at 2003-12-08 21:47
Message generated for change (Comment added) made by jimjjewett
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=856617&group_id=5470
Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Hammer Lee (lhy719)
Assigned to: Nobody/Anonymous (nobody)
Summary: HTMLParser parsers AT&T to AT
Initial Comment:
I use HTMLParser to parse HTML files. There is an
mistake when HTML contents have '&', like <BR>AT&T
Research Labs Cambridge - WinVNC Version 3, 3, 3, 7.
HTMLParser parses "AT&T Research" to "AT
Research".
It happens on "ETTC&P EpSCTWeb_Fr Application Version
1, 0, 0, 1" also.
I'm a newbie in Python, I don't know how to solve it.
----------------------------------------------------------------------
Comment By: Jim Jewett (jimjjewett)
Date: 2003-12-11 13:32
Message:
Logged In: YES
user_id=764593
Technically, that isn't legal html; they're supposed to write
& (follow the & with the word "amp;"), because & is an
escape character.
That said, it is a pretty common error in web pages. The
parser already recovers at the next space (instead of waiting
for a ";", and I think it would be reasonable to just return the
"&T" when T doesn't turn out to be a known entity.
You would do this by overriding handle_entityref -- but to be
honest, I suspect that you're "really" using some other library
(or local code) which already does this, so you may have to
make the modification there.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=856617&group_id=5470
More information about the Python-bugs-list
mailing list