[ python-Bugs-856617 ] HTMLParser parsers AT&T to AT

Thu Dec 11 13:32:22 EST 2003

Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"

Bugs item #856617, was opened at 2003-12-08 21:47
Message generated for change (Comment added) made by jimjjewett
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=856617&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Hammer Lee (lhy719)
Assigned to: Nobody/Anonymous (nobody)
Summary: HTMLParser parsers AT&T to AT

Initial Comment:
I use HTMLParser to parse HTML files. There is an 
mistake when HTML contents have '&', like <BR>AT&T 
Research Labs Cambridge - WinVNC Version 3, 3, 3, 7.

HTMLParser parses "AT&T Research" to "AT
 Research".

It happens on "ETTC&P EpSCTWeb_Fr Application Version 
1, 0, 0, 1" also.

I'm a newbie in Python, I don't know how to solve it.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2003-12-11 13:32

Message:
Logged In: YES 
user_id=764593

Technically, that isn't legal html; they're supposed to write 
&amp;  (follow the & with the word "amp;"), because & is an 
escape character.

That said, it is a pretty common error in web pages.  The 
parser already recovers at the next space (instead of waiting 
for a ";", and I think it would be reasonable to just return the 
"&T" when T doesn't turn out to be a known entity.

You would do this by overriding handle_entityref -- but to be 
honest, I suspect that you're "really" using some other library 
(or local code) which already does this, so you may have to 
make the modification there.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=856617&group_id=5470