Should HTML entity translation accept "&amp"?

Steven D'Aprano steven at REMOVE.THIS.cybersource.com.au
Sun Jan 6 22:55:44 EST 2008


On Mon, 07 Jan 2008 12:25:07 +1100, Ben Finney wrote:

> John Nagle <nagle at animats.com> writes:
> 
>> For our own purposes, I rewrote "htmldecode" to require a sequence
>> ending in ";", which means some bogus HTML escapes won't be recognized,
>> but correct HTML will be processed correctly. What's general opinion of
>> this behavior? Too strict, or OK?
> 
> I think it's fine. In the face of ambiguity (and deviation from the
> published standards), refuse the temptation to guess.

That's good advice for a library function. But...

> More specifically, I don't see any reason to contort your code to
> understand some non-entity sequence that would be flagged as invalid by
> HTML validator tools.

... it is questionable advice for a program which is designed to make 
sense of invalid HTML.

Like it or not, real-world applications sometimes have to work with bad 
data. I think we can all agree that the world would have been better off 
if the major browsers had followed your advice, but given that they do 
not, and thus leave open the opportunity for websites to exist with 
invalid HTML, John is left in the painful position of having to write 
code that has to make sense of invalid HTML.

I think only John can really answer his own question. What are the 
consequences of false positives versus false negatives? If it raises an 
exception, can he shunt the code to another function and use some 
heuristics to make sense of it, or is it "game over, another site can't 
be analyzed"?



-- 
Steven



More information about the Python-list mailing list