Regular Expression Help for Python Newbie.

Tony J Ibbs (Tibs) tony at lsl.co.uk
Mon Apr 10 05:32:43 EDT 2000


Raoul-Sam Daruwala wrote:
> My problem is that one of the sets of files that I'm trying to parse has
> badly formatted tables. Now when I do a view source on the files I can
> see the problem clearly. It's quite simple, the tables in these files
> starts out properly formatted but after a standard header the script
> than generates them leaves out the <TR> tag. This is incredible to me
> because both Netscape and IE read can view the tables properly.

Fredrik Lundh replied:
> <TR>'s are optional -- if the browser stumbles upon <TD>
> in a <TABLE> context, it should insert <TR>'s all by itself.

Well, no...

OK. As far as I can see, <TR>'s are not optional - given an (even vaguely)
SGML based specification I don't understand how they could be. And looking
at the HTML 3.2 Reference Specification, they clearly aren't (although the
</TR> is). And that makes sense as the <TR> defines the start of the row
"element".

Of course, it's entirely possible to write a browser (or other HTML reader)
that can cope with missing <TR>'s *if* the </TR>'s are present. And
obviously that's what some browsers do. This way also lies madness - it's
going to be impossible to guess which "mistakes" the browser will
self-correct for, and which it won't (it'll have to be by some ad-hoc
mixture of emulating what other browsers appear to do, noticing what
mistakes one has actually seen, guessing which mistakes are likely to
happen, and fixing some because they're easy to fix, even if unlikely). I
always find it strange that otherwise intelligent people don't want their
compilers to work like this, but do want their browsers to.

As to the specific problem, it's clearly possible to write code that will
trigger on the </TR>'s instead of on the <TR>'s. This is left as an
excercise for the reader, especially if they want to make it harder by using
regular expressions...

Tibs
(damn - it's Monday and I've already disagreed with /F, not something I'd
normally do - let's hope the rest of the week settles down)
--
Tony J Ibbs (Tibs)      http://www.tibsnjoan.demon.co.uk/
Give a pedant an inch and they'll take 25.4mm
(once they've established you're talking a post-1959 inch, of course)
My views! Mine! Mine! (Unless Laser-Scan ask nicely to borrow them.)





More information about the Python-list mailing list