[Tutor] Clarified: Best way to alter sections of a string which match dictionary keys?

Karl Pflästerer sigurd at 12move.de
Sat Jan 3 18:48:16 EST 2004


On  3 Jan 2004, SSokolow <- from_python_tutor at SSokolow.com wrote:

> Your reply is confusing me but as I understand it, there are three
> problems with this:

I didn't want to confuse you.

[...]
> 2. What do you mean safer? The situation may not apply to this

The regexp isn't 100% safe against badly (or broken) written HTML.  A
match starts with a `<a' then are some attributes then somewhere is a
`href"'.  I'm not absolutley sure at the moment (I had to reread the
docs of W3C) how much the syntax may differ.  Furthermore you need to
cope with HTML and XHTML; the last should be the smaller problem as it
is much stricter but HTML may differ a lot.  That's because a lot of
people don't read the docs of W3C.  But I think you need to cope with
spaces between `href=' and the following value of the attribute.  Also
the quotes can be single or double quotes (should be double).

That's not the biggest problem all this can be handled with a regexp but
if yoou had the (pathological) case that somebody writes
   <a ....> <a       </a> ..   </a>
a regexp will fail. But maybe that never happens or only once in a
million.  If you can live with it fine.

[...]
>  I also forgot to mention that the variable string does not hold the
>  entire file. This is run for each chunk of data as it's received from
>  the server. (I don't know how to build content-layer filtering into
>  the proxy code I'm extending so I hooked it in at the content layer. 
>  testing has shown that some links lie across chunk boundaries like
>  this:

> [continued from previous chunk]is some link text</a>
> .
> .
> .
> <a href="whatever">This is th[continued in next chunk]

> and I don't know if the HTML parser might stumble on an unclosed <a>
> tag pair.

With that the parser can cope very well.  You just had to change the
code a bit but that should be possible.

But if spped matters I think the simple regexp solution might suffice.

[...]

I think the problem is interesting so post here if you know more (but
please with as much facts as possible).


   Karl
-- 
Please do *not* send copies of replies to me.
I read the list




More information about the Tutor mailing list