Pulling out <TITLE></TITLE>

Brett Cannon bac at OCF.Berkeley.EDU
Thu Nov 22 02:42:51 EST 2001


Could use negative lookahead and lookbehinds.  Another solution is to just
strip out all comments from the HTML.  Probably wouldn't hurt, anyway,
since it will probably increase performance slightly be cutting down on
the amount of tags to deal with.

But it is also illegal syntax, I believe, to embed tags within a comment.

-Brett C.



On Wed, 21 Nov 2001, Bengt Richter wrote:

> On Sun, 18 Nov 2001 20:45:44 -0800, Brett Cannon <bac at OCF.Berkeley.EDU> wrote:
>
> >You could just read each page and use a regex to fetch it:
> >
> >title_value=re.search(r'<title>(?P<title>.*?)</title>',re.I)
> >title_value.group('title')
> >
> Hm. What happens with the following page?
>
>  <HTML><HEAD>
>  <!-- (old title kept for reference, or possible restoring)
>  <TITLE>This is the old title</TITLE>
>  -->
>  <TITLE>Official new title</TITLE>
>  </HEAD><Body>...whatever...</BODY></HTML>
>
> >On Sun, 18 Nov 2001, David A McInnis wrote:
> >
> >> I am writing a script to catalog about 30,000 html pages on my site and need
> >> to pull out the value of <TITLE></TITLE>.
> >>
> >> I guess this is possible with htmllib, but I cannot figure it out.
> >>
> >> Thanks,
> >> David
> >>
> >>
> >>
> >
>
>




More information about the Python-list mailing list