Extract Title from HTML documents

Max M maxm at mxm.dk
Fri Nov 5 03:18:40 EST 2004


Nickolay Kolev wrote:
> Hi all,
> 
> I am looking for a way to extract the titles of HTML documents. I have 
> made an honest attempt at doing it, and it even works. Is there an 
> easier (faster / more efficient / clearer) way?

You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
match = linkPattern.search(source)
if match is None:
     result = ''
result = match.group(0)

If you need more than just the title I would definitely go with 
BeautifulSoap.

-- 

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science



More information about the Python-list mailing list