how to get the summarized text from a given URL?

Peter Otten __peter__ at web.de
Tue Mar 24 07:46:12 EDT 2009


Rama Vadakattu wrote:

> Is there any python library to solve the below problem?
> 
> FOr the below URL :
> --------------------------
> http://tinyurl.com/dzcwbg
> 
> Summarized text is :
> ---------------------------
> By Roy Mark With sales plummeting and its smart phones failing to woo
> new customers, Sony Ericsson follows its warning that first quarter
> sales will be disappointing with the announcement that Najmi Jarwala,
> president of Sony Ericsson USA and head of ...
> 
> ~~~~~~~~~~~~~~
> Usually summarized text is a  2 to 3 line description of the URL which
> we usually obtain by fetching that html page , examining the  content
> and  figuring out short description from that html markup.
> ~~~~~~~~~~~~~
> 
> Are there any python libraries which give summarized text for a given
> url ?

BeautifulSoup makes it easy to access parts of a web page. 

import urllib2
from BeautifulSoup import BeautifulSoup

data = urllib2.urlopen("http://tinyurl.com/dzcwbg").read()
bs = BeautifulSoup(data)
print bs.find("meta", dict(name="description"))["content"]

> It is ok even if the library  just gives  intial two lines of text
> from the given URL Instead of summarization.

The problem is how you identify the summary. Different web sites will put it
in different places using different markup.

Peter



More information about the Python-list mailing list