Finding keywords

Matt Chaput matt at whoosh.ca
Tue Mar 8 14:00:01 EST 2011


On 08/03/2011 8:58 AM, Cross wrote:
> I know meta tags contain keywords but they are not always reliable. I
> can parse xhtml to obtain keywords from meta tags; but how do I verify
> them. To obtain reliable keywords, I have to parse the plain text
> obtained from the URL.

I think maybe what the OP is asking about is extracting key words from a 
text, i.e. a short list of words that characterize the text. This is an 
information retrieval problem, not really a Python problem.

One simple way to do this is to calculate word frequency histograms for 
each document in your corpus, and then for a given document, select 
words that are frequent in that document but infrequent in the corpus as 
a whole. Whoosh does this. There are different ways of calculating the 
importance of words, and stemming and conflating synonyms can give you 
better results as well.

A more sophisticated method uses "part of speech" tagging. See the 
Python Natural Language Toolkit (NLTK) and topia.termextract for more 
information.

http://pypi.python.org/pypi/topia.termextract/

Yahoo has a web service for key word extraction:

http://developer.yahoo.com/search/content/V1/termExtraction.html

You might want to investigate these resources and try google searches 
for e.g. "extracting key terms from documents" and then come back if you 
have a question about the Python implementation.

Cheers,

Matt



More information about the Python-list mailing list