Finding keywords

Wed Mar 9 13:13:26 EST 2011

On 03/09/2011 01:21 AM, Vlastimil Brom wrote:
> 2011/3/8 Cross<X at x.tv>:
>> On 03/08/2011 06:09 PM, Heather Brown wrote:
>>>
>>> The keywords are an attribute in a tag called<meta>, in the section
>>> called
>>> <head>. Are you having trouble parsing the xhtml to that point?
>>>
>>> Be more specific in your question, and somebody is likely to chime in.
>>> Although
>>> I'm not the one, if it's a question of parsing the xhtml.
>>>
>>> DaveA
>>
>> I know meta tags contain keywords but they are not always reliable. I can
>> parse xhtml to obtain keywords from meta tags; but how do I verify them. To
>> obtain reliable keywords, I have to parse the plain text obtained from the
>> URL.
>>
>> Cross
>>
>> --- news://freenews.netfront.net/ - complaints: news at netfront.net ---
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>
> Hi,
> if you need to extract meaningful keywords in terms of data mining
> using natural language processing, it might become quite a complex
> task, depending on the requirements; the NLTK toolkit may help with
> some approaches [ http://www.nltk.org/ ].
> One possibility would be to filter out more frequent and less
> meaningful words ("stopwords") and extract the more frequent words
> from the reminder., e.g. (with some simplifications/hacks in the
> interactive mode):
>
>>>> import re, urllib2, nltk
>>>> page_src = urllib2.urlopen("http://www.python.org/doc/essays/foreword/").read().decode("utf-8")
>>>> page_plain = nltk.clean_html(page_src).lower()
>>>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not in set(nltk.corpus.stopwords.words("english"))))
>>>> frequency_dist = nltk.FreqDist(txt_filtered)
>>>> [(word, freq) for (word, freq) in frequency_dist.items() if freq>  2]
> [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
> (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
> (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
> (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
> (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
> (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
> 3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
> (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
> 3), (u'readable', 3), (u'write', 3)]
>>>>
>
> Another possibility would be to extract parts of speech (e.g. nouns,
> adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
> for more convoluted html code e.g. BeautifulSoup might be used and
> there are likely many other options.
>
> hth,
>    vbr
I had considered nltk. That is why I said that straightforward frequency 
calculation of words would be naive. I have to look into this BeautifulSoup thing.

--- news://freenews.netfront.net/ - complaints: news at netfront.net ---