Some Issues on Tagging Text

Cameron Simpson cs at cskk.id.au
Sat May 26 17:11:12 EDT 2018


On 26May2018 04:02, Subhabrata Banerjee <subhabangalore at gmail.com> wrote:
>On Saturday, May 26, 2018 at 3:54:37 AM UTC+5:30, Cameron Simpson wrote:
>> It sounds like you want a more general purpose parser, and that depends upon
>> your purposes. If you're coding to learn the basics of breaking up text, what
>> you're doing is fine and I'd stick with it. But if you're just after the
>> outcome (tags), you could use other libraries to break up the text.
>>
>> For example, the Natural Language ToolKit (NLTK) will do structured parsing of
>> text and return you a syntax tree, and it has many other facilities. Doco:
>>
>>   http://www.nltk.org/
>>
>> PyPI module:
>>
>>   https://pypi.org/project/nltk/
>>
>> which you can install with the command:
>>
>>   pip install --user nltk
>>
>> That would get you a tree structure of the corpus, which you could process more
>> meaningfully. For example, you could traverse the tree and tag higher level
>> nodes as you came across them, possibly then _not_ traversing their inner
>> nodes. The effect of that would be that if you hit the grammatic node:
>>
>>   government of Mexico
>>
>> you might tags that node with "ORGANISATION", and choose not to descend inside
>> it, thus avoiding tagging "government" and "of" and so forth because you have a
>> high level tags. Nodes not specially recognised you're keep descending into,
>> tagging smaller things.
>>
>> Cheers,
>> Cameron Simpson
>
>Dear Sir,
>
>Thank you for your kind and valuable suggestions. Thank you for your kind time too.
>I know NLTK and machine learning. I am of belief if I may use language properly we need machine learning-the least.

I have similar beliefs: not that machine learning is not useful, but that it 
has a tendency to produce black boxes in terms of the results it produces 
because its categorisation rules are not overt, rather they tend to be side 
effects of weights in a graph.

So one might end up with a useful tool, but not understand how or why it works.

>So, I am trying to design a tagger without the help of machine learning, by simple Python coding. I have thus removed standard Parts of Speech(PoS) or Named Entity (NE) tagging scheme.
>I am trying to design a basic model if required may be implemented on any one of these problems.
>Detecting longer phrase is slightly a problem now I am thinking to employ 
>re.search(pattern,text). If this part is done I do not need machine learning. 
>Maintaining so much data is a cumbersome issue in machine learning.

NLTK is not machine learning (I believe). It can parse the corpus for you, 
emitting grammatical structures. So that would aid you in recognising words, 
phrases, nouns, verbs and so forth. With that structure you can then make 
better decisions about what to tag and how.

Using the re module is a very hazard prone way of parsing text. It can be 
useful for finding fairly fixed text, particularly in machine generated text, 
but it is terrible for prose.

Cheers,
Cameron Simpson <cs at cskk.id.au>



More information about the Python-list mailing list