Some Issues on Tagging Text

Sun May 27 07:43:10 EDT 2018

On Sunday, May 27, 2018 at 2:41:43 AM UTC+5:30, Cameron Simpson wrote:
> On 26May2018 04:02, Subhabrata Banerjee  wrote:
> >On Saturday, May 26, 2018 at 3:54:37 AM UTC+5:30, Cameron Simpson wrote:
> >> It sounds like you want a more general purpose parser, and that depends upon
> >> your purposes. If you're coding to learn the basics of breaking up text, what
> >> you're doing is fine and I'd stick with it. But if you're just after the
> >> outcome (tags), you could use other libraries to break up the text.
> >>
> >> For example, the Natural Language ToolKit (NLTK) will do structured parsing of
> >> text and return you a syntax tree, and it has many other facilities. Doco:
> >>
> >>   http://www.nltk.org/
> >>
> >> PyPI module:
> >>
> >>   https://pypi.org/project/nltk/
> >>
> >> which you can install with the command:
> >>
> >>   pip install --user nltk
> >>
> >> That would get you a tree structure of the corpus, which you could process more
> >> meaningfully. For example, you could traverse the tree and tag higher level
> >> nodes as you came across them, possibly then _not_ traversing their inner
> >> nodes. The effect of that would be that if you hit the grammatic node:
> >>
> >>   government of Mexico
> >>
> >> you might tags that node with "ORGANISATION", and choose not to descend inside
> >> it, thus avoiding tagging "government" and "of" and so forth because you have a
> >> high level tags. Nodes not specially recognised you're keep descending into,
> >> tagging smaller things.
> >>
> >> Cheers,
> >> Cameron Simpson
> >
> >Dear Sir,
> >
> >Thank you for your kind and valuable suggestions. Thank you for your kind time too.
> >I know NLTK and machine learning. I am of belief if I may use language properly we need machine learning-the least.
> 
> I have similar beliefs: not that machine learning is not useful, but that it 
> has a tendency to produce black boxes in terms of the results it produces 
> because its categorisation rules are not overt, rather they tend to be side 
> effects of weights in a graph.
> 
> So one might end up with a useful tool, but not understand how or why it works.
> 
> >So, I am trying to design a tagger without the help of machine learning, by simple Python coding. I have thus removed standard Parts of Speech(PoS) or Named Entity (NE) tagging scheme.
> >I am trying to design a basic model if required may be implemented on any one of these problems.
> >Detecting longer phrase is slightly a problem now I am thinking to employ 
> >re.search(pattern,text). If this part is done I do not need machine learning. 
> >Maintaining so much data is a cumbersome issue in machine learning.
> 
> NLTK is not machine learning (I believe). It can parse the corpus for you, 
> emitting grammatical structures. So that would aid you in recognising words, 
> phrases, nouns, verbs and so forth. With that structure you can then make 
> better decisions about what to tag and how.
> 
> Using the re module is a very hazard prone way of parsing text. It can be 
> useful for finding fairly fixed text, particularly in machine generated text, 
> but it is terrible for prose.
> 
> Cheers,
> Cameron Simpson 

Dear Sir, 

Thank you for your kind time to discuss the matter. 
I am very clear in Statistics but as I am a Linguist too I feel the modern day
craziness on theories is going no where. Many theories but hardly anything of
practical value, bit like post Chomskyan Linguistics scenario. Theories of parsing
are equally bad. Only advantage of statistics is if it is not giving result you may 
abandon them quickly. 

I do not feel Parsing theories of Linguistics lead anywhere esp if data is really big. 

I am looking for patterns. Like if you say Organizations in documents are mostly all
capital lettered acronyms. So no need of taking ML solution for that rather a simple
code line of [word for word in words if word.isupper()] does the job. In the same way
there are many interesting patterns in language if you observe them. I made many, making
many more. All you need some good time to observe the data patiently. 

NLTK is a library mainly built for students practice but now everyone uses it. 
They have many corpora and tools (most of them are built with ML based approach),
but they have many more ML libraries which you may use on user defined data and standard.
NLTK integrates nicely with other Python based libraries like Scikit or Gensim or Java based 
ones like Stanford. The code lines are nicely documented if you feel you may read as proper
references are mostly given. 

I got good results in re earlier but I would surely check your point.

Thank you again for your kind time and a nice discussion.