Some Issues on Tagging Text

Sat May 26 07:02:48 EDT 2018

On Saturday, May 26, 2018 at 3:54:37 AM UTC+5:30, Cameron Simpson wrote:
> On 25May2018 04:23, Subhabrata Banerjee  wrote:
> >On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote:
> >> On 24May2018 03:13, wrote:
> >> >I have a text as,
> >> >
> >> >"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption of Kilauea volcano in Hawaii sparked new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing into the ocean and setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny shards of volcanic glass. Janet Babb, a geologist with the Hawaiian Volcano Observatory, says the plume "looks innocuous, but it's not." "Just like if you drop a glass on your kitchen floor, there's some large pieces and there are some very, very tiny pieces," Babb said. "These little tiny pieces are the ones that can get wafted up in that steam plume." Scientists call the glass Limu O Pele, or Pele's seaweed, named after the Hawaiian goddess of volcano and fire"
> >> >
> >> >and I want to see its tagged output as,
> >> >
> >> >"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The eruption of Kilauea/TAG volcano/TAG in Hawaii/TAG sparked new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing into the ocean and setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny shards of volcanic glass. Janet/TAG Babb/TAG, a geologist with the Hawaiian/TAG Volcano/TAG Observatory/TAG, says the plume "looks innocuous, but it's not." "Just like if you drop a glass on your kitchen floor, there's some large pieces and there are some very, very tiny pieces," Babb/TAG said. "These little tiny pieces are the ones that can get wafted up in that steam plume." Scientists call the glass Limu/TAG O/TAG Pele/TAG, or Pele's seaweed, named after the Hawaiian goddess of volcano and fire"
> >> >
> >> >To do this I generally try to take a list at the back end as,
> >> >
> >> >Hawaii
> >> >PAHOA
> [...]
> >> >and do a simple code as follows,
> >> >
> >> >def tag_text():
> >> >    corpus=open("/python27/volcanotxt.txt","r").read().split()
> >> >    wordlist=open("/python27/taglist.txt","r").read().split()
> [...]
> >> >    list1=[]
> >> >    for word in corpus:
> >> >        if word in wordlist:
> >> >            word_new=word+"/TAG"
> >> >            list1.append(word_new)
> >> >        else:
> >> >            list1.append(word)
> >> >    lst1=list1
> >> >    tagged_text=" ".join(lst1)
> >> >    print tagged_text
> >> >
> >> >get the results and hand repair unwanted tags Hawaiian/TAG goddess of volcano/TAG.
> >> >I am looking for a better approach of coding so that I need not spend time on
> >> >hand repairing.
> >>
> >> It isn't entirely clear to me why these two taggings are unwanted. Intuitively,
> >> they seem to be either because "Hawaiian goddess" is a compound term where you
> >> don't want "Hawaiian" to get a tag, or because "Hawaiian" has already received
> >> a tag earlier in the list. Or are there other criteria.
> >>
> >> If you want to solve this problem with a programme you must first clearly
> >> define what makes an unwanted tag "unwanted". [...]
> >
> >By unwanted I did not mean anything so intricate.
> >Unwanted meant things I did not want.
> 
> That much was clear, but you need to specify in your own mind _precisely_ what 
> makes some things unwanted and others wanted. Without concrete criteria you 
> can't write code to implement those criteria.
> 
> I'm not saying "you need to imagine code to match these things": you're clearly 
> capable of doing that. I'm saying you need to have well defined concepts of 
> what makes something unwanted (or, if that is easier to define, wanted).  You 
> can do that iteratively: start with your basic concept and see how well it 
> works. When those concepts don't give you the outcome you desire, consider a 
> specific example which isn't working and try to figure out what additional 
> criterion would let you distinguish it from a working example.
> 
> >For example,
> >if my target phrases included terms like,
> >government of Mexico,
> >
> >now in my list I would have words with their tags as,
> >government
> >of
> >Mexico
> >
> >If I put these words in list it would tag
> >government/TAG of/TAG Mexico
> >
> >but would also tag all the "of" which may be
> >anywhere like haze is made of/TAG dense white,
> >clouds of/TAG steam, etc.
> >
> >Cleaning these unwanted places become a daunting task
> >to me.
> 
> Richard Damon has pointed out that you seem to want phrases instead of just 
> words.
> 
> >I have been experimenting around
> >wordlist=["Kilauea volcano","Kilauea/TAG volcano/TAG"),("Hawaii","Hawaii/TAG"),...]
> >tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus)
> >
> >is giving me sizeably good result but size of the wordlist is slight concern.
> 
> You can reduce that list by generating the "wordlist" form from something 
> smaller:
> 
>   base_phrases = ["Kilauea volcano", "government of Mexico", "Hawaii"]
>   wordlist = [
>       (base_phrase, " ".join([word + "/TAG" for word in base_phrase.split()]))
>       for base_phrase in base_phrases
>   ]
> 
> You could even autosplit the longer phrases so that your base_phrases 
> _automatically_ becomes:
> 
>   base_phrases = ["Kilauea volcano", "Kilauea", "volcano", "government of 
>   Mexico", "government", "Mexico", "Hawaii"]
> 
> That way your "replace" call would find the longer phrases before the shorter 
> phrases and thus _not_ tag the single words if they occurred in a longer 
> phrase, while still tagging the single words when they _didn't_ land in a 
> longer phrase.
> 
> Also, it is unclear to me whether "/TAG" is a fixed string or intended to be 
> distinct such as "/PROPER_NOUN", "/LOCATION" etc. If they vary then you need a 
> more elaborate setup.
> 
> It sounds like you want a more general purpose parser, and that depends upon 
> your purposes. If you're coding to learn the basics of breaking up text, what 
> you're doing is fine and I'd stick with it. But if you're just after the 
> outcome (tags), you could use other libraries to break up the text.
> 
> For example, the Natural Language ToolKit (NLTK) will do structured parsing of 
> text and return you a syntax tree, and it has many other facilities. Doco:
> 
>   http://www.nltk.org/
> 
> PyPI module:
> 
>   https://pypi.org/project/nltk/
> 
> which you can install with the command:
> 
>   pip install --user nltk
> 
> That would get you a tree structure of the corpus, which you could process more 
> meaningfully. For example, you could traverse the tree and tag higher level 
> nodes as you came across them, possibly then _not_ traversing their inner 
> nodes. The effect of that would be that if you hit the grammatic node:
> 
>   government of Mexico
> 
> you might tags that node with "ORGANISATION", and choose not to descend inside 
> it, thus avoiding tagging "government" and "of" and so forth because you have a 
> high level tags. Nodes not specially recognised you're keep descending into, 
> tagging smaller things.
> 
> Cheers,
> Cameron Simpson 

Dear Sir, 

Thank you for your kind and valuable suggestions. Thank you for your kind time too. 
I know NLTK and machine learning. I am of belief if I may use language properly we need machine learning-the least. 
So, I am trying to design a tagger without the help of machine learning, by simple Python coding. I have thus removed standard Parts of Speech(PoS) or Named Entity (NE) tagging scheme. 
I am trying to design a basic model if required may be implemented on any one of these problems. 
Detecting longer phrase is slightly a problem now I am thinking to employ re.search(pattern,text). If this part is done I do not need machine learning. Maintaining so much data is a cumbersome issue in machine learning. 

My regards to all other esteemed coders and members of the group for their kind and valuable time and valuable suggestions.