Some Issues on Tagging Text

Fri May 25 08:05:36 EDT 2018

On 5/25/18 7:23 AM, subhabangalore at gmail.com wrote:
> On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote:
>> First up, thank you for a well described problem! Remarks inline below.
>>
>> On 24May2018 03:13, wrote:
>>> I have a text as,
>>>
>>> "Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption of Kilauea volcano in Hawaii sparked new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing into the ocean and setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny shards of volcanic glass. Janet Babb, a geologist with the Hawaiian Volcano Observatory, says the plume "looks innocuous, but it's not." "Just like if you drop a glass on your kitchen floor, there's some large pieces and there are some very, very tiny pieces," Babb said. "These little tiny pieces are the ones that can get wafted up in that steam plume." Scientists call the glass Limu O Pele, or Pele's seaweed, named after the Hawaiian goddess of volcano and fire"
>>>
>>> and I want to see its tagged output as,
>>>
>>> "Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The eruption of Kilauea/TAG volcano/TAG in Hawaii/TAG sparked new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing into the ocean and setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny shards of volcanic glass. Janet/TAG Babb/TAG, a geologist with the Hawaiian/TAG Volcano/TAG Observatory/TAG, says the plume "looks innocuous, but it's not." "Just like if you drop a glass on your kitchen floor, there's some large pieces and there are some very, very tiny pieces," Babb/TAG said. "These little tiny pieces are the ones that can get wafted up in that steam plume." Scientists call the glass Limu/TAG O/TAG Pele/TAG, or Pele's seaweed, named after the Hawaiian goddess of volcano and fire"
>>>
>>> To do this I generally try to take a list at the back end as,
>>>
>>> Hawaii
>>> PAHOA
>>> Kilauea
>>> volcano
>>> Janet
>>> Babb
>>> Hawaiian
>>> Volcano
>>> Observatory
>>> Babb
>>> Limu
>>> O
>>> Pele
>>>
>>> and do a simple code as follows,
>>>
>>> def tag_text():
>>>    corpus=open("/python27/volcanotxt.txt","r").read().split()
>>>    wordlist=open("/python27/taglist.txt","r").read().split()
>> You might want use this to compose "wordlist":
>>
>>      wordlist=set(open("/python27/taglist.txt","r").read().split())
>>
>> because it will make your "if word in wordlist" test O(1) instead of O(n), 
>> which will matter later if your wordlist grows.
>>
>>>    list1=[]
>>>    for word in corpus:
>>>        if word in wordlist:
>>>            word_new=word+"/TAG"
>>>            list1.append(word_new)
>>>        else:
>>>            list1.append(word)
>>>    lst1=list1
>>>    tagged_text=" ".join(lst1)
>>>    print tagged_text
>>>
>>> get the results and hand repair unwanted tags Hawaiian/TAG goddess of volcano/TAG.
>>> I am looking for a better approach of coding so that I need not spend time on 
>>> hand repairing.
>> It isn't entirely clear to me why these two taggings are unwanted. Intuitively, 
>> they seem to be either because "Hawaiian goddess" is a compound term where you 
>> don't want "Hawaiian" to get a tag, or because "Hawaiian" has already received 
>> a tag earlier in the list. Or are there other criteria.
>>
>> If you want to solve this problem with a programme you must first clearly 
>> define what makes an unwanted tag "unwanted".
>>
>> For example, "Hawaiian" is an adjective, and therefore will always be part of a 
>> compound term.
>>
>> Can you clarify what makes these taggings you mention "unwanted"?
>>
>> Cheers,
>>
> Sir, Thank you for your kind time to write such a nice reply. 
>
> By unwanted I did not mean anything so intricate. 
> Unwanted meant things I did not want. 
> For example, 
> if my target phrases included terms like, 
> government of Mexico, 
>
> now in my list I would have words with their tags as,
> government
> of
> Mexico
>
> If I put these words in list it would tag 
> government/TAG of/TAG Mexico
>
> but would also tag all the "of" which may be
> anywhere like haze is made of/TAG dense white,
> clouds of/TAG steam, etc. 
>
> Cleaning these unwanted places become a daunting task
> to me. 
>
> I have been experimenting around 
> wordlist=["Kilauea volcano","Kilauea/TAG volcano/TAG"),("Hawaii","Hawaii/TAG"),...]
> tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus)
>
> is giving me sizeably good result but size of the wordlist is slight concern. 
>
The issue then sounds like you implemented tagging based on words, but
what you REALLY want is tagging based on phrases. It looks like you did
this in part because you had a tool that gave you words and not phrases.

The key here is to reframe the solution into the terms the problem
states or transforms the problem statement into something based on the
terms of the tools you are using.

Basically you had a plank of wood to attach to something and a screw,
and saw a hammer, so you hammered the screw in and wondered why it
didn't work that well.

-- 
Richard Damon