Some Issues on Tagging Text

MRAB python at mrabarnett.plus.com
Fri May 25 20:01:43 EDT 2018


On 2018-05-25 23:24, Cameron Simpson wrote:
[snip]
> You can reduce that list by generating the "wordlist" form from something
> smaller:
> 
>    base_phrases = ["Kilauea volcano", "government of Mexico", "Hawaii"]
>    wordlist = [
>        (base_phrase, " ".join([word + "/TAG" for word in base_phrase.split()]))
>        for base_phrase in base_phrases
>    ]
> 
> You could even autosplit the longer phrases so that your base_phrases
> _automatically_ becomes:
> 
>    base_phrases = ["Kilauea volcano", "Kilauea", "volcano", "government of
>    Mexico", "government", "Mexico", "Hawaii"]
> 
That list should also include "of".

As the OP doesn't want all instances of "of" to be tagged, there could 
be a separate exceptions list that contains those sub-phrases that 
should not be tagged; they would be dropped from the base_phrases list 
that was created.

[snip]



More information about the Python-list mailing list