python/regex question... hope someone can help

Mon Dec 10 00:30:33 EST 2007

En Sun, 09 Dec 2007 16:45:53 -0300, charonzen <your.master at gmail.com>  
escribió:

>> [John Machin] Another suggestion is to ensure that the job  
>> specification is not
>> overly simplified. How did you parse the text into "words" in the
>> prior exercise that produced the list of bigrams? Won't you need to
>> use the same parsing method in the current exercise of tagging the
>> bigrams with an underscore?
>
> Thank you John, that definitely puts things in perspective!  I'm very
> new to both Python and text parsing, and I often feel that I can't see
> the forest for the trees.  If you're asking, I'm working on a project
> that utilizes Church's mutual information score.  I tokenize my text,
> split it into a list, derive some unigram and bigram dictionaries, and
> then calculate a pmi dictionary based on x,y from the bigrams and
> unigrams.  The bigrams that pass my threshold then get put into my
> list of x_y strings, and you know the rest.  By modifying the original
> text file, I can view 'x_y', z pairs as x,y and iterate it until I
> have some collocations that are worth playing with.  So I think that
> covers the question the same parsing method.  I'm sure there are more
> pythonic ways to do it, but I'm on deadline :)

Looks like you should work with the list of tokens, collapsing consecutive  
elements, not with the original text. Should be easier, and faster because  
you don't regenerate the text and tokenize it again and again.

-- 
Gabriel Genellina