need an idea, recognize sequence, fsm or genetic ?

Thu Sep 2 03:21:48 EDT 2004

hello,

here is my trouble, i would to like to write a program which could
help me to detect sequence of consecutive words in list in a very
efficient way. (i need to do it upon large amount of text, and for now
i'm looking for a good start point)

i have many "sentences" of variables sizes, for example one could be :
sentence1 = ['w43', 'w12', 'w41', 'w85', 'w4', 'w74', 'w24', 'w1',
'w6', 'w41', 'w85', 'w7', 'w23', 'w43']
where wx can be seen as a word.

then i had sequence of words which i need to recognize, for example :
sequences = [['w41', 'w85', 'w4'], ['w41', 'w85', 'w7'],] etc.

and need something like span position where sequences are present in
sentence:

{
['w41', 'w85', 'w4']: [[2, 4], ],
['w41', 'w85', 'w7']: [[9, 11], ]
}

for now i'm doing it by using many nested 'for loop', and it is too
slow, as i search for each sequence one by one, and then restart from
beginning, and so on, i'm wondering if this can not be done by using
nested dictionnary or something like that, maybe is there an
improvement to imagine ?
(i'm wondering if i should not find a way to not re-start from
beginning when i had found ['w41', 'w85', ] for the second time and
thus just only look after last token 'w4' or 'w7')

also i think this is related to some kind of finite state automata,
but i do not know how to build automatically massive parallel FSM (i
can have many many sequences and huge amount of sentences)

i was also wondering if some genetics algorithm used to align sequence
of protein could not help me, i had once read something about that ,
but still can not where to start. also as far as i remember such
algorithms take into account that 'sequence' could not be consecutive,
what i need is consecutive.

maybe one of you could help ? give a start point or even an idea that
could help me to design this piece of code ?