[Tutor] best way to tokenize [was script too slow]

Paul Tremblay phthenry@earthlink.net
Tue Feb 25 21:32:02 2003


Thanks. Your method is very instructive on how to use recursion. It is
not quite perfect, since a line of tokens can look like:

\par}}{\par{\ect => '\par}}' (should be '\par', '}', '}'

However, it ends up that your method takes just a bit longer than using
regular expressions, so there is probably no use in trying to perfect
it. I did have one question about this line:


>             expandedWord =  ' '.join(['\\'+item for item in 
> word.split('\\') if item])

I get this much from it:

1. first python splits the word by the "\\".

2. Then ??? It joins them somehow. I'm not sure what the .join is.

Thanks

Paul



On Tue, Feb 25, 2003 at 03:43:25PM +1000, Alfred Milgrom wrote:
> 
> At 07:04 PM 24/02/03 -0500, Paul Tremblay wrote:
> >However, I don't know if there is a better way to split a line of RTF.
> >Here is a line of RTF that exhibits each of the main type of tokens:

[snip]

> Hi Paul:
> 
> I can't say whether regular expressions are the best way to tokenise your 
> RTF input, but here is an alternative recursive approach.
> 
> Each line is split into words (using spaces as the separator), and then 
> recursively split into sub-tokens if appropriate.
> 
> def splitWords(inputline):
>     outputList = []
>     for word in inputline.split(' '):
>         if word.startswith('{') and word != '{':
>             expandedWord = '{' + ' ' + word[1:]
>         elif word.endswith('}')and word != '}' and word != '\\}':
>             expandedWord = word[:-1] + ' ' + '}'
>         elif '\\' in word and word != '\\':
>             expandedWord =  ' '.join(['\\'+item for item in 
> word.split('\\') if item])
>         else:
>             expandedWord = word
>         if expandedWord != word:
>             expandedWord = splitWords(expandedWord)
>         outputList.append(expandedWord)
>     return ' '.join(outputList)
> 
> example1 = 'text \par \\ \{ \} {}'
> 
> print splitWords(example1)
> >>> text \par \ \{ \} { }
> print splitWords(example1).split(' ')
> >>> ['text', '\\par', '\\', '\\{', '\\}', '{', '}']
> 
> Seven different tokens seem to be identified correctly.
> 
> example2 = 'text \par\pard \par} \\ \{ \} {differenttext}'
> print splitWords(example2)
> >>> text \par \ \{ \} { }
> print splitWords(example2).split(' ')
> >>> ['text', '\\par', '\\pard', '\\par', '}', '\\', '\\{', '\\}', '{', 
> 'differenttext', '}']
> 
> Haven't tested exhaustively, but this seems to do what you wanted it to do.
> As I said, I don't know if this will end up being better than using re or 
> not, but it is an alternative approach.
> 
> Best regards,
> Fred
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************