[Tutor] best way to tokenize [was script too slow]

Alfred Milgrom fredm@smartypantsco.com
Mon Feb 24 23:45:01 2003


At 07:04 PM 24/02/03 -0500, Paul Tremblay wrote:
>However, I don't know if there is a better way to split a line of RTF.
>Here is a line of RTF that exhibits each of the main type of tokens:
>text \par \\ \{ \} {}
>
>Broken into tokens:
>['text', '\par', '\\', '\{', '\}', '{',   '}']
>
>There are 7 type of tokens:
>
>1. text
>2. control word, or a backslash followed by any number of characters. A
>space, backslash, or open or closed bracket ends this group.
>3. escaped backslash
>4. escaped open bracket
>5. escaped closed bracket
>6. open bracket
>7. closed bracket
>
>Is there any way to split this line *without* using regular expressions?
>
>Once I know how split and save the tokens, I
>imagine I can split the line into lists, then split the lists into
>lists, and so on--even though I'm vague on how to do this.
>
>But I'm not sure if this would be faster. Also, I don't know how to get
>around using a regular expression for the control words. A control word
>can be any length, and can take multiple forms:
>
>'\pard ' => '\pard '
>'\par\pard' => '\par', '\pard'
>'\par\pard ' => '\par', '\pard '
>'\par}' => '\par', '}'

Hi Paul:

I can't say whether regular expressions are the best way to tokenise your 
RTF input, but here is an alternative recursive approach.

Each line is split into words (using spaces as the separator), and then 
recursively split into sub-tokens if appropriate.

def splitWords(inputline):
     outputList = []
     for word in inputline.split(' '):
         if word.startswith('{') and word != '{':
             expandedWord = '{' + ' ' + word[1:]
         elif word.endswith('}')and word != '}' and word != '\\}':
             expandedWord = word[:-1] + ' ' + '}'
         elif '\\' in word and word != '\\':
             expandedWord =  ' '.join(['\\'+item for item in 
word.split('\\') if item])
         else:
             expandedWord = word
         if expandedWord != word:
             expandedWord = splitWords(expandedWord)
         outputList.append(expandedWord)
     return ' '.join(outputList)

example1 = 'text \par \\ \{ \} {}'

print splitWords(example1)
 >>> text \par \ \{ \} { }
print splitWords(example1).split(' ')
 >>> ['text', '\\par', '\\', '\\{', '\\}', '{', '}']

Seven different tokens seem to be identified correctly.

example2 = 'text \par\pard \par} \\ \{ \} {differenttext}'
print splitWords(example2)
 >>> text \par \ \{ \} { }
print splitWords(example2).split(' ')
 >>> ['text', '\\par', '\\pard', '\\par', '}', '\\', '\\{', '\\}', '{', 
'differenttext', '}']

Haven't tested exhaustively, but this seems to do what you wanted it to do.
As I said, I don't know if this will end up being better than using re or 
not, but it is an alternative approach.

Best regards,
Fred