[Tutor] best way to tokenize [was script too slow]

Jeff Shannon jeff@ccvcorp.com
Wed Feb 26 14:33:01 2003


Paul Tremblay wrote:

>Okay, now I do see the whole thing. The list to join is: first split the
>token by "\\", which will get rid of the "\\", and then add the "\\" to
>each item.
>
>  
>
>>>            expandedWord =  ' '.join(['\\'+item for item in 
>>>word.split('\\') if item])
>>>

A good approach with this sort of thing is to try to spread it out into 
several lines.  One-liners can be convenient, but sometimes they're a 
little confusing, especially if you're not terribly familiar with the 
way that things fit together.  So let's break this down into several steps.

    WordList = word.split('\\')
    TokenList = ['\\' + item for item in WordList if item]
    ExpandedWord = ' '.join(TokenList)

This code has the exact same effect as the one-liner, but is a little 
bit easier to figure out (even though it may take a bit longer to read). 
 I find myself using intermediate variables like this semi-frequently -- 
if it takes me more than a couple seconds to figure out what a compound 
expression is doing, I figure that it's too complex and would be better 
to split it into several parts.  The decision of how much complexity is 
appropriate is, of course, a very personal stylistic one.  For instance, 
my second line above is actually doing two things -- filtering out any 
null items from WordList, and prepending '\\' to each remaining item.  I 
could have split that out into two separate list comprehensions, and I 
could argue that it would make things a little more explicitly clear... 
but that would also require two iterations through the list, instead of 
one, so it has the potential of actually affecting performance -- and if 
it might be a long list, that could be a significant effect.  I feel 
that the (marginal) extra clarity is not worth that possible performance 
loss.  On the other hand, splitting the original one-liner into these 
three lines adds only a few variable lookups.  That's a very small cost, 
so it's much easier to argue that the increased clarity is worth it.

Personally, if I was writing code that I thought might be read by others 
(especially, say, example code for this list), which would probably 
include most code for programs that would be in use for any length of 
time, then I'd use the longer multi-line version.  Only if I were 
writing a quick script, or in a throw-away interactive session, would I 
use the one-liner.

Jeff Shannon
Technician/Programmer
Credit International