Confusion about dictionaries - keys use value or identity?

Roy Smith roy at panix.com
Sun Jul 8 19:46:43 EDT 2001


"Tim Peters" <tim.one at home.com> wrote:
> BTW, if you want, say, "aa" not to be the same key as "a"+"a", in what sense
> is this dict indexed by strings?  That is, it's unclear what strings have to
> do with this.

Well, OK, that's a fair question.  Here's what I really want to do...

I'm writing a parser for a kind of text file we use a lot.  The main loop 
of the parser is a state machine.  In each state, the current input line is 
tested to see if it matches one or more regular expressions, and based on 
which (if any) it matches, you do some processing and advance to the next 
state.  Something like:

   state = "cvsid"
   for line in inputfile.readlines():
      if state == "cvsid":
         if re.match ("^\s*$\s*Id:([^$]*)$\s*$", line):
            state = "header"
       elif state == "header":
         if re.match ("big ugly pattern", line):
            do some stuff
            state = "what comes after the header"

and so on.  I'll end up with about a dozen different states, with perhaps 2 
dozen different regex's.  Here's the dilema.  If I write it as I did above, 
the regex's will get compiled each time they're evaluated, which is clearly 
inefficient in the extreme.  The alternative would be to re.compile() all 
the regex's once, at the top, then use the stored programs, something like 
this:

   cvsPattern = re.compile ("^\s*$\s*Id:([^$]*)$\s*$")
         [...]
         if cvsPattern.match (line):
            state = header

and so on.  This would certainly work, but it moves the regex's away from 
where they are used, making (IMHO) the program more difficult to read and 
understand.  It reminds me of the bad old days when we would collect all 
our FORTRAN format statements at the back of the deck.  See 
http://www.python.org/doc/Humor.html#habits for why I don't want to do that 
any more :-)

So, here's my plan.  I was thinking that I might write a thin wrapper 
around the re class which kept a cache of regexes it had compiled.  
myRe.match would take the passed-in regex, look it up in the cache, and if 
it found it, just use the cached compiled version.  I figure it would be 
very close to the speed of storing the pre-compiled versions in the 
mainline code, as shown above, but still have the convenience and 
comprehensibility of keeping the regex string at the place where it's used.

So, that's what got me going on this.  Thinking about it again (and reading 
what people have written here), I'm starting to realize the default key 
algorithm is probably fast enough for what I'm trying to do, but it's still 
an interesting thing to think about.



More information about the Python-list mailing list