help please

Sun Feb 13 20:14:10 EST 2005

gargonx wrote:
> let's take the word "dogs"
> 
>    ext = dict("D":"V1",  "O":"M1", "G":"S1")
>    std = dict("S":"H")
> 
> encode("DOGS") # proc()
> we'll get: "V1M1S1H"
> 
> let's say i want to do just the opposite
> word: "V1M1S1H"
> decode("V1M1S1H")
>     #how do i decode "V1" to "D", how do i keep the "V1" together?
> and get: "DOGS"

If you can make some assumptions about the right-hand sides of your 
dicts, you can probably tokenize your string with a simple regular 
expression:

py> import re
py> charmatcher = re.compile(r'[A-Z][\d]?')
py>
py> ext = dict(D="V1", O="M1", G="S1")
py> std = dict(S="H")
py>
py> decode_replacements = {}
py> decode_replacements.update([(std[key], key) for key in std])
py> decode_replacements.update([(ext[key], key) for key in ext])
py>
py> def decode(text):
...     return ''.join([decode_replacements.get(c, c)
...                     for c in charmatcher.findall(text)])
...
py>
py> decode("V1M1S1H")
'DOGS'

So, instead of using
     for c in text
I use
     for c im charmatcher.findall(text)
That gives me the correct tokenization, and i can just use the inverted 
dicts to map it back.  Note however that I've written the regular 
expression to depend on the fact that the values in std and ext are 
either single uppercase characters or single uppercase characters 
followed by a single digit.

Steve