help please
Steven Bethard
steven.bethard at gmail.com
Sun Feb 13 20:14:10 EST 2005
gargonx wrote:
> let's take the word "dogs"
>
> ext = dict("D":"V1", "O":"M1", "G":"S1")
> std = dict("S":"H")
>
> encode("DOGS") # proc()
> we'll get: "V1M1S1H"
>
> let's say i want to do just the opposite
> word: "V1M1S1H"
> decode("V1M1S1H")
> #how do i decode "V1" to "D", how do i keep the "V1" together?
> and get: "DOGS"
If you can make some assumptions about the right-hand sides of your
dicts, you can probably tokenize your string with a simple regular
expression:
py> import re
py> charmatcher = re.compile(r'[A-Z][\d]?')
py>
py> ext = dict(D="V1", O="M1", G="S1")
py> std = dict(S="H")
py>
py> decode_replacements = {}
py> decode_replacements.update([(std[key], key) for key in std])
py> decode_replacements.update([(ext[key], key) for key in ext])
py>
py> def decode(text):
... return ''.join([decode_replacements.get(c, c)
... for c in charmatcher.findall(text)])
...
py>
py> decode("V1M1S1H")
'DOGS'
So, instead of using
for c in text
I use
for c im charmatcher.findall(text)
That gives me the correct tokenization, and i can just use the inverted
dicts to map it back. Note however that I've written the regular
expression to depend on the fact that the values in std and ext are
either single uppercase characters or single uppercase characters
followed by a single digit.
Steve
More information about the Python-list
mailing list