re beginner
Fredrik Lundh
fredrik at pythonware.com
Mon Jun 5 06:16:12 EDT 2006
SuperHik wrote:
> I'm trying to understand regex for the first time, and it would be very
> helpful to get an example. I have an old(er) script with the following
> task - takes a string I copy-pasted and wich always has the same format:
>
> >>> print stuff
> Yellow hat 2 Blue shirt 1
> White socks 4 Green pants 1
> Blue bag 4 Nice perfume 3
> Wrist watch 7 Mobile phone 4
> Wireless cord! 2 Building tools 3
> One for the money 7 Two for the show 4
>
> >>> stuff
> 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue
> bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless
> cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'
the first thing you need to do is to figure out exactly what the syntax
is. given your example, the format of the items you are looking for
seems to be "some text" followed by a tab character followed by an integer.
a initial attempt would be "\w+\t\d+" (one or more word characters,
followed by a tab, followed by one or more digits). to try this out,
you can do:
>>> re.findall('\w+\t\d+', stuff)
['hat\t2', 'shirt\t1', 'socks\t4', ...]
as you can see, using \w+ isn't good enough here; the "keys" in this
case may contain whitespace as well, and findall simply skips stuff that
doesn't match the pattern. if we assume that a key consists of words
and spaces, we can replace the single \w with [\w ] (either word
character or space), and get
>>> re.findall('[\w ]+\t\d+', stuff)
['Yellow hat\t2', 'Blue shirt\t1', 'White socks\t4', ...]
which looks a bit better. however, if you check the output carefully,
you'll notice that the "Wireless cord!" entry is missing: the "!" isn't
a letter or a digit. the easiest way to fix this is to look for
"non-tab characters" instead, using "[^\t]" (this matches anything
except a tab):
>>> len(re.findall('[\w ]+\t\d+', stuff))
11
>>> len(re.findall('[^\t]+\t\d+', stuff))
12
now, to turn this into a dictionary, you could split the returned
strings on a tab character (\t), but RE provides a better mechanism:
capturing groups. by adding () to the pattern string, you can mark the
sections you want returned:
>>> re.findall('([^\t]+)\t(\d+)', stuff)
[('Yellow hat', '2'), ('Blue shirt', '1'), ('White socks', ...]
turning this into a dictionary is trivial:
>>> dict(re.findall('([^\t]+)\t(\d+)', stuff))
{'Green pants': '1', 'Blue shirt': '1', 'White socks': ...}
>>> len(dict(re.findall('([^\t]+)\t(\d+)', stuff)))
12
or, in function terms:
def putindict(items):
return dict(re.findall('([^\t]+)\t(\d+)', stuff))
hope this helps!
</F>
More information about the Python-list
mailing list