re beginner

Mon Jun 5 06:16:12 EDT 2006

SuperHik wrote:

> I'm trying to understand regex for the first time, and it would be very 
> helpful to get an example. I have an old(er) script with the following 
> task - takes a string I copy-pasted and wich always has the same format:
> 
>  >>> print stuff
> Yellow hat	2	Blue shirt	1
> White socks	4	Green pants	1
> Blue bag	4	Nice perfume	3
> Wrist watch	7	Mobile phone	4
> Wireless cord!	2	Building tools	3
> One for the money	7	Two for the show	4
> 
>  >>> stuff
> 'Yellow hat\t2\tBlue shirt\t1\nWhite socks\t4\tGreen pants\t1\nBlue 
> bag\t4\tNice perfume\t3\nWrist watch\t7\tMobile phone\t4\nWireless 
> cord!\t2\tBuilding tools\t3\nOne for the money\t7\tTwo for the show\t4'

the first thing you need to do is to figure out exactly what the syntax 
is.  given your example, the format of the items you are looking for 
seems to be "some text" followed by a tab character followed by an integer.

a initial attempt would be "\w+\t\d+" (one or more word characters, 
followed by a tab, followed by one or more digits).  to try this out, 
you can do:

     >>> re.findall('\w+\t\d+', stuff)
     ['hat\t2', 'shirt\t1', 'socks\t4', ...]

as you can see, using \w+ isn't good enough here; the "keys" in this 
case may contain whitespace as well, and findall simply skips stuff that 
doesn't match the pattern.  if we assume that a key consists of words 
and spaces, we can replace the single \w with [\w ] (either word 
character or space), and get

     >>> re.findall('[\w ]+\t\d+', stuff)
     ['Yellow hat\t2', 'Blue shirt\t1', 'White socks\t4', ...]

which looks a bit better.  however, if you check the output carefully, 
you'll notice that the "Wireless cord!" entry is missing: the "!" isn't 
a letter or a digit.  the easiest way to fix this is to look for 
"non-tab characters" instead, using "[^\t]" (this matches anything 
except a tab):

     >>> len(re.findall('[\w ]+\t\d+', stuff))
     11
     >>> len(re.findall('[^\t]+\t\d+', stuff))
     12

now, to turn this into a dictionary, you could split the returned 
strings on a tab character (\t), but RE provides a better mechanism: 
capturing groups.  by adding () to the pattern string, you can mark the 
sections you want returned:

     >>> re.findall('([^\t]+)\t(\d+)', stuff)
     [('Yellow hat', '2'), ('Blue shirt', '1'), ('White socks', ...]

turning this into a dictionary is trivial:

     >>> dict(re.findall('([^\t]+)\t(\d+)', stuff))
     {'Green pants': '1', 'Blue shirt': '1', 'White socks': ...}
     >>> len(dict(re.findall('([^\t]+)\t(\d+)', stuff)))
     12

or, in function terms:

     def putindict(items):
         return dict(re.findall('([^\t]+)\t(\d+)', stuff))

hope this helps!

</F>