regex help

Rhodri James rhodri at wildebst.demon.co.uk
Wed Jul 8 19:20:59 EDT 2009


On Wed, 08 Jul 2009 23:06:22 +0100, David <david.bramer at googlemail.com>  
wrote:

> Hi
>
> I have a few regexs I need to do, but im struggling to come up with a
> nice way of doing them, and more than anything am here to learn some
> tricks and some neat code rather than getting an answer - although
> thats obviously what i would like to get to.
>
> Problem 1 -
>
> <span class="chg"
>                 id="ref_678774_cp">(25.47%)</span><br>
>
> I want to extract 25.47 from here - so far I've tried -
>
> xPer = re.search('<span class="chg" id="ref_"'+str(xID.group(1))+'"_cp
> \">(.*?)%', content)

Supposing that str(xID.group(1)) == "678774", let's see how that string
concatenation turns out:

<span class="chg" id="ref_"678774"_cp">(.*?)%

The obvious problems here are the spurious double quotes, the spurious
(but harmless) escaping of a double quote, and the lack of (escaped)
backslash and (escaped) open parenthesis.  The latter you can always
strip off later, but the first sink the match rather thoroughly.

>
> and
>
> xPer = re.search('<span class=\"chg\" id=\"ref_"+str(xID.group(1))+"_cp
> \">\((\d*)%\)</span><br>', content)

With only two single quotes present, the biggest problem should be obvious.

Unfortunately if you just fix the obvious in either of the two regular
expressions, you're setting yourself up for a fall later on.  As The Fine
Manual says right at the top of the page on the re module
(http://docs.python.org/library/re.html), you want to be using raw string
literals when you're dealing with regular expressions, because you want
the backslashes getting through without being interpreted specially by
Python's own parser.  As it happens you get away with it in this case,
since neither '\d' nor '\(' have a special meaning to Python, so aren't
changed, and '\"' is interpreted as '"', which happens to be the right
thing anyway.


> Problem 2 -
>
> <td> </td>
>
> <td width="1%" class=key>Open:
> </td>
> <td width="1%" class=val>5.50
> </td>
> <td> </td>
> <td width="1%" class=key>Mkt Cap:
> </td>
> <td width="1%" class=val>6.92M
> </td>
> <td> </td>
> <td width="1%" class=key>P/E:
> </td>
> <td width="1%" class=val>21.99
> </td>
>
>
> I want to extract the open, mkt cap and P/E values - but apart from
> doing loads of indivdual REs which I think would look messy, I can't
> think of a better and neater looking way. Any ideas?

What you're trying to do is inherently messy.  You might want to use
something like BeautifulSoup to hide the mess, but never having had
cause to use it myself I couldn't say for sure.

-- 
Rhodri James *-* Wildebeest Herder to the Masses



More information about the Python-list mailing list