Looking for help with Regular Expression

James Stroud jstroud at ucla.edu
Tue May 23 20:18:56 EDT 2006


ProvoWallis wrote:
> Hi,
> 
> I'm looking for a little advice about regular expressions. I want to
> capture a string of text that falls between an opening squre bracket
> and a closing square bracket (e.g., "[" and "]") but I've run into a
> small problem.
> 
> I've been using this: '''\[(.*?)\]''' as my pattern. I was expecting
> this to be greedy but the funny thing is that it's not greedy enough in
> some situations.
> 
> Here's my problem: The end of my string sometimes contains a cross
> reference to a section in a book and the subsections are cited using
> square brackets exactly like the one I'm using as the ending point in
> my original regular expression.
> 
> E.g., the text string in my data looks like this: <core:emph
> typestyle="it">see</core:emph> discussion in
> &#xa7;&#x2002;512.16[3][b]]
> 
> But my regular expression is stopping after the first "]" so after I
> add the new markup the output looks like this:
> 
> <core:emph typestyle="it">see</core:emph> discussion in
> &#xa7;&#x2002;512.16[3]</fn:note>[b]]
> 
> So the last subsection is outside of the note tag. I want something
> like this:
> 
> <core:emph typestyle="it">see</core:emph> discussion in
> &#xa7;&#x2002;512.16[3][b]]</fn:note>
> 
> I'm not sure how to make my capture more greedy so I've resorted to
> cleaning up the data after I make the first round of replacements:
> 
> data = re.sub(r'''\[(\d*?)\]</fn:note>\[(\w)\]\]''',
> '''[\1][\2]]</fn:note>''', data)
> 
> There's got to be a better way but I'm not sure what it is.

I do: Pyparsing.

from pyparsing import *
crossref = Suppress("[") + Word(alphanums, exact=1) + Suppress("]")
footnote = (
               Suppress("[") + SkipTo(crossref) +
               ZeroOrMore(crossref) + Suppress("]")
            )

footnote.parseString("[&#xa7;&#x2002;512.16[3][b]]").asList()

py> footnote.parseString("[&#xa7;&#x2002;512.16[3][b]]").asList()
['&#xa7;&#x2002;512.16', '3', 'b']

James

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/



More information about the Python-list mailing list