working with RE's (was Re: Pls, help me with re)

Mon Mar 3 05:29:47 EST 2003

<posted & mailed>

Lexy Zhitenev wrote:

> 
> "Erik Max Francis" <max at alcyone.com> wrote in message:
> news:3E6324D2.7E76BCB1 at alcyone.com...
>> >
>> > 'Course: 23,95' - matches '23,95'
>> > '$1 = 23,95 y' - matches '23,95'.
>>
>> Why not a regular expression simply like [including both commas and
>> periods]:
>>
>> =.*([0-9,.]+)
> 
> This re matches the second example, but it doesn't match the first one.
> Sorry

The specs as you gave them originally are a little bit strange:

> match a number if it is the only one in the string, and the
> second one, if '=' preceeds it.

because they don't cover many other possibilities and also,
taken literally, they seem to specify things you're unlikely
to actually want.  For example, '= 23 45 67' should match
45?  That's what your specs say -- here 45 is the "second
number" and there is indeed an '=' that precedes it.  And
what is "a number" is not clear either -- you show commas
in your examples, but what about signs (leading? trailing?),
exponents, periods...?

Let's first finesse the second issue by assuming that
somehow in string renu you put the regular expression
that matches "a number" according to your favourite (and
perhaps locale-dependent) definition.  For example "a
digit followed by zero or more digits or commas" would
be expressed by

renu = r'\d[\d,]*'

if you do NOT want signs, neither leading nor trailing,
and have no problem considering "1,,,,44" ``a number''
(which feels weird, but, how can we TELL what you mean
by ``a number'' unless you TELL us?-).

Now, for your first specs, determining that a number is
"the only one in the string" may be delicate.  Negative
lookahead works for "is followed by no numbers in the
string", but negative lookbehind is quite limited (to
patterns matching strings of fixed length...!).  And
for your second specs, determining that a number is the
second one is NOT satisfied by Erik's proposal -- that
one (used with search, no doubt, not match) would also
match the third, fourth, ... and so on, which breaks the
specs you've given.  Note that, _in general_, given the
limitations of negative lookbehind, it's not possible
to express "no numbers before this one" for general and
arbitrary enough definition of "number".

However, if your definition of "number" is something
that MUST include at least a digit, then you might
accept "no DIGITS before here" as a satisfactory proxy
for "no NUMBERS before here".  As a digit is a single
character, NOW we could do something, i.e. meet the
first half of your specs by:

firsthalf = r'^[^\d]*(' + renu +')'

and the second half by:

secondhalf = r'^[^\d]*' + renu +'[^\d]*=[^\d]*(' + renu +')'

and the whole, obviously, by:

allspecs = firsthalf + '|' + secondhalf

The resulting monstruosity (about which correctness it
would be hard to make entirely sure!) would be quite typical
of unbridled use of regular expressions for tasks they do
not meet well -- often they CAN be stretched to perform
jobs they're not ideal for, but, look at the PRICE you
pay in terms of complication...!  Also note that the
secondhalf is not an EXACT match for your specs -- I've
placed the '=' where I suspect you MEAN it should go,
rather than going for the positive lookbehind.

A far more sensible approach would be to ask RE's to do
only what RE's do well, and combine that with some Python
code to get exactly what you desire.  For example, the
split method of RE objects is quite handy:

nure = re.compile('(%s)'%renu)   # assume no parens in renu!
pieces = nure.split(astring, 2)
if len(pieces) == 3:
    return pieces[1]             # case "just one number"
assert len(pieces) == 5          # just a sanity check
if '=' in pieces[2]:             # =-positioning assumption
    return pieces[3]             # case "2nd number prec. ="
return None                      # or other indication of no-match

with this approach, you'd only have to design ONE regular
expression -- the relatively simple one whose pattern you
should bind to the name renu -- and leave the other issues,
such as distinguishing the first from the second match of
this RE, etc, to very elementary Python code working on
the list of strings that I've bound to name pieces.

Personally, THIS is the way I suggest people should more
often than not work with RE's -- not eschew them completely,
but be on guard against the temptation of delving into
them to the utmost complexity they can afford.  Keep your
RE's reasonably simple, and dress them up with Python code
(which is often quite elementary) for those cases in which
the RE, even if feasible, would become horridly complex.

Alex