How to write simple code to match strings?

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Wed Dec 30 01:01:57 EST 2009


On Tue, 29 Dec 2009 21:01:05 -0800, beginner wrote:

> Hi All,
> 
> I run into a problem.  I have a string s that can be a number of
> possible things. I use a regular expression code like below to match and
> parse it. But it looks very ugly. Also, the strings are literally
> matched twice -- once for matching and once for extraction -- which
> seems to be very slow. Is there any better way to handle this?

The most important thing you should do is to put the regular expressions 
into named variables, rather than typing them out twice. The names 
should, preferably, describe what they represent.

Oh, and you should use raw strings for regexes. In this particular 
example, I don't think it makes a difference, but if you ever modify the 
strings, it will!

You should get rid of the unnecessary double calls to match. That's just 
wasteful. Also, since re.match tests the start of the string, you don't 
need the leading ^ regex (but you do need the $ to match the end of the 
string).

You should also fix the syntax error, where you have "elif s=='-'" 
instead of "elif s='-'".

You should consider putting the cheapest test(s) first, or even moving 
the expensive tests into a separate function.

And don't be so stingy with spaces in your source code, it helps 
readability by reducing the density of characters.

So, here's my version:

def _re_match_items(s):
    # Setup some regular expressions.
    COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
    FLOAT_RE = COMMON_RE + '$'
    BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$'
    DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
    mo = re.match(FLOAT_RE, s)  # "mo" short for "match object"
    if mo:
        return float(mo.group(1).replace(',', ''))
    # Otherwise mo will be None and we go on to the next test.
    mo = re.match(BRACKETED_FLOAT_RE, s)
    if mo:
        return -float(mo.group(1).replace(',', ''))
    if re.match(DATE_RE, s):
        return dateutil.parser.parse(s, dayfirst=True)
    raise ValueError("bad string can't be matched")


def convert_data_item(s):
    if s = '-':
        return None
    else:
        try:
            return _re_match_items(s)
        except ValueError:
            print "Unrecognized format %s" % s
            return s



Hope this helps.


-- 
Steven



More information about the Python-list mailing list