Regular Expression - old regex module vs. re module

Jim Segrave jes at nl.demon.net
Sat Jul 1 09:19:05 EDT 2006


In article <1151707747.611123.65800 at m73g2000cwd.googlegroups.com>,
Steve <stever at cruzio.com> wrote:
>Hi All!
>
>Thanks for your suggestions and comments!  I was able to use some of
>your code and suggestions and have come up with this new version of
>Report.py.
>
>Here's the updated code :
>
>exponentPattern = re.compile('\(^\|[^\\#]\)|#+\.#+\*\*\*\*')
>floatPattern = re.compile('\(^\|[^\\#]\)|#+\.#+')
>integerPattern = re.compile("\(^\|[^\\#]\)|\##+")
>leftJustifiedStringPattern = re.compile('\(^\|[^\\<]\)|\<<+')
>rightJustifiedStringPattern = re.compile('\(^\|[^\\>]\)|\>>+')


Some comments and suggestions 

If you want to include backslashes in a string, either
use raw strings, or use double backslashes (raw strings are much
easier to read). Otherwise, you have an accident waiting to happen -
'\(' _does_ make a two character string as you desired, but '\v' makes
a one character string, which is unlikely to be what you wanted.

That's a stylistic point. But more serious is the leading part of all
your regexes - '\(^\|[^\\#]\)|'. I'm not sure what you're trying to
accomplish - presumably to skip over escaped format characters, but
that's not what they do.

\( says to look for an open parenthesis (you've escaped it, so it's
not a grouping character.  ^ says look at the start of the line. This
means the first character can never match an open parens, so this
entire term, up to the alternate expression (after the non-escaped
pipe symbol) never matches anything. If you want to ignore escaped
formating characters before a format, then you should use a negative
lookbehind assertation (see the library reference, 4.2.1, Regular
Expression Syntax:

'(?<!\\)' 

This says that the match to the format can't start
immediately after a backslash. You need to make a final
pass over your format to remove the extra backslashes, otherwise they
will appear in the output, which you do, but you replace them with
spaces, which may not be the right thing to do - how could you output
a line of like '################' as part of a template? 

Other odds and ends comments interleaved here

>###################################################################
>#              _make_printf                                       #
>###################################################################
>
>def _make_printf(s):
>    """Convert perl style format symbology to printf tokens.
>
>    Take a string and substitute computed printf tokens for perl style
>    format symbology.
>
>    For example:
>
>        ###.##    yields %6.2f
>        ########  yields %8d
>        <<<<<     yields %-5s
>    """
>#    print("Original String = %s\n\n") % (s)
>
>
>    while 1:                          # process all sci notation fields
>        if exponentPattern.search(s) < 0: break
>        i1 , i2 = exponentPattern.search(s).span()
>        width_total = i2 - i1
>        field = s[i1:i2-4]
>        width_mantissa = len( field[string.find(field,'.')+1:] )
>        f = '%'+`width_total`+'.'+`width_mantissa`+'e'
>        s = exponentPattern.sub(f, s, 1)

There are better ways to examine a match than with span() - consider
using grouping in your regex to get the mantissa width and the total
width, use a regex with grouping like this:

If:

exponentPattern = re.compile(r'(?<!\\)(#+\.(#+)\*\*\*\*'))

then you could do this:
         m = re.match(exponentPattern, s)
         if m:
             s = exponentPattern.sub("%%%d.%de" % (len(m.groups()[0],
                                                   len(m.groups()[1]), s, 1)

m.groups()[0] will be the entire '#+\.#\*\*\'*\*') match, in other
words the field width to be printed

m.groups()[1] will be the string after the decimal point, not
inclouding the '*'s

In my opinion, building the string by using the sprintf like '%'
formatting operator, rather than adding together a bunch of substrings
is easier to read and maintain.

Similar use of grouping can be done for the other format string types.


>    s = re.sub('\\\\', ' ', s)
>    return s

As I noted, should backslashes be converted to spaces? And again, it's
easier to type and read if it uses raw strings:

     s = re.sub(r'\\', ' ', s)


>###################################################################
>#              ReportTemplate                                     #
>###################################################################
>
>class ReportTemplate:
>    """Provide a print formatting object.
[Snip]
>    The total width of the symbol and it's decimal point position is
>    used to compute the appropriate printf token; see 'make_printf'
>    method. The symbol must have at least two adjacent characters for

A minor nit - make_printf is better described as a function, not a
method (it's not part of the ReportTemplate class or any other cleass)

[SNIP]

>    def __init__( self, template = ''):
>        self.body = []
>        self.vars = []
>
>        #read in and parse a format template
>        try:
>            tpl = open(template, 'r')
>            lines = string.split(tpl.read(), '\n')[:-1]

You'd be better off to use something like:

             lines = []
             for l in open(template, 'r').readlines():
                lines.append(l.rstrip)
             
The use of rstrip discards any trailing whitespace and deals with
reading Windows generated files on a Unix box, where lines will end in
CRLF and you'd strip only the LF

>        except IOError:
>            lines = string.split(template, '\n')

I have my doubts about the advisability of assuming that you are
either passed the name of a file containing a template or a
template itself. A misspelled file name won't raise
an error, it will simply be processed as a fixed output. I would have
passed a flag to say if the template argument was file name or a
template and terminated with an error if it was a non-existant file.

[SNIP]

>    def _format( self, dataobj ):
>        """Return the values of the given data object substituted into
>        the template format stored in this object.
>        """
>        # value[] is a list of lists of values from the dataobj
>        # body[] is the list of strings with %tokens to print
>        # if value[i] == None just print the string without the %
>argument
>        s = ''
>        value = []
>
>        for i in range(self.nrows):
>            value.append([])
>
>        for i in range(self.nrows):
>            for vname in self.vars[i]:
>                try:
>                    if string.find(vname, '[') < 0:
>                        # this is the nominal case and a simple get
>will be faster
>                        value[i].append(getattr(dataobj, vname))
>                    else:
>                        # I use eval so that I can support sequence
>values
>                        # although it's slow.
>                        value[i].append(eval('dataobj.'+vname))
>                except AttributeError, SyntaxError:
>                    value[i].append('')

There's another way to do this - use getattr to retrieve the sequence,
then use __getitem__ to index it

Something like this would work, again using a regex to separate out
the index (you might want the regex compiled once at __init__
time). The regex looks for a run of characters valid in a python
variable name followed by an optional integer (with or without sign)
index in square brackets. m.groups()[0] is the variable name portion,
m.grousp()[1] is the index if there's a subscripting term.

                 try:
                   m = re.match(r'([a-zA-Z_][a-zA-Z0-9._]*)\s*(?:\[\s*([+-]?\d+)\s*\])?\s*',
                     value[i])
                   if not m: raise SyntaxError
                   if m.groups()[1] is None:
                      value[i].append(getattr(dataobj, vname))
                   else:
		      value[i].append(getattr(m.groups()[0]).\
                         __getitem__(int(m.groups()[1])))
                 except AttributeError, SyntaxError, IndexError:
>                    value[i].append('')

This is a bit ugly, but avoids eval with all the dangers it carries -
a deliberately hacked template file can be used to do a lot of damage,
a badly written one could be hard to debug

>            if value[i][0] != '':
>                try:
>                    temp_vals = []
>                    for item in value[i]:
>                        # items on the list of values for this line
>                        # can be either literals or lists
>                        if type(item) == ListType:

Might you be better off asking if item has a __getitem__? It would
then work with tuples and Extending to dictionaries would then be easier

>def isReportTemplate(obj):
>    """Return 1 if obj is an instance of class ReportTemplate.
>    """
>    if  type(obj) == InstanceType and \
>        string.find(`obj.__class__` , ' ReportTemplate ') > -1:
>        return 1
>    else:
>        return 0

Why not just use isinstance(obj, classname)?

>###################################################################
>#              ColumnReportTemplate                               #
>###################################################################
>
>class ColumnReportTemplate:
>    """This class allows one to specify column oriented output formats.
>


-- 
Jim Segrave           (jes at jes-2.demon.nl)




More information about the Python-list mailing list