Stripping C-style comments using a Python regexp

Jeff Epler jepler at unpythonic.net
Wed Jul 27 12:37:30 EDT 2005


#------------------------------------------------------------------------
import re, sys

def q(c):
    """Returns a regular expression that matches a region delimited by c,
    inside which c may be escaped with a backslash"""

    return r"%s(\\.|[^%s])*%s" % (c, c, c)

single_quoted_string = q('"')
double_quoted_string = q("'")
c_comment = r"/\*.*?\*/"
cxx_comment = r"//[^\n]*[\n]"

rx = re.compile("|".join([single_quoted_string, double_quoted_string,
                            c_comment, cxx_comment]), re.DOTALL)

def replace(x):
    x = x.group(0)
    if x.startswith("/"): return ' '
    return x

result = rx.sub(replace, sys.stdin.read())
sys.stdout.write(result)
#------------------------------------------------------------------------

The regular expression matches ""-strings, ''-character-constants,
c-comments, and c++-comments.  The replace function returns ' ' (space)
when the matched thing was a comment, or the original thing otherwise.
Depending on your use for this code, replace() should return as many
'\n's as are in the matched thing, or ' ' otherwise, so that line
numbers remain unchanged.

Basically, the regular expression is a tokenizer, and replace() chooses
what to do with each recognized token.  Things not recognized as tokens
by the regular expression are left unchanged.

Jeff
PS this is the test file I used:
/* ... */ xyzzy;
456 // 123
const char *mystr =  "This is /*trouble*/";
/* * */
/* /* */
// /* /* */
/* // /* */
/*
 * */
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20050727/a00759fa/attachment.sig>


More information about the Python-list mailing list