Stripping C-style comments using a Python regexp

lorinh at gmail.com lorinh at gmail.com
Wed Jul 27 11:42:09 EDT 2005


Hi Folks,

I'm trying to strip C/C++ style comments (/* ... */  or // ) from
source code using Python regexps.

If I don't have to worry about comments embedded in strings, it seems
pretty straightforward (this is what I'm using now):

cpp_pat = re.compile(r"""
/\* .*? \*/ |		     # C comments
// [^\n\r]*		     # C++ comments
""",re.S|re.X)
s = file('myprog.cpp').read()
cpp_pat.sub(' ',s)

However, the sticking point is dealing with tokens like /* embedded
within a string:

const char *mystr =  "This is /*trouble*/";

I've inherited a working Perl script, which I'd like to reimplement in
Python so that I don't have to spawn a new Perl process in my Python
program each time I want to strip comments from a file. The Perl script
looks like this:

#!/usr/bin/perl -w

$/ = undef;			# no line delimiter
$_ = <>;			# read entire file

s! ((['"]) (?: \\. | .)*? \2) | # skip quoted strings
   /\* .*? \*/ |		# delete C comments
   // [^\n\r]*			# delete C++ comments
 ! $1 || ' '			# change comments to a single space
 !xseg; 			# ignore white space, treat as single line
				# evaluate result, repeat globally
print;

The Perl regexp above uses some sort of conditional  to deal with this,
by replacing a quoted string with itself if the initial match is a
quoted string. Is there some equivalent feature in Python regexps?

Lorin




More information about the Python-list mailing list