String Manipulation Help!

Sat Jan 28 16:47:01 EST 2006

"Dave" <davidworley at gmail.com> wrote in message
news:1138481853.165529.321870 at z14g2000cwz.googlegroups.com...
> OK, I'm stumped.
>
> I'm trying to find newline characters (\n, specifically) that are NOT
> in comments.
>
> So, for example (where "<-" = a newline character):
> ==========================================
> 1: <-
> 2: /*<-
> 3: ----------------------<-
> 4:     comment<-
> 5: ----------------------<-
> 6: */<-
> 7: <-
> 8: CODE CODE CODE<-
> 9: <-
> ==========================================
>
> I want to return the newline characters at lines 1, 6, 7, 8, and 9 but
> NOT the others.
>

Dave -

Pyparsing has built-in support for detecting line breaks and comments, and
the syntax is pretty simple, I think.  Here's a pyparsing program that gives
your desired results:

===============================
from pyparsing import lineEnd, cStyleComment, lineno

testsource = """
/*
----------------------
    comment
----------------------
*/

CODE CODE CODE

"""

# define the expression you want to search for
eol = lineEnd

# specify that you don't want to match within C-style comments
eol.ignore(cStyleComment.leaveWhitespace())

# loop through all the occurrences returned by scanString
# and print the line number of that location within the original string
for toks,startloc,endloc in eol.scanString(testsource):
    print lineno(startloc,data)
===============================

The expression you are searching for is pretty basic, just a plain
end-of-line, or pyparsing's built-in expression, lineEnd.  The curve you are
throwing is that you *don't* want eol's inside of C-style comments.
Pyparsing allows you to designate an "ignore" expression to skip undesirable
content, and fortunately, ignoring comments happens so often during parsing,
that pyparsing includes common comment expressions for C, C++, Java, Python,
and HTML.  Next, pyparsing's version of re.search is scanString.  scanString
returns a generator that gives the matching tokens, start location, and end
location of every occurrence of the given parse expression, in your case,
eol.  Finally, in the body of our for loop, we use pyparsing's lineno
function to give us the line number of a string location within the original
string.

About the only real wart on all this is that pyparsing implicitly skips over
leading whitespace, even when looking for expressions to be ignored.  In
order not to lose eols that are just before a comment (like your line 1), we
have to modify cStyleComment to leave leading whitespace.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul