Regexes: How to handle escaped characters

Paul McGuire ptmcg at austin.rr.com
Thu May 17 19:46:17 EDT 2007


On May 17, 6:12 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> Note: "must not be *part of* any match" [my emphasis]
>
Ooops, my bad.  See this version:

from pyparsing import Regex,ParseException,col,lineno,getTokensEndLoc

# fake (and inefficient) version of any if not yet upgraded to Py2.5
any = lambda lst : sum(list(lst)) > 0

def guardedSearch(pattern, text, forbidden_offsets):

    def offsetValidator(strng,locn,tokens):
        start,end = locn,getTokensEndLoc()-1
        if any( start <= i <= end for i in forbidden_offsets ):
            raise ParseException, "can't match at offset %d" % locn

    regex = Regex(pattern).setParseAction(offsetValidator)
    return [ (tokStart,toks[0]) for toks,tokStart,tokEnd in
                regex.scanString(text) ]

print guardedSearch(ur"o\S", u"Hollo how are you", [8,])


def guardedSearchByColumn(pattern, text, forbidden_columns):

    def offsetValidator(strng,locn,tokens):
        start,end = col(locn,strng), col(getTokensEndLoc(),strng)-1
        if any( start <= i <= end for i in forbidden_columns ):
            raise ParseException, "can't match at col %d" % start

    regex = Regex(pattern).setParseAction(offsetValidator)
    return [ (lineno(tokStart,text),col(tokStart,text),toks[0])
                for toks,tokStart,tokEnd in regex.scanString(text) ]

text = """\
alksjdflasjf;sa
a;sljflsjlaj
;asjflasfja;sf
aslfj;asfj;dsf
aslf;lajdf;ajsf
aslfj;afsj;sd
"""
print guardedSearchByColumn("[fa];", text, [4,12,13,])

Prints:
[(1, 'ol'), (15, 'ou')]
[(2, 1, 'a;'), (5, 10, 'f;')]

>
> While we're waiting for clarification from the OP, there's a chicken-
> and-egg thought that's been nagging me: if the OP knows so much about
> the searched string that he can specify offsets which search patterns
> should not span, why does he still need to search it?
>
I suspect that this is column/tabular data (a log file perhaps?), and
some columns are not interesting, but produce many false hits for the
search pattern.

-- Paul




More information about the Python-list mailing list