Regexes: How to handle escaped characters

Thu May 17 18:16:43 EDT 2007

On May 17, 4:06 pm, John Machin <sjmac... at lexicon.net> wrote:
> On May 18, 6:00 am, Torsten Bronger <bron... at physik.rwth-aachen.de>
> wrote:
>
>
>
>
>
> > Hallöchen!
>
> > James Stroud writes:
> > > Torsten Bronger wrote:
>
> > >> I need some help with finding matches in a string that has some
> > >> characters which are marked as escaped (in a separate list of
> > >> indices).  Escaped means that they must not be part of any match.
>
> > >> [...]
>
> > > You should probably provide examples of what you are trying to do
> > > or you will likely get a lot of irrelevant answers.
>
> > Example string: u"Hollo", escaped positions: [4].  Thus, the second
> > "o" is escaped and must not be found be the regexp searches.
>
> > Instead of re.search, I call the function guarded_search(pattern,
> > text, offset) which takes care of escaped caracters.  Thus, while
>
> >     re.search("o$", string)
>
> > will find the second "o",
>
> >     guarded_search("o$", string, 0)
>
> Huh? Did you mean 4 instead of zero?
>
>
>
> > won't find anything.
>
> Quite apart from the confusing use of "escape", your requirements are
> still as clear as mud. Try writing up docs for your "guarded_search"
> function. Supply test cases showing what you expect to match and what
> you don't expect to match. Is "offset" the offset in the text? If so,
> don't you really want a set of "forbidden" offsets, not just one?
>
> >  But how to program "guarded_search"?
> > Actually, it is about changing the semantics of the regexp syntax:
> > "." doesn't mean anymore "any character except newline" but "any
> > character except newline and characters marked as escaped".
>
> Make up your mind whether you are "escaping" characters [likely to be
> interpreted by many people as position-independent] or "escaping"
> positions within the text.
>
> >  And so
> > on, for all syntax elements of regular expressions.  Escaped
> > characters must spoil any match, however, the regexp machine should
> > continue to search for other matches.
>
> Whatever your exact requirement, it would seem unlikely to be so
> wildly popularly demanded as to warrant inclusion in the "regexp
> machine". You would have to write your own wrapper, something like the
> following totally-untested example of one possible implementation of
> one possible guess at what you mean:
>
> import re
> def guarded_search(pattern, text, forbidden_offsets, overlap=False):
>     regex = re.compile(pattern)
>     pos = 0
>     while True:
>         m = regex.search(text, pos)
>         if not m:
>             return
>         start, end = m.span()
>         for bad_pos in forbidden_offsets:
>             if start <= bad_pos < end:
>                 break
>         else:
>             yield m
>         if overlap:
>             pos = start + 1
>         else:
>             pos = end
> 8<-------
>
> HTH,
> John- Hide quoted text -
>
> - Show quoted text -

Here are two pyparsing-based routines, guardedSearch and
guardedSearchByColumn.  The first uses a pyparsing parse action to
reject matches at a given string location, and returns a list of
tuples containing the string location and matched text.  The second
uses an enhanced version of guardedSearch that uses the pyparsing
built-ins col and lineno to filter matches by column instead of by raw
string location, and returns a list of tuples of line and column of
the match location, and the matching text.  (Note that string
locations are zero-based, while line and column numbers are 1-based.)

-- Paul

from pyparsing import Regex,ParseException,col,lineno

def guardedSearch(pattern, text, forbidden_offsets):

    def offsetValidator(strng,locn,tokens):
        if locn in forbidden_offsets:
            raise ParseException, "can't match at offset %d" % locn

    regex = Regex(pattern).setParseAction(offsetValidator)
    return [ (tokStart,toks[0]) for toks,tokStart,tokEnd in
                regex.scanString(text) ]

print guardedSearch(u"o", u"Hollo how are you", [4,])

def guardedSearchByColumn(pattern, text, forbidden_columns):

    def offsetValidator(strng,locn,tokens):
        if col(locn,strng) in forbidden_columns:
            raise ParseException, "can't match at offset %d" % locn

    regex = Regex(pattern).setParseAction(offsetValidator)
    return [ (lineno(tokStart,text),col(tokStart,text),toks[0])
                for toks,tokStart,tokEnd in regex.scanString(text) ]

text = """\
alksjdflasjf;sa
a;sljflsjlaj
;asjflasfja;sf
aslfj;asfj;dsf
aslf;lajdf;ajsf
aslfj;afsj;sd
"""
print guardedSearchByColumn(";", text, [1,6,11,])

Prints:
[(1, 'o'), (7, 'o'), (15, 'o')]
[(1, 13, ';'), (2, 2, ';'), (3, 12, ';'), (5, 5, ';')]