Regexes: How to handle escaped characters

Fri May 18 03:35:06 EDT 2007

Hallöchen!

John Machin writes:

> On May 18, 6:00 am, Torsten Bronger <bron... at physik.rwth-aachen.de>
> wrote:
>
>> [...]
>>
>> Example string: u"Hollo", escaped positions: [4].  Thus, the
>> second "o" is escaped and must not be found be the regexp
>> searches.
>>
>> Instead of re.search, I call the function guarded_search(pattern,
>> text, offset) which takes care of escaped caracters.  Thus, while
>>
>>     re.search("o$", string)
>>
>> will find the second "o",
>>
>>     guarded_search("o$", string, 0)
>
> Huh? Did you mean 4 instead of zero?

No, the "offset" parameter is like the "pos" parameter in the search
method of regular expression objects.  It's like

    guarded_search("o$", string[offset:])

Actually, my real guarded_search even has an "endpos" parameter,
too.

> [...]
>
> Quite apart from the confusing use of "escape", your requirements are
> still as clear as mud. Try writing up docs for your "guarded_search"
> function.

Note that I don't want to add functionality to the stdlib, I just
want to solve my tiny annoying problem.  Okay, here is a more
complete story:

I've specified a simple text document syntax, like reStructuredText,
Wikimedia, LaTeX or whatever.  I already have a preprocessor for it,
now I try to implement the parser.  A sectioning heading looks like
this:

Introduction
============

Thus, my parser searches (among many other things) for
r"\n\s*={4,}\s*$".  However, the author can escape any character
with a backslash:

Introduction     or     Introduction
\===========            ====\=======

This means the first (or fifth) equation sign is an equation sign as
is and not part of a heading underlining.  This must not be
interpreted as a section begin.  The preprocessor generates
u"===========" with escaped_positions=[0].  (Or [4], in the
righthand case.)

This is why I cannot use normal search methods.

> [...]
>
> Whatever your exact requirement, it would seem unlikely to be so
> wildly popularly demanded as to warrant inclusion in the "regexp
> machine". You would have to write your own wrapper, something like
> the following totally-untested example of one possible
> implementation of one possible guess at what you mean:
>
> import re
> def guarded_search(pattern, text, forbidden_offsets, overlap=False):
>     regex = re.compile(pattern)
>     pos = 0
>     while True:
>         m = regex.search(text, pos)
>         if not m:
>             return
>         start, end = m.span()
>         for bad_pos in forbidden_offsets:
>             if start <= bad_pos < end:
>                 break
>         else:
>             yield m
>         if overlap:
>             pos = start + 1
>         else:
>             pos = end
> 8<-------

This is similar to my current approach, however, it also finds too
many "^a" patterns because it starts a fresh search at different
positions.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
                                      Jabber ID: bronger at jabber.org
                      (See http://ime.webhop.org for ICQ, MSN, etc.)