python regex: variable length of positive lookbehind assertion

Wed Jun 15 01:10:16 EDT 2016

Yubin Ruan writes:

> Hi everyone, 
> I am struggling writing a right regex that match what I want:
>
> Problem Description:
>
> Given a string like this:
>
>     >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \
>              true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> true_tail"
>
> I want to match the all the text surrounded by those "<a> </a>", but
> only if those "<a> </a>" locate **in some distance** behind
> "true_head". That is, I expect to result to be like this:
>
>     >>>import re
>     >>>result = re.findall("the_regex",string)
>     >>>print result
>     ["ccc","ddd","eee"]
>
> How can I write a regex to match that?
> I have try to use the **positive lookbehind assertion** in python regex,
> but it does not allowed variable length of lookbehind.

Don't.

Don't even try to do it all in one regex. Keep your regexen simple and
match in two steps.

For example, capture all such elements together with your marker:

re.findall(r'true_head|<a>[^<]+</a>', string)
==>
['<a>aaa</a>', '<a>bbb</a>',
 'true_head', '<a>ccc</a>', '<a>ddd</a>', '<a>eee</a>']

Then filter the result in the obvious way (not involving any regex any
more, unless needed to recognize the true 'true_head' again). I've kept
the tags at this stage, so a possible '<a>true_head</a>' won't look like
'true_head' yet.

Another way is to find 'true_head' first (if you can recognize it safely
before also recognizing the elements), and then capture the elements in
the latter half only.