python regex: variable length of positive lookbehind assertion

Wed Jun 15 12:04:29 EDT 2016

alister writes:

> On Wed, 15 Jun 2016 15:55:42 +0300, Jussi Piitulainen wrote:
>
>> alister writes:
>> 
>>> On Tue, 14 Jun 2016 20:28:24 -0700, Yubin Ruan wrote:
>>>
>>>> Hi everyone,
>>>> I am struggling writing a right regex that match what I want:
>>>> 
>>>> Problem Description:
>>>> 
>>>> Given a string like this:
>>>> 
>>>>     >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \
>>>>              true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a>
>>>>              true_tail"
>>>> 
>>>> I want to match the all the text surrounded by those "<a> </a>",
>>>> but only if those "<a> </a>" locate **in some distance** behind
>>>> "true_head". That is, I expect to result to be like this:
>>>> 
>>>>     >>>import re result = re.findall("the_regex",string) print result
>>>>     ["ccc","ddd","eee"]
>>>> 
>>>> How can I write a regex to match that?
>>>> I have try to use the **positive lookbehind assertion** in python
>>>> regex,
>>>> but it does not allowed variable length of lookbehind.
>>>> 
>>>> Thanks in advance,
>>>> Ruan
>>>
>>> don't try to use regex to parse html it wont work reliably i am
>>> surprised no one has mentioned beautifulsoup yet, which is probably
>>> what you require.
>> 
>> Nothing in the question indicates that the data is HTML.
>
> the <a></a> tags are a prety good indicator though

I can see how they point that way, but to me that alone seemed pretty
weak.

> even if it is not HTML the same advise stands for XML (the quote
> example would be invalid if it was XML)

It's not valid HTML either, for similar reasons. Or is it? I don't even
want to know.

> if it is neither for these formats but still using a similar tag
> structure then I would say that Reg ex is still unsuitable & the OP
> would need to write a full parser for the format if one does not
> already exist

That depends on details that weren't provided.

I work with a data format that mixes element tags with line-oriented
data records, and having a dedicated parser would be more of a hassle. A
couple of very simple regexen are useful in making sure that start tags
have a valid form and extracting attribute-value pairs from them. I'm
not at all experiencing "two problems" here. Some uses of regex are
good. (And now I may be about to experience the third problem. That
makes me sad.)

Anyway, I think you and another person guessed correctly that the OP is
indeed really considering HTML, and then your suggestion is certainly
helpful.