[Tutor] Regex across multiple lines

Ed Singleton singletoned at gmail.com
Thu Apr 27 12:16:08 CEST 2006


On 26/04/06, Liam Clarke <ml.cyresse at gmail.com> wrote:
> Hi Frank, just bear in mind that the pattern:
>
> patObj = re.compile("<title>.*</title>", re.DOTALL)
>
> will match
>
> <title>
>    This is my title
> </title>
>
> But, it'll also match
>
> <title>
>    This is my title
> </title>
> <p>Some content here</p>
> <title>
>     Another title; not going to happen with a title tag in HTML, but
> more an illustration
> </title>
>
> All of that.
>
> Got to watch .* with re.DOTALL; try using .*? instead, it makes it
> non-greedy. Functionality for your current use case won't change, but
> you won't spend ages when you have a different use case trying to
> figure out why half your data is matching. >_<

When you only want a tag like <title> with no nested tags, I sometimes use:

<title>[^<]*</title>

though for anything but the most trivial cases, it's often better to
use BeautifulSoup

Ed


More information about the Tutor mailing list