Regex Question

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Aug 18 10:22:29 EDT 2012


On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote:

> Hi,
> 
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would appreciate
> some direction on how to proceed.

Others have already given you excellent advice to NOT use regular 
expressions to parse HTML files, but to use a proper HTML parser instead.

However, since I remember how hard it was to get started with regexes, 
I'm going to ignore that advice and show you how to abuse regexes to 
search for text, and pretend that they aren't HTML tags.

Here's your string you want to search for:

> <h1>@foo1</h1>

You want to find a piece of text that starts with "<h1>@", followed by 
any alphanumeric characters, followed by "</h1>".


We start by compiling a regex:

import re
pattern = r"<h1>@\w+</h1>"
regex = re.compile(pattern, re.I)


First we import the re module. Then we define a pattern string. Note that 
I use a "raw string" instead of a regular string -- this is not 
compulsory, but it is very common.

The difference between a raw string and a regular string is how they 
handle backslashes. In Python, some (but not all!) backslashes are 
special. For example, the regular string "\n" is not two characters, 
backslash-n, but a single character, Newline. The Python string parser 
converts backslash combinations as special characters, e.g.:

\n => newline
\t => tab
\0 => ASCII Null character
\\ => a single backslash
etc.

We often call these "backslash escapes".

Regular expressions use a lot of backslashes, and so it is useful to 
disable the interpretation of backlash escapes when writing regex 
patterns. We do that with a "raw string" -- if you prefix the string with 
the letter r, the string is raw and backslash-escapes are ignored:

# ordinary "cooked" string:
"abc\n" => a b c newline

# raw string
r"abc\n" => a b c backslash n


Here is our pattern again:

pattern = r"<h1>@\w+</h1>"

which is thirteen characters:

less-than h 1 greater-than at-sign backslash w plus-sign less-than slash 
h 1 greater-than

Most of the characters shown just match themselves. For example, the @ 
sign will only match another @ sign. But some have special meaning to the 
regex:

\w doesn't match "backslash w", but any alphanumeric character;

+ doesn't match a plus sign, but tells the regex to match the previous 
symbol one or more times. Since it immediately follows \w, this means 
"match at least one alphanumeric character".

Now we feed that string into the re.compile, to create a pre-compiled 
regex. (This step is optional: any function which takes a compiled regex 
will also accept a string pattern. But pre-compiling regexes which you 
are going to use repeatedly is a good idea.)

regex = re.compile(pattern, re.I)

The second argument to re.compile is a flag, re.I which is a special 
value that tells the regular expression to ignore case, so "h" will match 
both "h" and "H".

Now on to use the regex. Here's a bunch of text to search:

text = """Now is the time for all good men blah blah blah <h1>spam</h1>
and more text here blah blah blah
and some more <h1>@victory</h1> blah blah blah"""


And we search it this way:

mo = re.search(regex, text)

"mo" stands for "Match Object", which is returned if the regular 
expression finds something that matches your pattern. If nothing matches, 
then None is returned instead.

if mo is not None:
    print(mo.group(0))

=> prints <h1>@victory</h1>

So far so good. But we can do better. In this case, we don't really care 
about the tags <h1>, we only care about the "victory" part. Here's how to 
use grouping to extract substrings from the regex:

pattern = r"<h1>@(\w+)</h1>"  # notice the round brackets ()
regex = re.compile(pattern, re.I)
mo = re.search(regex, text)
if mo is not None:
    print(mo.group(0))
    print(mo.group(1))

This prints:

<h1>@victory</h1>
victory


Hope this helps.


-- 
Steven



More information about the Python-list mailing list