[Tutor] Regular expression - I

Tue Feb 18 20:39:55 CET 2014

Hi Santosh,

On Tue, Feb 18, 2014 at 9:52 AM, Santosh Kumar <rhce.san at gmail.com> wrote:
>
> Hi All,
>
> If you notice the below example, case I is working as expected.
>
> Case I:
> In [41]: string = "<H*>test<H*>"
>
> In [42]: re.match('<H\*>',string).group()
> Out[42]: '<H*>'
>
> But why is the raw string 'r' not working as expected ?
>
> Case II:
>
> In [43]: re.match(r'<H*>',string).group()
> ---------------------------------------------------------------------------
> AttributeError                            Traceback (most recent call last)
> <ipython-input-43-d66b47f01f1c> in <module>()
> ----> 1 re.match(r'<H*>',string).group()
>
> AttributeError: 'NoneType' object has no attribute 'group'
>
> In [44]: re.match(r'<H*>',string)

It is working as expected, but you're not expecting the right thing
;).  Raw strings don't escape anything, they just prevent backslash
escapes from expanding.  Case I works because "\*" is not a special
character to Python (like "\n" or "\t"), so it leaves the backslash in
place:

   >>> '<H\*>'
   '<H\*>'

The equivalent raw string is exactly the same in this case:

   >>> r'<H\*>'
   '<H\*>'

The raw string you provided doesn't have the backslash, and Python
will not add backslashes for you:

   >>> r'<H*>'
   '<H*>'

The purpose of raw strings is to prevent Python from recognizing
backslash escapes.  For example:

   >>> path = 'C:\temp\new\dir' # Windows paths are notorious...
   >>> path   # it looks mostly ok... [1]
   'C:\temp\new\\dir'
   >>> print(path)  # until you try to use it
   C:      emp
   ew\dir
   >>> path = r'C:\temp\new\dir'  # now try a raw string
   >>> path   # Now it looks like it's stuffed full of backslashes [2]
   'C:\\temp\\new\\dir'
   >>> print(path)  # but it works properly!
   C:\temp\new\dir

[1] Count the backslashes in the repr of 'path'.  Notice that there is
only one before the 't' and the 'n', but two before the 'd'.  "\d" is
not a special character, so Python didn't do anything to it.  There
are two backslashes in the repr of "\d", because that's the only way
to distinguish a real backslash; the "\t" and "\n" are actually the
TAB and LINE FEED characters, as seen when printing 'path'.

[2] Because they are all real backslashes now, so they have to be
shown escaped ("\\") in the repr.

In your regex, since you're looking for, literally, "<H*>", you'll
need to backslash escape the "*" since it is a special character *in
regular expressions*.  To avoid having to keep track of what's special
to Python as well as regular expressions, you'll need to make sure the
backslash itself is escaped, to make sure the regex sees "\*", and the
easiest way to do that is a raw string:

   >>> re.match(r'<H\*>', string).group()
   '<H*>'

I hope this makes some amount of sense; I've had to write it up
piecemeal and will never get it posted at all if I don't go ahead and
post :).  If you still have questions, I'm happy to try again.  You
may also want to have a look at the Regex HowTo in the Python docs:
http://docs.python.org/3/howto/regex.html

Hope this helps,

-- 
Zach