re Questions

Sun Jan 26 12:15:51 EST 2014

On Sunday, January 26, 2014 12:08:01 PM UTC-5, Chris Angelico wrote:
> On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams <blakesadams at gmail.com> wrote:
> 
> > If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'.  However, when I run the following:
> 
> >
> 
> > re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_'].  I would expect the match to be ['z', 'C', '0', '_'].
> 
> >
> 
> > Why does this happen?
> 
> 
> 
> Because \w is not the same as [A-z0-9_]. Quoting from the docs:
> 
> 
> 
> """
> 
> \w For Unicode (str) patterns:Matches Unicode word characters; this
> 
> includes most characters that can be part of a word in any language,
> 
> as well as numbers and the underscore. If the ASCII flag is used, only
> 
> [a-zA-Z0-9_] is matched (but the flag affects the entire regular
> 
> expression, so in such cases using an explicit [a-zA-Z0-9_] may be a
> 
> better choice).For 8-bit (bytes) patterns:Matches characters
> 
> considered alphanumeric in the ASCII character set; this is equivalent
> 
> to [a-zA-Z0-9_].
> 
> """
> 
> 
> 
> If you're working with a byte string, then you're close, but A-z is
> 
> quite different from A-Za-z. The set [A-z] is equivalent to
> 
> [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's
> 
> a literal backslash in there, btw), so it'll also catch several
> 
> non-alphabetic characters. With a Unicode string, it's quite
> 
> distinctly different. Either way, \w means "word characters", though,
> 
> so just go ahead and use it whenever you want word characters :)
> 
> 
> 
> ChrisA

Thanks Chris