python regex character group matches

Wed Sep 17 09:56:31 EDT 2008

christopher taylor wrote:

> my issue, is that the pattern i used was returning:
> 
> [ '\\uAD0X', '\\u1BF3', ... ]
> 
> when i expected:
> 
> [ '\\uAD0X\\u1BF3', ]
> 
> the code looks something like this:
> 
> pat = re.compile("(\\\u[0-9A-F]{4})+", re.UNICODE|re.LOCALE)
> #print pat.findall(txt_line)
> results = pat.finditer(txt_line)
> 
> i ran the pattern through a couple of my colleagues and they were all
> in agreement that my pattern should have matched correctly.

First, [0-9A-F] cannot match an "X".  Assuming that's a typo, your next 
problem is a precedence issue: (X)+ means "one or more (X)", not "one or 
more X inside parens".  In other words, that pattern matches one or more 
X's and captures the last one.

Assuming that you want to find runs of \uXXXX escapes, simply use 
non-capturing parentheses:

    pat = re.compile(u"(?:\\\u[0-9A-F]{4})")

and use group(0) instead of group(1) to get the match.

</F>