Finding # prefixing numbers

Tue Jul 19 08:05:28 EDT 2005

peterbe at gmail.com wrote:

> In a text that contains references to numbers like this: #583 I want
> to find them with a regular expression but I'm having problems with
> the hash. Hopefully this code explains where I'm stuck:
> 
>>>> import re
>>>> re.compile(r'\b(\d\d\d)\b').findall('#123 x (#234) or:#456 #6789')
> ['123', '234', '456']
>>>> re.compile(r'\b(X\d\d\d)\b').findall('X123 x (X234) or:X456 X6789')
> ['X123', 'X234', 'X456']
>>>> re.compile(r'\b(#\d\d\d)\b').findall('#123 x (#234) or:#456 #6789')
> []
>>>> re.compile(r'\b(\#\d\d\d)\b').findall('#123 x (#234) or:#456
>>>> #6789') 
> []
> 
> As you can guess, I'm trying to find a hash followed by 3 digits word
> bounded. As in the example above, it wouldn't have been a problem if
> the prefix was an 'X' but that's not the case here.
> 
> 

>From the re documentation:

> \b 
> Matches the empty string, but only at the beginning or end of a word.
> A word is defined as a sequence of alphanumeric or underscore
> characters, so the end of a word is indicated by whitespace or a
> non-alphanumeric, non-underscore character. Note that \b is defined as
> the boundary between \w and \ W, so the precise set of characters
> deemed to be alphanumeric depends on the values of the UNICODE and
> LOCALE flags. Inside a character range, \b represents the backspace
> character, for compatibility with Python's string literals. 

# is not a letter or digit, so \b# will match only if the # is directly 
preceded by a letter or digit which isn't the case in any of your examples.
Use \B (which is the opposite of \b) instead:

>>> re.compile(r'\B(#\d\d\d)\b').findall('#123 x (#234) or:#456 #6789')
['#123', '#234', '#456']