Regular expression to capture model numbers

John Machin sjmachin at lexicon.net
Wed Apr 22 21:42:28 EDT 2009


On Apr 23, 8:01 am, krishnaposti... at gmail.com wrote:
> My quick attempt is below:
> obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')

1. Provided the remainder of the pattern is greedy and it will be used
only for findall, the \b seems pointless.

2. What is the "|" for? Inside a character class, | has no special
meaning, and will match a literal "|" character (which isn't part of
your stated requirement).

3. \w will match underscore "_" ... not in your requirement.

4. Re [\w-] : manual says "If you want to include a ']' or a '-'
inside a set, precede it with a backslash, or place it as the first
character" which IIRC is the usual advice given with just about any
regex package -- actually, placing it at the end works but relying on
undocumented behaviour when there are alternatives that are as easy to
use and are documented is not a good habit to get into :-)

5. You have used "+" twice; does this mean a minimum length of 2 is
part of your requirement?

>   >>> re.findall(obj, 'TestThis;1234;Test123AB-x')
>
> ['TestThis', '1234', 'Test123AB-x']
>
> This is not working.
>
> Requirements:
>   The text must contain a combination of numbers, alphabets and hyphen
> with at least two of the three elements present.

Unfortunately(?), regular expressions can't express complicated
conditions like that.

> I can use it to set
> min length using ) {}

I presume that you mean enforcing a minimum length of (say) 4 by using
{4,} in the pattern ...

You are already faced with the necessity of filtering out unwanted
matches programmatically. You might as well leave the length check
until then.

So: first let's establish what the pattern should be, ignoring the "2
or more out of 3 classes" rule and the length rule.

First character: Digits? Maybe not. Hyphen? Probably not.
Last character: Hyphen? Probably not.
Other characters: Any of (ASCII) letters, digits, hyphen.

So based on my guesses for answers to the above questions, the pattern
should be r"[A-Za-z][-A-Za-z0-9]*[A-Za-z0-9]"

Note: this assumes that your data is impeccably clean, and there isn't
any such data outside textbooks. You may wish to make the pattern less
restrictive, so that you can pick up probable mistakes like "A123-
456" instead of "A123-456".

Checking a candidate returned by findall could be done something like
this:

# initial setup:
import string
alpha_set = set(string.ascii_letters)
digit_set = set('1234567890')
min_len = 4 # for example

# each candidate:
cand_set = set(cand)
ok = len(cand) >= min_len and (
   bool(cand_set & alpha_set)
   + bool(cand_set & digit set)
   + bool('-' in cand_set)
   ) >= 2

HTH,
John



More information about the Python-list mailing list