Is there a maximum length of a regular expression in python?

Steve Holden steve at holdenweb.com
Wed Jan 18 09:07:21 EST 2006


olekristianvillabo at gmail.com wrote:
> I have a regular expression that is approximately 100k bytes. (It is
> basically a list of all known norwegian postal numbers and the
> corresponding place with | in between. I know this is not the intended
> use for regular expressions, but it should nonetheless work.
> 
> the pattern is
> ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
> SVOLVÆR)'
> 
> The error message I get is:
> RuntimeError: internal error in regular expression engine
> 
And I'm not the least bit surprised. Your code is brittle (i.e. likely 
to break) and cannot, for example, cope with multiple spaces between the 
number and the word(s). Quite apart from breaking the interpreter :-)

I'd say your test was the clearest possible demonstration that there 
*is* a limit.

Wouldn't it be better to have a dict keyed on the number and containing 
the word (which you can construct from the same source you constructed 
your horrendously long regexp)?

Then if you find something matching the pattern (untested)

ur'(N-|NO-)?((\d\d\d\d)\s*([A-Za-z ]+))'

or something like it that actually works (I invariably get regexps wrong 
at least three times before I get them right) you can use the dict to 
validate the number and name.

Quite apart from anything else, if the text line you are examining 
doesn't have the right syntactic form then you are going to test 
hundreds of options, none of which can possibly match. So matching the 
syntax and then validating the data identified seems like a much more 
sensible option (to me, at least).

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC                     www.holdenweb.com
PyCon TX 2006                  www.python.org/pycon/




More information about the Python-list mailing list