more than 100 capturing groups in a regex

Steven D'Aprano steve at REMOVETHIScyber.com.au
Tue Oct 25 10:37:02 EDT 2005


On Tue, 25 Oct 2005 06:30:35 -0700, Iain King wrote:

> 
> Steven D'Aprano wrote:
>> On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:
>>
>> >
>> > Fredrik Lundh wrote:
>> >> Joerg Schuster wrote:
>> >>
>> >> > I just want to use more than 100 capturing groups.
>> >>
>> >> define "more" (101, 200, 1000, 100000, ... ?)
>> >>
>> >> </F>
>> >
>> > The Zero-One-Infinity Rule:
>> >
>> > http://www.catb.org/~esr/jargon/html/Z/Zero-One-Infinity-Rule.html
>>
>>
>> Nice in principle, not always practical. Sometimes the choice is, "do you
>> want it today with arbitrary limits, or six months from now with bugs
>> but no limits?"
>>
>> If assigning arbitrary limits prevents worse problems, well, then go for
>> the limit. For instance, anyone who has fought browser pops ups ("close
>> one window, and ten more open") may have wished that the browser
>> implemented an arbitrary limit of, say, ten pop ups. Or even zero :-)
>>
> 
> Well, exactly.  Why limit to ten?  The user is either going to want to
> see pop-ups, or not.  So either limit to 0, or to infinity (and indeed,
> this is what most browsers do).

I haven't been troubled by exponentially increasing numbers of pop up
windows for a long, long time. But consider your question "why limit to
ten?" in a wider context.

Elevators always have a weight limit: the lift will operate up to N
kilograms, and stop at N+1. This limit is, in a sense, quite arbitrary,
since that value of N is well below the breaking point of the elevator
cables. Lift engineers, I'm told, use a safety factor of 10 (if the
cable will carry X kg without breaking, set N = X/10). This safety
factor is obviously arbitrary: a more cautious engineer might use a
factor of 100, or even 1000, while another might choose a factor of 5 or
2 or even 1.1. If engineers followed your advice, they would build lifts
that either carried nothing at all, or accepted as much weight until the
cable stretched and snapped.

Perhaps computer programmers would have fewer buffer overflow security
exploits if they took a leaf out of engineers' book and built in a few
more arbitrary safety factors into their data-handling routines. We can
argue whether 256 bytes is long enough for a URL or not, but I think we
can all agree that 3 MB for a URL is more than any person needs.

When you are creating an iterative solution to a problem, the ending
condition is not always well-specified. Trig functions such as sine and
cosine are an easy case: although they theoretically require an infinite
number of terms to generate an exact answer, the terms will eventually
underflow to zero allowing us to stop the calculation.

But unfortunately that isn't the case for all mathematical calculations.
Often, the terms of our sequence do not converge to zero, due to round-off
error. Our answer cycles backwards and forwards between two or more
floating point approximations, e.g. 1.276805 <-> 1.276804. The developer
must make an arbitrary choice to stop after N iterations, if the answer
has not converged. Zero iterations is clearly pointless. One is useless.
And infinite iterations will simply never return an answer. So an
arbitrary choice for N is the only sensible way out.

In a database, we might like to associate (say) multiple phone numbers
with a single account. Most good databases will allow you to do that, but
there is still the question of how to collect that information: you have
to provide some sort of user interface. Now, perhaps you are willing to
build some sort of web-based front-end that allows the user to add new
fields, put their phone number in the new field, with no limit. But
perhaps you are also collecting data using paper forms. So you make an
arbitrary choice: leave two (or three, or ten) boxes for phone numbers.

There are many other reasons why you might decide rationally to impose an
arbitrary limit on some process -- arbitrary does not necessarily mean
"for no good reason". Just make sure that the reason is a good one.


-- 
Steven.




More information about the Python-list mailing list