more than 100 capturing groups in a regex

Tue Oct 25 11:12:42 EDT 2005

Steven D'Aprano wrote:
> On Tue, 25 Oct 2005 06:30:35 -0700, Iain King wrote:
>
> >
> > Steven D'Aprano wrote:
> >> On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:
> >>
> >> >
> >> > Fredrik Lundh wrote:
> >> >> Joerg Schuster wrote:
> >> >>
> >> >> > I just want to use more than 100 capturing groups.
> >> >>
> >> >> define "more" (101, 200, 1000, 100000, ... ?)
> >> >>
> >> >> </F>
> >> >
> >> > The Zero-One-Infinity Rule:
> >> >
> >> > http://www.catb.org/~esr/jargon/html/Z/Zero-One-Infinity-Rule.html
> >>
> >>
> >> Nice in principle, not always practical. Sometimes the choice is, "do you
> >> want it today with arbitrary limits, or six months from now with bugs
> >> but no limits?"
> >>
> >> If assigning arbitrary limits prevents worse problems, well, then go for
> >> the limit. For instance, anyone who has fought browser pops ups ("close
> >> one window, and ten more open") may have wished that the browser
> >> implemented an arbitrary limit of, say, ten pop ups. Or even zero :-)
> >>
> >
> > Well, exactly.  Why limit to ten?  The user is either going to want to
> > see pop-ups, or not.  So either limit to 0, or to infinity (and indeed,
> > this is what most browsers do).
>
> I haven't been troubled by exponentially increasing numbers of pop up
> windows for a long, long time. But consider your question "why limit to
> ten?" in a wider context.
>
> Elevators always have a weight limit: the lift will operate up to N
> kilograms, and stop at N+1. This limit is, in a sense, quite arbitrary,
> since that value of N is well below the breaking point of the elevator
> cables. Lift engineers, I'm told, use a safety factor of 10 (if the
> cable will carry X kg without breaking, set N = X/10). This safety
> factor is obviously arbitrary: a more cautious engineer might use a
> factor of 100, or even 1000, while another might choose a factor of 5 or
> 2 or even 1.1. If engineers followed your advice, they would build lifts
> that either carried nothing at all, or accepted as much weight until the
> cable stretched and snapped.
>
> Perhaps computer programmers would have fewer buffer overflow security
> exploits if they took a leaf out of engineers' book and built in a few
> more arbitrary safety factors into their data-handling routines. We can
> argue whether 256 bytes is long enough for a URL or not, but I think we
> can all agree that 3 MB for a URL is more than any person needs.
>
> When you are creating an iterative solution to a problem, the ending
> condition is not always well-specified. Trig functions such as sine and
> cosine are an easy case: although they theoretically require an infinite
> number of terms to generate an exact answer, the terms will eventually
> underflow to zero allowing us to stop the calculation.
>
> But unfortunately that isn't the case for all mathematical calculations.
> Often, the terms of our sequence do not converge to zero, due to round-off
> error. Our answer cycles backwards and forwards between two or more
> floating point approximations, e.g. 1.276805 <-> 1.276804. The developer
> must make an arbitrary choice to stop after N iterations, if the answer
> has not converged. Zero iterations is clearly pointless. One is useless.
> And infinite iterations will simply never return an answer. So an
> arbitrary choice for N is the only sensible way out.
>
> In a database, we might like to associate (say) multiple phone numbers
> with a single account. Most good databases will allow you to do that, but
> there is still the question of how to collect that information: you have
> to provide some sort of user interface. Now, perhaps you are willing to
> build some sort of web-based front-end that allows the user to add new
> fields, put their phone number in the new field, with no limit. But
> perhaps you are also collecting data using paper forms. So you make an
> arbitrary choice: leave two (or three, or ten) boxes for phone numbers.
>
> There are many other reasons why you might decide rationally to impose an
> arbitrary limit on some process -- arbitrary does not necessarily mean
> "for no good reason". Just make sure that the reason is a good one.
>
>
> --
> Steven.

I think we are arguing at cross-purposes, mainly because the term'
arbitrary' has snuck in.  The actual rule:

 "Allow none of foo, one of foo, or any number of foo." A rule of
thumb for software design, which instructs one to not place random
limits on the number of instances of a given entity.

Firstly, 'for software design'.  Not for field engineers servicing
elevators :)

Second, it's [random], not [arbitrary].  I took your use of arbitrary
to mean much the same thing - a number picked without any real
judgement involved, simply because it was deemed larger than some
assumed maximum size.  The rule does not apply to a number selected for
good reason.

I don't think I get your phone record example:  Surely you'd have the
client record in a one-to-many relationship with the phone number
records, so there would be (theoretically) no limit?

Your web interface rang a bell though - in GMails contacts info page,
each contact has info stored in sections.  Each of these sections
stores a heading, an address, and some fields.  It defaults to two
fields, with an add field button  Hitting it a lot I found this maxed
out at 20 fields per section.  You can also add more sections though -
I got bored hitting the add section button once I got to 51 sections
with the button still active.  I assume there is some limit to the
number of sections, but I don't know what it is :)  GMail is awesome.

Anyway, back to the OP: in this specific case, the cap of 100 groups in
a RE seems random to me, so I think the rule applies.

Also, see "C Programmer's Disease":
http://www.catb.org/~esr/jargon/html/C/C-Programmers-Disease.html

Iain