Perl is worse!

Alex Martelli alex at magenta.com
Sat Jul 29 05:13:48 EDT 2000


"Steve Lamb" <grey at despair.rpglink.com> wrote in message
news:slrn8o40d1.49c.grey at teleute.rpglink.com...
    [snip]
> >> (/^(\d{1,2})[Dd](\d{1,3})$/ ||
/^(\d{1,2})[Dd](\d{1,3})([\+\-])(\d{1,2})$/)
>
> >I don't know what you mean here. I never do anything like that, I'm
> >happy to say. :)
>
>     Extracting up to 6 match groups, 5 of which will be integers with the
last
> being a + or a -.  This means all groups can be either strings or None.
I've
> already verified that groups 1-4&6 will be integers through the regular
> expression so why check again?

You have most definitely NOT verified that these groups will be
integers: you have verified that they will be sequences of digits
*which may be empty*.  You keep talking about how the person on the
street feels -- do you truly believe he or she perceives
    ''
as being a NUMBER?!

Of course, you can (and IMHO should) do better in this particular
case because the start of the string is fixed; i.e., you could
use a more compact and efficient expression:

    dicere = re.compile(r'(\d{1,2})[Dd](\d{1,3})([+-]\d{1,2})?$')

The key issue here is striving to 'express any given thing ONLY
ONCE' -- here, the 'given thing' is the start of the RE.

I've also applied two minor simplifications -- no need for a ^ at
the start, since we're going to re.match and not re.search; and
no need to escape the + or - when within brackets.  Strictly an
issue of trying to minimize the syntax-complexity of the RE to
make it as readable as possible (not much, but that's OK:-).

Now, after mat=dicere.match(theline), you've left with only 3
match groups.  Groups mat.group(1) and mat.group(2) are known
to be non-empty sequences of digits; group mat.group(3) is
either empty, or a + or - sign followed by a non-empty sequence
of digits.  You simply need to be explicit about what you
want to do with each, as per Python's general mantra "explicit
is better than implicit".

If you have to do this any oftener than once, you should of
course pack it in a little function of its own.  What to put
in it and what to leave outside is of course an issue of
design, depending on what exactly you need/want to reuse. To
get one single field out as a number (0 if the digit sequence
is empty), you could do:

def intgroup(match,index):
    digits = match.group(index)
    if digits: return int(digits)
    else: return 0

To get ALL of the matched-fields out as numbers (as before):

def intgroups(match):
    results = []
    for g in match.groups():
        if g: results.append(int(g))
        else: results.append(0)
    return results

Or, of course, you could also move the re.match inside
the function itself, etc.  Also, I've been careful to
only return a 0 for an empty sequence ("" or None), and
not many other kinds of non-digit-sequences; you could
soften that, if needed, with a try/except in lieu of
the if construct I've chosen.

On the other hand, you could only package the digit
sequence to integer conversion, if you get such seqs
from several different sources and not just from
match.group constructs:

def asInt(digits):
    if digits: return int(digits)
    else: return 0

and of course you could have BOTH layers, too:

def intGroups(match):
    results = []
    for g in match.groups():
        results.append(asInt(g))
    return results

where the latter might perhaps more elegantly
be expressed as:

def intGroups(match):
    return map(asInt, match.groups())

[map and friends are easy to overuse, but, for a
simple case like this, it seems made to order].


Don't get me wrong -- I fully realize how attractive it
is, when you frequently need a certain functionality, to
find it bundled right in the language, or in its standard
libraries.  Here, you seem to very often need 'treat this
digit-sequence as a number, 0 if empty', and have chosen
to think of any language not supplying it to you in a
pre-cooked form as 'quirky' in this regard.

But, really, the advantages of a language's built-in
behaviour in a given situation being _close_ to what you
want are easily overrated.  As long as the issue is not
as fundamental as, say, unlimited-precision integers:-),
keeping it under your control has its own pluses: when
the behaviour you want becomes slightly different (and
it will!), it's an easy change to make rather than a
potential nightmare.

Let's take your case, that of parsing and interpreting
input strings of the form 'iDj[+k]' as found in role
playing games ('roll i j-sided dice and add k').  Say
that you now also need to support GURPS, which only ever
uses six-sided dice, so the j is often omitted *and needs
to default to SIX*, NOT to zero, when it is.

If you have closely relied on a language's implicitly
and silently transforming empty digit-sequences into 0,
you have to insert a lot of code, in various appropriate
places, of the form "if x==0: x=6".

If you have your own digits-to-number conversion function,
on the other hand, the change is easier, sharper, and
more explicit.  For example:

def asInt(digits, default=0):
    if digits: return int(digits)
    else: return default

def intGroups(match, default6index=-1):
    groups = match.groups()
    defaults = [0] * len(groups)
    if default6index >= 0:
        defaults[default6index] = 6
    return map(asInt, groups, defaults)

Now, you only need to change some calls from:
    x=asInt(match.group(2))
to
    x=asInt(match.group(2), default=6)
(you don't _need_ to give the argument name, but
I find it very readable in this case!), or, when
you're doing it in bulk to all groups, change:
    nums=intGroups(match)
to
    nums=intGroups(match, default6index=2)


> it behind and if or try every time.  I don't feel that is an uncommon
> occurance for anyone who might be taking numbers from plain text and
wanting
> to perform math operations on them.

It's not all that rare, but very often 0 is not the desired result
of an empty digit sequence.  I get a lot of "strings that I know
should be [non-empty sequences of digits; representations of dates;
syntactically valid GUIDs; &c, &c]" from XML parsing, and rarely
is 0 what I want the empty-string or missing-string to be equivalent
to.  An error is more often what I want to receive in that case;
the rarer cases where the empty-string is OK (and the default is
probably most often 1 in that case, NOT 0!) are best handled by
either try/except, or an explicit test.

Why would one think that 0 is most often desired, by the way?  Take
your RPG-dice example again.  If you accepted a syntax such as
'D20+3', would anybody on Earth take that as meaning 'roll no
20-sided dice at all, and add 3 to the non-result'?-)  Come on.
If you ever WANT to accept such specs, they'd better be VERY
explicit, say '0D20+3' -- you want to explicitly SEE that 0 there
(if the string was generated by another program, it might make
life simpler to accept it, but having the explicit 0 is not a
burden and clarifies things).

If you accept 'D20+3', which seems sensible to do, it must default
to mean the same as '1d20+3' -- roll ONE 20-sided die, etc, etc.

Same in the typical XML situations, by the way.
    <sandwich number='2'>
        <ham/>
        <cheese/>
    </sandwich>
is OK for "two ham and cheese sandwiches", but, if the number
attribute can be omitted, then it had better default to ONE
ham and cheese sandwich, not ZERO of them (although the
explicit number='0' case should probably be accepted).

Isn't that what the person on the street expects, when going to
a counter and saying "ham and cheese sandwich, please", rather
than an explicit "one sandwich" or "two sandwiches"?-)  He or
she expects one sandwich, not zero.  What the counter attendant
will do on an explicit order of "zero sandwiches" is here of
course another issue:-).


So, the language had BETTER not second-guess what IS meant
when a digit-sequence happens to be empty!  Defaulting it to
the integer 0, as Perl does, seems to me a pretty bad design
choice, predicated on a situation that is not really frequent.

But I would oppose defaulting it to the integer 1, either, AS
A LANGUAGE-BUILTIN CHOICE.  The meta-design mistake lies in
trying to have the language second-guess the programmer's
intent: explicit is better than implicit.  Just let the
programmer STATE what he or she intends to happen when the
digit-sequence is empty.  If in a given application domain
that is a frequent issue, the statement will happen once,
as a function:

def asInt(digits, ifEmpty=1):
    if digits: return int(digits)
    else: return ifEmpty

the programmer may encode the default once (here assumed to
be 1) and only have to make it explicit in the code body
when it needs to differ; or the '=1' may be omitted if the
programmer accepts 'explicit is better than implicit' and
WANTS to code explicit statements such as:

    x = asInt(match.group(2), ifEmpty=6)

each and every time (maximally readable, IMHO).  In any
case, the choice is UP TO THE PROGRAMMER, just as it should
be: the language has not made it in advance for him/her,
and has not second-guessed in the ambiguous case.  Bliss!-)


> >>> a = 1
> >>> b = a
> >>> a = 2
> >>> b
> 1
> >>> a = [1]
> >>> b = a
> >>> del(a[0])
> >>> b
> []
>
>     Forget, for a moment, what is happening internally.  What looks like
here
> is that in the first case there is an assignment going on and in the
latter a
> reference.  Nevermind that what is happening in the first case is a
references
> objint[1] and b references objint[1] and when we change a it points to a
new
> reference objint[2].  It is the difference in behavior of an identical
> operator that is ambigious.  I'm sure even though I am aware that it
happens I
> am going to be nailed by it a few times before I get used to it.

There is no difference in the behaviour *of an identical operator*.

The difference is in the behaviour of two different constructs,
    a = 2
in the first case versus
    del a[0]
in the second case.

Use IDENTICAL operators, and you'll get the same behaviour:

>>> a=[1]
>>> b=a
>>> a=[2]
>>> b
[1]

See?  EXACTLY as in the first case, b is not in the least affected
by any rebinding whatsoever that you may perform on a.

The difference is between _rebinding_ a, and _modifying_ whatever
a is bound to (which can't happen when a is bound to something that
cannot be modified, of course -- a number, string, tuple).  Rebinding
a never has any effect whatsoever on any other variables that may
happen to have been bound to the same object as a before the latter
variable was re-bound.  Modifying happens to the OBJECT, *not* to
the variable[s] bound to it; and so, after the object is modified,
all variables, still bound, to it now reference the modified object,
never the original, unmodified one, which isn't even around any
longer.

Never any "identical operator" in the two different cases, and
never any ambiguity.  You may have trouble thinking of variables
as post-its that are just transiently bound to something (whether
that something get modified during that transient, or not; indeed,
whether that something is ever SUBJECT to modification, or not),
but there is no ambiguity and no identical-operators.

An everyday analogy is with identifying-roles (unique ones, that
is) as opposed to names.  'The president of Smallton Soccer Club'
and 'The president of Smallton Country Club' can happen to refer
to the same person AT A CERTAIN POINT IN TIME, but there is
nothing forcing the situation to remain the same.  If that person
undergoes some modification, e.g. she dies her hair blue, then
that modification happens to 'both' presidents.  If a new
president of the Country Club is elected, the president of the
Soccer Club can remain the same person.

Other identifying-roles may refer to unmodifiable objects (hard
to find examples in the real world, but a slight abstraction that
is not alien to everyday thinking fixes that).  'My favourite
novel' and 'Your favourite novel' can both happen to be "Moby
Dick" at a given point in time (and nobody's going to modify
the novel "Moby Dick" itself, which is an abstraction of what
is common to all physical books that are copies of it:-).  If
I change my tastes so "My favourite novel" becomes "Heart Of
Darkness", that does not automatically change _yours_, nor can
it modify "Moby Dick" itself in any sense; it's just a rebinding
of one variable, with no effect whatsoever on the bindings of
other variables (here, other people's favourite novels) that just
happened to be similarly bound at one point in time.


> >Isn't that worse than the program giving up?
>
>     No.  The program giving up in the customer's hands because of an
unchecked
> except is a far cry worse than a nigglet but having it continue on.  As a
> customer which would you rather have, given the choice of these two:
>
> a: A program that crashes, stopping everything that you're doing with no
> chance of recovery
>
> b: A program that behaves oddly but allows for chance of recovery.

As the 'behaving oddly' will typically produce irreversible alterations
of a precious persistent database, while the crash can at least be
assumed to leave the database intact, the choice is anything but easy.
Would you rather die suddenly and painlessly, but with no chance of
making a last will and testament, or slowly in excruciating pain, but
with a last chance to put your worldly affairs in order?  What a heck
of a choice to have to make!

Fortunately, it doesn't have to be made: wrapping the top-level entries
in a try/except is so TRIVIALLY easy, that there is really NO excuse
for any program not doing so.  Just as there is no excuse for not
HAVING a last will and testament already made, if your estate (were
you to die suddenly) has any worth whatever (yes, I know that most
people die intestate, often leaving a mess in the executors' hands,
and most programs do crash irretrievably on occasion -- two sad
comments on the state of the world, that most people never think of
their own demise AND most programmers never think of possible errors;
but that does not stop ME from having a last will registered, nor
from wrapping my programs in try/except...:-).

(PS: to ensure the database is unpolluted, transactions exist: a
crash will rollback the transaction, only an explicit commit will
make the changes persist; the 'odd behaviour' that pollutes the
database, however, is not helped at all by this, unless you've
been able to express ALL the semantic constraints on your data
as triggers in your db -- fat chance!-).


>     Right, which is worthless in a script.  I'd personally have the Python
> check what mode it is in and toss an exception when it is in interactive
mode
> and gets a meaningless statement.  It does so on every other possible
> ambiguity, why not this one when it is even less grey than a lot of the
ones I
> am tossing out?  Is there a purpose for having such informative statements
in
> non-interactive mode?

I'm not sure there is any practical way for Python to ensure that
a certain statement IS meaningless, i.e., that it can have had no
side-effects whatsoever.  If such insurance could be had without
pain, it would be worth having.  But if there's a price to be paid,
and I suspect there would be, I would not be willing to pay it.


> >No, my entire point was that you need to know what your data is *anyway*.
>
>     Right, I know what the data is, I don't need the language to ask me at
> every operation, "Are you sure?  I mean, really, really, REALLY sure!?"

The language asks nothing: just TELL it what you know, explicitly.  If
you know s is a string which is a non-empty sequence of digits, and want
to use the equivalent integer, use int(s).  If s may be an empty sequence,
or None, or "foo", you need to just-as-unambiguously state what is to
be done in those cases.  You appear to ASSUME that the only sensible
thing in those cases is to have int(s) return 0, but I hope I've shown
above how wrong that would be.  Just write your explicit, tailored
conversion-function ONCE, again as shown above, and you're all set.


> >You need to know that something is an integer anyway, so why not simply
> >tell the computer what's going on? I'm *not* saying the language should
> >do the checking. It can't; even though Perl tries.
>
>     Because what if I want to do something that the computer thinks is
> non-sensical?

Such as?  Whatever it IS you want to do, if algorithmically feasible, it
can be expressed.


> >checking. If you want the system to shut up about your mistakes,
> >you can always do this in Python:
>
> >try:
> >   a = a + 1
> >except:
> >   pass
>
>     Great, now I get to litter my code on a per operation basis with tries
> instead of being able to set it on certain variables.  That would make
more
> sense.

In Python, a variable refers to an object, PERIOD.  It would make no
sense at all, in this setting, to make a variable carry, besides the
reference, extra flags specifying peculiar special-case behaviour, so
that two variables referring to the same object would not be equivalent
because of the flag-setting; a huge increase of complexity for no real
gain, since the flags' meaning just has to depend on the object as well,
so what happens when the variable with the peculiar flags is re-bound
to some object of a completely different type?

You want tailored behaviour, you put it in the OBJECT, *NOT* in the
several variables that may or may not refer to it at different points
in time.  We're doing OBJECT-oriented programming, not VARIABLE-
oriented programming, see?

This is utterly fundamental to the language architecture, and
even the underlying idea of O-O programming.  Getting different
behaviour from two variables referring to the same object, as
can happen e.g. in C++ (because C++ variables also have type, as
well as object having it, and not all methods need be virtual,
so there CAN be a semantic mismatch between variable & object!)
is a horrid wart, and in practice never what one MEANS to do.


>     Right, which is basically what I had to do except I left None as None.
> Just seems like a pain in the butt to do after I have allready made sure,
> beyond all doubt, that those variables contain numbers in the first place.

Hey, you just *SAID* there could be a None, so you had NOT made
sure at all that they did contain numbers!  None is NOT a number.
The empty sequence is not a number.  You're contradicting yourself!


> >Though I may be missing something about regular expressions here and
> >I'm goofing up?
>
>     A regex could return a string or None, only one of which is
convertable to
> int.  You must litter your code with such checks.

No, you must just code *ONCE* what specific integer you want None to
be transformed to: you encode that in a make-this-an-integer function,
and, if it's always the same integer, you're done -- you then just
call that function, see above.  Which gives you the needed flexibility
of having a different number for None in some cases, etc, etc, again
see above.


Alex






More information about the Python-list mailing list