Graceful failures

Wed Dec 31 15:28:45 EST 2003

|Thus Spake Hans-Joachim Widmaier On the now historical date of Wed, 31
Dec 2003 18:19:00 +0100|
> It is quite a coincidence that a thread like this pops up right when I
> feel I have to start one - with almost exactly the same subject -
> myself.
> 
> Error handling can easily be the hardest task in a program, and that's
> why its being neglected most of the time.

Agreed.  Another problem is that programmers often forget what it's like
for people who don't understand computers.  It's hard to overestimate the
apprehension many users feel towards computers.  I've seen very competent
and intelligent people frozen with fear in front of a word processor
because they were afraid that they'd break the computer.

> I would like it much more detailed, like:

Most people that will read this post would prefer a more detailed message.
Of course one should tailor the system to your program's target audience,
but let's assume, for the sake of this particular discussion, that the
errors will mostly be read by non-technically oriented people.  I think
that detailed error reports should always be readily available, but should
not necessarily be the first thing presented.

> """
> Foo has encountered a problem. It was trying to load a necessary image
> from the file '/usr/share/Foo/images/up.png'. This file does apparently
> not exist. The problem may be caused by a broken installation of Foo.
> Alas, Foo cannot be continued and will be closed. """

That's a good message.  I would avoid putting raw filenames in the
non-technical description.  (Unless, of course, the user was trying to
open a document, then it's okay to list the path.)  I would also avoid
using the words like "broken." Both of these things look normal to you and
me, but to a lot of people they're Scary Things(tm).  By simply hiding the
filename behind a button that says "Technical Details" you have said "Hey,
you might not understand this, but that's OK because it's meant for the
geeks."  Heck, they might take a look at it and feel proud that they
understand more of it than they thought they would.

I like the increased specificity of your error message.  I might suggest:

"""
Foo has encountered a problem. It was trying to load necessary image
files. At least one image apparently does not exist. Uninstalling and then
Reinstalling Foo may correct this problem. Unfortunately, Foo cannot be
continued and will be closed. """

> (I'm not the best error message designer, but I hope you get the point:

Nor am I.  Most of the reason I offered up my opinion is so that other
people might clue me into new ideas on the matter.

> Which means the details I want go there - fine with me.

As most people like you and me wouldn't mind one extra click to get at the
juicy details.

> This, of course, is a bit overkill for a little freeware program, but
> sounds good for a "big" application with a hefty price tag.

It makes as much sense as installing something like bugzilla. (more on
that in a minute)

> Handling every conceivable error right is quite a challenge and lots of
> work, especially in the test department.

Let me digress for a moment and explain how I stumbled into this idea of
allowing the user to jump to an online information system based on their
particular error.  It might make more sense if you know where it came
from.  Or, at least, you can pin-point the critical flaw in my logic.

My last development job was with a large company that preprocessed medical
insurance claims.  Essentially, we took on contracts to collect insurance
claims from doctors, verify that the data was well-formatted (there are
several hundred formats for medical insurance claims) translate them into
the single format requested by our client, transmit them to the client,
receive the response and deliver the responses back to the doctors.  Karl
Marx would have hated us.  We were nothing but big-time middle-men.  When
asked where I worked, I used to say "I work for a huge quasi-governmental
corporation that exists solely to shuffle paper."

The method for transferring data was plain old 56k modems and a simple
BBS. Our competitive advantage in the market was that we handled all
end-user support and that we offered a pretty gui program tailored
specifically for the medical field.  (At this point, anyone who's worked
in that field knows exactly who I worked for.)  The gui program,
essentially a fancy modem driver that allowed the user to track which
files had been sent and pair them with the proper responses. 
Unfortunately the code for it was one huge tangled mess.  It was the first
program I was asked to make changes too and, as I hadn't mastered the
Borland C++ debugger, I resorted to the tried and true method of keeping a
log file while debugging.  Every time that the program began an action, I
had it write and flush, in plain english, what it was attempting to do.  I
found that when I sent the product on to QA, that having that log was
invaluable.  After some discussion with our tech support department, we
decided to keep the code in for the production version.  It was a smashing
success.  The front line tech support people were able to get much more
reliable information about how the user's particular error came about.
They started a database around the log-file so that they could reference
the solutions for people with the same or similar problems.  Most
importantly, for me at least, I knew exactly where the end users were
having problems.  I didn't have to deal with errors like "Version Foo
failed an assertion on line X in module Bar."  I was able to implement
more graceful error handling.

So, I read the OP and thought to myself:  Couldn't python keep a stack of
the 'goals' it's attempting to achieve, then if an exception was thrown or
an assertion failed couldn't this become part of the information about the
error.  You would push goals onto the stack, then pop them off if the
program was reasonably certain that the action wasn't going to cause a
problem.  Then, couldn't that information be used to see if other users
were having the same or similar problems?  Couldn't that information be
used to register and track problems via a hybrid of a message-board and
bugzilla?  Well, I don't see why not.  In fact, since Python's exception
system is so wonderful you could have the error dialog marshal error codes
and basic stack-trace info into the get portion of a url and probably even
make it open a browser with the press of a button. The user wouldn't have
to know that all this went on in the background. All they would know is "I
pressed a button and it took me to a place where people wanted to help me
make it work."

I want to work on something like that someday.  I've already got a couple
of projects on my plate, so at the moment this is blue sky thinking, but
don't you think it would be nice if things could work this way?  Maybe
when my current project becomes stable, I'll see if I can add this as a
feature.

Sam Walters

-- 
Never forget the halloween documents.
http://www.opensource.org/halloween/
""" Where will Microsoft try to drag you today?
    Do you really want to go there?"""