Test driven programming, was Re: VB to Python migration

Thu Feb 2 08:05:35 EST 2006

First of all, while I use TextTest, I'm fortunate to be surrounded
by TextTest experts such as Goeff and Johan here at Carmen, so I'm
not a TextTest expert by any measure. I probably use it in an non-
optimal way. For really good answers, I suggest using the mailing
list at sourceforge:
http://lists.sourceforge.net/lists/listinfo/texttest-users

For those who don't know, TextTest is a testing framework which
is based on executing a program under test, and storing all
output from such runs (stdout, sterr, various files written
etc). On each test run, the output is compared to output which is
known to be correct. When there are differences, you can inspect
these with a normal diff tool such as tkdiff. In other words,
there are no test scripts that perform actual and expected values,
but there might be scripts for setting up, extracting results and
cleaning up.

Grig Gheorghiu wrote:
> I've been writing TextTest tests lately for an application that will be
> presented at a PyCon tutorial on "Agile development and testing". I
> have to say that if your application does a lot of logging, then the
> TextTest tests become very fragile in the presence of changes. So I had
> to come up with this process in order for the tests to be of any use at
> all:

The TextTest approach is different from typical unittest work,
and it takes some time to get used to. Having worked with it
for almost a year, I'm fairly happy with it, and feel that it
would mean much more work do achieve the same kind of confidence
in the software we write if would use e.g. the unittest module.
YMMV.

> 1) add new code in, with no logging calls
> 2) run texttest and see if anything got broken
> 3) if nothing got broken, add logging calls for new code and
> re-generate texttest golden images

I suppose its usefulness depends partly on what your application
looks like. We've traditionally built applications that produce
a lot of text files which have to look in a certain way. There are
several things to think about though:

We use logging a lot and we have a clear strategy for what log
levels to use in different cases. While debugging we can set a
high log level, but during ordinary tests, we have a fairly low
log level. What log messages gets emitted is controlled by an
environment variable here, but you could also filter out e.g.
all INFO and DEBUG messages from your test comparisions.

We're not really interested in exactly what the program does when
we test it. We want to make sure that it produces the right results.
As you suggested, we'll get lots of false negatives if we log too
much details during test runs.

We filter out stuff like timstamps, machine names etc. There are
also features that can be used to verify that texts occur even if
they might appear in different orders etc (I haven't meddled with
that myself).

We don't just compare program progress logs, but make sure that
the programs produce appropriate results files. I personally don't
work with any GUI applications, being confined in the database
dungeons here at Carmen, but I guess that if you did, you'd need
to instrument your GUI so that it can dump the contents of e.g.
list views to files, if you want to verify the display contents.

Different output files are assigned different severities. If
there are no changes, you get a green/green result in the test
list. If a high severity file (say a file showing produced
results) differ, you get a red/red result in the test list.
If a low severity file (e.g. program progress log or stderr)
changes, you get a green/red result in the test list.

You can obviously use TextTest for regression testing legacy
software, where you can't influence the amount of logs spewed
out, but in those cases, it's probable that the appearence of
the logs will be stable over time. Otherwise you have to fiddle
with filters etc.

In a case like you describe above, I could imagine the following
situation. You're changing some code, and that change will add
some new feature, and as a side effect change the process (but
not the results) for some existing test. You prepare a new test,
but want to make sure that existing tests don't break.

For your new test, you have no previous results, so all is red,
and you have to carefully inspect the files that describe the
generated results. If you're a strict test-first guy, you've
written a results file already and with some luck it's green,
or just have some cosmetic differences that you can accept. You
look at the other produced files as well and see that there
are no unexpected error messages etc, but you might accept the
program progress log without caring too much over every line.

For your existing tests, you might have some green/red cases,
where low severity files changed. If there are red/red test
cases among the existing tests, you are in trouble. Your new
code has messed up the results of other runs. Well, this
might bee expected. It's all text files, and if it's hundreds
of tests, you can write a little script to verify that the
only thing that changed in these 213 result files was that a
date is formatted differently or whatever. In a more normal
case, a handful of tests have a few more lines (maybe repeated
a number of times) in some progress logs, and it's quick to
verify in the GUI that this is the only change. As you suggested,
you might also raise the log level for this particular log
message for the moment, rerun the tests, see that the green/red
tunred to green/green, revert to log levels, rerun again, and
save all the new results as correct.

> I've been doing 3) pretty much for a while and I find myself
> regenerating the golden images over and over again. So I figured that I
> won't go very far with this tool without the discipline of going
> through 1) and 2) first.

If you want this it migh be convenient to write something like

LOGLEVEL_NEW=1
...
log(LOGLEVEL_NEW, 'a new log message')
...
log(LOGLEVEL_NEW, 'another new log message')
...

and then remove the first assignment and replace all LOGLEVEL_NEW
with DEBUG or whatever after the first green test.

It's also worth thinking about how much logging you want during
tests. Do you want to see implementation changes that don't cause
changed results? You are verifying the achieved results properly
in some kind of snapshot or output files, right?

>>From what I see though, there's no way I can replace my unit tests with
> TextTest. It's just too coarse-grained to catch subtle errors. I'm
> curious to know how exactly you use it at Carmen and how you can get
> rid of your unit tests by using it.

I work in the database group at Carmen R&D, and a big part of our job
is to provide APIs for the applications to use when working with
databases. Thus, we typically test our APIs rather than full programs.
This means that we need to write test programs, typically in Python,
that use the API, and present appropriate results.

The way we work in those cases isn't very different from typical unit
testing. We have to write test scripts (since the *real* program
doesn't exist) and we have to make the test scripts produce a
predictable and reproducable output. This isn't different from what
you do in unittest. If you can use .assertEqual() etc, you can also
get output to a file which is reproducable. For us, who mainly work
with content in databases, the typical thing for use to verify is that
one or several tables hold a particular content after a particular
set of operations. The easiest way to verify that is to dump the results
of one or several DB queries to text files, and to compare the files.
Typically, the files are identical, and if there is a difference, we
can directly see the difference between the files with tkdiff. This
is far superior that just writing typical unittest modules where we'd
perform a certain query and get a message that this particular query
gave the wrong result, but I don't get a chance to see any context.
Tkdiff on dumps of small tables is very informative.

It's also easier when the expected result changes. I can usually see
fairly easy in tkdiff that the new result differs from the old just
in the way I expected, and save the new result as the correct result
to compare with. One has to be very careful here of course. If you're
in a hurry and don't inspect everything properly, you could save an
incorrect result as correct, and think that your flawed program works
as intended. It might be less likely that you would write an incorrect
assertion in unittest and make the program produce this incorrect
result.

This is a fundamental difference between unittest and the way *I* use
texttest. With unittest, you state your expected results before the
test, and mechanically verify that this is correct. With TextTest, you
*could* do the same, and change the text file with expected results to
what you now think it should be. I'm usually too lazy for that though.
I just change my code, rerun the test, expect red, look at the
difference in tkdiff, determine whether this is what it should look
like, and save the new results as correct if I determine that it is.

So far, this hasn't led to any problems for me, and it's far less work
than to manually write all the assertions. TextTest (particularly in
the mature setup here) also provides support for distribution of tests
among hosts, on different operating systems, with different database
systems etc, automatic test execution as part of nightly builds etc.

TextTest originated among our optimization experts. As far as I know,
they use it more as it is intended than we do, i.e. to test complete
programs, rather than small scripts written just to check a feature in
an API. Honestly, I don't know a lot of details there though. They
seem to practice XP fairly strictly, but instead of writing unit tests,
they test the entire programs in TextTest, and they obviously manage
to produce world class software in an efficient way. As I said, I'm
not the expert...