Seven Principles of Software Testing

May 2008

While everyone knows the theoretical limitations of testing, in practice we devote considerable effort to this task, and would consider it foolish or downright dangerous to skip testing.

Other verification techniques, such as static analysis, model checking and proofs, have great potential, but it is unlikely they will ever fully remove the need for testing. In the meantime we need to understand the scope and limitations of tests, and perform them right.

The seven principles that follow emerged from experience of studying testing and developing automated testing tools (AutoTest, CDD) over the past few years.

1 Defining testing

As a verification technique, testing is a paradox. Testing a program to assess its quality is, in theory, akin to sticking pins into a doll; very small pins, very large doll.

The way out of the paradox is to set realistic expectations. All too often thesoftware engineering literature claims an overblown role for testing, echoed in the Wikipedia definition [1]:

Software testing is the process used to assess the quality of computer software. Software testing is an empirical technical investigation conducted to provide stakeholders with information about the quality of the product or service under test, with respect to the context in which it is intended to operate.

In truth, testing a program tells us little about its quality, since ten test runs or ten million are a drop in the ocean of possible cases. More precisely, there are connections between tests and quality, but they are tenuous:

  • A successful test is only relevant to quality assessment if the test previously failed; then it shows a failure has gone away, possibly indicating that a fault has been corrected. (This article will follow the IEEE standard terminology [2]: an unsatisfactory program execution is a "failure", pointing to a "fault" in the program, itself the result of a "mistake" in the programmer's thinking. It avoids the word "bug", which may refer to any one of these three phenomena.)
  • If a systematic process tracks failures and faults over the life of the project or - better - many projects, the observed history may give some clues as to how many remain. For example if the last three weekly test runs have evidenced 550, 540 and 530 faults, the trend is mildly encouraging but it is unlikely that the next run will find no faults, or a hundred; it is difficultto make more precise predictions, and they would in any case involve information from sources other than the project's current tests.

The only incontrovertible connection is a negative one, a falsification in the Popperian sense: a failed test gives us evidence of non-quality. (In addition, if the test previously passed, it indicates regression and suggests that a quality problem may exist not just for the current program but for the whole development process.)

The most famous of all quotations about testing says this well: "Program testing", wrote Edsger Dijkstra in 1970 [3], "can be used to show the presence of errors, but never to show their absence!" What is less widely understood (and probably not intended by Dijkstra), is what this maxim actually means for testers: the best possible advertisement for their discipline. Surely, any technique that helps to uncover faults holds great interest for any "stakeholder", from managers to developers and customers.

Rather than indictment of testing, we should understand this view as a definition of testing which, for being less ambitious than any attempt a t providing "information about quality", is more realistic and extremely useful in practice. It yields our first principle:

Principle 1: Definition of software testing
    To test a program is to try to make it fail
.

Taking this as the definition of software testing helps keep the testing process focused: its single goal is to uncover faults by triggering failures. Any inference on quality is the responsibility of the quality assurance process and beyond the scope of testing.

The definition also reminds us that testing, unlike debugging, does not concern itself with correcting faults, only with finding them.

2 Tests and specification

The next principle was already covered in an earlier EiffelWorld column [4]. The introduction of Test-Driven Development has convincingly brought tests to the center stage of software development, but sometimes with the implication that tests can be a substitute for specifications. They cannot. Tests, even a million of them, are only instances; they miss the abstraction that only a specification can provide. If I tell you about a particular function whose value is 1 for the input 1, 4 for the input 2, 9 for 3 and 16 for 4, I have given you some evidence, but no sure answer to the question of what it will give for 5. (I tried for fun to feed the values into a curve-fitting program; sure enough, the square function came out first, but a few really strange functions fit almost as well.)

The danger of believing that a set of tests, however carefully designed, can serve as specification is evidenced by a number of software failures that happened because no one had thought of some extreme case. While writing a specification is not a panacea either - specifications can miss cases - at least it implies an effort at abstraction that can only help.

One aspect of the primacy of specifications over tests is that from a specification it is possible to generate tests, even automatically (model-driven testing); the reverse is not possible without human intervention.

Principle 2: Tests versus specs
    Tests are no substitute for specifications.
    One can derive tests from specifications, not the other way around.

3 Regression testing

A characteristic of the testing activity as practiced in software - well known to experienced developers, although outsiders would not guess its extent - is the regrettable propensity of old, previously corrected faults to resuscitate. The hydra's old heads, thought to have been long cut off, pop back up. This is known as regression and may be due to various reasons including bad configuration management (an older version of a module is reinstated by mistake), incomplete correction (the correction fixed one manifestation of an underlying problem but stopped there, failing to investigate whether it had other similar consequences elsewhere in the code), and recurrence of incorrect programmer thought patterns.

Whatever its source, regression is a common plague of software projects and leads to regression testing: checking that what has been corrected is still functions. This means in particular that once you have uncovered a fault it must remain part of your life forever (or at least for at least as long as the project lasts):

Principle 3: Regression testing
    Any failed execution must yield a test case, to remain permanently part of
    the project's test suite.

Ideally, we would apply this principle to all executions rather than just the failed ones, but with any reasonable process we expect the number of successful executions to be too large to permit this in practice. To a tester, in any case, successful tests are profoundly uninteresting; what counts are failures.

The principle applies to all failed executions occurring during the development and testing process. Ideally, there should be a mechanism for users of the software to report, optionally, any failure encountered during operation.

In any of these cases, application of this principle assumes a mechanism for turning a failed execution into an easily reproducible test case. The CDD tool for EiffelStudio [5] is an example of such a mechanism.

4 Test oracles

A test run is only useful if it is possible to determine unambiguously whether it passed or not. The criterion that determines success or failure is called a test oracle.

If you have a few dozen or perhaps a few hundred tests you may afford to examine the results one by one. Clearly this approach does not scale up. An effective approach to testing requires automating this task:

Principle 4: Applying oracles
    Determining success or failure of tests must be an automatic process.

This statement of the principle leaves open the form of oracles. Often, oracles are specified separately, for example as files listing the expected outcome of every test case. Our work on automated testing (AutoTest framework [6]) goes one step further than the above principle by making the test oracles part of the software; thanks to the use of contracts, programs contain their own tests. We call the results "programs that test themselves" [7]. The following variant of the previous principle, which not everyone may be ready to accept yet, covers this more systematic view:

Principle 4 (variant): Contracts as oracles
    Oracles should be part of the program text, in the form of contracts.
    Determining success or failure of tests should be an automatic process
    consisting of monitoring contract satisfaction during execution.

5 Manual and automatic test cases

Many test cases are manual: developers or testers think up execution cases of interest and devise a test case accordingly. To this category we may add the test cases derived - according to principle 3 - from an execution that failed even though it was not initially intended as a test case.

It is becoming increasingly interesting to include automatic test cases, typically derived from the specification as in model-based testing. If we restricted ourselves to manual tests we would be under-utilizing the power of modern computers.

Manual tests are good at depth: they reflect the developers' understanding of the problem domain and the data structure. Automatic tests are good at breadth: they will try many different values, including extreme cases that humans would not necessarily include. Our next principle reflects the complementary role of these two approaches:

Principle 5: Manual and automatic
    An effective testing process must include both manually and automatically
    produced test cases.

6 Testing strategies

For our last two principles we move from the practice of testing to the search for new techniques. The term "principle" may be a bit too grand for the next rule, which is more a piece of general advice drawn from examining the testing literature and trying to apply it, as well from our own efforts to develop effective testing strategies. Both in works published by others and in our own initial efforts, we have repeatedly come across a risky thought process: you find an idea that looks certain to improve the testing process, and you believe your own intuition. Well, testing is a tricky matter; not all smart ideas turn out to be helpful when you submit them to the unforgiving test (if I may use this word) of objective evaluation.

A typical example is random testing. Intuition tells us that when we look for appropriate values to exercise a program any strategy based on knowledge of the program must be the (seemingly silly) strategy of selecting random values. And yet if one applies objective measures, such as the number of faults found, random testing often beats supposedly smart ideas; Hamlet's fascinating review of random testing [8] is a good place to consult for a confrontation of folk knowledge about testing and the objective reality.

That objective reality is not always easy to ascertain. Some testing papers follow a simplistic scheme: start from an existing strategy (for example to generate test cases); propose an improvement; describe a few experiments, on programs of the author's making with faults added ("seeded" or "injected"); show that the faults are identified faster than with the earlier strategy. But the examples may not be representative of real programs, and seeded faults may not representative of real faults. Worse, it is not yet a universal practice to publish the programs and tests on the Internet in a way that enables others to perform the same experiments and check the results (as is common in experimental sciences); this makes many published results suspicious, as they are subject to classic problems of experimenter bias.

I remember being impressed by reading, many years ago, "The Life of Bees" (La Vie des Abeilles) by the Belgian writer Maurice Maeterlinck, better known for the text of Debussy's Pelléas et Mélisande. The book states that the intelligence of bees will make them lose out to the much dumber flies in the following experiment: put a bunch of insects into a carafe whose bottom your turn towards the light. Bees, attracted by the light, will get stuck forever; flies don't have a clue and will try all directions, so that eventually they getout. Maeterlinck was a poet rather than a professional biologist and I don't know whether the experiment really holds up. But it is a pretty good metaphor for cases in which apparent stupidity may outsmart apparent smartness. This can happen in testing; there is no substitute for empirical assessment.

The following advice summarizes these observations:

Principle 6: Empirical assessment of testing strategies
    Evaluate any testing strategy, however attractive it may appear in
    principle, through objective assessment based on explicit criteria and a
    reproducible testing process.

7 Assessment criteria

In applying the last principle the issue remains of which criteria to apply. In the testing literature, for example, one often finds measures such as "number of tests to first failure". From a practitioner's viewpoint this is not the most useful:

  • We want to find all faults, not just one. Of course the idea behind the given criterion is that once the first fault has been spotted it will be corrected, so it is just a matter of applying the criterion again. But this is not really satisfactory since the second and successive faults may be of an entirely different nature. In particular, an automated process must trigger as many failures as possible, not stop at the first one.
  • The number of tests is not that useful to stakeholders of the testing process, such as project managers, who need help deciding when to stop testing and ship, and to customers, who need some idea of fault densities. More relevant is the testing time needed to uncover the faults. Using the number of tests as the quantity to minimize has the pernicious effect of favoring strategies that may take a long time to devise and set up the tests, then get to the first one in a small number of tests; but what counts is the time of the whole process. A strategy that seems dumb, such as random testing, may turn out to be better if it makes up in setup time for whatever time it may lose in finding the bugs.

In our experience, what really matters in a testing strategy is how fast it can produce failures (from which we infer faults); to be precise, the number of failures found plotted against time, say f (t). This is useful in two ways: counting failures found in a set time, researchers who start from a software base with known faults can assess the effectiveness of a testing strategy by seeing how many of them it finds in that time; counting the time needed to go from one failure to the next, project managers can get some help to address the age-old question "when can I stop testing?" Hence the principle:

Principle 7: Assessment criteria
    The most important property of a testing strategy is the number of faults
    (or, if faults are not directly measurable, failures) it uncovers as a
    function of testing time.

In all this we never strayed very far from where we started. The first principle told us that testing is about producing failures; the last one is but a quantitative restatement of the same general observation, which also underlies all the others.

-- Bertrand Meyer

References

  1. Wikipedia: entry on "Software Testing", as of March 2008.
  2. IEEE: Standard Glossary of Software Engineering Terminology, 1990, available online at ieeexplore.ieee.org/iel1/2238/4148/00159342.pdf.
  3. Edsger W. Dijkstra, Notes On Structured Programming, in Dahl, Dijkstra, Hoare, Structured Programming, Academic Press, 1972. Also as EWD249 at http://www.cs.utexas.edu/users/EWD/ewd02xx/EWD249.PDF.
  4. Bertrand Meyer: Test or spec? Test and spec? Test from spec!, in EiffelWorld, September 2004, available online at http://www.eiffel.com/general/column/2004/september.html.
  5. Andreas Leitner, Ilinca Ciupa, Manuel Oriol, Bertrand Meyer and Arno Fiva: Contract Driven Development = Test Driven Development - Writing Test Cases, Proceedings of ESEC/FSE'07 (European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering 2007), Dubrovnik, Croatia, September 2007, available online at http://se.ethz.ch/people/leitner/publications/cdd_leitner_esec_fse_2007.pdf.
  6. Ilinca Ciupa, Andreas Leitner, Bertrand Meyer, Manuel Oriol et al.: AutoTest references, se.ethz.ch/research/autotest/.
  7. Bertrand Meyer, Ilinca Ciupa, Andreas Leitner: When Programs Test Themselves, submitted for publication, 2008.
  8. Richard Hamlet: Random testing, in Encyclopedia of Software Engineering, ed. J.Marciniak, Wiley, 1994, pp. 970-978.