Software Testing

Seng 621 Winter 1999

Abstract

This web document, an extension of a presentation for S. Eng. 623, provides and introduction to software testing. This covers the basic methods of black and white box testing, as well as the different test levels (unit, integration, system, etc.) Each level is described with how it builds on the previous stage. A brief discussion of software testing metrics is presented. The challenges facing software testing in an organization are explored, and the question of testing versus software inspections is discussed. Finally we present a look at fault-based testing methods, a testing strategy which is gaining in popularity.

Abstract
Introduction to Software Testing
Basic Methods
Testing Levels

Unit testing
Integration testing
External Function Testing
System Testing
Regression Testing
Acceptance Testing
Installation Testing
Completion Criteria

Metrics
Organization
Testing and SQA, Inspections
A Closer Look: Fault Based Methods
Conclusion
References

Introduction to Software Testing

Software testing is an vital part of the software lifecycle. To understand its role, it is instructive to review the definition of software testing in the literature.

Among alternative definitions of testing are the following:

"... the process of exercising or evaluating a system or system component by manual or automated means to verify that it satisfies specified requirements or to identify differences between expected and actual results ..."

(ANSI/IEEE Standard 729, 1983).

"... any activity aimed at evaluating an attribute or capability of a program or system and determining that it meets its required results. Testing is the measurement of software quality ..."

(Hetzel, W., The Complete Guide to Software Testing, QED Information Sciences Inc., 1984).

"... the process of executing a program with the intent of finding errors..."

(Myers, G. J., The Art of Software Testing, Wiley, 1979).

Of course, none of these definitions claims that testing shows that software is free from defects. Testing can show the presence, but not the absence of problems.

According to Humphrey [1], software testing is defined as 'the execution of a program to find its faults'. Thus, a successful test is one that finds a defect. This sounds simple enough, but there is much to consider when we want to do software testing. Besides finding faults, we may also be interested in testing performance, safety, fault-tolerance or security.

Testing often becomes a question of economics. For projects of a large size, more testing will usually reveal more bugs. The question then becomes when to stop testing, and what is an acceptable level of bugs. This is the question of 'good enough software'.

It is important to remember that testing assumes that requirements are already validated.

Basic Methods
White Box Testing	White box testing is performed to reveal problems with the internal structure of a program. This requires the tester to have detailed knowledge of the internal structure. A common goal of white-box testing is to ensure a test case exercises every path through a program. A fundamental strength that all white box testing strategies share is that the entire software implementation is taken into account during testing, which facilitates error detection even when the software specification is vague or incomplete. The effectiveness or thoroughness of white-box testing is commonly expressed in terms of test or code coverage metrics, which measure the fraction of code exercised by test cases.
Black Box Testing	Black box tests are performed to assess how well a program meets its requirements, looking for missing or incorrect functionality. Functional tests typically exercise code with valid or nearly valid input for which the expected output is known. This includes concepts such as 'boundary values'. Performance tests evaluate response time, memory usage, throughput, device utilization, and execution time. Stress tests push the system to or beyond its specified limits to evaluate its robustness and error handling capabilities. Reliability tests monitor system response to representative user input, counting failures over time to measure or certify reliability.

Testing Levels

Different Levels of Test

Testing occurs at every stage of system construction. The larger a piece of code is when defects are detected, the harder and more expensive it is to find and correct the defects.

The different levels of testing reflect that testing, in the general sense, is not a single phase of the software lifecycle. It is a set of activities performed throughout the entire software lifecycle.

In considering testing, most people think of the activities described in figure 1. The activities after Implementation are normally the only ones associated with testing. Software testing must be considered before implementation, as is suggested by the input arrows into the testing activities.

Figure 1: V-Shaped Life Cycle

.The following paragraphs describe the testing activities from the 'second half' of the software lifecycle.

Unit Testing

Unit testing exercises a unit in isolation from the rest of the system. A unit is typically a function or small collection of functions (libraries, classes), implemented by a single developer.

The main characteristic that distinguishes a unit is that it is small enough to test thoroughly, if not exhaustively. Developers are normally responsible for the testing of their own units and these are normally white box tests. The small size of units allows a high level of code coverage. It is also easier to locate and remove bugs at this level of testing.

Integration Testing

One of the most difficult aspects of software development is the integration and testing of large, untested sub-systems. The integrated system frequently fails in significant and mysterious ways, and it is difficult to fix it

Integration testing exercises several units that have been combined to form a module, subsystem, or system. Integration testing focuses on the interfaces between units, to make sure the units work together. The nature of this phase is certainly 'white box', as we must have a certain knowledge of the units to recognize if we have been successful in fusing them together in the module.

There are three main approaches to integration testing: top-down, bottom-up and 'big bang'. Top-down combines, tests, and debugs top-level routines that become the test 'harness' or 'scaffolding' for lower-level units. Bottom-up combines and tests low-level units into progressively larger modules and subsystems. 'Big bang' testing is, unfortunately, the prevalent integration test 'method'. This is waiting for all the module units to be complete before trying them out together.

(From [1])	Bottom-up	Top-down
Major Features	Allows early testing aimed t proving feasibility and practicality of particular modules. Modules can be integrated in various clusters as desired. Major emphasis is on module functionality and performance.	The control program is tested first Modules are integrated one at a time Major emphasis is on interface testing
Advantages	No test stubs are needed It is easier to adjust manpower needs Errors in critical modules are found early	No test drivers are needed The control program plus a few modules forms a basic early prototype Interface errors are discovered early Modular features aid debugging
Disadvantages	Test drivers are needed Many modules must be integrated before a working program is available Interface errors are discovered late	Test stubs are needed The extended early phases dictate a slow manpower buildup Errors in critical modules at low levels are found late
Comments	At any given point, more code has been written and tested that with top down testing. Some people feel that bottom-up is a more intuitive test philosophy.	An early working program raises morale and helps convince management progress is being made. It is hard to maintain a pure top-down strategy in practice.

Integration tests can rely heavily on stubs or drivers. Stubs stand-in for finished subroutines or sub-systems. A stub might consist of a function header with no body, or it may read and return test data from a file, return hard-coded values, or obtain data from the tester. Stub creation can be a time consuming piece of testing.

The cost of drivers and stubs in the top-down and bottom-up testing methods is what drives the use of 'big bang' testing. This approach waits for all the modules to be constructed and tested independently, and when they are finished, they are integrated all at once. While this approach is very quick, it frequently reveals more defects than the other methods. These errors have to be fixed and as we have seen, errors that are found 'later' take longer to fix. In addition, like bottom up, there is really nothing that can be demonstrated until later in the process.

External Function Testing

The 'external function test' is a black box test to verify the system correctly implements specified functions. This phase is sometimes known as an alpha test. Testers will run tests that they believe reflect the end use of the system.

System Testing

The 'system test' is a more robust version of the external test, and can be known as an alpha test. The essential difference between 'system' and 'external function' testing is the test platform. In system testing, the platform must be as close to production use in the customers’ environment, including factors such as hardware setup and database size and complexity. By replicating the target environment, we can more accurately test 'softer' system features (performance, security and fault-tolerance).

Because of the similarities between the test suites in the external function and system test phases, a project may leave one of them out. It may be too expensive to replicate the user environment for the system test, or we may not have enough time to run both.

Acceptance Testing

An acceptance (or beta) test is an exercise of a completed system by a group of end users to determine whether the system is ready for deployment. Here the system will receive more realistic testing that in the 'system test' phase, as the users have a better idea how the system will be used than the system testers.

Regression Testing

Regression testing is an expensive but necessary activity performed on modified software to provide confidence that changes are correct and do not adversely affect other system components. Four things can happen when a developer attempts to fix a bug. Three of these things are bad, and one is good:

	New Bug	No New Bug
Successful Change	Bad	Good
Unsuccessful Change	Bad	Bad

Because of the high probability that one of the bad outcomes will result from a change to the system, it is necessary to do regression testing.

It can be difficult to determine how much re-testing is needed, especially near the end of the development cycle. Most industrial testing is done via test suites; automated sets of procedures designed to exercise all parts of a program and to show defects. While the original suite could be used to test the modified software, this might be very time-consuming. A regression test selection technique chooses, from an existing test set, the tests that are deemed necessary to validate modified software.

There are three main groups of test selection approaches in use:

Minimization approaches seek to satisfy structural coverage criteria by identifying a minimal set of tests that must be rerun.
Coverage approaches are also based on coverage criteria, but do not require minimization of the test set. Instead, they seek to select all tests that exercise changed or affected program components.
Safe attempt instead to select every test that will cause the modified program to produce different output than original program.

An interesting approach to limiting test cases is based on whether we can confine testing to the "vicinity" of the change. (Ex. If I put a new radio in my car, do I have to do a complete road test to make sure the change was successful?) A new breed of regression test theory tries to identify, through program flows or reverse engineering, where boundaries can be placed around modules and subsystems. These graphs can determine which tests from the existing suite may exhibit changed behavior on the new version.

Regression testing has been receiving more attention as corporations focus on fixing the 'Year 2000 Bug'. The goal of most Y2K is to correct the date handling portions of their system without changing any other behavior. A new 'Y2K' version of the system is compared against a baseline original system. With the obvious exception of date formats, the performance of the two versions should be identical. This means not only do they do the same things correctly, they also do the same things incorrectly. A non-Y2K bug in the original software should not have been fixed by the Y2K work.

A frequently asked question about regression testing is 'The developer says this problem is fixed. Why do I need to re-test?’ to which the answer is 'The same person probably told you it worked in the first place'.

Installation Testing

The testing of full, partial, or upgrade install/uninstall processes.

Completion Criteria

There are a number of different ways to determine the test phase of the software life cycle is complete. Some common examples are:

All black-box test cases are run
White-box test coverage targets are met
Rate of fault discovery goes below a target value
Target percentage of all faults in the system are found
Measured reliability of the system achieves its target value (mean time to failure)
Test phase time or resources are exhausted

When we begin to talk about completion criteria, we move naturally into a discussion of software testing metrics.

Metrics

Goals

As stated above, the major goal of testing is to discover errors in the software. A secondary goal is to build confidence that the system will work without error when testing does not reveal any errors. Then what does it mean when testing does not detect any errors? We can say that either the software is high quality or the testing process is low quality. We need metrics on our testing process if we are to tell which is the right answer.

As with all domains of the software process, there are hosts of metrics that can be used in testing. Rather than discuss the merits of specific measurements, it is more important to know what they are trying to achieve.

Three themes prevail:

Quality Assessment (What percentage of defects are captured by our testing process, how many remain?)
Risk Management (What is the risk related to remaining defects?)
Test Process Improvement (How long does our testing process take?)

Quality Assessment

An important question in the testing process is "when should we stop?" The answer is when system reliability is acceptable or when the gain in reliability cannot compensate for the testing cost. To answer either of these concerns we need a measurement of the quality of the system.

The most commonly used means of measuring system quality is defect density. Defect density is represented by:

# of Defects / System Size

where system size is usually expressed in thousands of lines of code or KLOC. Although it is a useful indicator of quality when used consistently within an organization, there are a number of well documented problems with this metric. The most popular relate to inconsistent definitions of defects and system sizes.

Defect density accounts only for defects that are found in-house or over a given amount of operational field use. Other metrics attempt to estimate of how many defects remain undetected. A simplistic case of error estimation is based on "error seeding". We assume the system has X errors. It is artificially seeded with S additional errors. After a testing, we have discovered Tr 'real' errors and Ts seeded errors. If we assume (questionable assumption) that the testers find the same percentage of seeded errors as real errors, we can calculate X:

S / (X + S) = Ts / (Tr + Ts)
X = S * ((Tr + Ts) / Ts -1)

For example, if we find half the seeded errors, then the number of 'real' defects found represents half of the total defects in the system.

Estimating the number and severity of undetected defects allows informed decisions on whether the quality is acceptable or additional testing is cost-effective. It is very important to consider maintenance costs and redevelopment efforts when deciding on value of additional testing.

Risk Management

Metrics involved in risk management measure how important a particular defect is (or could be). These measurements allow us to prioritize our testing and repair cycles. A truism is that there is never enough time or resources for complete testing, making prioritization a necessity.

One approach is known as Risk Driven Testing, where Risk has specific meaning. The failure of each component is rated by Impact and Likelihood. Impact is a severity rating, based on what would happen if the component malfunctioned. Likelihood is an estimate of how probable it is that the component would fail. Together, Impact and Likelihood determine the Risk for the piece.

Obviously, the higher rating on each scale corresponds to the overall risk involved with defects in the component. With a rating scale, this might be represented visually:

I m p a c t	4
	3
	2
	1
	+	1	2	3	4
		Likelihood -

The relative importance of likelihood and impact will vary from project to project and company to company.

A system level measurement for risk management is the Mean Time To Failure (MTTF). Test data sampled from realistic beta testing is used find the average time until system failure. This data is extrapolated to predict overall uptime and the expected time the system will be operational. Sometimes measured with MTTF is Mean Time To Repair (MTTR). This represents the expected time until the system will be repaired and back in use after a failure is observed. Availability, obtained by calculating MTTF / (MTTF + MTTR), is the probability that a system is available when needed. While these are reasonable measures for assessing quality, they are more often used to assess the risk (financial or otherwise) that a failure poses to a customer or in turn to the system supplier.

Process Improvement

It is generally accepted that achieve improvement you need a measure against which to gauge performance. To improve our testing processes we the ability to compare the results from one process to another.

Popular measures of the testing process report:

Effectiveness: Number of defects found and successfully removed / Number of Defect Presented
Efficiency: Number of defects found in a given time

It is also important to consider reported system failures in the field by the customer. If a high percentage of customer reported defects were not revealed in-house, it is a significant indicator that the testing process in incomplete.

A good defect reporting structure will allow defect types and origins to be identified. We can use this information to improve the testing process by altering and adding test activities to improve our changes of finding the defects that are currently escaping detection. By tracking our test efficiency and effectiveness, we can evaluate the changes made to the testing process.

Testing metrics give us an idea how reliable our testing process has been at finding defects, and can is a reasonable indicator if its performance in the future. It must be remembered that measurement is not the goal, improvement through measurement, analysis and feedback is what is needed.

Software Testing Organization

Test Groups

The following summarizes the Pros and Cons of maintaining separate test groups…

Pros

Testers are usually the only people to use a system heavily as experts;
Independent testing is typically more efficient at detecting defects related to special cases, interaction between modules, and system level usability and performance problems
Programmers are neither trained, nor motivated to test
Overall, more of the defects in the product will likely be detected.
Test groups can provide insight into the reliability of the software before it is actually shipped

Cons

Having separate test groups can result in duplication of effort (e.g., the test group expends resources executing tests developers have already run.
The detection of the defects happens at a later stage, designers may have to wait for responses from the test group to proceed. This problem can be exacerbated in situations where the test group is not physically collocated with the design group.
The cost of maintaining separate test groups

The key to optimizing the use of separate test groups is understanding that developers are able to find certain types of bugs very efficiently, and testers have greater abilities in detecting other bugs. An important consideration would be the size of the organization, and the criticality of the product.

Testing Problems

When trying to effectively implement software testing, there are several mistakes that organizations typically make. The errors fall into (at least) 4 broad classes:

Misunderstanding the role of testing.

The purpose of testing is to discover defects in the product. Furthermore, it is important to have an understanding of the relative criticality of defects when planning tests, reporting status, and recommending actions.

Poor planning of the testing effort.

Test plans often over emphasize testing functionality at the expense of potential interactions. This mentality also can lead to incomplete configuration testing and inadequate load and stress testing. Neglecting to test documentation and/or installation procedures is also a risky decision.

Using the wrong personnel as testers.

The role of testing should not be relegated to junior programmers, nor should it be a place to employ failed programmers. A test group should include domain experts, and need not be limited to people who can program. A test team that lacks diversity will not be as effective.

Poor testing methodology.

Just as programmers often prefer coding to design, testers can be too focussed on running tests at the expense of designing them. The tests must verify that product does what it is supposed to, while not doing what it should not. As well, using code coverage as a performance goal for testers, or ignoring coverage entirely are poor strategies.

Testing and SQA, Inspections

Inspections are undoubtedly a critical tool to detect and prevent defects. Inspections are strict and close examinations conducted on specifications, design, code, test, and other artifacts. An important point about inspections is that they can be performed much earlier in the design cycle, well before testing begins. Having said that, testing is something that can be started much earlier than is normally the case. Testers can review their test plans with developers as they are creating their designs. Thus the developer may be more aware of the potential defects and act accordingly. In any case, the detection of defects early is critical, the closer to the time of its creation that we detect and remove a defect, the lower the cost, both in terms of time and money. This is illustrated in figure 2:

Figure 2 : Defect Detection and cost to correct (Source: McConnell)

Evidence of the benefits of inspections abounds. The literature (Humphrey 1989) reports cases where:

inspections are up to 20 times more efficient than testing;
code reading detects twice as many defects/hour as testing;
80% of development errors were found by inspections;
inspections resulted in a 10x reduction in cost of finding errors;

In the face of all this evidence, it has been suggested that "software inspections can replace testing". While the benefits of inspections are real, they are not enough to replace testing. Inspections could replace testing if and only if all information gleaned through testing could be obtained through inspection. This is not true for several reasons. Firstly, testing can identify defects due to complex interactions in large systems (e.g. timing/synchronization). While inspections can detect this event, as systems become more complex the chances of one person understanding all the interfaces and being present at all the reviews is quite small.

Second, testing can provide a measure of software reliability (i.e. failures/execution time) that is unobtainable from inspections. This measure can often be used as a vital input to the release decision. Thirdly, testing identifies system level performance and usability issues that inspections cannot. Therefore, since inspections and testing provide different, equally important information, one cannot replace the other. However, depending on the product, the optimal mix of inspections and testing may be different!

A Closer Look: Fault Based Methods

The following paragraphs will describe some newer techniques in the software testing field. Fault based methods include Error Based Testing, Fault seeding, mutation testing, and fault injection, among others.

After briefly describing each of the 4 techniques, fault injection will be discussed in more detail.

Error based testing defines classes of errors as well as inputs that will reveal any error of a particular class, if it exists.
Fault seeding implies the injection of faults into software prior to test. Based on the number of these artificial faults discovered during testing, inferences are made on the number of remaining ‘real’ faults For this to be valid the seeded faults must be assumed similar to the real faults.
Mutation testing injects faults into code to determine optimal test inputs.
Fault Injection evaluates the impact of changing the code or state of an executing program on behavior of the software.

These methods attempt to address the belief that current techniques for assessing software quality are not adequate, particularly in the case of mission critical systems. Voas et. al. suggests that the traditional belief that improving and documenting the software development process will increase software quality is lacking. Yet, they recognize that the amount of testing (which is product focussed) required in order to demonstrate high reliability is impractical. In short, quality processes cannot demonstrate reliability and the testing necessary to do so is impossible to perform.

Fault injection is not a new concept. Hardware design techniques have long used inserted fault conditions to test system behavior. It is as simple as pulling the modem out of your PC during use and observing the results to determine if they are safe and/or desired. The injection of faults into software is not so widespread, though it would appear that companies such as Hughes Information Systems, Microsoft, and Hughes Electronics have applied the techniques or are considering them. Properly used, fault insertion can give insight as to where testing should be concentrated, how much testing should be done, whether or not systems are fail-safe, etc.

As a simple example consider the following code:

Original	Fault Injected
X = (r1 – 2) + (s2 – s1) Y = z – 1 … T = x/y	X = (r1 – 2) + (s2 – s1) X = perturb(x) Y = z – 1 … T = x/y If T > 100 then print (‘WARNING’)

In this case it is catastrophic if T > 100. By using perturb(x) to generate changed values of X (i.e. a random number generator) you can quickly determine how often corrupted values of X lead to undesired values of T.

The technique can be applied to internal source code, as well as to 3^rd party software, which may be a "black box"

Conclusion

Software testing is an important part of the software development process. It is not a single activity that takes place after code implementation, but is part of each stage of the lifecycle. A successful test strategy will begin with consideration during requirements specification. Testing details will be fleshed through high and low level system designs, and testing will be carried out by developers and separate test groups after code implementation.

As with the other activities in the software lifecycle, testing has its own unique challenges. As software systems become more and more complex, the importance of effective, well planned testing efforts will only increase.

References

References

Humphrey, Watts S , "Managing the Software Process", Addison-Wesley Publishing Company, Inc., 1989
McConnell, Steve, "Software Quality at Top Speed", August 1996. http://www.construx.com/stevemcc/art04.htm
Voas, J and Miller, K.W., "Using Fault Injection To Assess Software Engineering Standards", Proceedings of Int'l. Symp. on Software Engineering Standards, August, 1995.
Voas, J., "Fault Injection for the Masses", IEEE Computer, December 1997.
Mills, E., "Software Metrics" SEI-CM-12-1.1, December 1998.

Further Reading

Marick, Brian, "Classic Testing Mistakes", 1997
URL:http://www.stlabs.com/MARICK/Classic/mistakes.html

"Software Testing Techniques"
URL: http://hebb.cis.uoguelph.ca/~deb/27320/testing/testing.html

"Software Inspections"
URL: http://www.sei.cmu.edu/str/descriptions/inspections_body.html

Hower, Rick, "Software QA and Testing Frequently-Asked-Qustions, Part 1", 1998
URL: http://www.charm.net/~dmg/qatest/qatfac1.html

		up
	Software Testing Seng 621 Winter 1999			Simon Dyck David Sloane