Callisto - Selecting Effective Mutation Operators for Mutation Testing

A Master thesis project by Jan Smits

University supervisor: Ansgar Fehnker
Daily supervisors: Nico Jansen, Jan-Jelle Kester
Process supervisor: Harry Nieboer
University assessors: Marieke Huisman, Maarten Everts

Thesis project submitted in fulfillment of the requirements for the degree of:
MSc Computer Science, Software Technology specialization
Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS)

Mutation Testing

Mutation testing is a software strategy that works by deliberately introducing bugs to the source code, creating mutant versions of the program. Mutants are generated by mutation operators, which mutate a syntax token in a specific way. These mutants are then tested using the existing test suite. If the tests fail, then the mutation was detected and is now deemed killed. If the tests pass then the mutation has gone undetected and has survived. The more mutants are killed, the better the test suite is at detecting bugs in the source code. The mutation score is the percentage of mutants that is killed, and indicates the effectiveness of the test suite. Mutation testing therefore does not test the software directly, but rather the tests.

Performance Problems

The main reason why mutation testing is not widely applied in industry is because of its high performance cost. Every mutant requires up to a full run of the test suite, and there are hundreds to thousands of mutants possible, depending on the program size.

Several techniques have been developed to speed up mutation testing. This includes more efficient ways to generate and test mutants, for example by using coverage information or compiling multiple mutants into one program and activating them using code flags. For this project we focus on selective mutation, which reasons that too many mutants are generated and tries to select a subset of mutants such that mutation testing is still effective for judging a test suite.

Mutation Levels

This project introduces the use of mutation levels as a technique to speed up mutation testing. A mutation level is a subset of the available mutation operators, such that, when used, fewer mutants will be generated and thus fewer executions of the test suite will be needed during testing. Mutation operators are selected based on their resolution and performance impact. Resolution describes the ability of operators to generate hard to kill mutants, which require more specific test cases to kill and thus promote the creation of a high-quality test suite. Performance impact relates to the relative number of mutants an operator generates.

An ideal mutation level consists of mutation operators with a high resolution and low performance impact, so that a high-quality test suite is encouraged with good performance. The difficulty of designing the mutation levels therefore lies in choosing which mutation operators are worth keeping and which can be excluded, based on their resolution and performance impact. For this the tool Callisto is developed, which analyses mutation operators in an empirical setting to quantify their resolution and performance impact. This is done using an existing quality metric from literature.

Callisto

A tool to analyse mutation operators.

Metrics

To quantify the resolution of mutation operators the Quality metric defined by Estero-Botaro et al. and improved by Delgado-Pérez et al. is used. In a nutshell, Quality is calculated experimentally after a mutation testing session by looking at the number of test cases that killed individual mutants. The fewer test cases kill a mutant, the harder it is to kill, and therefore the higher the quality of the mutation operator that generated the mutant. Quality always falls between 0 and 1.

The performance impact of a mutation operator can be quantified in the same experiment by counting the total number of test case executions that were needed to test the mutants generated by it.

Callisto

Callisto was created to calculate the two metrics described above. It does this using the mutation report generated during a mutation testing session, which contains all necessary information. The design is largely based on the steps for calculating Quality defined by Estero-Botaro et al., as this is far more complicated than counting test case executions.

On the practical side Callisto is designed as a CLI tool and has been implemented in C#. As input Callisto accepts a JSON mutation report conforming to the open-source standard for mutation reports defined by Stryker Mutator. Callisto will then output the found Quality and number of test case executions for each used mutation operator, along with several other statistics, as a CSV file.

Evaluation

To assess the effectiveness of using mutation levels as a technique to speed up mutation testing an evaluation is performed where it is applied to an existing mutation testing framework: Stryker Mutator. Stryker is an open-source mutation testing framework consisting of three flavours: StrykerJS, Stryker.NET and Stryker4s. They provide mutation testing for JavaScript / TypeScript, C# and Scala respectively. For the evaluation StrykerJS is used, because at the time it was the only flavour able to provide a suitable mutation report to Callisto.

Program Name	Mutant Count	Mutation Score	Description
BigMath	1671	88.51%	A library for working with large numbers
CucumberJS	2677	66.23%	A test framework adding cucumber testing to JavaScript
Express	1984	88.77%	“Fast, unopinionated, minimalist web framework for Node”
Freecodecamp	8465	10.02%	FreeCodeCamp.org’s open-source codebase and curriculum
Minesweeper	398	97.14%	Well-known puzzle game implemented in JavaScript
Mutation Testing Elements	1170	74.28%	Metadata and visualisation for the mutation reports of Stryker
Nest	8779	51.85%	Framework for server-side Node.js apps
StrykerJS Core	3052	69.82%	Core module of StrykerJS
StrykerJS Instrumenter	1197	90.63%	Instrumenter module of StrykerJS

First several example programs to mutate with StrykerJS are found. These programs should be of sufficient size, so that enough mutants are generated, and come with an effective test suite, so that enough mutants are killed. This is needed, as Callisto can only analyse killed mutants, and manually expanding the test suites is outside the scope of this project. The example programs that were used can be seen in the table to the right.

All programs were found through GitHub and make use of Node, to ensure compatibility with StrykerJS. Several programs were found based on their popularity, such as CucumberJS, Express and Freecodecamp. Others were referred to the project because they already used StrykerJS, thus saving the time needed to set it up. Examples of these are Minesweeper, and the two modules of StrykerJS itself.

Not all example programs have good mutation scores, but they compensate for this with their size, seen by the number of mutants generated for these programs. For example, Freecodecamp only has a score of 10%, but with a mutant count of over 8000 there will still be enough killed mutants to analyse with Callisto.

To start the evaluation the programs are mutated using StrykerJS, and mutants are killed using the accompanying test suites. This results in mutation reports for each program, with detailed information about the mutants, that are then given to Callisto. Callisto then calculates the Quality of each killed mutant using the formula from literature by looking at how many test cases killed that mutant. The Quality of a mutation operator is then simply the average of the qualities of the individual mutants it generated. This forms the quantification of the resolution. Performance impact is quantified simultaneously by counting how many test case executions were necessary to test the mutants of an operator.

Now that the resolution and performance impact of mutation operators is known, mutation levels can be designed. Several techniques can be used for this, for example by removing performance heavy operators, or only including operators with a resolution above a certain threshold. Two metrics are introduced to analyse a mutation level once designed: the effectiveness and performance of mutation levels. Effectiveness is measured by looking at how many test cases are induced by the chosen mutation operators. A test case is induced by a mutant if it is needed to kill that mutant. The more test cases are induced by the mutants generated by the mutation level, the more effective the level is at creating high-quality test suites. The size of the test suite induced by the mutation level is divided by test suite size needed when using all mutation operators. This is the effectiveness as a percentage. The performance of a mutation level is easier to calculate: it is simply the percentage of test case executions that is saved when using the mutation level compared to using all mutation operators. Callisto is capable of calculating these two metrics for a given mutation level, within the context of an example program.

Results

First the results of the Quality and performance impact that Callisto calculated for the mutation operators used in the example programs is discussed. In total 56 mutation operators in StrykerJS were analysed. For an overview of the available mutations in StrykerJS, see here. The found Qualities range overall between 0.5 and 1, where Relational Operator Replacement (ROR) mutation operators generally score above 0.9, whereas string and block statement mutations score relatively low, around 0.6. This reflects how difficult it is to kill such mutants when mutation testing, where a high quality indicates it is difficult to kill. A complete overview of all mutation operators is too elaborate to show here.

The pie chart to the right shows the counted test case executions that were needed for each mutation operator, out of a total of 320,589. As can be seen, the top five operators account for over half of all executions, and therefore these operators have a high performance impact compared to others. Again string and block statement mutations stand out, as these type of mutants occur frequently, and therefore need many test case executions during mutation testing.

With these results several mutation levels are designed using simple techniques and intuition. The fewer mutation operators are included in a level, the more performance, as fewer test case executions are required. However, with fewer mutants the effectiveness of a level will decrease, as the induced test suite needed to kill them will become smaller. Designing a mutation level is therefore a game of removing many test case executions without decreasing the size of the induced test suite too much.

In total 17 mutation levels are designed. 6 of them establish a quality threshold and only include mutation operators with a quality above the threshold. Thresholds were set between 0.6 and 0.85, with intervals of 0.05. Other levels were designed by removing performance-heavy mutation operators. Several levels were also added that intuitively seem 'badly' designed, to see how they behave. Finally 4 custom levels were made that remove varying amounts of mutation operators. The goal of these was to keep related mutation operators (such as mutating a + to a - and vice versa) together in a level, as the other mutation levels did not respect this.

The graph above shows the effectiveness and performance that were determined using Callisto for the 17 designed mutation levels. A dotted average trend-line is added to better show the outliers. Both a high effectiveness and performance percentage are desired. Thus the closer a level is placed to the top-right corner of the graph, the better it is. Surprisingly, several of the 'badly' designed mutation levels outperformed all others. Most notably, 'OnlyBlockStatement' (a level only containing the BlockStatement operator) achieved a performance of 86% and an effectiveness of 63%, meaning it removes 86% of test case executions while retaining 63% of the test suite. Its high performance can be explained by the fact that only one operator is used. Its high effectiveness may be caused by the nature of the mutation. BlockStatement deletes blocks of code and can therefore be applied frequently throughout the code, thus needing many individual test cases to kill them all, leading to a high effectiveness.

Most of the other levels performed average. The threshold levels can mostly be found in order close to the trend-line. The 4 custom levels achieved mixed results, with custom 1 and 2 performing average, but 3 and 4 sub-par.

What is apparent from this graph is that all mutation levels can be found in the top-right half of the graph. This means that the performance percentage achieved is always higher than the effectiveness percentage lost compared to 100% effectiveness. This is a strong indication that mutation levels are a valid means to speed up mutation testing. The evaluation using StrykerJS is therefore deemed a success. In particular Custom 1 and 2 are recommended for use in StrykerJS, as they provide a more consistent mutation experience and still offer increased performance with a decent effectiveness. Two levels are offered, so that users of Stryker can choose how much effectiveness they wish to lose to gain performance.

Future work

There are multiple options for future work with this project, regarding both Callisto itself, and the experiments performed with it. The following list provides a few examples.

Performance impact is currently quantified by counting test case executions. This assumes that each test case takes equal time to execute, which is of course not true in reality. A better solution would be to measure the time needed for executing test cases and storing this in the mutation report, so that Callisto can use it.
The test suites that came with example programs were not expanded to achieve a mutation score of 100%, as this is too much work. However, this decision may have influenced the results of the project. It would be interesting to see how the results may change if only programs with a mutation score of (near) 100% are used.
Only mutation levels for StrykerJS were designed. When other flavours of Stryker, or even other frameworks support the mutation report standard on which Callisto is based, then the method of evaluation described here can be applied to design mutation levels for them as well.
The developers of Stryker would like to have the use of Callisto automated in a pipeline. Then all the steps leading to the design of mutation levels can be done automatically. For this a set of example programs should be set up so that Stryker can run and collect the resulting mutation reports. This can make use of the example programs used in this project. This will allow Callisto to be more easily used when for example new mutation operators are considered for Stryker.