Results
First the results of the Quality and performance impact that Callisto calculated for the mutation operators used in the example programs is discussed. In total 56 mutation operators in StrykerJS were analysed. For an overview of the available mutations in StrykerJS, see here. The found Qualities range overall between 0.5 and 1, where Relational Operator Replacement (ROR) mutation operators generally score above 0.9, whereas string and block statement mutations score relatively low, around 0.6. This reflects how difficult it is to kill such mutants when mutation testing, where a high quality indicates it is difficult to kill. A complete overview of all mutation operators is too elaborate to show here.
The pie chart to the right shows the counted test case executions that were needed for each mutation operator, out of a total of 320,589. As can be seen, the top five operators account for over half of all executions, and therefore these operators have a high performance impact compared to others. Again string and block statement mutations stand out, as these type of mutants occur frequently, and therefore need many test case executions during mutation testing.
With these results several mutation levels are designed using simple techniques and intuition. The fewer mutation operators are included in a level, the more performance, as fewer test case executions are required. However, with fewer mutants the effectiveness of a level will decrease, as the induced test suite needed to kill them will become smaller. Designing a mutation level is therefore a game of removing many test case executions without decreasing the size of the induced test suite too much.
In total 17 mutation levels are designed. 6 of them establish a quality threshold and only include mutation operators with a quality above the threshold. Thresholds were set between 0.6 and 0.85, with intervals of 0.05. Other levels were designed by removing performance-heavy mutation operators. Several levels were also added that intuitively seem 'badly' designed, to see how they behave. Finally 4 custom levels were made that remove varying amounts of mutation operators. The goal of these was to keep related mutation operators (such as mutating a + to a - and vice versa) together in a level, as the other mutation levels did not respect this.
The graph above shows the effectiveness and performance that were determined using Callisto for the 17 designed mutation levels. A dotted average trend-line is added to better show the outliers. Both a high effectiveness and performance percentage are desired. Thus the closer a level is placed to the top-right corner of the graph, the better it is. Surprisingly, several of the 'badly' designed mutation levels outperformed all others. Most notably, 'OnlyBlockStatement' (a level only containing the BlockStatement operator) achieved a performance of 86% and an effectiveness of 63%, meaning it removes 86% of test case executions while retaining 63% of the test suite. Its high performance can be explained by the fact that only one operator is used. Its high effectiveness may be caused by the nature of the mutation. BlockStatement deletes blocks of code and can therefore be applied frequently throughout the code, thus needing many individual test cases to kill them all, leading to a high effectiveness.
Most of the other levels performed average. The threshold levels can mostly be found in order close to the trend-line. The 4 custom levels achieved mixed results, with custom 1 and 2 performing average, but 3 and 4 sub-par.
What is apparent from this graph is that all mutation levels can be found in the top-right half of the graph. This means that the performance percentage achieved is always higher than the effectiveness percentage lost compared to 100% effectiveness. This is a strong indication that mutation levels are a valid means to speed up mutation testing. The evaluation using StrykerJS is therefore deemed a success. In particular Custom 1 and 2 are recommended for use in StrykerJS, as they provide a more consistent mutation experience and still offer increased performance with a decent effectiveness. Two levels are offered, so that users of Stryker can choose how much effectiveness they wish to lose to gain performance.