Sunday, November 13, 2011

Testing Legacy Code - Test Driven Maintenance

In this post first I will shortly present the method of Shihab et al. of prioritising creation of unit tests for legacy code and second, how we can implement it using StatSVN and Gendarme (both tools are open and available for Windows and Linux).

How to prioritise creation of unit tests for legacy code
When facing a large legacy system you have to write unit tests for, Michael Feathers in his excellent book Working Effectively with Legacy Code (0) tells us two important things:
  1. There are characterisation tests we build so they can tell us something about how that legacy code works.
  2. Before introducing any changes into the legacy code build a corset of unit tests around the object to change so you will have full control about any undesired behaviour.
But often not even these two principles will help us at the moment that we have to decide which functions or objects to write tests for. E.g. your company might have decided that they want to increase the quality of their code by creating unit tests. There are no recent changes in the code and there is too much code, you can't write characterisation tests for each line of code.

The general recommendation found in the internet is to build unit tests incrementally; on one hand when there are changes introduced or to be introduced, on the other hand whenever there is time to test some code.

But yet we don't know which code to write tests for first. Shihab et al. (1) have found a possible solution for this problem of - what they call - Test Driven Maintenance (TDM): 
"[...] we think it is extremely beneficial to study the adaption to TDD-like practices for maintenance of already implemented code, in particular for legacy systems. In this paper we call this 'Test-Driven Maintenance' (TDM)."
Their main idea is to prioritise making use of history-based heuristics like for example function size, modification frequency, fixing frequency a.o. Here I present the full list. I find the intuitions quite convincing:


Most Frequently Modified

Functions that were modified the most since the start of the project.
Functions that are modified frequently tend to decay over time, leading to more bugs.

Most Recently Modified

Functions that were most recently modified.
Functions that were modified most recently are the ones most likely to have a bug in them (due to the recent changes).

Bug Fixes
Most Frequently Fixed

Functions that were fixed the most since the start of the project.
Functions that are frequently fixed in the past are likely to be fixed in the future.

Most Recently Fixed

Functions that were most recently fixed.
Functions that were fixed most recently are more likely to have a bug in them in the future.

Largest Modified

The largest modified functions, in terms of total lines of code (i.e., source, comment and blank lines of code).
Large functions are more likely to have bugs than smaller functions.

Largest Fixed

The largest fixed functions, in terms of total lines of code (i.e., source, comment and blank lines of code).
Large functions that need to be fixed are more likely to have more bugs than smaller functions that are fixed less.

Size Risk

Riskiest functions, defined as the number of bug fixing changes divided by the size of the function in lines of code.
Since larger functions may naturally need to be fixed more than smaller functions, we normalize the number of bug fixing changes by the size of the function. This heuristic will mostly point out relatively small functions that are fixed a lot (i.e., have high defect density).

Change Risk

Riskiest functions, defined as the number of bug fixing changes divided by the total number of changes.
The number of bug fixing changes normalized by the total number of changes. For example, afunction that changes 10 times in total and out of those 10 times 9 of them wer to fix a bug should have a higher priority to be tested than a function that changes 10 times where only 1 of those ten is a bug fixing change.

Randomly selects functions to write unit tests for.
Randomly selecting functions to test can be thoought of as a base line scenario. Therefore, we use the random heuristic's performance as a base line to compare the performance of the other heuristics, too.

In their article (1) they furthermore show that this approach really has advantages over picking parts of code randomly. Also, there aren't many methods known for prioritising, in fact so far I haven't found anything else.

Some ideas about how to implement the TDM heuristics with StatSVN, Reflection and Gendarme
Of course, we don't dispose of the tool Shihab et al. used and we don't have time to create one ourselves. But we can try to implement it partially.
The crucial point is to get hold of the project's history. If it has been checked into a CVS or SVN repository, we can get that information by its repository log. We suppose we have an SVN repository.

In a shell we navigate to the working directory of the checked out project and we type
svn log -v --xml > logfile.log

to obtain the history needed. After that we generate StatSVN's documentation by typing 
java -jar /path/to/statsvn.jar /path/to/module/logfile.log /path/to/module

We do all this because StatSVN automatically produces a list of the 20 biggest files and the 20 files that have the most revisions. Thus we implement MFM and LM (see above) on file (not object or function level).

Note that you would normally want to finetune StatSVN's output, e.g. use -include "**/*.cs" -exclude "DeletedClass.cs:ClassWithTests.cs" to include only source files and exclude files that have been deleted (StatSVN will show them to you in your first run) or do already have unit tests. 

We can obtain information about the other heuristics by the other options StatSVN offers (e.g. bugzilla integration), by talking to our developers, Static Code Analysis.
Furthermore making use of any Reflection Framework we can count the number of classes and their members. This will give us some useful information about the complexity, if we accept the following intuition:

If an object has many methods (and complex properties) it will be more likely to contain a bug (and more lines per method).

And at last we use the static code analyser Gendarme (in case of .NET assemblies) that will tell us which rules for code quality weren't followed in which object, and if we place the debugging symbols (.pdb, .mdb) next to the (.dlls) it will tell us the line and file name where to localise the item, thus obtaining directly another heuristic per file. Our intuition is:

If an object has design problems it is likely to have some errors in depth, too.

In the end we will have quite a lot of heuristic information that will hopefully be helping us to decide which parts of the legacy code to test first.
When I first implemented this approach, I quickly had identified an object as critical because it came up in MFM, LM, Gendarme and Counting with Reflection Framework. I later talked to a developer and without his knowing about my heuristics he actually pointed at that object and said it was critical we had some tests for it. This proved to me that these heuristics are a valuable manner of prioritizing test case creation.

Having talked a lot about test planing, finally a little video about writing the test code. Gerard Meszaros himself gives a very interesting presentation about xUnit Test Patterns (2).

(0) Michael C. Feathers: Working Effectively With Legacy Code, 2005, Pearson Education
(1) Prioritizing the Creation of Unit Tests in Legacy Software Systems E. Shihab, Z. Jiang, B. Adams, A. E. Hassan and R. Bowerman, In Software: Practice and Experience, 2011
(2) Gerard Meszaros: xUnit Test Patterns: Refactoring Test Code, 2007, Pearson Education 

No comments:

Post a Comment