Sunday, November 13, 2011

Testing Legacy Code - Test Driven Maintenance

In this post first I will shortly present the method of Shihab et al. of prioritising creation of unit tests for legacy code and second, how we can implement it using StatSVN and Gendarme (both tools are open and available for Windows and Linux).


How to prioritise creation of unit tests for legacy code
When facing a large legacy system you have to write unit tests for, Michael Feathers in his excellent book Working Effectively with Legacy Code (0) tells us two important things:
  1. There are characterisation tests we build so they can tell us something about how that legacy code works.
  2. Before introducing any changes into the legacy code build a corset of unit tests around the object to change so you will have full control about any undesired behaviour.
But often not even these two principles will help us at the moment that we have to decide which functions or objects to write tests for. E.g. your company might have decided that they want to increase the quality of their code by creating unit tests. There are no recent changes in the code and there is too much code, you can't write characterisation tests for each line of code.

The general recommendation found in the internet is to build unit tests incrementally; on one hand when there are changes introduced or to be introduced, on the other hand whenever there is time to test some code.

But yet we don't know which code to write tests for first. Shihab et al. (1) have found a possible solution for this problem of - what they call - Test Driven Maintenance (TDM): 
"[...] we think it is extremely beneficial to study the adaption to TDD-like practices for maintenance of already implemented code, in particular for legacy systems. In this paper we call this 'Test-Driven Maintenance' (TDM)."
Their main idea is to prioritise making use of history-based heuristics like for example function size, modification frequency, fixing frequency a.o. Here I present the full list. I find the intuitions quite convincing:



Category
Heuristic
Description
Intuition

Modifications
MFM
Most Frequently Modified

Functions that were modified the most since the start of the project.
Functions that are modified frequently tend to decay over time, leading to more bugs.



MRM
Most Recently Modified

Functions that were most recently modified.
Functions that were modified most recently are the ones most likely to have a bug in them (due to the recent changes).

Bug Fixes
MFF
Most Frequently Fixed

Functions that were fixed the most since the start of the project.
Functions that are frequently fixed in the past are likely to be fixed in the future.



MRF
Most Recently Fixed

Functions that were most recently fixed.
Functions that were fixed most recently are more likely to have a bug in them in the future.

Size
LM
Largest Modified

The largest modified functions, in terms of total lines of code (i.e., source, comment and blank lines of code).
Large functions are more likely to have bugs than smaller functions.



LF
Largest Fixed

The largest fixed functions, in terms of total lines of code (i.e., source, comment and blank lines of code).
Large functions that need to be fixed are more likely to have more bugs than smaller functions that are fixed less.

Risk
SR
Size Risk

Riskiest functions, defined as the number of bug fixing changes divided by the size of the function in lines of code.
Since larger functions may naturally need to be fixed more than smaller functions, we normalize the number of bug fixing changes by the size of the function. This heuristic will mostly point out relatively small functions that are fixed a lot (i.e., have high defect density).



CR
Change Risk

Riskiest functions, defined as the number of bug fixing changes divided by the total number of changes.
The number of bug fixing changes normalized by the total number of changes. For example, afunction that changes 10 times in total and out of those 10 times 9 of them wer to fix a bug should have a higher priority to be tested than a function that changes 10 times where only 1 of those ten is a bug fixing change.

Random
Random
Randomly selects functions to write unit tests for.
Randomly selecting functions to test can be thoought of as a base line scenario. Therefore, we use the random heuristic's performance as a base line to compare the performance of the other heuristics, too.










In their article (1) they furthermore show that this approach really has advantages over picking parts of code randomly. Also, there aren't many methods known for prioritising, in fact so far I haven't found anything else.


Some ideas about how to implement the TDM heuristics with StatSVN, Reflection and Gendarme
Of course, we don't dispose of the tool Shihab et al. used and we don't have time to create one ourselves. But we can try to implement it partially.
The crucial point is to get hold of the project's history. If it has been checked into a CVS or SVN repository, we can get that information by its repository log. We suppose we have an SVN repository.


In a shell we navigate to the working directory of the checked out project and we type
 
svn log -v --xml > logfile.log

to obtain the history needed. After that we generate StatSVN's documentation by typing 
 
java -jar /path/to/statsvn.jar /path/to/module/logfile.log /path/to/module


We do all this because StatSVN automatically produces a list of the 20 biggest files and the 20 files that have the most revisions. Thus we implement MFM and LM (see above) on file (not object or function level).

Note that you would normally want to finetune StatSVN's output, e.g. use -include "**/*.cs" -exclude "DeletedClass.cs:ClassWithTests.cs" to include only source files and exclude files that have been deleted (StatSVN will show them to you in your first run) or do already have unit tests. 

We can obtain information about the other heuristics by the other options StatSVN offers (e.g. bugzilla integration), by talking to our developers, Static Code Analysis.
Furthermore making use of any Reflection Framework we can count the number of classes and their members. This will give us some useful information about the complexity, if we accept the following intuition:

If an object has many methods (and complex properties) it will be more likely to contain a bug (and more lines per method).

And at last we use the static code analyser Gendarme (in case of .NET assemblies) that will tell us which rules for code quality weren't followed in which object, and if we place the debugging symbols (.pdb, .mdb) next to the (.dlls) it will tell us the line and file name where to localise the item, thus obtaining directly another heuristic per file. Our intuition is:

If an object has design problems it is likely to have some errors in depth, too.

In the end we will have quite a lot of heuristic information that will hopefully be helping us to decide which parts of the legacy code to test first.
When I first implemented this approach, I quickly had identified an object as critical because it came up in MFM, LM, Gendarme and Counting with Reflection Framework. I later talked to a developer and without his knowing about my heuristics he actually pointed at that object and said it was critical we had some tests for it. This proved to me that these heuristics are a valuable manner of prioritizing test case creation.

Having talked a lot about test planing, finally a little video about writing the test code. Gerard Meszaros himself gives a very interesting presentation about xUnit Test Patterns (2).




(0) Michael C. Feathers: Working Effectively With Legacy Code, 2005, Pearson Education
(1) Prioritizing the Creation of Unit Tests in Legacy Software Systems E. Shihab, Z. Jiang, B. Adams, A. E. Hassan and R. Bowerman, In Software: Practice and Experience, 2011
(2) Gerard Meszaros: xUnit Test Patterns: Refactoring Test Code, 2007, Pearson Education 

Saturday, October 29, 2011

Accessability and testability

This year's winner for the best presentation at QA&Test 2011 in Bilbao was Julian Harty and I am totally convinced that it was the right choice. Julian did not only present in a very clear style, combining theory with practical examples, but he presented a totally new view to accessability testing (@testinggeek.com).

Does your company care about accessability issues of their products?

We don't! And I don't know of anyone else. It's expensive and normally not worth the effort. Public institutions, of course, need to care about this, but private companies? Why should they care?

Julian's brilliant idea consists in using accessability technology to improve testability of our products! (see his slides, 18)

Brilliant of course because he found another use for accessability which is testability (and search engine optimisation see his slides, 18). Everybody is happy and the product gets improved in various manners: More SEO for the management, more testability for testers, better usability and accessability for the users.

In Germany we have a saying for this: it's like a kinder surprise!
(because it contains - as advertised - three things at the same time: amazement, chocolate AND a toy :-) )

Monday, October 10, 2011

Effort estimates - be brave, invent your own

Inspired by Michael Feathers' Working Effectively With Legacy Code (on google books) I felt brave enough to confront my latest "homework" in testing: "Write a detailed test plan about how to test that code!" The technical term would be "retrofitting tests for oo code" I understand.

Now, unfortunately I was rather restrained because some premises were put up that made many of Feathers' advices impossible to follow:

1.) Tests will be seperated from the developers' code, into their own projects.
2.) Only public members are relevant.
3.) You cannot break dependencies.
4.) You need to mainly write characterisation tests.
5.) My boss needs an idea about cost estimation.


So how did I start? Well, first I used that good old tranquilizer: counting. So I wrote a program using System.Reflection (.NET) to list and count the following (per project):

1.) Non-abstract classes (without enums) t (note that in System.Reflection static classes are typed sealed and abstract)
2.) Methods (and constructors) m
3.) Properties and fields (without constants) p

I then loosely mimicked a formula I found on the internet (mathematik.uni-ulm.de) for calculating the effort on the following basis:

1.) A first impression cannot count each testing state (of input parameters/the testing object), neither too much dependencies without digging too deep.
2.) There's a stable average of testing states and complexity.
3.) Objects will be used as parameters, too.
4.) My boss wants a simple answer, not a maths thesis on the subject, things need to get done.

After much trying demanding attention of a very good and patient friend I have (thank you Ewald!) I came back to the formula I started from:

testing_effort = p*m*t*TEU


(where p,m,t as above and TEU refers to "testing effort unit" which would need to be interpolated)


When I applied that formula to a recent unit testing project to interpolate TEU and after that used it to calculate a first estimation, I could be sure this wasn't the right way, as my results foresaw some gigantic ressources we just don't dispose of.
So, next step was some static code analysis targeting the question how to reduce test basis. (Remember: "We cannot test everything.")

Some sensible criteria I set up with one of our developers then were e.g.

1.) If there are two overloads, look if they really do something different, kick one out (of your calculation).
2.) If there are e.g. 20 read-only properties of type string, count 1 property of type List<string>.
3.) If two classes inherit from the same abstract class, only count where overriden / where they're different, a.s.o.

The good thing about reading the source code is that you get quite a good picture of the importance of certain classes so you can prioritize the components / units at the same time.

So then I had a final list, different countings, my formula would certainly give better results:

total_testing_effort_1 = \sum_{i=1,...,7} te_i


(where te_i is the testing effort of each assembly, te_i = p_i*m_i*t_i)


The ridiculous result was: 37h. Thirty-seven hours for seven quite complicated assemblies (e.g. there was one object that could only be created by another which itself depended on a third one).

Still I didn't panic - it's just numbers!

So, I thought: to include the dependency my assemblies have on each other, I can instead calculate the testing effort of the total numbers, thus, by mutliplication, establishing a relation between those numbers that is stronger, that is

total_testing_effort_2 = (\sum t_i)*(\sum p_i)*(\sum m_i)


resulting in another ridiculous estimate of 1039 hours.

Knowing that plan had to be written, and that I couldn't live with 37h and my boss couldn't do so with 1039h I had to find some equally acceptable number. So find some mean. It's obvious that the average isn't good enough, so I used the geometric mean of those two:

total_testing_effort = sqrt(total_testing_effort_1*total_testing_effort_2)


And there it was: a magical 119,1h approx. 25 days of 8h work.

You think this is all hocus pocus? Maybe you're right. But I've found the following to be true, too:

1.) There is an effort estimate we feel good about.
2.) My boss understands the code isn't as easy to be tested as he had thought, and will probably be more forgiving when not "everything" will have been tested.

Of course, I still hope my baby will show some good approximation over time :-)

Do you have other means of estimating costs / setting up test plans for legacy code? Found mistakes or nonsense? Please comment!

Wednesday, August 31, 2011

Pass-fail-criteria - testing a hash function

I recently had to test a hash function. Let me explain:

Hash function
A hash function mathematically is a mapping h : D -> V, where each datum in D is assigned a value in V, normally being V a subset of the natural numbers {0,1,2,...} or a finite sum of subsets of the natural numbers.

The purpose is to find an easy way to identify data of arbitrary type. It can be thought of as a fingerprint of my piece of data. E.g. I want to save images in a table, but I don't want my data base to blow up so I only allow for one copy of an image to be saved. How do I recognise if I already have a copy in my table?

If d is my image and v=h(d) then I can use v as key in my table. If then I want to save another image d' in my table and found that v'=h(d') already is in my table, I'd know I already saved it and shouldn't need to upload it again.

But then hash functions aren't normally (or virtually cannot be) injective. This is the mathematical term for uniqueness of my fingerprints. This would be the case if each piece of data I hash would have a different unique fingerprint.

If two pieces of data have the same fingerprint this is called a collision. Surely, if we can't avoid collisions totally, we'd at least want them to occur almost never.

Test a hash function
Almost never is clearly a relative term. If I have 10 pieces of data, we cannot accept that two have the same fingerprint, saying 100% confidence. You could also feel that you can accept 2 of 100000 pieces of data to have the same hashvalue; this would be 1 - 2/100000 = 99.998% confidence.

I didn't have any specifications telling me what confidence or quality my hash function had to fulfill. The function has the form:

hash(str string, range int): hashkey int, i.e. each string (of arbitrary length) is encoded into a natural number in {0,1,...,range}.

So I reasoned:
  • If I have a confidence of 99.99% I personally am fine with my test. As long as 
  • there's a way to improve confidence by tuning the input of my hash function.
Admittedly 99.99% isn't too much. See discussion below.

So I took 50000 arbitrary distinct strings (randomly generated), hashed them and than looked at how many different hashkeys were produced. Mathematically spoken I calculated:

test_confidence_1 = #h({set of 50000 distinct strings}, 50000) / 50000, where # means 'number of'.

I had a pretty bad result (67% confidence) which was what I could have expected choosing 50000 arbitrary strings but there was a way to improve test confidence:

test_confidence_2 = #h({set of 50000 distinct strings}, 50000*1000) / 50000.

This one gave me 99.995% confidence so I filed a "bug report" about the results. Hopefully developeres will read it and improve the hash function or document the confidence for future users of this hash function.

I learned what a hash function is, how I could test it, and finally:
  • As a tester sometimes you will have to setup your own pass-fail-criteria.
Management and Development later still can and hopefully will discuss your results / criteria with you. That's okay. But you cannot do your work waiting indefinitely for some specification. You can use your own criteria to provoke later discussion also; this is why I chose my test confidence just high enough to make the test fail.

Friday, August 12, 2011

Overflow without warning

Testing might also be kind of learning programming. E.g. there's this programming resp. transformation language for structured (text) data, MetaMorphosis.
As I'm trying to figure out how that scripting language works I found myself with little documentation about datatypes. As a mathematician with interest in prime number testing I'm always interested in big integers as in java or C#. So I wrote the following hoping the loop NOT to end with an overflow:

*count := 1, *lastcount:=*count,
!while *count >= *lastcount !do
   (*count).echo,
   *lastcount := *count,
   *count := 10*(*count)
!end


The loop really didn't end by itself throwing any error. But what I saw on my screen kind of surprised me. When I paused it I found that 
 
1946039808 <= -1348927488  

seemed to be true. So I asked in pace mode (option -p):

> i *result := (-1348927488) - 1946039808;
> i *result;
> 1000000000


So supposing a < b will be true if and only if 0 < b - a we see that really 1946039808 <= -1348927488.
What have I learned? Well, first you cannot rely on any programming language to tell you about its limitations. Second: overflows do not only influence our results when we calculate with numbers like subtracting or adding, but go also further as seen in the comparison of two of them.

Finding only two a, b such that a-b > max(numberdomain) we cannot be sure a < b will give the expected result.