Monday, July 9, 2012

History Based Heuristics for Regression Testing

Common Testing knowledge - see for example this article - recommends you to:
  1. Identify test areas of your product by
  2. Select and add test cases based on the identified areas
  3. Add some basic test cases (some kind of smoke test)
This really is a good way to get the job done, though in a legacy environment where you would expect the testing process to have much less lifetime than your product under test, you might want some more testing to be done.



Imagine Regression Testing as part of a bigger test iteration planned for your next product release. As by 1. through 3. you only can check if recently - since the last release -  modified items continue to work pleasantly, you still encounter yourself with that blind side of all those things that haven't been tested ever.

How to approach? You might want to organise a bug bash. Or you might consider to expand 3. to more test areas. But which would those be? If you work in a legacy environment you probably at least have a bug tracking system (BTS) at your disposal that has lived since the beginning of any of your company's development process (or at least longer than your testing documentation).

The idea is quite simple and is an analogue of the one to identify test items for test driven maintenance.

First we identify our key numbers we can normally obtain by asking the underlying data base of our BTS:

Key Name Description
D Number of Duplicate Bug Reports Duplicates come up if several users have found an issue in different contextes. The more duplicates a bug report has the more important it is to us to identify the underlying area.
R Number of Related Bug Reports The more related bug reports are known the bigger the test area must be and therefore the more important the bug report itself.
C Number of Comments Many comments indicate a matter worth a discussion. This number also includes status changes (like New => Confirmed).
S Summary length Although a long, unconcise bug summary may be symptom of a bad style, we want to assume that complexer bugs need a longer summary.
B Description Length Same as in case of S: we want to assume that more complicated issues need a longer description / report.
P Priority (mapped to a numeric value) There might be different priorities (internal, external) but after all there's a reason why someone decided to assign such priority.
T Creation date If the first bug report ever has status fixed and after n years the bug hasn't come up again, we assume the feature to be stable. Also, if we tested a feature in our last internal release test ("recent date"), we might not need to re-test it again.

From all those numbers we can derive values indicating the relevance for our regression test.
Assuming linear relations we might calculate then numbers for each bug report {n_D,...,n_T} in [0,1] like this:

  • n_D = D/(maximal number of duplicates of any bug report)
  • n_R = R/(maximal number of related bug reports of any bug report)
  • n_C = C/(maximal number of comments of any bug report)
  • n_S = S/(maximal number of characters in trimmed string of the summary of any bug report)
  • n_B = B/maximal number of characters in trimmed string of the description of any bug report)
  • n_P = P/(maximal possible priority)
  • n_T = ticks(T)/ticks(recent date) or 0, if the report was created after the set recent date.
We don't have much experience about which of those numbers might be the most important and we don't want to make a scientific study of it, so we set weights {w_C,...,w_T} being numbers in [0,1] such that w_C + ... + w_T = 1. These weights allow us to play with how much importance we attribute to each key number. For example, if we only wanted to evaluate our bug reports based on the creation date we would set w_C = ... = w_P = 0, w_T = 1 meaning that we don't care about the other key numbers.

For any selection of these weights we then calculate for each bug report its relevance as the weighted mean

K(bug report) = (w_D * n_D + ... + w_T * n_T)


K is now a good mean to identify bug reports which matter most to us regarding our intuitions and how much we think we should take each factor into account. What do you think?