Thursday, May 30, 2013

Some advanced code testing

For those who don't want or don't have the time to read thick books or framework testing APIs, I have collected a few topics that can help improve your unit testing but might not always be obvious. Note that they're not purely restricted to C# even though the examples are.

Mighty testing frameworks

There are some mighty mocking and unit testing frameworks out there with impressing features like mocking dependent-on static methods, testing private members etc. Although in some cases these are vital, e.g. when writing tests for legacy code, they might lead you to write your code with lower quality taking less care about class design, dependencies etc. What's better: code quality or a new (and possibly costly) dependency on a mighty framework? Code quality does not depend on the language you're programming in, it depends mainly on you as a developer!

Use delegates for dependency to static methods

C# delegates can work like functions in languages like Python or Go where they are first class citizens. This is a good thing, e.g. in the following case:
 public void Init()  
 {  
      var content = File.ReadAllText(this.Path);  
      ...  
 }  
We have a dependency here making testing somewhat difficult. Whereas in some situations it's good pratice to wrap call to static methods into an instance, here we can get along without it:
 public void Init()  
 {  
      this.Init(File.ReadAllText);  
 }  
 internal void Init(ReadAllText readAllText)  
 {  
      var content = readAllText(this.Path);  
      ...  
 }  
 internal delegate string ReadAllText(string path);  

Dependencies to the surface

Somewhat related to the last topic on delegates is the the following: Often we programm quickly and only afterwards discover that we have dependencies, especially to built-ins like System.IO.FileInfo. Now, built-ins can have bugs, too. And a dependency on them should better be injected, e.g. for testability or for extendability. We can avoid missing those dependencies by not using global imports, by deleting any using statement at the beginning of our files and using full namespaces. What is the difference between
 using System.IO;  
 using System.Xml;  
 ...  
 internal void Init()  
 {  
      var file = new FileInfo(this.Path);  
      var dir = new DirectoryInfo(this.DirPath);  
      var xdoc = new XmlDocument();  
      ...  
 }  
and
 internal void Init()  
 {  
      var file = new System.IO.FileInfo(this.Path);  
      var dir = new System.IO.DirectoryInfo(this.DirPath);  
      var xdoc = new System.Xml.XmlDocument();  
      ...  
 }  
The difference is that in the second case you're getting so tired of typing the namespaces that you will want to do something about it. For the moment you cannot use global imports, so you will need to refactor to dependency injection, interface extraction etc.

The protected antipattern

There is this rule that production code and test code should not be mixed. In .NET we choose to create seperate test projects and only test against the public interface, letting private members become internal to be testable if necessary and sensible to do so.
Another possibility is to set protected instead. Then in our test project we just inherit and overwrite.
In my experience though, this second approach has at least two downsides that the use of internal doesn't have:

  • the access modifiers totally loose their sense because we cannot control what's done with protected members.
  • the inherited class will be tested and in a more complicated setup in the end we might loose track. Did we really test our code or the test code?

Any vs. Some

In order for our test cases to serve as documentation we need method and variable names communicating intention:
 public void TestCaseConstructorSetsProperties()  
 {  
      var to = new TestObject(anyParameter());  
      AssertThatPropertiesAreSet(to);  
 }  
My personal gusto is only to use the prefix any if null is permitted, too. So if - following common practice - the constructor checks for null argument, this will be another test case. We can opt for using anyNonNullParameter() or simply:
 public void TestCaseConstructorSetsProperties()  
 {  
      var to = new TestObject(someParameter());  
      AssertThatPropertiesAreSet(to);  
 }  

Advanced setup and teardown - cleanup files example

Have a look at the following code:
 public void TestCaseUsingFileSystem()  
 {  
      var file = createTestFile();  
      var to = new TestObject();  
      to.doSomething(file);  
      AssertThatSomethingHoldsOn(to);  
      file.Delete();  
 }  
The problem here is that the file probably won't be deleted if the assertion fails, making this test case fragile. Surely, you can think of other objects that might need proper tear down even if the test fails in order to assure correctness of the test fixture. A common solution of this problem is to introduce some class variable serving as trash and using a shared teardown. Mind also the file creation method which simply could have been called createTestFile() as before:
 public void TestCaseUsingFileSystem()  
 {  
      var file = createAndRegisterForCleanupTestFile();  
      var to = new TestObject();  
      to.doSomething(file);  
      AssertThatSomethingHoldsOn(to);  
 }  
 public void TearDown()  
 {  
      foreach(var item in this.trash)  
      {  
           try  
           {  
                var file = item as File;  
                if(file != null) file.Delete();  
                ...  
           }catch(Exception e){  
                reportToTestRunner(e.Message);  
           }  
      }  
 }  
 private File createAndRegisterForCleanupTestFile()  
 {  
      var file = createTestFile();  
      this.trash.Add(file);  
      return file;  
 }       

Event checking

You should always check if events are raised, too! An easy pattern for doing so is this:
 public void TestCaseSomeMethodRaisesEvent()  
 {  
      var eventHasBeenRaised = false;  
      testObject.SomeEventHandler = (sender, args) => eventHasBeenRaised = true;  
      testObject.SomeMethod();  
      AssertThat(eventHasBeenRaised);  
 }  

Friday, May 10, 2013

Android Programming and Testing with ADT

It's been a while I dared to take a look at Android application development using Eclipse. I'm certainly surprised about the high quality tutorials and documentation. It is fun to learn, with the right tools, of course. Also, the developer community hasn't left out the vital testing perspective, delivering automatic test project setup, JUnit extensions (even mocks!). If you want to make your first steps with Android application development and learn how to test it right from the start, I recommend to take the following steps supposing you have some experience with Java, Eclipse and, of course, unit testing:


  1. Download the Android ADT Bundle here.
  2. Follow the steps for setting it up here.
  3. Complete the tutorial Building your First App. I recommend using a real device, not only for performance but it feels great ;-) In case you're working on Linux, you'll probably have to add a rule for udev. This is well documented and can be found at the tutorial. Tip: find your vendorId using lsusb and use MSC as the transfer protocol.
  4. Skim through Managing Projects from Eclipse with ADT, Building and Running from Eclipse and Testing Fundamentals, the latter being a fascinating read by itself for testing developers (and suffering testers in automation).
  5. Make sure you have the Samples for SDK. If you don't you will download them using Android SDK Manager as described here.
  6. Skip the Testing from Eclipse with ADT, and dive directly into Activity Testing Tutorial.
This is a good starting point and fun to do :-)

Monday, October 29, 2012

Documenting smells for untestable .NET code with NSubstitute and NUnit

Retrofitting unit tests is hard work and the knowledge about the code could easily be forgotten until it will be refactored. We want to communicate to the developers (and to ourselves after enough time has passed to forget what we've learned about the code) our testing intentions, how the code could be more testable.

One of the unit testing axioms is: always keep production and test code separate. Now if you program in .NET/C# and create a separate test project, then friend assemblies or this pattern can help you to overcome the visibility problem: if I can only access the public interface of my code, there will be much unaccessible code that I still need to test.

But if you work with legacy code there will likely access much untestable code. Tight coupling, hidden dependencies, implementation dependency and many many others are common problems you would normally conquer by refactoring your code (see e.g. Michael C. Feather´s Dependency Breaking Techniques (Part III of his book)).

Sometimes though, you won't have the permission to change the production code. Sometimes you just need to document the code for future reference and prepare test cases so the knowledge you've built up won't get wasted until the code has been refactored..

In this world of retrofitting tests to legacy code you will often find it useful to document the smells or pathologies that make your testing so hard resp. impossible. Instead of writing bug reports, source comments, etc. you might consider next time the following pattern.

For example, imagine you have the following code you're not allowed to make changes to:

public class HasDependencyToImplementation{
  
  public HasDependencyToImplementation(Class c)
  {
    this.c = c; 
    ...
  }

  public void DoSomethingWithC(){
    this.privateProperty = this.c.DoSomething();
    this.PublicProperty = this.WorkWithPrivateProperty();
  }
  
  private T WorkWithPrivateProperty(){
    ... do something with this.privateProperty ...
  }

}

public class Class{...}

Now, if you need to take control over the method Class.DoSomething, you'll normally be advised to extract the interface of Class. Then the constructor admits passing it a mocked Class instance. But what if you're not allowed to do that? Do you want to call the developer to please refactor that code and check in again? Is your developer even in the mood or has time to listen to your pleas? Let's guess they won't. Then with NSubstitute and NUnit you would normally still be able to write a (failing) test like this:

...
[Test]
public void DoSomethingOnC_ChangesState(){
   var c = Substitute.For<Class>();
   var controllableValue = getValue();
   c.DoSomething().Returns(controllableValue);
   
   var hasDependency = new HasDependencyToImplementation(c);
   hasDependency.DoSomethingWithC();
   var expected = getExpectedFor(controllableValue);

   Assert.That(
       hasDependency.PublicProperty,
       Is.EqualTo(
       expected);
}

This test would normally compile, but by running it, we will get an exception from NSubstitute as Class is not abstract and the method Class.DoSomething isn't virtual, so NSubstitute cannot overwrite it. Our problem is: on one hand we want this test to be written for documentation and future testing, after refactoring is done, on the other hand the refactoring developers won't necessarily be able to interpret the test result as a todo for them and you might forget what the problem was until next time you'll work with the code. A solution to this problem might be the following:

We implement a custom assert on NUnit.Framework.Assert:

public static void HasSmell(int PRIO, string PROBLEM, string PROPOSAL){
        Assert.Fail(
                 helper.GetMessage(PRIO, PROBLEM, PROPOSAL));
}

and implement constants

public static Smells{
   public const string IrritatingParameter = "Irritating parameter";
   ...
}

public static Refactorings{
   public const string ExtractInterface = "Extract interface";
   ...
}

Then we can add documentation for developers:


...
[Test]
public void DoSomethingOnC_ChangesState(){

   Assert.HasSmell(
          PRIO: HIGH,
          PROBLEM: Smells.IrritatingParameter
                  +"\nClass c: cannot be mocked for checking.",
          PROPOSAL: Refactorings.ExtractInterface);

   var c = Substitute.For<Class>();
   var controllableValue = getValue();
   ...
}

Which not only will fail when run by developers, but the NUnit runner will tell them

Test DoSomethiongOnC_ChangesState failed:
SMELL PRIO HIGH
PROBLEM: Irritating parameter
Class c: cannot be mocked for checking.
PROPOSAL: Extract interface

Monday, July 9, 2012

History Based Heuristics for Regression Testing

Common Testing knowledge - see for example this article - recommends you to:
  1. Identify test areas of your product by
  2. Select and add test cases based on the identified areas
  3. Add some basic test cases (some kind of smoke test)
This really is a good way to get the job done, though in a legacy environment where you would expect the testing process to have much less lifetime than your product under test, you might want some more testing to be done.



Imagine Regression Testing as part of a bigger test iteration planned for your next product release. As by 1. through 3. you only can check if recently - since the last release -  modified items continue to work pleasantly, you still encounter yourself with that blind side of all those things that haven't been tested ever.

How to approach? You might want to organise a bug bash. Or you might consider to expand 3. to more test areas. But which would those be? If you work in a legacy environment you probably at least have a bug tracking system (BTS) at your disposal that has lived since the beginning of any of your company's development process (or at least longer than your testing documentation).

The idea is quite simple and is an analogue of the one to identify test items for test driven maintenance.

First we identify our key numbers we can normally obtain by asking the underlying data base of our BTS:

Key Name Description
D Number of Duplicate Bug Reports Duplicates come up if several users have found an issue in different contextes. The more duplicates a bug report has the more important it is to us to identify the underlying area.
R Number of Related Bug Reports The more related bug reports are known the bigger the test area must be and therefore the more important the bug report itself.
C Number of Comments Many comments indicate a matter worth a discussion. This number also includes status changes (like New => Confirmed).
S Summary length Although a long, unconcise bug summary may be symptom of a bad style, we want to assume that complexer bugs need a longer summary.
B Description Length Same as in case of S: we want to assume that more complicated issues need a longer description / report.
P Priority (mapped to a numeric value) There might be different priorities (internal, external) but after all there's a reason why someone decided to assign such priority.
T Creation date If the first bug report ever has status fixed and after n years the bug hasn't come up again, we assume the feature to be stable. Also, if we tested a feature in our last internal release test ("recent date"), we might not need to re-test it again.

From all those numbers we can derive values indicating the relevance for our regression test.
Assuming linear relations we might calculate then numbers for each bug report {n_D,...,n_T} in [0,1] like this:

  • n_D = D/(maximal number of duplicates of any bug report)
  • n_R = R/(maximal number of related bug reports of any bug report)
  • n_C = C/(maximal number of comments of any bug report)
  • n_S = S/(maximal number of characters in trimmed string of the summary of any bug report)
  • n_B = B/maximal number of characters in trimmed string of the description of any bug report)
  • n_P = P/(maximal possible priority)
  • n_T = ticks(T)/ticks(recent date) or 0, if the report was created after the set recent date.
We don't have much experience about which of those numbers might be the most important and we don't want to make a scientific study of it, so we set weights {w_C,...,w_T} being numbers in [0,1] such that w_C + ... + w_T = 1. These weights allow us to play with how much importance we attribute to each key number. For example, if we only wanted to evaluate our bug reports based on the creation date we would set w_C = ... = w_P = 0, w_T = 1 meaning that we don't care about the other key numbers.

For any selection of these weights we then calculate for each bug report its relevance as the weighted mean

K(bug report) = (w_D * n_D + ... + w_T * n_T)


K is now a good mean to identify bug reports which matter most to us regarding our intuitions and how much we think we should take each factor into account. What do you think?

Sunday, November 13, 2011

Testing Legacy Code - Test Driven Maintenance

In this post first I will shortly present the method of Shihab et al. of prioritising creation of unit tests for legacy code and second, how we can implement it using StatSVN and Gendarme (both tools are open and available for Windows and Linux).


How to prioritise creation of unit tests for legacy code
When facing a large legacy system you have to write unit tests for, Michael Feathers in his excellent book Working Effectively with Legacy Code (0) tells us two important things:
  1. There are characterisation tests we build so they can tell us something about how that legacy code works.
  2. Before introducing any changes into the legacy code build a corset of unit tests around the object to change so you will have full control about any undesired behaviour.
But often not even these two principles will help us at the moment that we have to decide which functions or objects to write tests for. E.g. your company might have decided that they want to increase the quality of their code by creating unit tests. There are no recent changes in the code and there is too much code, you can't write characterisation tests for each line of code.

The general recommendation found in the internet is to build unit tests incrementally; on one hand when there are changes introduced or to be introduced, on the other hand whenever there is time to test some code.

But yet we don't know which code to write tests for first. Shihab et al. (1) have found a possible solution for this problem of - what they call - Test Driven Maintenance (TDM): 
"[...] we think it is extremely beneficial to study the adaption to TDD-like practices for maintenance of already implemented code, in particular for legacy systems. In this paper we call this 'Test-Driven Maintenance' (TDM)."
Their main idea is to prioritise making use of history-based heuristics like for example function size, modification frequency, fixing frequency a.o. Here I present the full list. I find the intuitions quite convincing:



Category
Heuristic
Description
Intuition

Modifications
MFM
Most Frequently Modified

Functions that were modified the most since the start of the project.
Functions that are modified frequently tend to decay over time, leading to more bugs.



MRM
Most Recently Modified

Functions that were most recently modified.
Functions that were modified most recently are the ones most likely to have a bug in them (due to the recent changes).

Bug Fixes
MFF
Most Frequently Fixed

Functions that were fixed the most since the start of the project.
Functions that are frequently fixed in the past are likely to be fixed in the future.



MRF
Most Recently Fixed

Functions that were most recently fixed.
Functions that were fixed most recently are more likely to have a bug in them in the future.

Size
LM
Largest Modified

The largest modified functions, in terms of total lines of code (i.e., source, comment and blank lines of code).
Large functions are more likely to have bugs than smaller functions.



LF
Largest Fixed

The largest fixed functions, in terms of total lines of code (i.e., source, comment and blank lines of code).
Large functions that need to be fixed are more likely to have more bugs than smaller functions that are fixed less.

Risk
SR
Size Risk

Riskiest functions, defined as the number of bug fixing changes divided by the size of the function in lines of code.
Since larger functions may naturally need to be fixed more than smaller functions, we normalize the number of bug fixing changes by the size of the function. This heuristic will mostly point out relatively small functions that are fixed a lot (i.e., have high defect density).



CR
Change Risk

Riskiest functions, defined as the number of bug fixing changes divided by the total number of changes.
The number of bug fixing changes normalized by the total number of changes. For example, afunction that changes 10 times in total and out of those 10 times 9 of them wer to fix a bug should have a higher priority to be tested than a function that changes 10 times where only 1 of those ten is a bug fixing change.

Random
Random
Randomly selects functions to write unit tests for.
Randomly selecting functions to test can be thoought of as a base line scenario. Therefore, we use the random heuristic's performance as a base line to compare the performance of the other heuristics, too.










In their article (1) they furthermore show that this approach really has advantages over picking parts of code randomly. Also, there aren't many methods known for prioritising, in fact so far I haven't found anything else.


Some ideas about how to implement the TDM heuristics with StatSVN, Reflection and Gendarme
Of course, we don't dispose of the tool Shihab et al. used and we don't have time to create one ourselves. But we can try to implement it partially.
The crucial point is to get hold of the project's history. If it has been checked into a CVS or SVN repository, we can get that information by its repository log. We suppose we have an SVN repository.


In a shell we navigate to the working directory of the checked out project and we type
 
svn log -v --xml > logfile.log

to obtain the history needed. After that we generate StatSVN's documentation by typing 
 
java -jar /path/to/statsvn.jar /path/to/module/logfile.log /path/to/module


We do all this because StatSVN automatically produces a list of the 20 biggest files and the 20 files that have the most revisions. Thus we implement MFM and LM (see above) on file (not object or function level).

Note that you would normally want to finetune StatSVN's output, e.g. use -include "**/*.cs" -exclude "DeletedClass.cs:ClassWithTests.cs" to include only source files and exclude files that have been deleted (StatSVN will show them to you in your first run) or do already have unit tests. 

We can obtain information about the other heuristics by the other options StatSVN offers (e.g. bugzilla integration), by talking to our developers, Static Code Analysis.
Furthermore making use of any Reflection Framework we can count the number of classes and their members. This will give us some useful information about the complexity, if we accept the following intuition:

If an object has many methods (and complex properties) it will be more likely to contain a bug (and more lines per method).

And at last we use the static code analyser Gendarme (in case of .NET assemblies) that will tell us which rules for code quality weren't followed in which object, and if we place the debugging symbols (.pdb, .mdb) next to the (.dlls) it will tell us the line and file name where to localise the item, thus obtaining directly another heuristic per file. Our intuition is:

If an object has design problems it is likely to have some errors in depth, too.

In the end we will have quite a lot of heuristic information that will hopefully be helping us to decide which parts of the legacy code to test first.
When I first implemented this approach, I quickly had identified an object as critical because it came up in MFM, LM, Gendarme and Counting with Reflection Framework. I later talked to a developer and without his knowing about my heuristics he actually pointed at that object and said it was critical we had some tests for it. This proved to me that these heuristics are a valuable manner of prioritizing test case creation.

Having talked a lot about test planing, finally a little video about writing the test code. Gerard Meszaros himself gives a very interesting presentation about xUnit Test Patterns (2).




(0) Michael C. Feathers: Working Effectively With Legacy Code, 2005, Pearson Education
(1) Prioritizing the Creation of Unit Tests in Legacy Software Systems E. Shihab, Z. Jiang, B. Adams, A. E. Hassan and R. Bowerman, In Software: Practice and Experience, 2011
(2) Gerard Meszaros: xUnit Test Patterns: Refactoring Test Code, 2007, Pearson Education 

Saturday, October 29, 2011

Accessability and testability

This year's winner for the best presentation at QA&Test 2011 in Bilbao was Julian Harty and I am totally convinced that it was the right choice. Julian did not only present in a very clear style, combining theory with practical examples, but he presented a totally new view to accessability testing (@testinggeek.com).

Does your company care about accessability issues of their products?

We don't! And I don't know of anyone else. It's expensive and normally not worth the effort. Public institutions, of course, need to care about this, but private companies? Why should they care?

Julian's brilliant idea consists in using accessability technology to improve testability of our products! (see his slides, 18)

Brilliant of course because he found another use for accessability which is testability (and search engine optimisation see his slides, 18). Everybody is happy and the product gets improved in various manners: More SEO for the management, more testability for testers, better usability and accessability for the users.

In Germany we have a saying for this: it's like a kinder surprise!
(because it contains - as advertised - three things at the same time: amazement, chocolate AND a toy :-) )

Monday, October 10, 2011

Effort estimates - be brave, invent your own

Inspired by Michael Feathers' Working Effectively With Legacy Code (on google books) I felt brave enough to confront my latest "homework" in testing: "Write a detailed test plan about how to test that code!" The technical term would be "retrofitting tests for oo code" I understand.

Now, unfortunately I was rather restrained because some premises were put up that made many of Feathers' advices impossible to follow:

1.) Tests will be seperated from the developers' code, into their own projects.
2.) Only public members are relevant.
3.) You cannot break dependencies.
4.) You need to mainly write characterisation tests.
5.) My boss needs an idea about cost estimation.


So how did I start? Well, first I used that good old tranquilizer: counting. So I wrote a program using System.Reflection (.NET) to list and count the following (per project):

1.) Non-abstract classes (without enums) t (note that in System.Reflection static classes are typed sealed and abstract)
2.) Methods (and constructors) m
3.) Properties and fields (without constants) p

I then loosely mimicked a formula I found on the internet (mathematik.uni-ulm.de) for calculating the effort on the following basis:

1.) A first impression cannot count each testing state (of input parameters/the testing object), neither too much dependencies without digging too deep.
2.) There's a stable average of testing states and complexity.
3.) Objects will be used as parameters, too.
4.) My boss wants a simple answer, not a maths thesis on the subject, things need to get done.

After much trying demanding attention of a very good and patient friend I have (thank you Ewald!) I came back to the formula I started from:

testing_effort = p*m*t*TEU


(where p,m,t as above and TEU refers to "testing effort unit" which would need to be interpolated)


When I applied that formula to a recent unit testing project to interpolate TEU and after that used it to calculate a first estimation, I could be sure this wasn't the right way, as my results foresaw some gigantic ressources we just don't dispose of.
So, next step was some static code analysis targeting the question how to reduce test basis. (Remember: "We cannot test everything.")

Some sensible criteria I set up with one of our developers then were e.g.

1.) If there are two overloads, look if they really do something different, kick one out (of your calculation).
2.) If there are e.g. 20 read-only properties of type string, count 1 property of type List<string>.
3.) If two classes inherit from the same abstract class, only count where overriden / where they're different, a.s.o.

The good thing about reading the source code is that you get quite a good picture of the importance of certain classes so you can prioritize the components / units at the same time.

So then I had a final list, different countings, my formula would certainly give better results:

total_testing_effort_1 = \sum_{i=1,...,7} te_i


(where te_i is the testing effort of each assembly, te_i = p_i*m_i*t_i)


The ridiculous result was: 37h. Thirty-seven hours for seven quite complicated assemblies (e.g. there was one object that could only be created by another which itself depended on a third one).

Still I didn't panic - it's just numbers!

So, I thought: to include the dependency my assemblies have on each other, I can instead calculate the testing effort of the total numbers, thus, by mutliplication, establishing a relation between those numbers that is stronger, that is

total_testing_effort_2 = (\sum t_i)*(\sum p_i)*(\sum m_i)


resulting in another ridiculous estimate of 1039 hours.

Knowing that plan had to be written, and that I couldn't live with 37h and my boss couldn't do so with 1039h I had to find some equally acceptable number. So find some mean. It's obvious that the average isn't good enough, so I used the geometric mean of those two:

total_testing_effort = sqrt(total_testing_effort_1*total_testing_effort_2)


And there it was: a magical 119,1h approx. 25 days of 8h work.

You think this is all hocus pocus? Maybe you're right. But I've found the following to be true, too:

1.) There is an effort estimate we feel good about.
2.) My boss understands the code isn't as easy to be tested as he had thought, and will probably be more forgiving when not "everything" will have been tested.

Of course, I still hope my baby will show some good approximation over time :-)

Do you have other means of estimating costs / setting up test plans for legacy code? Found mistakes or nonsense? Please comment!