Wednesday, August 31, 2011

Pass-fail-criteria - testing a hash function

I recently had to test a hash function. Let me explain:

Hash function
A hash function mathematically is a mapping h : D -> V, where each datum in D is assigned a value in V, normally being V a subset of the natural numbers {0,1,2,...} or a finite sum of subsets of the natural numbers.

The purpose is to find an easy way to identify data of arbitrary type. It can be thought of as a fingerprint of my piece of data. E.g. I want to save images in a table, but I don't want my data base to blow up so I only allow for one copy of an image to be saved. How do I recognise if I already have a copy in my table?

If d is my image and v=h(d) then I can use v as key in my table. If then I want to save another image d' in my table and found that v'=h(d') already is in my table, I'd know I already saved it and shouldn't need to upload it again.

But then hash functions aren't normally (or virtually cannot be) injective. This is the mathematical term for uniqueness of my fingerprints. This would be the case if each piece of data I hash would have a different unique fingerprint.

If two pieces of data have the same fingerprint this is called a collision. Surely, if we can't avoid collisions totally, we'd at least want them to occur almost never.

Test a hash function
Almost never is clearly a relative term. If I have 10 pieces of data, we cannot accept that two have the same fingerprint, saying 100% confidence. You could also feel that you can accept 2 of 100000 pieces of data to have the same hashvalue; this would be 1 - 2/100000 = 99.998% confidence.

I didn't have any specifications telling me what confidence or quality my hash function had to fulfill. The function has the form:

hash(str string, range int): hashkey int, i.e. each string (of arbitrary length) is encoded into a natural number in {0,1,...,range}.

So I reasoned:
  • If I have a confidence of 99.99% I personally am fine with my test. As long as 
  • there's a way to improve confidence by tuning the input of my hash function.
Admittedly 99.99% isn't too much. See discussion below.

So I took 50000 arbitrary distinct strings (randomly generated), hashed them and than looked at how many different hashkeys were produced. Mathematically spoken I calculated:

test_confidence_1 = #h({set of 50000 distinct strings}, 50000) / 50000, where # means 'number of'.

I had a pretty bad result (67% confidence) which was what I could have expected choosing 50000 arbitrary strings but there was a way to improve test confidence:

test_confidence_2 = #h({set of 50000 distinct strings}, 50000*1000) / 50000.

This one gave me 99.995% confidence so I filed a "bug report" about the results. Hopefully developeres will read it and improve the hash function or document the confidence for future users of this hash function.

I learned what a hash function is, how I could test it, and finally:
  • As a tester sometimes you will have to setup your own pass-fail-criteria.
Management and Development later still can and hopefully will discuss your results / criteria with you. That's okay. But you cannot do your work waiting indefinitely for some specification. You can use your own criteria to provoke later discussion also; this is why I chose my test confidence just high enough to make the test fail.

Friday, August 12, 2011

Overflow without warning

Testing might also be kind of learning programming. E.g. there's this programming resp. transformation language for structured (text) data, MetaMorphosis.
As I'm trying to figure out how that scripting language works I found myself with little documentation about datatypes. As a mathematician with interest in prime number testing I'm always interested in big integers as in java or C#. So I wrote the following hoping the loop NOT to end with an overflow:

*count := 1, *lastcount:=*count,
!while *count >= *lastcount !do
   (*count).echo,
   *lastcount := *count,
   *count := 10*(*count)
!end


The loop really didn't end by itself throwing any error. But what I saw on my screen kind of surprised me. When I paused it I found that 
 
1946039808 <= -1348927488  

seemed to be true. So I asked in pace mode (option -p):

> i *result := (-1348927488) - 1946039808;
> i *result;
> 1000000000


So supposing a < b will be true if and only if 0 < b - a we see that really 1946039808 <= -1348927488.
What have I learned? Well, first you cannot rely on any programming language to tell you about its limitations. Second: overflows do not only influence our results when we calculate with numbers like subtracting or adding, but go also further as seen in the comparison of two of them.

Finding only two a, b such that a-b > max(numberdomain) we cannot be sure a < b will give the expected result.