I recently had to test a hash function. Let me explain:
Hash function
A hash function mathematically is a mapping h : D -> V, where each datum in D is assigned a value in V, normally being V a subset of the natural numbers {0,1,2,...} or a finite sum of subsets of the natural numbers.
The purpose is to find an easy way to identify data of arbitrary type. It can be thought of as a fingerprint of my piece of data. E.g. I want to save images in a table, but I don't want my data base to blow up so I only allow for one copy of an image to be saved. How do I recognise if I already have a copy in my table?
If d is my image and v=h(d) then I can use v as key in my table. If then I want to save another image d' in my table and found that v'=h(d') already is in my table, I'd know I already saved it and shouldn't need to upload it again.
But then hash functions aren't normally (or virtually cannot be) injective. This is the mathematical term for uniqueness of my fingerprints. This would be the case if each piece of data I hash would have a different unique fingerprint.
If two pieces of data have the same fingerprint this is called a collision. Surely, if we can't avoid collisions totally, we'd at least want them to occur almost never.
Test a hash function
Almost never is clearly a relative term. If I have 10 pieces of data, we cannot accept that two have the same fingerprint, saying 100% confidence. You could also feel that you can accept 2 of 100000 pieces of data to have the same hashvalue; this would be 1 - 2/100000 = 99.998% confidence.
I didn't have any specifications telling me what confidence or quality my hash function had to fulfill. The function has the form:
hash(str string, range int): hashkey int, i.e. each string (of arbitrary length) is encoded into a natural number in {0,1,...,range}.
So I reasoned:
- If I have a confidence of 99.99% I personally am fine with my test. As long as
- there's a way to improve confidence by tuning the input of my hash function.
Admittedly 99.99% isn't too much. See discussion below.
So I took 50000 arbitrary distinct strings (randomly generated), hashed them and than looked at how many different hashkeys were produced. Mathematically spoken I calculated:
test_confidence_1 = #h({set of 50000 distinct strings}, 50000) / 50000, where # means 'number of'.
I had a pretty bad result (67% confidence) which was what I could have expected choosing 50000 arbitrary strings but there was a way to improve test confidence:
test_confidence_2 = #h({set of 50000 distinct strings}, 50000*1000) / 50000.
This one gave me 99.995% confidence so I filed a "bug report" about the results. Hopefully developeres will read it and improve the hash function or document the confidence for future users of this hash function.
I learned what a hash function is, how I could test it, and finally:
- As a tester sometimes you will have to setup your own pass-fail-criteria.
Management and Development later still can and hopefully will discuss your results / criteria with you. That's okay. But you cannot do your work waiting indefinitely for some specification. You can use your own criteria to provoke later discussion also; this is why I chose my test confidence just high enough to make the test fail.
No comments:
Post a Comment