I recently had to test a hash function. Let me explain:

**Hash function**A hash function mathematically is a mapping

*h : D -> V*, where each datum in*D*is assigned a value in*V*, normally being V a subset of the natural numbers*{0,1,2,...}*or a finite sum of subsets of the natural numbers.The purpose is to find an easy way to identify data of arbitrary type. It can be thought of as a fingerprint of my piece of data. E.g. I want to save images in a table, but I don't want my data base to blow up so I only allow for one copy of an image to be saved. How do I recognise if I already have a copy in my table?

If

*d*is my image and*v=h(d)*then I can use*v*as key in my table. If then I want to save another image*d'*in my table and found that*v'=h(d')*already is in my table, I'd know I already saved it and shouldn't need to upload it again.But then hash functions aren't normally (or virtually cannot be)

*injective.*This is the mathematical term for*uniqueness*of my fingerprints. This would be the case if each piece of data I*hash*would have a*different unique fingerprint.*If two pieces of data have the same fingerprint this is called a

*collision.*Surely, if we can't avoid collisions totally, we'd at least want them to occur*almost never.*

**Test a hash function***Almost never*is clearly a relative term. If I have 10 pieces of data, we cannot accept that two have the same fingerprint, saying 100% confidence. You could also feel that you can accept 2 of 100000 pieces of data to have the same hashvalue; this would be 1 - 2/100000 = 99.998% confidence.

I didn't have any specifications telling me what confidence or quality my hash function had to fulfill. The function has the form:

*hash(str string, range int): hashkey int*, i.e. each string (of arbitrary length) is encoded into a natural number in {0,1,...,range}.

So I reasoned:

- If I have a confidence of 99.99% I personally am fine with my test. As long as
- there's a way to improve confidence by tuning the input of my hash function.

Admittedly 99.99% isn't too much. See discussion below.

So I took 50000 arbitrary distinct strings (randomly generated), hashed them and than looked at how many different hashkeys were produced. Mathematically spoken I calculated:

*test_confidence_1 = #h({set of 50000 distinct strings}, 50000) / 50000, where # means 'number of'.*

I had a pretty bad result (67% confidence) which was what I could have expected choosing 50000

*arbitrary*strings but there was a way to improve test confidence:*test_confidence_2 = #h({set of 50000 distinct strings}, 50000*1000) / 50000.*

This one gave me 99.995% confidence so I filed a "bug report" about the results. Hopefully developeres will read it and improve the hash function or document the confidence for future users of this hash function.

I learned what a hash function is, how I could test it, and finally:

**As a tester sometimes you will have to setup your own pass-fail-criteria.**

Management and Development later still can and hopefully will discuss your results / criteria with you. That's okay. But you cannot do your work waiting indefinitely for some specification. You can use your own criteria to provoke later discussion also; this is why I chose my test confidence just high enough to make the test fail.

## No comments:

## Post a Comment