Tuesday, October 18, 2022

CRC vs Hash Algorithms

Before the System.IO.Hashing namespace was introduced in late 2021, the .NET FCL (Framework Class Library) did not provide any classes for error detection or hashing.

For error detection you would have to find some code for one of the CRC algorithms and paste it into your codebase. Some developers may find that objectionable.

For hashing you could use the built-in MD5 class, but it produces a 16-byte cryptographic hash which may be overkill for typical business apps that just want to generate hash keys for database rows or dictionaries. Otherwise you would have to paste-in some code that implements a popular non-crypto hash algorithm such as FNV, Murmur, and many others. Web searches reveal many algorithms that produce 32-bit or 64-bit hashes, which unfortunately means there is too much choice and you have to make a decision.

For the last 20 years, I followed these rough rules:

  • For signing data I use an MD5 hash. The 16 bytes can conveniently be manipulated in a Guid. MD5 is old and regarded as cryptographically "broken", but I'm happy to use it to create the internal signature of the contents of a database row, for example.
  • For hashing small numbers of strings in non-persistent dictionaries I would use FNV-1a because it's only a few lines of code and it's well researched.
  • For all other persistent hashes I preferred to use CRC-64 (ECMA 182), via code I stole decades ago and placed in a shared library. 64-bit hashes are collision resistant for several million items, but don't use 32-bit hashes because they are too short as described in Birthday Attack.

With the arrival of the System.IO.Hashing namespace we now have classes specifically designed for error checking (the CRC classes) and for hashing (the XxHash classes). You can throw away your borrowed or hand-written equivalent code.

The arrival of the classes also made me realise that I've been a naughty developer because I have historically been using an error detecting algorithm as a hash algorithm. The distinction is important...

👉 Error detecting codes like those created by CRC are specifically designed to detect common types of accidental changes to digital data (as described in the Wikipedia article). I can't find any evidence that CRC was also designed to produce good hash codes, but luckily it does. My HashStrings app and other tests shows that CRC produces widely distributed and collision resistant output codes. This reassures me that I haven't been making a gross mistake in using CRC for hashing over the last 20 years.

👉 A hash algorithm is specifically designed to convert arbitrary amounts of input data into a small fixed-size value called the hash, digest, or signature. It is desirable that the algorithm be sensitive so that small changes to the input produce completely different hash outputs, and that the hashes be widely distributed. The new XxHash classes claim to satisfy all of those requirements.

In the future I will drop using CRC or FNV-1a for hashing and I will switch to XxHash for all hashing.

I will continue to use MD5 for making internal data signatures that need to be more secure than a hash.

Further reading:

Hashing Short Strings
HashStrings Repository