If Linus truly doesn't care about security, then git could use any error correcting code that produces a uniform distribution of tags, such as CRC64. The size of the tag only affects the number of objects we'd expect to be able to commit before we see a collision: over 4 billion in the case of CRC64.
Linux mistakenly claims using a cryptographic hash function helps avoid non-malicious collisions, but this is not the case.
Where the choice of a cryptographic hash function matters is specifically if we expect an attacker to be trying to collide tags. CRC64 is a linear function of the input data and therefore fails miserably at preventing attackers from colliding tags, but still produces a uniform distribution of tags for non-malicious inputs.
git seems to be in the odd place where Linus argues he's using a cryptographic hash function but not for security purposes.
Well, if the cost of computation is not too relevant, and if you don't explicitly need the ability to craft collisions, why would you use a non-cryptographic hash function?
Like, when I'm building a lookup index for files, I'm going to use sha-(something), because it's easy and well known. I don't particularly care about the security aspect; I care that everyone immediately knows the contract of sha-1.
There is nothing to be gained from using cryptographic primitives in a non-security context. You could just as easily use e.g. CRC32 for the case you're describing.
There is, however, a performance cost in using cryptographic primitives in non-security-related contexts. You may not care about performance, but it certainly matters for something like git.
Linus claims: "So in git, the hash is used for de-duplication and error detection, and the 'cryptographic' nature is mainly because a cryptographic hash is really good at those things."
CRC produces a distribution just as uniform as a cryptographic hash function, and it's faster to boot. If these are the only things he actually cares about, and he's explicitly discounting security, he's choosing a slower primitive for no reason.
He writes off CRC inexplicably earlier in the post:
"Other SCM's have used things like CRC's for error detection, although honestly the most common error handling method in most SCM's tends to be 'tough luck, maybe your data is there, maybe it isn't, I don't care'."
Linus seems to think that SHA1 has some sort of magic crypto sauce which magically makes the distribution it produces more uniform than CRC's. It doesn't. The only difference is SHA1 was originally designed to be resistant to preimage and collision attacks, both of which are irrelevant outside of a security context.
You can still make a not-totally-unreasonable argument that something like CRC64 is simply too small - that maybe the 1 in a million collision chance for a few million hashes is too high. The fast, keyed 'semi-cryptographic' big hashes that are common now weren't around when git was written so the easiest thing to reach for would have been something like SHA-1.
For a 64-bit tag, even with 6,100,000 objects we'd only have a 1 in 10 million chance of a collision, so a 64-bit tag is more than sufficient to meet your stated requirements.
No no, I don't have any requirements, I just did the numbers and missed a zero instead of sensibly looking at a table. But that doesn't change the argument much - if one in a million isn't crazy, one in ten million is not completely insane either. With a 64 bit hash, you probably should write code to deal with a potential collision. Beside the (tiny) chance, someone might legitimately plop some test vectors that collide into a repo, just like the webkit people did the other day. With SHA-1 (in 2005), you can just punt and spend the time you saved yelling at people complaining to you about the choice of SHA-1 on mailing lists. It seems like a pragmatic implementation decision for the time.
I'm not trying to defend Linus's somewhat confused explanations, it's just that git's 'security' requirements are somewhat woolly and one could reasonably get away with half-assing it a bit for a while.
If you feel CRC64 does not meet the requirements, use CRC128.
For what it's worth, I think CRC64 should be fine for git-like workloads (but would still recommend using a cryptographically secure hash function, because git's usage is security-critical despite Linus constantly insisting it's not).
Why would you do that? Even if you don't know exactly what you want, are wrong about whether it's the basis of 'trust', for the purposes of writing git, you'd just take SHA-1. Nothing terrible is going to happen if it's both overkill and you aren't really building a secure system. You seem to be arguing, if I'm understanding you right, that you should only use a cryptographically strong hash iff you need all its properties. That seems like a really odd angle.
I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.
It is not an "odd angle". It is literally the very purpose for which they were created in the first place: to defend against attacks (preimage, collision)
If there are no attackers, the cryptography buys you nothing and merely makes the system slower.
Again, to go back to the original point: Linus's argument is that cryptographic functions have unique properties that make them specifically useful in non-security contexts. He's wrong. They don't. The non-cryptographic constructions he namedropped then glossed over work fine in these contexts.
I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.
Well, if we're going to be dicks to each other about it I'll try to explain what I think appears to be difficult for you to grasp. :) If you were throwing together something like git in a hurry you'd want a hash that
Lets you not have to think about collisions at all even if:
The collisions are by mere chance
The collisions arise by non-malicious accident
The collisions arise from malicious inputs [obviously, that implies an attacker but it also falls under 'I just don't want to think about collisions']
And right there, you grab the first non-completely-broken, not-too-giant, not-too-slow crypto hash around and get on with whatever else you had in mind. And this, to me, seems like the right call, especially since if collisions did eventually pop up, it'll probably be one generative collision and your entire system won't suddenly implode just because it exists. You'll have some time to fix stuff.
Again, to go back to the original point: Linus's argument is that cryptographic functions have unique properties that make them specifically useful in non-security contexts. He's wrong. They don't. The non-cryptographic constructions he namedropped then glossed over work fine in these contexts.
No argument there. Maybe I misunderstood 'Linus said something wrong' as 'Linus did something horribly wrong' and we're arguing over nothing?
I recently worked on a project where I had to choose a hash function for non-cryptographic error checking. I investigated CRC in detail for it, even wrote two different implementations from scratch.
CRC-X is complicated to use and terrible choice.
First of all, it is not a single algorithm, it is a family of algorithms. For a CRC of size N, you have to also choose a N-bit polynomial, N-bit starting value, and N-bit output XOR value. There is no single standard, there are tens or hundreds of popular options [1]. The optimal polynomial depends on both the CRC size, and the length of the data you are feeding into it [2]. If you choose poor parameters, you will get terrible error detection characteristics (like allowing extra 0 bits at the start of data with no change to the checksum). If you choose good parameters, certain classes of common errors like zeroing out a block of more than N bits will still have much worse characteristics than 1 / 2^N random chance of collision.
Second, implementation in standard libraries is patchy. Most programming languages have some CRC32 implementation - but do not document what parameters they use, or use different notation for the same parameters (forward and reverse), or do not let you change the parameters, or do not let you change the CRC size, or all of these. There is no easy way to get a "standard CRC64 or CRC128" compatible across platforms without putting it together from github snippets and example code yourself.
Third, CRC is fast when implemented in hardware, but not that much faster than SHA-1 or SHA-512 in software. It's only 1.5-2.5x faster [3], and when you're doing one checksum per uploaded file or something, it really does not matter. It's going to be even slower when you don't have a suitable-length CRC available in optimised native code, and have to write it yourself in simple C or pure Python.
The obvious and simple solution is to pick a known popular cryptographic hash like (SHA-X) that is available in the standard library of every programming language under the exact same API and no parameters to configure, truncate its output to the digest size you want, and call it a day. No need to worry about error detection performance on specific error cases like long bursts of zeros, and you get some defense against malicious tampering as a free bonus.
Would MD5 also not satisfy all the availablity and standardization problems you mentioned just as well as SHA-x ? and also meet other non security requirements of git just as well ?
So why SHA-1 over MD5 which probably is move available and order of magnitude faster ? what problems MD5 has outside of its security that SHA-1 does not ?
I'm glad you've at least looked into this, but you have a number of things wrong:
First of all, it is not a single algorithm, it is a family of algorithms
This argument holds for SHA1, but all modern cryptographic hash functions are also families. The SHA3 family has SHA3-224, SHA3-256, SHA3-384, SHA3-512, SHAKE128, and SHAKE256
For a CRC of size N, you have to also choose a N-bit
polynomial, N-bit starting value, and N-bit output XOR
value. There is no single standard, there are tens or
hundreds of popular options
Most programming languages have some CRC32 implementation
- but do not document what parameters they use, or use
different notation for the same parameters (forward and
reverse), or do not let you change the parameters, or do
not let you change the CRC size, or all of these. There is
no easy way to get a "standard CRC64 or CRC128" compatible
across platforms without putting it together from github
snippets and example code yourself.
Here's CRC32C implementations in a handful of popular languages:
And again, I'm not actually recommending git switch to CRC (as a fan of security, I would prefer they use an actual collision resistant hash function). But CRC better meets Linus's stated requirements:
- CRC will be faster (much faster as in ~4X, your numbers seem off to me) in almost all cases
- CRC will *not* fail to detect a bitflip, whereas there's a certain probability a random oracle-based construction will
> I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.
Because you can't seem to tell the difference between "unnecessary" and "shouldn't be done".
If I build a shed then using larger screws on the door might be unnecessary but it only costs me .2% more and I know it won't fall over.
Using a recent SHA function might be overkill in a non-crypto context but it's high-quality and fast. And there's existing libraries for whatever language you want. Why the hell would anyone use CRC128? I've never even heard of it before.
It's a good hash choice, no matter how unnecessary.
Clearly it's not, because it's both several times slower than the CRC family, and was known to have cryptographic flaws before git was even released. "The worst of both worlds" so to speak...
> There is, however, a performance cost in using cryptographic primitives in non-security-related contexts. You may not care about performance, but it certainly matters for something like git.
The performance difference between SHA1 and SHA256 is less than 50%. Unless hashing is a significant percentage of git, which it isn't, this is an insignificant difference.
I absolutely believe git should use a cryptographically secure hash function.
But you're completely missing my point, which is about Linus's claims that git's use of SHA1 isn't security-critical, but why he's using a cryptographically secure hash function anyway.
There are better cryptographic hash functions (e.g. Blake2) than SHA1 that are even faster. Check out the "results" slide of this slide deck for the cycles per byte table: https://blake2.net/acns/slides.html
>Linux mistakenly claims using a cryptographic hash function helps avoid non-malicious collisions, but this is not the case.
of course it does. It is using a different field (cryptography) as a CRC that "really, really won't collide" because there is a whole field (cryptography) that is completely busted if it does.
Let me put it this way. If I really, really need a random distribution of white noise, I might use a different field, cryptography, to provide it: because if the distribution is not effectively random and uniformly distributed, that field in some fundamental sense is broken: no information is supposed to make it into the ciphertext, it should be indistinguishable from white noise.
So encrypting your source of white noise for the sole purpose of making it statistically closer to noise is a perfectly valid choice.
Actually in your commment you said it yourself: in as little as four billion commits CRC64 expects to see a collision. That is tiny compared to the search space cryptographers work with.
If you look at the history of git there was originally no reason to use cryptographic functions except in the same way as the analogy I just made (for white noise): he borrowed a property from a different field from the one he was working in.
You seem to be operating under the same sort of "cryptographic hash functions are magic!" delusions as Linus.
CRC and SHA1 both produce a uniform distribution. SHA1 does not magically do this better because cryptography. The only things that make CRC and SHA1 are any different are:
- SHA1 produces a longer tag (of course CRC256 is a thing)
- SHA1 is hardened against preimage attacks
- SHA1 was intended to be secure against collision attacks (not anymore!)
SHA1, truncated to 32 or 64-bits, will produce a distribution just as uniform as CRC.
In a non-security setting, we can pick the size of the tag based on the rough number of objects we'd like to be able to store before we'd expect to see a collision (i.e. the birthday bound). If that number is ~4 billion, then CRC64 is sufficient.
If Linus truly doesn't care about security, then git could use any error correcting code that produces a uniform distribution of tags, such as CRC64. The size of the tag only affects the number of objects we'd expect to be able to commit before we see a collision: over 4 billion in the case of CRC64.
Linux mistakenly claims using a cryptographic hash function helps avoid non-malicious collisions, but this is not the case.
Where the choice of a cryptographic hash function matters is specifically if we expect an attacker to be trying to collide tags. CRC64 is a linear function of the input data and therefore fails miserably at preventing attackers from colliding tags, but still produces a uniform distribution of tags for non-malicious inputs.
git seems to be in the odd place where Linus argues he's using a cryptographic hash function but not for security purposes.