Yes, exactly. If Linus truly doesn't care about security, then git could use any...

snowwrestler · on Feb 26, 2017

Linus: "You can have people who try to be malicious... they won't succeed."

Linus talked about why git's use of a "strong hash" made it better than other source control options during his talk at Google in 2007.

https://youtu.be/4XpnKHJAok8

Edit: the whole talk is good but the discussion of using hashes starts at about 55 min.

Anderkent · on Feb 26, 2017

Well, if the cost of computation is not too relevant, and if you don't explicitly need the ability to craft collisions, why would you use a non-cryptographic hash function?

Like, when I'm building a lookup index for files, I'm going to use sha-(something), because it's easy and well known. I don't particularly care about the security aspect; I care that everyone immediately knows the contract of sha-1.

bascule · on Feb 26, 2017

There is nothing to be gained from using cryptographic primitives in a non-security context. You could just as easily use e.g. CRC32 for the case you're describing.

There is, however, a performance cost in using cryptographic primitives in non-security-related contexts. You may not care about performance, but it certainly matters for something like git.

Linus claims: "So in git, the hash is used for de-duplication and error detection, and the 'cryptographic' nature is mainly because a cryptographic hash is really good at those things."

CRC produces a distribution just as uniform as a cryptographic hash function, and it's faster to boot. If these are the only things he actually cares about, and he's explicitly discounting security, he's choosing a slower primitive for no reason.

He writes off CRC inexplicably earlier in the post:

"Other SCM's have used things like CRC's for error detection, although honestly the most common error handling method in most SCM's tends to be 'tough luck, maybe your data is there, maybe it isn't, I don't care'."

Linus seems to think that SHA1 has some sort of magic crypto sauce which magically makes the distribution it produces more uniform than CRC's. It doesn't. The only difference is SHA1 was originally designed to be resistant to preimage and collision attacks, both of which are irrelevant outside of a security context.

pvg · on Feb 26, 2017

You can still make a not-totally-unreasonable argument that something like CRC64 is simply too small - that maybe the 1 in a million collision chance for a few million hashes is too high. The fast, keyed 'semi-cryptographic' big hashes that are common now weren't around when git was written so the easiest thing to reach for would have been something like SHA-1.

bascule · on Feb 26, 2017

"maybe the 1 in a million collision chance for a few million hashes is too high"

Well, let's look at what the actual numbers are. There's a nice table on this page:

https://en.wikipedia.org/wiki/Birthday_attack

For a 64-bit tag, even with 6,100,000 objects we'd only have a 1 in 10 million chance of a collision, so a 64-bit tag is more than sufficient to meet your stated requirements.

pvg · on Feb 26, 2017

No no, I don't have any requirements, I just did the numbers and missed a zero instead of sensibly looking at a table. But that doesn't change the argument much - if one in a million isn't crazy, one in ten million is not completely insane either. With a 64 bit hash, you probably should write code to deal with a potential collision. Beside the (tiny) chance, someone might legitimately plop some test vectors that collide into a repo, just like the webkit people did the other day. With SHA-1 (in 2005), you can just punt and spend the time you saved yelling at people complaining to you about the choice of SHA-1 on mailing lists. It seems like a pragmatic implementation decision for the time.

I'm not trying to defend Linus's somewhat confused explanations, it's just that git's 'security' requirements are somewhat woolly and one could reasonably get away with half-assing it a bit for a while.

bascule · on Feb 26, 2017

If you feel CRC64 does not meet the requirements, use CRC128.

For what it's worth, I think CRC64 should be fine for git-like workloads (but would still recommend using a cryptographically secure hash function, because git's usage is security-critical despite Linus constantly insisting it's not).

pvg · on Feb 26, 2017

Why would you do that? Even if you don't know exactly what you want, are wrong about whether it's the basis of 'trust', for the purposes of writing git, you'd just take SHA-1. Nothing terrible is going to happen if it's both overkill and you aren't really building a secure system. You seem to be arguing, if I'm understanding you right, that you should only use a cryptographically strong hash iff you need all its properties. That seems like a really odd angle.

bascule · on Feb 26, 2017

I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.

It is not an "odd angle". It is literally the very purpose for which they were created in the first place: to defend against attacks (preimage, collision)

If there are no attackers, the cryptography buys you nothing and merely makes the system slower.

Again, to go back to the original point: Linus's argument is that cryptographic functions have unique properties that make them specifically useful in non-security contexts. He's wrong. They don't. The non-cryptographic constructions he namedropped then glossed over work fine in these contexts.

pvg · on Feb 26, 2017

I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.

Well, if we're going to be dicks to each other about it I'll try to explain what I think appears to be difficult for you to grasp. :) If you were throwing together something like git in a hurry you'd want a hash that

Lets you not have to think about collisions at all even if:

The collisions are by mere chance

The collisions arise by non-malicious accident

The collisions arise from malicious inputs [obviously, that implies an attacker but it also falls under 'I just don't want to think about collisions']

And right there, you grab the first non-completely-broken, not-too-giant, not-too-slow crypto hash around and get on with whatever else you had in mind. And this, to me, seems like the right call, especially since if collisions did eventually pop up, it'll probably be one generative collision and your entire system won't suddenly implode just because it exists. You'll have some time to fix stuff.

Again, to go back to the original point: Linus's argument is that cryptographic functions have unique properties that make them specifically useful in non-security contexts. He's wrong. They don't. The non-cryptographic constructions he namedropped then glossed over work fine in these contexts.

No argument there. Maybe I misunderstood 'Linus said something wrong' as 'Linus did something horribly wrong' and we're arguing over nothing?

ProblemFactory · on Feb 26, 2017

I recently worked on a project where I had to choose a hash function for non-cryptographic error checking. I investigated CRC in detail for it, even wrote two different implementations from scratch.

CRC-X is complicated to use and terrible choice.

First of all, it is not a single algorithm, it is a family of algorithms. For a CRC of size N, you have to also choose a N-bit polynomial, N-bit starting value, and N-bit output XOR value. There is no single standard, there are tens or hundreds of popular options [1]. The optimal polynomial depends on both the CRC size, and the length of the data you are feeding into it [2]. If you choose poor parameters, you will get terrible error detection characteristics (like allowing extra 0 bits at the start of data with no change to the checksum). If you choose good parameters, certain classes of common errors like zeroing out a block of more than N bits will still have much worse characteristics than 1 / 2^N random chance of collision.

Second, implementation in standard libraries is patchy. Most programming languages have some CRC32 implementation - but do not document what parameters they use, or use different notation for the same parameters (forward and reverse), or do not let you change the parameters, or do not let you change the CRC size, or all of these. There is no easy way to get a "standard CRC64 or CRC128" compatible across platforms without putting it together from github snippets and example code yourself.

Third, CRC is fast when implemented in hardware, but not that much faster than SHA-1 or SHA-512 in software. It's only 1.5-2.5x faster [3], and when you're doing one checksum per uploaded file or something, it really does not matter. It's going to be even slower when you don't have a suitable-length CRC available in optimised native code, and have to write it yourself in simple C or pure Python.

The obvious and simple solution is to pick a known popular cryptographic hash like (SHA-X) that is available in the standard library of every programming language under the exact same API and no parameters to configure, truncate its output to the digest size you want, and call it a day. No need to worry about error detection performance on specific error cases like long bursts of zeros, and you get some defense against malicious tampering as a free bonus.

[1] https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Standa...

[2] http://repository.cmu.edu/cgi/viewcontent.cgi?article=1672&c...

[3] https://www.cryptopp.com/benchmarks.html

manquer · on Feb 26, 2017

Would MD5 also not satisfy all the availablity and standardization problems you mentioned just as well as SHA-x ? and also meet other non security requirements of git just as well ?

So why SHA-1 over MD5 which probably is move available and order of magnitude faster ? what problems MD5 has outside of its security that SHA-1 does not ?

pvg · on Feb 26, 2017

So why SHA-1 over MD5 which probably is more available and order of magnitude faster?

This actually ends up not being true. Try

    openssl speed md5
    openssl speed sha1

On a somewhat recent Intel CPU.

bascule · on Feb 26, 2017

I'm glad you've at least looked into this, but you have a number of things wrong:

  First of all, it is not a single algorithm, it is a family of algorithms

This argument holds for SHA1, but all modern cryptographic hash functions are also families. The SHA3 family has SHA3-224, SHA3-256, SHA3-384, SHA3-512, SHAKE128, and SHAKE256

  For a CRC of size N, you have to also choose a N-bit
  polynomial, N-bit starting value, and N-bit output XOR
  value. There is no single standard, there are tens or
  hundreds of popular options

Apparently you've never heard of CRC32C!

http://www.evanjones.ca/crc32c.html

  Most programming languages have some CRC32 implementation
  - but do not document what parameters they use, or use
  different notation for the same parameters (forward and   
  reverse), or do not let you change the parameters, or do   
  not let you change the CRC size, or all of these. There is 
  no easy way to get a "standard CRC64 or CRC128" compatible 
  across platforms without putting it together from github  
  snippets and example code yourself.

Here's CRC32C implementations in a handful of popular languages:

- C: https://software.intel.com/en-us/node/503522

- Java: http://download.java.net/java/jdk9/docs/api/java/util/zip/CR...

- JavaScript: https://www.npmjs.com/package/fast-crc32c https://www.npmjs.com/package/crc32c

- Go: https://github.com/jacobsa/gcloud/blob/master/gcs/gcsutil/cr...

- Python: https://github.com/ludios/pycrc32c

- Ruby: http://www.rubydoc.info/gems/digest-crc/0.4.1/Digest/CRC32c

And again, I'm not actually recommending git switch to CRC (as a fan of security, I would prefer they use an actual collision resistant hash function). But CRC better meets Linus's stated requirements:

  - CRC will be faster (much faster as in ~4X, your numbers seem off to me) in almost all cases
  - CRC will *not* fail to detect a bitflip, whereas there's a certain probability a random oracle-based construction will

Dylan16807 · on Feb 26, 2017

> I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.

Because you can't seem to tell the difference between "unnecessary" and "shouldn't be done".

If I build a shed then using larger screws on the door might be unnecessary but it only costs me .2% more and I know it won't fall over.

Using a recent SHA function might be overkill in a non-crypto context but it's high-quality and fast. And there's existing libraries for whatever language you want. Why the hell would anyone use CRC128? I've never even heard of it before.

It's a good hash choice, no matter how unnecessary.

bascule · on Feb 26, 2017

Clearly it's not, because it's both several times slower than the CRC family, and was known to have cryptographic flaws before git was even released. "The worst of both worlds" so to speak...

CalChris · on Feb 26, 2017

> There is, however, a performance cost in using cryptographic primitives in non-security-related contexts. You may not care about performance, but it certainly matters for something like git.

The performance difference between SHA1 and SHA256 is less than 50%. Unless hashing is a significant percentage of git, which it isn't, this is an insignificant difference.

http://atodorov.org/blog/2013/02/05/performance-test-md5-sha...

Also, you can easily argue that version control is a security context.

bascule · on Feb 26, 2017

I absolutely believe git should use a cryptographically secure hash function.

But you're completely missing my point, which is about Linus's claims that git's use of SHA1 isn't security-critical, but why he's using a cryptographically secure hash function anyway.

knweiss · on Feb 26, 2017

There are better cryptographic hash functions (e.g. Blake2) than SHA1 that are even faster. Check out the "results" slide of this slide deck for the cycles per byte table: https://blake2.net/acns/slides.html

beagle3 · on Feb 26, 2017

Probably not. It's just that CRC128/CRC160/CRC256 primitives are not as common as MD5/SHA1/SHA256.

I did not, from Linus' post, get the impression he believes SHA1 are more magical than CRC except in the "image / second primage" sense.

logicallee · on Feb 26, 2017

>Linux mistakenly claims using a cryptographic hash function helps avoid non-malicious collisions, but this is not the case.

of course it does. It is using a different field (cryptography) as a CRC that "really, really won't collide" because there is a whole field (cryptography) that is completely busted if it does.

Let me put it this way. If I really, really need a random distribution of white noise, I might use a different field, cryptography, to provide it: because if the distribution is not effectively random and uniformly distributed, that field in some fundamental sense is broken: no information is supposed to make it into the ciphertext, it should be indistinguishable from white noise.

So encrypting your source of white noise for the sole purpose of making it statistically closer to noise is a perfectly valid choice.

Actually in your commment you said it yourself: in as little as four billion commits CRC64 expects to see a collision. That is tiny compared to the search space cryptographers work with.

If you look at the history of git there was originally no reason to use cryptographic functions except in the same way as the analogy I just made (for white noise): he borrowed a property from a different field from the one he was working in.

bascule · on Feb 26, 2017

You seem to be operating under the same sort of "cryptographic hash functions are magic!" delusions as Linus.

CRC and SHA1 both produce a uniform distribution. SHA1 does not magically do this better because cryptography. The only things that make CRC and SHA1 are any different are:

  - SHA1 produces a longer tag (of course CRC256 is a thing)
  - SHA1 is hardened against preimage attacks
  - SHA1 was intended to be secure against collision attacks (not anymore!)

SHA1, truncated to 32 or 64-bits, will produce a distribution just as uniform as CRC.

In a non-security setting, we can pick the size of the tag based on the rough number of objects we'd like to be able to store before we'd expect to see a collision (i.e. the birthday bound). If that number is ~4 billion, then CRC64 is sufficient.