> reversing commit hashes back into their contents Somewhat off topic, but is th...

pshc · on Feb 25, 2017

It doesn't sound possible for all but the most trivial diffs. An attacker would have to guess not only the exact commit text, but also its timestamp. And the only attack scenario I can conceive of seems pretty silly. Maybe I'm missing something...

Alice clones an open source git repo, commits one secret change where she edits a config file's default password to her own secret password (a bad practice), and then publishes the new hash in public for some reason (build info?). Mallory would have to (a) know that exactly this happened, (b) guess the commit message, (c) guess the commit's timestamp to the second (or within a few seconds), and (d) preimage-attack her password.

And the preimage attack must pierce git's Merkle tree, which sounds downright impossible. (Unless Mallory is just bruteforcing, in which case a strong password is enough.)

usrusr · on Feb 26, 2017

Most commits are not just text, they are source code. If you had a feasible way to enumerate all the byte sequences that collapse into a given hash value, you might find the subset that is syntactically correct code to be very low. Except if it's Perl, of course.

And the likelihood of all the characters aligning in a way a compiler might find acceptable gets lower with every increase in length of your collisions, so it would be extremely unlikely that the shortest nontrivial match (for both the hash and for code sanity) would not be the right one. The code constraint certainly would not make it easier to find the collision in the first place, but it would give you great confidence in the result of you did.

hueving · on Feb 25, 2017

No, you're correct that for large inputs there are obviously going to be many that lead to the same hash (see pigeonhole principle).

x1798DE · on Feb 26, 2017

This is presumably true also of small inputs, given that the large and small inputs are all mapping to the same space.

hueving · on Feb 26, 2017

Well not necessarily true for small inputs. Hash a single byte for example. There is no collision in sha1 for that so you can build a 1:1 mapping of hashes back to input examples for that case.

But yeah, as the input size approaches the output size, the probability of a collision existing gets to 1. The birthday paradox formula will give the probability of a collision (assumes random placement in output space) based on number of inputs.

x1798DE · on Feb 26, 2017

> Well not necessarily true for small inputs. Hash a single byte for example. There is no collision in sha1 for that so you can build a 1:1 mapping of hashes back to input examples for that case.

I suppose you are saying that if you know that the input size is sufficiently small, you don't have to worry about collisions, which is true. I was interpreting "for small inputs" to mean that if you give a small input to a hash function (which can take inputs much larger than the space of the output), that you can still reconstruct the small inputs uniquely. Unless the hash function is deliberately designed to provide unique 1:1 mappings for small inputs, I would think that it's not true that you can uniquely reconstruct small inputs because they will likely map to the same value as a large input would (i.e. 'a' might hash to the same value as some 14821-byte string).

bascule · on Feb 25, 2017

Yes this is possible. It's called a preimage attack:

https://en.wikipedia.org/wiki/Preimage_attack

It's computationally infeasible against most cryptographically secure hash functions. However, Grover's algorithm results in a sqrt(keyspace) reduction in security levels (effectively halving bit-sized security levels):

https://en.wikipedia.org/wiki/Grover's_algorithm

AgentME · on Feb 25, 2017

A successful preimage attack against a cryptographic hash would most likely find you a different message that gets you the same hash. You wouldn't do a preimage attack to find what the original input to create a hash was. You would do a preimage attack to find a new input with the same hash so that you could pass this new input around claiming it was the original (and have any signatures for the old input be valid for your new input, etc).

oleganza · on Feb 26, 2017

What you described is called "second preimage attack", that's != "first preimage attack".

AgentME · on Feb 26, 2017

In a second preimage attack, you start with an input that comes out to a hash, and you're tasked with making a distinct input that comes out to a hash.

In a first preimage attack, you start with only the hash, and have to come up with an input from scratch that comes out to that hash. Even in a first preimage attack, you're extremely unlikely to craft the same input that someone else used to create the hash you were given. By pigeonhole principle, the number of inputs that come out to a shorter hash is extreme.