Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> reversing commit hashes back into their contents

Somewhat off topic, but is this actually possible?

Given hashing is inherently lossy, I'm inclined to assume it's not possible for anything must longer than a password, but commits are text, which I suppose is low entropy per character, so I don't know.



It doesn't sound possible for all but the most trivial diffs. An attacker would have to guess not only the exact commit text, but also its timestamp. And the only attack scenario I can conceive of seems pretty silly. Maybe I'm missing something...

Alice clones an open source git repo, commits one secret change where she edits a config file's default password to her own secret password (a bad practice), and then publishes the new hash in public for some reason (build info?). Mallory would have to (a) know that exactly this happened, (b) guess the commit message, (c) guess the commit's timestamp to the second (or within a few seconds), and (d) preimage-attack her password.

And the preimage attack must pierce git's Merkle tree, which sounds downright impossible. (Unless Mallory is just bruteforcing, in which case a strong password is enough.)


Most commits are not just text, they are source code. If you had a feasible way to enumerate all the byte sequences that collapse into a given hash value, you might find the subset that is syntactically correct code to be very low. Except if it's Perl, of course.

And the likelihood of all the characters aligning in a way a compiler might find acceptable gets lower with every increase in length of your collisions, so it would be extremely unlikely that the shortest nontrivial match (for both the hash and for code sanity) would not be the right one. The code constraint certainly would not make it easier to find the collision in the first place, but it would give you great confidence in the result of you did.


No, you're correct that for large inputs there are obviously going to be many that lead to the same hash (see pigeonhole principle).


This is presumably true also of small inputs, given that the large and small inputs are all mapping to the same space.


Well not necessarily true for small inputs. Hash a single byte for example. There is no collision in sha1 for that so you can build a 1:1 mapping of hashes back to input examples for that case.

But yeah, as the input size approaches the output size, the probability of a collision existing gets to 1. The birthday paradox formula will give the probability of a collision (assumes random placement in output space) based on number of inputs.


> Well not necessarily true for small inputs. Hash a single byte for example. There is no collision in sha1 for that so you can build a 1:1 mapping of hashes back to input examples for that case.

I suppose you are saying that if you know that the input size is sufficiently small, you don't have to worry about collisions, which is true. I was interpreting "for small inputs" to mean that if you give a small input to a hash function (which can take inputs much larger than the space of the output), that you can still reconstruct the small inputs uniquely. Unless the hash function is deliberately designed to provide unique 1:1 mappings for small inputs, I would think that it's not true that you can uniquely reconstruct small inputs because they will likely map to the same value as a large input would (i.e. 'a' might hash to the same value as some 14821-byte string).


Yes this is possible. It's called a preimage attack:

https://en.wikipedia.org/wiki/Preimage_attack

It's computationally infeasible against most cryptographically secure hash functions. However, Grover's algorithm results in a sqrt(keyspace) reduction in security levels (effectively halving bit-sized security levels):

https://en.wikipedia.org/wiki/Grover's_algorithm


A successful preimage attack against a cryptographic hash would most likely find you a different message that gets you the same hash. You wouldn't do a preimage attack to find what the original input to create a hash was. You would do a preimage attack to find a new input with the same hash so that you could pass this new input around claiming it was the original (and have any signatures for the old input be valid for your new input, etc).


What you described is called "second preimage attack", that's != "first preimage attack".


In a second preimage attack, you start with an input that comes out to a hash, and you're tasked with making a distinct input that comes out to a hash.

In a first preimage attack, you start with only the hash, and have to come up with an input from scratch that comes out to that hash. Even in a first preimage attack, you're extremely unlikely to craft the same input that someone else used to create the hash you were given. By pigeonhole principle, the number of inputs that come out to a shorter hash is extreme.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: