Contents
For a more detailed write-up, please check out this post.
Why do we this type of tree
You might be saying, we already have binary trees, B-trees, Red Black trees. What is so special about this tree?
Suppose you keep files stored on a 3rd party system or on some server. Whenever you download one of these files, you’d like to be sure that it is the exact same as the file you uploaded. Now you might be thinking, why not keep the file on you local computer and do just do a comparison when you download the file. The issue with this approach is that it defeats the purpose of having to store the file elsewhere if you always have it stored locally as well…
Now you might be wondering, well if we can’t store a copy of the file locally, why don’t we simple store a SHA hash of the file and then compare the SHA hash of the downloaded file. This is a far better approach, however, if we have millions and millons of files that can be updated. We will be foreced to maintain millions and millons of hashes. Yikes, not fun.
What if I tell you its possible to just store 1 hash. Yup, just one hash for verifying millions and millions of files. So lets learn about Merkle Tree!
Lets discuss hash functions
Have you heard of SHA-256, MD5, SHA1, … They are all hash functions. What do
they do? Well, they take some input X
and map it to an irreversible and
uniformly distributed hash h
. Where the hash is simply a fixed size of random
numbers and letters. For example:
H("Hello World") = B94D27B9934D3E08A52E52D7DA7DABFAC484EFE37A5380EE9088F7ACE2EFCD7A
Lets create a tree from hashes
Suppose you had n
files and you used the data in those files to generate
n
hashes.
_________ _________ _________ _________
| Hash 1 | | Hash 2 | | Hash 3 | ... | Hash n |
|_________| |_________| |_________| |_________|
Now suppose we joined adjacent hashes together into one string and then regenerated the hash. This would result in half as many hashes in the second layer. Then suppose you did this until you were left with one hash. As you can image this would result in a tree like data structure:
_________ _________ _________ _________
| File 1 | | File 2 | | File 3 | ... | File n |
|_________| |_________| |_________| |_________|
| | | |
v v v v
_________ _________ _________ _________
| Hash 1 | | Hash 2 | | Hash 3 | ... | Hash n |
|_________| |_________| |_________| |_________|
| | | |
|_______________| |________________|
| |
v v
__________ __________
| Hash 1-2 | | Hash 3-4 | ...
|__________| |__________|
| |
|______________________________|
|
v
__________
| Hash 1-4 | ...
|__________|
|
|
v
......
|
|
v
_________
| Final |
| Hash |
|_________|
Congrats! That is a upside down merkle tree. The nice thing about having a merkle tree is that locally, you only need to store the final hash and not the individual n hashes.
How can we prove that a file is unchanged
Now that you know of a Merkle Tree, let us discuss how we can be sure the file you retrieved from a web server remains unchanged.
Lets assume we are running a file hosting service, and we are asked to store n files. We could generate a merkle tree and find the root hash. Then we can give this root hash to the verifier to store safely. Then the next time the verifier requests a file, we can respond with the following:
- The requested file
- $log_2(n)$ hashes
This will provide the verifier enough information to recompute the root hash and verify that the root hash is equal to the hash they stored locally when they originally stored the n files.
Why can’t the server lie to us
You might be wondering what if the server lies to us with the $log_2(n)$ hashes?
Well given the randomness of hash functions it is virtually impossible to modify a requested file and then design $log_2(n)$ hashes to trick the verifier. This goes back to the two properties of hash functions, irreversibility and uniformly random. If you don’t believe me, try to do it!