Merkle Trees

Posted by Wahab Ahmad on Sunday, July 14, 2024

Contents

For a more detailed write-up, please check out this post.

Why do we this type of tree

You might be saying, we already have binary trees, B-trees, Red Black trees. What is so special about this tree?

Suppose you keep files stored on a 3rd party system or on some server. Whenever you download one of these files, you’d like to be sure that it is the exact same as the file you uploaded. Now you might be thinking, why not keep the file on you local computer and do just do a comparison when you download the file. The issue with this approach is that it defeats the purpose of having to store the file elsewhere if you always have it stored locally as well…

Now you might be wondering, well if we can’t store a copy of the file locally, why don’t we simple store a SHA hash of the file and then compare the SHA hash of the downloaded file. This is a far better approach, however, if we have millions and millons of files that can be updated. We will be foreced to maintain millions and millons of hashes. Yikes, not fun.

What if I tell you its possible to just store 1 hash. Yup, just one hash for verifying millions and millions of files. So lets learn about Merkle Tree!

Lets discuss hash functions

Have you heard of SHA-256, MD5, SHA1, … They are all hash functions. What do they do? Well, they take some input X and map it to an irreversible and uniformly distributed hash h. Where the hash is simply a fixed size of random numbers and letters. For example:

H("Hello World") = B94D27B9934D3E08A52E52D7DA7DABFAC484EFE37A5380EE9088F7ACE2EFCD7A

Lets create a tree from hashes

Suppose you had n files and you used the data in those files to generate n hashes.

   _________       _________       _________        _________
  |  Hash 1 |     |  Hash 2 |     |  Hash 3 |  ... |  Hash n |
  |_________|     |_________|     |_________|      |_________|

Now suppose we joined adjacent hashes together into one string and then regenerated the hash. This would result in half as many hashes in the second layer. Then suppose you did this until you were left with one hash. As you can image this would result in a tree like data structure:

   _________       _________       _________        _________
  |  File 1 |     |  File 2 |     |  File 3 |  ... |  File n |
  |_________|     |_________|     |_________|      |_________|
       |               |               |                |
       v               v               v                v
   _________       _________       _________        _________
  |  Hash 1 |     |  Hash 2 |     |  Hash 3 |  ... |  Hash n |
  |_________|     |_________|     |_________|      |_________|
       |               |               |                |
       |_______________|               |________________|
               |                              |
               v                              v
            __________                  __________
           | Hash 1-2 |                | Hash 3-4 |  ... 
           |__________|                |__________|       
               |                              |
               |______________________________|
                              |
                              v
                           __________
                          | Hash 1-4 |  ... 
                          |__________|
                              |
                              |
                              v
                            ......
                              |
                              |
                              v
                           _________
                          |  Final  |
                          |  Hash   |
                          |_________|

Congrats! That is a upside down merkle tree. The nice thing about having a merkle tree is that locally, you only need to store the final hash and not the individual n hashes.

How can we prove that a file is unchanged

Now that you know of a Merkle Tree, let us discuss how we can be sure the file you retrieved from a web server remains unchanged.

Lets assume we are running a file hosting service, and we are asked to store n files. We could generate a merkle tree and find the root hash. Then we can give this root hash to the verifier to store safely. Then the next time the verifier requests a file, we can respond with the following:

  1. The requested file
  2. $log_2(n)$ hashes

This will provide the verifier enough information to recompute the root hash and verify that the root hash is equal to the hash they stored locally when they originally stored the n files.

Why can’t the server lie to us

You might be wondering what if the server lies to us with the $log_2(n)$ hashes?

Well given the randomness of hash functions it is virtually impossible to modify a requested file and then design $log_2(n)$ hashes to trick the verifier. This goes back to the two properties of hash functions, irreversibility and uniformly random. If you don’t believe me, try to do it!