File Comparison in C# – Part 3

Files and Directories

BinaryIn this third and final article of my File Comparison in C# series we will be creating a method which compares file hashes. I suggest you read File Comparison in C# – Part 1, and File Comparison in C# – Part 2 before continuing with this article, if you haven’t already done so.

What is a File Hash?

First of all a hash function is a mathematical function which converts a large sized amount of data into a much smaller datum. This datum is a representation of the actual large data, so it is ideal to use for data comparisons or database lookups for example.

So, a file hash would be the hash representation of a file. When you look at it, a hash is just a long numerical or alpha-numerical value, such as for example these two:
b7404b4dd5e4d1b67869226dcbc2da09
29-B4-1C-B3-54-F3-14-19-16-EE-0D-6A-F5-73-56-9F-DA-3F-D5-47

File Comparison using Hashes

To create our file comparison method we are going to use the .NET HashAlgorithm class. To use this class we need to add the System.Security.Cryptography namespace to our project.

The below code is an example of a file hash comparison method:

private static bool CompareFileHashes(string fileName1, string fileName2)
{
    // Compare file sizes before continuing. 
    // If sizes are equal then compare bytes.
    if (CompareFileSizes(fileName1, fileName2))
    {
        // Create an instance of System.Security.Cryptography.HashAlgorithm
        HashAlgorithm hash = HashAlgorithm.Create();

        // Declare byte arrays to store our file hashes
        byte[] fileHash1;
        byte[] fileHash2;

        // Open a System.IO.FileStream for each file.
        // Note: With the 'using' keyword the streams 
        // are closed automatically.
        using (FileStream fileStream1 = new FileStream(fileName1, FileMode.Open),
                          fileStream2 = new FileStream(fileName2, FileMode.Open))
        {
            // Compute file hashes
            fileHash1 = hash.ComputeHash(fileStream1);
            fileHash2 = hash.ComputeHash(fileStream2);
        }

        return BitConverter.ToString(fileHash1) == BitConverter.ToString(fileHash2);
    }
    else
    {
        return false;
    }
}

This method accepts two parameters, which are the full file names of the files we want to compare. In the first line of code we are calling the CompareFileSizes method we created in part 1 of this article. Next we are initialising the HashAlgorithm class which will be used to generate our file hashes. We are also declaring two byte arrays which will be used to store the file hashes. Next we are creating two FileStream objects for each of the files we passed as parameters, and finally we are using the ComputeHash method of the HashAlgorithm class to compute our file hashes. This method returns true if the hashes match and false if they don’t.

How accurate is this method?

The chances of generating two identical hashes for different files is almost impossible. A tiny change in a file results in a fairly large and unpredictable change in the generated hash. During my entire software development career, I have never encountered equal hashes for different values/files. Although having said this, it is still technically possible to generate equal hashes for different files but it is so rare that it is not something to worry about when working with hashes.

So which comparison technique is the best?

To find out I compared small and large files using the byte by byte comparison, and the file hash comparison methods we created, and I timed them. Below are the results:

File Comparisons

As you can see from the results, there is practically no difference between the comparison methods for small files, but when I compared two large files of around 700MB each you can clearly see a difference between the methods. The byte by byte method took around 27 seconds to complete comparing the large files, while the file hash comparison method took around 18 seconds to complete.

This comparison clearly shows that the file hash method is quicker, and given the almost impossible chances of generating equal hashes for different files, I would say that the accuracy of this method is basically the same as that of the byte by byte method.

Therefore, if I was creating a comparison method, I would go for the file hash comparison technique 🙂

I hope you found this article series interesting. Feel free to leave any comments or if you want you can contact me through my contact page.

Happy comparing…
Dave

13 comments… add one
  • Robert Glaab Link Reply

    Awesome article, and very informative. Using the created or last modified date properties of the FileInfo class tend to be inaccurate, and file sizes can still be equivalent when a single ascii character changes value, so hashing in my opinion is by far the best way to compare files for equality.

  • Thanks for your comment Robert. I absolutely agree that hashing is the best way to compare files.

  • Pankaj Link Reply

    Great article for beginners.However I have a doubt, if you have to count the number of differences between two files what will you do?

    • Thanks for your comment Pankaj. To count differences within a file can become quite complex if you want to do it properly, and as far as I know .NET does not have anything built in to do this.

      In my opinion, what you need is a Difference (or Diff) algorithm. Do a Google search for ‘diff algorithm’ and you should find plenty of examples and free classes you can use in your code. 🙂

  • Jay Link Reply

    Are you able to take the differences in hashes and output the original data to another file? I am working with two large data files and I am trying to find the differences between them and output the lines that are different to another file.

  • Prabu Link Reply

    Wow!!!!!! it works for me..:)

  • Rz Link Reply

    Thanks a lottt Dave for these tri-series article. it really helped me a loot to come to one decision.
    I searched a lott before this, i was confused to go for byte comparison or hash.
    Have a nice time. Good luck.

    Rz

  • Kiran Ravi Link Reply

    Very well explained. Thank you.

  • DG Link Reply

    What if the file is read only?

  • Thanks a lot!!!
    I had the md5 test for the CRC of comparing files, but all the examples don’t work, the md5 comparison only match the name of the files, not the content…

    This works great!!!!

  • Mohamed Saleh Link Reply

    The article is so much informative, I liked the idea behind it..The thing is would that work with large files or it might throw an exception during calculating the Hash Value for More than 10GB file. I’ve done a small program to manage that but I had to put a limit to the files size then I compare First & Last 10MB/less or More from both files to check, it works fine but later on I though about Multi-Threading to check the whole files but it didn’t work out.
    Thanks for the article

  • Pratibha Link Reply

    I want file hash comparison in web application.

Leave a Comment

Cancel reply