A couple of days ago I decided I wanted to go through my collection of music in order to get rid of duplicates. I would have assumed that this problem had been solved already, but when I googled “linux file duplicate finder” the results were not so great. The first result was a program called FSlint. It wasn’t in portage, and I didn’t feel like finding out what its dependencies were, so I skipped to the next result, which was someones duplicate file finder that they wrote in Java. It was amazingly slow.
Therefore, I decided, what the heck, I will just write my own in Erlang! It sounded simple enough, just get a complete file listing, sort the files by their file size, then go through and compare files that have the same size(since a different file size would mean a different file), if you find files with the same size, get the MD5 sum, then compare those.
It turned out to be slightly more complicated than that. First of all, the filelib:fold_files function gets stuck in an infinite loop if you have a symlink that points to a higher parent directory in which that symlink is contained. Wine actually does this when you set it up. So I had to write my own function to generate a file listing that just ignores symlinks for now.
Another thing I found out was the amazement of how many files I had that were the same size. Taking hashes of all these files is a pain, especially if they are large. So I decided I would instead just take the first 256 bytes and the last 256 bytes of the file and hash that. If the file was less than 512 bytes, I just hash the whole file. Using this method increased the performance by 400% and maintained the same accuracy, though is technically still a small possibility of a false positive, but it is good enough for what I need.
The funny thing was, after I was almost done, a friend on the #firstname.lastname@example.org channel referred me to a perl program called fdupe. I tried it out and found out it actually did what I wanted and used close to the same methods. So I decided I would benchmark it against my implementation.
For a collection of 35,042 files that totaled 168GB, my implementation was 23.77% faster
For a collection of 220,463 files that totaled 85GB, my implementation was 16.46% faster
It is good to see that Erlang can hold its own.