A couple of days ago I decided I wanted to go through my collection of music in order to get rid of duplicates. I would have assumed that this problem had been solved already, but when I googled “linux file duplicate finder” the results were not so great. The first result was a program called FSlint. It wasn’t in portage, and I didn’t feel like finding out what its dependencies were, so I skipped to the next result, which was someones duplicate file finder that they wrote in Java. It was amazingly slow.
Therefore, I decided, what the heck, I will just write my own in Erlang! It sounded simple enough, just get a complete file listing, sort the files by their file size, then go through and compare files that have the same size(since a different file size would mean a different file), if you find files with the same size, get the MD5 sum, then compare those.
It turned out to be slightly more complicated than that. First of all, the filelib:fold_files function gets stuck in an infinite loop if you have a symlink that points to a higher parent directory in which that symlink is contained. Wine actually does this when you set it up. So I had to write my own function to generate a file listing that just ignores symlinks for now.
Another thing I found out was the amazement of how many files I had that were the same size. Taking hashes of all these files is a pain, especially if they are large. So I decided I would instead just take the first 256 bytes and the last 256 bytes of the file and hash that. If the file was less than 512 bytes, I just hash the whole file. Using this method increased the performance by 400% and maintained the same accuracy, though is technically still a small possibility of a false positive, but it is good enough for what I need.
The funny thing was, after I was almost done, a friend on the #chiglug@irc.oftc.net channel referred me to a perl program called fdupe. I tried it out and found out it actually did what I wanted and used close to the same methods. So I decided I would benchmark it against my implementation.
For a collection of 35,042 files that totaled 168GB, my implementation was 23.77% faster
fdupe: 2m23s
edupe: 1m49s
For a collection of 220,463 files that totaled 85GB, my implementation was 16.46% faster
fdupe: 35m39s
edupe: 29m47s
It is good to see that Erlang can hold its own.
I have released the code under the name EDupe, and you can download the latest code here.



rad!
Comment by Erl — April 4, 2007 @ 3:17 am
Hey there, just wanted to drop you a comment to let you know that edupe seems to work great under Windows as well. Nice job!
FYI, it took 424 seconds to process 34,492 files with 3155 folders for a total of 25 GB of source material. I had other things I was doing while it was processing and this is a laptop (albeit with 4 GB of RAM), so the numbers should be taken with a large grain of salt.
-Vince
Comment by Vince P. — April 4, 2007 @ 1:43 pm
Vince, I appreciate testing it in Windows. I did not have the time to try it out yet. Good to hear that it works well!
Comment by diginux — April 4, 2007 @ 1:57 pm
For better accuracy, (assuming you didn’t already do this) why don’t you just take the ones that hash the same with the first and last 256 bytes, and do a full comparison on just those files?
Comment by Dan — April 4, 2007 @ 5:30 pm
I probably will in the next version if I update it. From my testing on my own files, it just wasn’t needed, and I was trying to keep the code as clean as possible and reduce the number of comparisons needed. I will definitely do it though if I update the code.
Comment by diginux — April 4, 2007 @ 8:27 pm
It looks like you are only looking at part of the files, which is cheating :) You actually do not need to calculate any md5 hashes in order to implement a duplicate file finding program efficiently. The Perl program you referenced seems to get things almost right, but it looks like it does too much reading. In your case, you could probably just store the 256+256 byte chunks as the “hash” itself, rather than the md5 hash of it.
I have a duplicate finder program implemented in python, but the general idea would be able to be implemented in Erlang as well. The method I use is to find all files of the same size, and then compare them all to each other 8K at a time. With this method, the program only reads as many bytes as it needs to to determine if files are different, or the same. It only ends up reading an entire file if it is actually a duplicate. This makes it extremely IO efficient, while also being 100% accurate.
http://bouncybouncy.net/dupes.py
Comment by Justin — April 13, 2007 @ 10:04 pm
I’ll take your ideas into consideration for a version 2. I know it is “cheating”, but in general I still got the same accuracy of results in less time. Also, it seemed quicker to take 512 bytes, hash it, then compare the 16 bit hashes, then it took to just compare the 512 bytes. I admit that doesn’t make much sense, but I didn’t have time to investigate further.
Thanks for your comments, I will have to check your implementation out!
Comment by diginux — April 17, 2007 @ 12:30 pm
Hey you didn’t provide a link to FSlint. boo hiss :)
http://www.pixelbeat.org/fslint/
You can download the tarball of fslint.
Also there is an ebuild file linked prominently on that page.
I’ve been very careful about performance in fslint,
and I’ve found nothing as robust or fast even though
it’s written in shell script.
cheers,
Pádraig.
Comment by Pádraig Brady — May 9, 2007 @ 6:29 am
Pádraig,
Actually not a bad tool now that I have tried it. I just saw it was a GUI, and no information was listed about the required libraries for it, so I just assumed it was a pain. You should make your introductory page of it a little more informative.
Thanks!
Jordan Wilberding
Comment by diginux — May 9, 2007 @ 7:10 am
comment - thepetite
Comment by thepetite — August 26, 2007 @ 12:44 pm
I can’t get the source. Save never finishes; times out repeatedly, on separate days.
Please advise.
Thanks for any help.
Jim
Comment by Jim Harris — August 5, 2008 @ 11:43 am
download slot fortune wheel of wheel fortune slot of download
Comment by free of slot fortune wheel — September 20, 2008 @ 12:02 pm