Motivation

Finding image duplicates can be useful for lots of purposes: find fake profiles in social networks (checking if the profile pictures are a copies from other profiles), find content picked from other sources, etc..

Perhaps we can not only rely on the images to determine that, but they give us a good hint.

We can also look to find duplicates of our own images, however, if the images are exactly the same file (not resized / edited versions), we can perform better simply looking for file duplicates.

Finding file duplicates

Hashing all files in your disk does not look like a great idea, so, if you know the file extensions of your files, you should filter the list with those extensions (jpeg, jpg, png..).

Since files can be of different size, and accessing the file size is faster (it only has to access the directory inode info), we can first make a selection and group files by same size.

Once done that, best approach is to directly hash the contents of the files, and compare those hashes to check if are the same.

Also, if we have very big files, and slow disk access (perhaps in a remote location) hashing the full content could be expensive, so we can repeat the process we did with sizes, but hashing only a block of data from the begining of the file (A few kbs for each file).

Files can have the same bytes at the begining (image headers), and is only useful if we are going to discard duplicates in an early pass (files matching should be fully hashed to be sure we have the same file).

The last step, is, with the groups we have, split them by full file content hashing, and output the duplicates list.

Here I have some code that is not optimized, but could serve as a basic implementation for this task:

Python 2 code to find exact file duplicates

Comparing images

Standing on other’s shoulders

A lot has been written about image duplicate finding. Actually, one of the first references I used was Skyscanner’s post about image deduplication, where they explain the process used to detect image duplicates and select the best one to show to the user, and the tool they built to check that everything was working as expected.

In that article you can find references to the libraries used for image hashing like Python’s ImageHash and some other good articles about image hashing, like how the perceptual hash works. Actually, I must point that there is another good blog post about the topic in the same site about dHash compared with aHash and pHash.

Some samples

In order to show those image differences I have selected an hotel image from different sites (Cardenal Hotel at Monforte de Lemos):

source    width    height    aspect ratio    file size
booking.com 1024 683 1.5 67 232
tripadvisor.com 840 460 1.82 41 078
lol.travel 694 462 1.5 44 994
roomdi 903 500 1.86 31 589
zenhotels.com 1000 666 1.5 82 147
![booking image](/images/imagedups/booking_cardenal_sm.jpg) ![roomdi image](/images/imagedups/roomdi_cardenal_sm.jpg)

If you follow the links provided in the table you will have the full size image downloaded from each site, and you will also see that not only the aspect ratio is different (some are croped), but also que image quality (i.e. the roomdi one is really really bad).

Using Python’s ImageHash

So I decided to collect some pictures, download them and save three of their hashes: pHash, dHash, and wHash. I left the aHash one out, as it seems to not give as good results for false positives, and I prefer to err on the false negatives side.

But as I previous step I croped all the images to a squared size, this way I can get rid of different aspect ratios, not using the “extra pixels” at the borders.

I used a sqlite3 database and the Peewee ORM, where I stored like 33 different hotels, with an average of 15 images each one.

Too slooooow

So, I started with the naive approach of checking each hotel with each other, to find if they had image duplicates. Yeah, thats an N^2 algorithm, and should be optimized. However, something smelled really bad for an N=33.

It took 27.729 seconds to process all of them !!

So I decided to profile it with python -m cProfile -s tottime dupfinder.py , to find out that most of time was consumed with image hash comparison:

         9817878 function calls (9788041 primitive calls) in 27.729 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2458848   11.742    0.000   11.742    0.000 __init__.py:110(<listcomp>)
   307356    7.129    0.000   21.626    0.000 __init__.py:97(hex_to_hash)
   307394    2.226    0.000    2.226    0.000 {built-in method numpy.core.multiarray.array}
    51226    0.693    0.000   23.188    0.000 imghasher.py:82(calc_hash_distances)
   153678    0.615    0.000    0.855    0.000 __init__.py:75(__sub__)
    21882    0.408    0.000    0.703    0.000 _strptime.py:321(_strptime)
  2549715    0.323    0.000    0.323    0.000 {method 'append' of 'list' objects}
     7273    0.265    0.000    1.851    0.000 peewee.py:2428(construct_instances)
      682    0.252    0.000    0.252    0.000 {method 'read' of '_io.FileIO' objects}

This is the offending code inside calc_hash_distance:

    h_a = hex_to_hash(hexhash_a[hk])
    h_b = hex_to_hash(hexhash_b[hk])
    hash_distances[hk] = (h_a - h_b) / 64.0

This was beeing executed for each stored hash (remember that I had 3 for each image, and I am ran an N^2 loop to compare them).

If we look at the original source code for hex_to_hash, I am actually creating two ImageHash objects from the hex string only to find the number of different bits from one hash to the other.

What happens if I code it directly ? I changed that snippet for this:


_LOOKUP_BIT_COUNT = {
    0: 0,   1: 1,   2: 1,   3: 2,
    4: 1,   5: 2,   6: 2,   7: 3,
    8: 1,   9: 2,  10: 2,  11: 3,
   12: 2,  13: 3,  14: 3,  15: 4
    }


def calc_hash_distances(hexhash_a, hexhash_b):
    for cha, chb in zip(hexhash_a, hexhash_b):
        dist = dist + _LOOKUP_BIT_COUNT[ord(cha) ^ ord(chb)]
    hash_distances[hk] = dist / 64.0

and run again the profiler:

         2940613 function calls (2911799 primitive calls) in 3.679 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    21896    0.384    0.000    0.685    0.000 _strptime.py:321(_strptime)
     7281    0.271    0.000    1.879    0.000 peewee.py:2428(construct_instances)
   316495    0.207    0.000    0.453    0.000 {built-in method builtins.setattr}
   216598    0.177    0.000    0.225    0.000 peewee.py:866(__set__)
     7339    0.114    0.000    0.114    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
      681    0.112    0.000    0.112    0.000 {built-in method marshal.loads}
   179648    0.096    0.000    0.187    0.000 peewee.py:960(python_value)
    27686    0.076    0.000    0.076    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
    14581    0.074    0.000    0.101    0.000 peewee.py:4734(__init__)

Nice! 3.679 seconds. A 7.5x speed up. It seems like a good advice to compare the hashes directly instead of using the library.

Further improvements

Obviously I can use something better than compare every hotel to each other, and also, there are good articles about it, like this one explaining BK-Trees for finding near-matches to a string, that I found in this also good blog post about Duplicate Image detection with perceptual hashing in Python.

In the Hotel’s case I could filter out based on the name of the hotel, so I can reduce N to a small number.

Also, it would be great to have a tool to check that everything is ok :)

References