Cleaning duplicate photos in large libraries (I turned a script into a safe tool)
At some point the archive becomes full of near-identical copies and it becomes difficult to know what can safely be removed. A while ago I wrote a small Python script to detect duplicate photos. It worked surprisingly well and I shared it on Reddit. After people started using it on real photo libraries the requirements quickly grew.
Some examples that came up: photo libraries with hundreds of thousands of images, HEIC/JPEG variants created by phones. Also the need to review duplicates visually before deleting anything and overall wanting something safer than just deleting files automatically. So the project evolved from a simple script into a full tool.
The workflow became deliberately conservative and safety-first. We dry run by default (nothing is deleted automatically), duplicates can be moved to a quarantine folder instead of deleted and optionally send to Windows Recycle Bin. A HTML reports with thumbnails is provided so you can visually inspect duplicate groups. A CSV log showing exactly what the tool decided. The goal is not aggressive cleanup but controlled reduction of redundancy while keeping the best version of each image.
One interesting challenge was handling phone photos where the same picture exists as both HEIC and JPEG, or slightly edited variants that are visually identical but not byte-identical. The tool groups those into clusters so you can review them before deciding what to keep.
If anyone is interested in the engineering behind it, I wrote a deeper breakdown here: from-a-finding-duplicates-script-to-the-deduptool-engineering-a-safe-deterministic-photo-deduplication-tool-for-windows
[link] [comments]
Want to read more?
Check out the full article on the original site