Skip to content

Add Concurrent File Deduplication Script Using Content Hashing #528

@AndreaPallotta

Description

@AndreaPallotta

Is your feature request related to a problem? Please describe.

Over time, systems accumulate duplicate files across different directories (e.g., backups, downloads, synced folders, datasets). These files may have different names or locations but are identical in content, leading to wasted disk space and unnecessary data management overhead.

Currently, this repository contains many useful automation scripts, but there is no tool that safely identifies and manages duplicate files based on file content, rather than filename or size alone. Relying only on filenames or file size can be inaccurate, as files with the same size (e.g., CSVs or binaries) may have completely different content.


Describe the solution you'd like.

I’d like to add a File Deduplication Python script that accurately detects duplicate files by hashing their contents in a memory-efficient and safe manner.

Key features:

  • Scan one or more directories for duplicate files
  • Use file size as a fast pre-filter, followed by full content-based hashing (SHA-256)
  • Read files in chunks to support very large files without high memory usage
  • Optional concurrency (threading or multiprocessing) to improve performance
  • Provide safe actions:
    -- Report duplicates
    -- Move duplicates to a separate folder
    -- Optionally delete duplicates with explicit confirmation

Let me know what your thoughts are and if you think this would be valuable to have. I can then work on a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions