Add Concurrent File Deduplication Script Using Content Hashing

**Is your feature request related to a problem? Please describe.**

Over time, systems accumulate duplicate files across different directories (e.g., backups, downloads, synced folders, datasets). These files may have different names or locations but are identical in content, leading to wasted disk space and unnecessary data management overhead.

Currently, this repository contains many useful automation scripts, but there is no tool that safely identifies and manages duplicate files based on file content, rather than filename or size alone. Relying only on filenames or file size can be inaccurate, as files with the same size (e.g., CSVs or binaries) may have completely different content.

---

**Describe the solution you'd like.**

I’d like to add a File Deduplication Python script that accurately detects duplicate files by hashing their contents in a memory-efficient and safe manner.

*Key features:*

- Scan one or more directories for duplicate files
- Use file size as a fast pre-filter, followed by full content-based hashing (SHA-256)
- Read files in chunks to support very large files without high memory usage
- Optional concurrency (threading or multiprocessing) to improve performance
- Provide safe actions:
-- Report duplicates
-- Move duplicates to a separate folder
-- Optionally delete duplicates with explicit confirmation

---

Let me know what your thoughts are and if you think this would be valuable to have. I can then work on a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Concurrent File Deduplication Script Using Content Hashing #528

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add Concurrent File Deduplication Script Using Content Hashing #528

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions