Deduplication: Finding 735 GB of Wasted Space
You just scanned 3.5 million files. You indexed 1.47 TB of data across decades of backups. You hashed every file, extracted EXIF data from photos, and categorized everything by type.
Now comes the revelation: you’re storing more than half of it twice.
Here’s what deduplication analysis revealed:
- 3,576,948 total files indexed
- 1,185,610 unique content hashes
- 1,002,684 files are duplicates (multiple copies exist)
- 735 GB wasted on duplicate storage
- Actual unique data: 732 GB (not 1.47 TB!)
That’s a 50% reduction in storage needed. Half of everything you’ve been backing up is copies of copies.
Why Duplicates Accumulate
You didn’t create duplicates intentionally. They happened naturally over years of backing up files:
Scenario 1: The incremental backup
You backed up your MacBook in 2015. Then in 2016, you backed it up again to a new drive. Both backups contain your entire photo library from 2010-2015. That’s two copies of every photo from those years.
Scenario 2: The reorganization
You organized your photos in 2018, moving them into folders by year. But you kept the original unorganized backup “just in case.” Now you have the same photos in Photos-Organized and Photos-Original-Backup.
Scenario 3: The cloud sync
You synced your Documents folder to Dropbox. Then you backed up your entire machine, including the Dropbox cache. Now you have every document three times: original, Dropbox folder, and Dropbox cache.
Scenario 4: The “final” versions
That wedding video exists as wedding.mov, wedding-final.mov, wedding-final-compressed.mov, and wedding-for-upload.mp4. Only two of these are actually unique. The other two are byte-for-byte identical copies with different names.
None of this was intentional. It’s just what happens when you make backups over years without a system to track what’s already been saved.
Content Hashing: How Deduplication Works
To find duplicates, you need a way to determine if two files contain identical data, even if they have different names or live in different directories.
File size isn’t enough. Lots of photos are exactly 2,048 KB. That doesn’t mean they’re the same photo.
File name isn’t enough. IMG_1234.jpg and vacation-beach.jpg might be the same photo, just renamed.
File modification time isn’t enough. Every time you copy a file, the modification time changes.
What you need is a content hash: a fingerprint of the file’s actual data. Two files with the same content hash contain identical bytes, guaranteed.
Hash Functions: Speed vs Security Tradeoff
There are several hash algorithms you could use:
| Algorithm | Output Size | Speed | Collision Risk | Use Case |
|---|---|---|---|---|
| xxHash64 | 64 bits | 10+ GB/s | Very low (non-adversarial) | Personal archives, caches |
| MD5 | 128 bits | ~500 MB/s | Known collisions | Legacy systems (avoid) |
| SHA-1 | 160 bits | ~400 MB/s | Broken (2017) | Git still uses it |
| SHA-256 | 256 bits | ~200 MB/s | None known | Cryptographic signatures |
| BLAKE3 | 256 bits | ~3 GB/s | None known | Modern cryptography |
For deduplication in a personal archive, xxHash64 is the right choice.
Here’s why:
Speed matters when you’re hashing terabytes. At 1.47 TB of data, the difference between 200 MB/s and 10 GB/s is the difference between scanning in 2 hours versus 2 days.
Cryptographic security isn’t necessary. You’re not defending against an attacker trying to forge files. You’re just identifying duplicates in your own archive. xxHash64’s collision resistance is more than adequate for this use case.
64 bits provides enough entropy. With 64 bits, the probability of a collision is negligible unless you have billions of files. At 1.2 million unique files, the collision probability is roughly 1 in 10 trillion.
Implementing Content Hashing
The scanner computes xxHash64 for every file by streaming the file through the hash function:
func hashFile(path string) (string, error) {
f, err := os.Open(path)
if err != nil {
return "", err
}
defer f.Close()
h := xxhash.New()
if _, err := io.Copy(h, f); err != nil {
return "", err
}
return fmt.Sprintf("%016x", h.Sum64()), nil
}
Key implementation details:
- Streaming, not loading:
io.Copyprocesses the file in 32KB chunks. A 4GB video file never loads fully into memory. - 16-character hex string: The 64-bit hash is formatted as a zero-padded hexadecimal string for consistent storage.
- Error handling: If a file can’t be read (permissions, corruption, disk failure), the hash is skipped and the error is logged.
The result gets stored in the files table:
CREATE TABLE files (
id SERIAL PRIMARY KEY,
path TEXT NOT NULL,
content_hash VARCHAR(16), -- xxHash64 as hex
size_bytes BIGINT,
category VARCHAR(50),
-- ... other fields
);
CREATE INDEX index_files_on_content_hash ON files (content_hash);
Now you can find duplicates with a single SQL query.
Finding Duplicates with SQL
Once every file has a content hash, finding duplicates is straightforward.
Step 1: Group by content hash
SELECT
content_hash,
COUNT(*) as occurrence_count,
SUM(size_bytes) as total_bytes
FROM files
WHERE content_hash IS NOT NULL
GROUP BY content_hash
HAVING COUNT(*) > 1
ORDER BY total_bytes DESC;
What this query does:
- Groups files by their content hash
- Counts how many files share each hash
- Sums the total bytes consumed by all copies
- Filters to only hashes with multiple occurrences
- Orders by total bytes (shows the most wasteful duplicates first)
Example output:
| content_hash | occurrence_count | total_bytes |
|---|---|---|
| a3f5b2c9d1e4 | 4 | 3,440,234,496 (3.4 GB) |
| 7c8e1f2a3b4d | 3 | 1,207,959,552 (1.2 GB) |
| 9d4c3e2f1a5b | 7 | 896,532,480 (896 MB) |
That first row? That’s a 860 MB video file that exists in four places. You’re wasting 2.6 GB storing three extra copies.
Step 2: Find all occurrences of a specific duplicate
SELECT path, size_bytes, modified_at
FROM files
WHERE content_hash = 'a3f5b2c9d1e4'
ORDER BY path;
Output:
/Volumes/Backup2015/Videos/wedding.mov 860 MB 2015-08-14
/Volumes/Backup2018/Important/wedding-final.mov 860 MB 2018-03-22
/Users/avi/Dropbox/Archive/wedding.mov 860 MB 2019-11-05
/Volumes/ExternalDrive/ToSort/wedding-backup.mov 860 MB 2020-01-10
Four copies, different paths, different modification times, identical content.
Step 3: Calculate total wasted space
SELECT
COUNT(DISTINCT content_hash) as unique_files,
SUM(size_bytes) as unique_bytes,
SUM(size_bytes * (occurrence_count - 1)) as wasted_bytes
FROM (
SELECT
content_hash,
size_bytes,
COUNT(*) as occurrence_count
FROM files
WHERE content_hash IS NOT NULL
GROUP BY content_hash, size_bytes
) grouped;
What this calculates:
- unique_files: How many distinct pieces of content exist
- unique_bytes: How much space the data would take with no duplicates
- wasted_bytes: How much space is consumed by duplicate copies
Result:
unique_files: 1,185,610
unique_bytes: 732 GB
wasted_bytes: 735 GB
You need 732 GB to store everything once. You’re using 1.47 TB. The other 735 GB is duplicates.
The UniqueFile Model
To make deduplication practical, the application layer creates a UniqueFile model that represents each unique piece of content and tracks all its occurrences.
Here’s the schema:
create_table :unique_files do |t|
t.string :content_hash, null: false # xxHash64 (unique index)
t.bigint :size_bytes, null: false # Size of the content
t.string :category # photo, video, audio, etc.
t.string :extension # .jpg, .mp4, etc.
# Duplication tracking
t.integer :occurrence_count, default: 1 # How many copies exist
t.bigint :canonical_file_entry_id # Reference to "best" copy
t.string :canonical_path # Path to keep
# Backup tracking
t.boolean :needs_backup, default: true
t.string :backup_status # pending, uploaded, skipped, error
t.string :backup_url # Cloud storage location
t.datetime :backed_up_at
t.text :backup_error
t.timestamps
end
add_index :unique_files, :content_hash, unique: true
add_index :unique_files, :category
add_index :unique_files, :backup_status
Key fields:
- content_hash: The xxHash64 fingerprint (unique across the table)
- occurrence_count: How many files in the archive have this hash
- canonical_file_entry_id: Foreign key to the “best” copy in the
filestable - canonical_path: Path to the canonical copy (denormalized for performance)
- backup_status: Tracks whether this unique content has been backed up
The model:
class UniqueFile < ApplicationRecord
belongs_to :canonical_entry,
class_name: "Scanner::FileEntry",
foreign_key: :canonical_file_entry_id,
optional: true
validates :content_hash, presence: true, uniqueness: true
validates :size_bytes, presence: true
scope :photos, -> { where(category: "photo") }
scope :videos, -> { where(category: "video") }
scope :audio, -> { where(category: "audio") }
scope :documents, -> { where(category: "document") }
scope :duplicated, -> { where("occurrence_count > 1") }
scope :unique_only, -> { where(occurrence_count: 1) }
def all_occurrences
Scanner::FileEntry.where(content_hash: content_hash)
end
def wasted_space
size_bytes * (occurrence_count - 1)
end
def human_size
ActiveSupport::NumberHelper.number_to_human_size(size_bytes)
end
def human_wasted
ActiveSupport::NumberHelper.number_to_human_size(wasted_space)
end
def canonical_exists?
canonical_path && File.exist?(canonical_path)
end
end
Useful queries:
# All duplicated photos
UniqueFile.photos.duplicated
# Total wasted space on video duplicates
UniqueFile.videos.duplicated.sum(&:wasted_space)
# => 524,288,000 (524 MB)
# Find all copies of a specific file
unique_file = UniqueFile.find_by(content_hash: 'a3f5b2c9d1e4')
unique_file.all_occurrences
# => [#<Scanner::FileEntry path="/path/1">, #<Scanner::FileEntry path="/path/2">, ...]
# Photos that appear more than 5 times
UniqueFile.photos.where("occurrence_count > 5").order(occurrence_count: :desc)
Populating the UniqueFile Table
The UniqueFilePopulator reads all unique content hashes from the files table and creates UniqueFile records in batches:
class UniqueFilePopulator
def run
unique_hashes = Scanner::FileEntry
.where.not(content_hash: nil)
.distinct
.pluck(:content_hash)
unique_hashes.each_slice(5000) do |hash_batch|
files_by_hash = Scanner::FileEntry
.where(content_hash: hash_batch)
.group_by(&:content_hash)
files_by_hash.each do |hash, entries|
canonical = select_canonical(entries)
UniqueFile.create_or_find_by!(content_hash: hash) do |uf|
uf.size_bytes = canonical.size_bytes
uf.category = canonical.category
uf.extension = canonical.extension
uf.occurrence_count = entries.length
uf.canonical_file_entry_id = canonical.id
uf.canonical_path = canonical.path
end
end
end
end
end
Process:
- Find all unique content hashes
- For each hash, load all files with that hash
- Select the “canonical” copy (see next section)
- Create a
UniqueFilerecord with aggregated data
Running it:
bin/rails populate:unique_files
Output:
Populating unique files...
Found 1,185,610 unique content hashes
Processed 5000/1185610 (0.4%)
Processed 10000/1185610 (0.8%)
...
=== Summary ===
Total unique files: 1,185,610
Duplicated: 628,492
Wasted space: 735 GB
This takes about 20 minutes on a reasonably fast machine.
Choosing the Canonical Copy
When you have multiple copies of the same file, you need to decide which one to keep as the “canonical” version. This is the copy you’ll back up to the cloud, and potentially the only one you’ll keep long-term.
The selection algorithm:
def select_canonical(entries)
entries.min_by { |e| [e.path.count("/"), e.id] }
end
What this does:
- Prefer shorter paths: A file at
/Photos/2015/beach.jpgwins over/Backup/OldMacBook/Users/avi/Desktop/Unsorted/Photos/2015/beach.jpg - Use ID as tiebreaker: If paths have the same depth, pick the entry with the lowest ID (first indexed)
Why prefer shorter paths:
Shorter paths tend to indicate better organization. Files buried deep in backup folders or temporary directories are usually not the canonical location.
Examples:
| Path | Depth | Winner? |
|---|---|---|
/Photos/wedding.mov |
2 | YES |
/Backup2015/Users/avi/Desktop/Temp/wedding.mov |
6 | No |
| Path | Depth | Winner? |
|---|---|---|
/Videos/2018/vacation.mp4 |
3 | YES (lower ID) |
/Archive/Videos/vacation.mp4 |
3 | No (higher ID) |
Limitations of this approach:
This algorithm is a heuristic. It doesn’t know that /Backup2015/Photos-Organized is better than /Backup2018/Photos-Random-Dump even if they have the same depth.
Alternative strategies:
- Prefer specific root directories: If you have well-organized locations, prioritize them explicitly:
PREFERRED_ROOTS = [
'/Photos',
'/Videos',
'/Documents/Personal',
'/Archive/Organized'
]
def select_canonical(entries)
# First try preferred roots
preferred = entries.find { |e| PREFERRED_ROOTS.any? { |root| e.path.start_with?(root) } }
return preferred if preferred
# Fall back to shortest path
entries.min_by { |e| [e.path.count("/"), e.id] }
end
- Avoid temp and cache directories:
BAD_PATTERNS = ['temp', 'cache', '.Trash', 'Downloads', 'Desktop']
def select_canonical(entries)
# Filter out paths with bad patterns
good_entries = entries.reject { |e| BAD_PATTERNS.any? { |p| e.path.include?(p) } }
good_entries = entries if good_entries.empty? # Fallback if all are bad
good_entries.min_by { |e| [e.path.count("/"), e.id] }
end
- User review for important files: Flag high-value duplicates (photos, videos) for manual review before auto-selecting:
def select_canonical(entries)
if entries.first.category == 'photo' && entries.length > 3
# Let user choose via web UI
nil
else
entries.min_by { |e| [e.path.count("/"), e.id] }
end
end
Strategies for Handling Duplicates
Once you’ve identified duplicates, what do you actually do with them? There are four main approaches, each with tradeoffs.
Strategy 1: Keep One, Delete the Rest (Aggressive)
What it means:
Keep only the canonical copy. Delete all other occurrences.
Pros:
- Maximum space savings
- Clean, simple result
- Easy to back up (just one copy of everything)
Cons:
- Destructive (can’t undo easily)
- Risky if canonical selection is wrong
- Loses context (duplicate might be in a meaningful location)
When to use:
After you’ve verified your backups are solid and you’re confident in canonical selection. Good for obviously redundant copies like .DS_Store files or multiple identical backups of the same drive.
Implementation:
unique_file = UniqueFile.find_by(content_hash: 'abc123')
# Get all non-canonical copies
duplicates = unique_file.all_occurrences.where.not(id: unique_file.canonical_file_entry_id)
duplicates.each do |file_entry|
if File.exist?(file_entry.path)
File.delete(file_entry.path)
puts "Deleted: #{file_entry.path}"
end
end
Strategy 2: Keep All, Link Together (Conservative)
What it means:
Don’t delete anything. Just track that files are duplicates so you can make informed decisions later.
Pros:
- Zero risk of data loss
- Preserves all context and organization
- Can analyze patterns before acting
Cons:
- No space savings
- Still paying for duplicate storage
- Still backing up duplicates
When to use:
During initial analysis. This is the default approach when you first run deduplication. Get visibility into what’s duplicated before you start making destructive changes.
Implementation:
This is what UniqueFile does by default. Just populate the table and browse the results without deleting anything.
Strategy 3: Backup Unique Only (Practical)
What it means:
Keep all local copies (don’t delete anything), but only back up the canonical copy to the cloud.
Pros:
- Saves cloud storage costs (50% reduction!)
- Keeps all local copies for safety
- Avoids risky deletion decisions
Cons:
- Still using local disk space for duplicates
- If canonical copy is lost, need to re-designate
When to use:
This is the ideal middle ground for most people. You get the financial benefit of deduplication (cloud storage is expensive) without the risk of deleting local files.
Implementation:
# Backup only unique content
UniqueFile.where(backup_status: 'pending').find_each do |uf|
if uf.canonical_exists?
upload_to_s3(uf.canonical_path)
uf.update!(
backup_status: 'uploaded',
backed_up_at: Time.current,
backup_url: "s3://bucket/#{uf.content_hash}"
)
end
end
# Skip duplicates automatically
duplicates = Scanner::FileEntry
.where(content_hash: UniqueFile.where(backup_status: 'uploaded').pluck(:content_hash))
.where.not(id: UniqueFile.pluck(:canonical_file_entry_id))
puts "Skipping #{duplicates.count} duplicate files (already backed up)"
Cost savings:
If cloud storage costs $5/month per 100 GB:
- Without deduplication: 1.47 TB = $73.50/month
- With deduplication: 732 GB = $36.60/month
- Savings: $36.90/month ($442.80/year)
Strategy 4: Analyze Before Deciding (Adaptive)
What it means:
Use different strategies for different types of files.
Examples:
- Photos and videos: Backup unique only (high value, large files)
- System files: Delete all duplicates (no value, safe to remove)
- Documents: Keep all, review manually (may have versions with different edits)
- Code projects: Keep all (might be different branches with same compiled output)
Pros:
- Maximizes value while minimizing risk
- Tailored to actual file importance
- Can be automated with rules
Cons:
- More complex to implement
- Requires careful rule design
- Easy to make mistakes in rule logic
When to use:
Once you understand your archive’s composition. Start with conservative approach, analyze the results, then build rules based on what you learn.
Implementation:
class DuplicationStrategy
def self.handle(unique_file)
case unique_file.category
when 'photo', 'video'
BackupUniqueOnly.new(unique_file).execute
when 'system'
DeleteDuplicates.new(unique_file).execute
when 'document'
FlagForReview.new(unique_file).execute
when 'code'
KeepAll.new(unique_file).execute
else
KeepAll.new(unique_file).execute # Safe default
end
end
end
Watch Out: Deleting Duplicates Safely
Deleting duplicates is risky. If you delete the wrong file or if your canonical selection algorithm has bugs, you could lose data permanently. Follow these rules:
1. Always verify backups first
Before you delete anything, confirm that:
- Your backup system is working
- You can restore files from backup
- Your backup includes the canonical copies
Test it:
# Check that canonical files actually exist
missing = UniqueFile.where.not(canonical_path: nil).select { |uf| !uf.canonical_exists? }
if missing.any?
puts "ERROR: #{missing.count} canonical files are missing!"
puts missing.first(10).map(&:canonical_path)
exit 1
end
2. Use a dry run first
Don’t actually delete files on the first pass. Log what would be deleted and review it:
def delete_duplicates(unique_file, dry_run: true)
duplicates = unique_file.all_occurrences.where.not(id: unique_file.canonical_file_entry_id)
duplicates.each do |file_entry|
if dry_run
puts "WOULD DELETE: #{file_entry.path} (#{file_entry.human_size})"
else
File.delete(file_entry.path) if File.exist?(file_entry.path)
puts "DELETED: #{file_entry.path}"
end
end
end
# First run (safe)
delete_duplicates(unique_file, dry_run: true)
# Review output, then run for real
delete_duplicates(unique_file, dry_run: false)
3. Move to trash instead of deleting
Instead of File.delete, move duplicates to a trash folder. If something goes wrong, you can restore them:
TRASH_DIR = '/Volumes/ExternalDrive/DeduplicationTrash'
def trash_duplicates(unique_file)
FileUtils.mkdir_p(TRASH_DIR)
duplicates = unique_file.all_occurrences.where.not(id: unique_file.canonical_file_entry_id)
duplicates.each do |file_entry|
if File.exist?(file_entry.path)
# Preserve directory structure in trash
relative_path = file_entry.path.delete_prefix('/')
trash_path = File.join(TRASH_DIR, relative_path)
FileUtils.mkdir_p(File.dirname(trash_path))
FileUtils.mv(file_entry.path, trash_path)
puts "Trashed: #{file_entry.path} -> #{trash_path}"
end
end
end
After a month of confirming nothing is needed, delete the trash folder.
4. Delete in phases by category
Start with low-risk files and work up to high-value files:
# Phase 1: System and cache files (safe to delete)
UniqueFile.where(category: 'system').duplicated.find_each do |uf|
delete_duplicates(uf, dry_run: false)
end
# Phase 2: Code and config (medium risk)
UniqueFile.where(category: ['code', 'config']).duplicated.find_each do |uf|
delete_duplicates(uf, dry_run: false)
end
# Phase 3: Documents (higher risk, review first)
UniqueFile.documents.duplicated.find_each do |uf|
puts "Review: #{uf.canonical_path} (#{uf.occurrence_count} copies)"
# Manual review before deleting
end
# Phase 4: Photos and videos (NEVER auto-delete)
# Use backup unique only strategy instead
5. Keep a deletion log
Record every file you delete with timestamp and reason:
def log_deletion(file_entry, reason)
File.open('deletion_log.txt', 'a') do |f|
f.puts "#{Time.current} | #{reason} | #{file_entry.path} | #{file_entry.content_hash} | #{file_entry.size_bytes}"
end
end
def delete_with_logging(file_entry, reason)
log_deletion(file_entry, reason)
File.delete(file_entry.path) if File.exist?(file_entry.path)
end
If you need to restore something, you have a complete audit trail.
Edge Cases: Same Hash, Different Files
Hash collisions are theoretically possible. Two different files could produce the same xxHash64 value. With 64 bits of entropy and 1.2 million files, the probability is about 1 in 10 trillion, but it’s not zero.
How to detect collisions:
If you’re paranoid (or dealing with critical data), verify that files with the same hash are actually identical by comparing them byte-for-byte:
def verify_no_collision(unique_file)
occurrences = unique_file.all_occurrences.limit(2).to_a
return true if occurrences.length < 2
# Compare first two files byte-for-byte
file1_content = File.read(occurrences[0].path, mode: 'rb')
file2_content = File.read(occurrences[1].path, mode: 'rb')
if file1_content != file2_content
puts "COLLISION DETECTED!"
puts "Hash: #{unique_file.content_hash}"
puts "File 1: #{occurrences[0].path}"
puts "File 2: #{occurrences[1].path}"
return false
end
true
end
# Check all duplicates for collisions
UniqueFile.duplicated.find_each do |uf|
verify_no_collision(uf)
end
In practice:
I’ve never seen a collision in a personal archive. xxHash64 is designed to avoid collisions even with adversarial input. For random personal files (photos, videos, documents), collisions are effectively impossible.
If you do find a collision:
- Switch to a stronger hash (SHA-256 or BLAKE3)
- Re-hash your entire archive with the new algorithm
- Update the
content_hashcolumn and re-run deduplication
This is extremely unlikely to be necessary.
Real Results: What 50% Deduplication Looks Like
Here’s what the deduplication analysis revealed in my 3.5 million file archive:
Overall statistics:
- Total files: 3,576,948
- Unique content: 1,185,610 (33% are unique)
- Duplicated content: 2,391,338 (67% are duplicates)
- Unique data: 732 GB
- Wasted space: 735 GB (50% of total)
By category:
| Category | Unique | Duplicates | Wasted Space | Notes |
|---|---|---|---|---|
| Photos | 418,234 | 1,247,892 | 387 GB | 3x duplication on average |
| Videos | 4,421 | 8,842 | 215 GB | Large files, few copies |
| Documents | 98,234 | 187,456 | 45 GB | PDFs duplicated across backups |
| Audio | 39,872 | 79,744 | 52 GB | Music libraries backed up multiple times |
| Code | 245,678 | 491,356 | 18 GB | Mostly node_modules and build artifacts |
| Other | 379,171 | 376,048 | 18 GB | Config, data, archives |
Biggest offenders:
-
Photo library duplicated 3x: Same 418k photos in
/Photos,/Backup2015/Photos, and/iCloudDrive/Photos. Cost: 387 GB. -
Wedding video in 4 places: Same 2.3 GB video file in four different backup locations. Cost: 6.9 GB.
-
iPhone camera roll synced twice: Photos from iPhone appear both in
~/Pictures/Photos Libraryand/Backups/iPhone. Cost: 124 GB. -
Document backups: Work documents backed up annually from 2015-2023. Each backup is complete, so files from 2015 exist in 8 places. Cost: 45 GB.
-
Node modules everywhere: Every code project has its own copy of React, Lodash, etc. Cost: 12 GB.
What I did:
- Backed up unique content only to Backblaze B2 (saved 735 GB of cloud storage)
- Kept all local copies (too risky to delete photos without more review)
- Deleted duplicate
node_modules(reconstructible frompackage.json) - Deleted system files and caches (safe, no value)
- Manually reviewed duplicates for important documents (kept multiple versions where edits might differ)
Result:
- Local storage: Still 1.47 TB (kept all copies for safety)
- Cloud storage: 732 GB (50% reduction, saves $37/month)
- Duplicates: Documented and tracked, can delete later if space becomes critical
What This Enables
Deduplication isn’t just about saving space. It changes how you think about your archive:
1. Backup strategy becomes simpler
Instead of “back up everything and hope for the best,” you can back up unique content only. You know exactly what needs to be protected and what’s redundant.
2. You can find the best version
When you have four copies of wedding.mov, which one do you keep? Deduplication identifies them all so you can pick the one in the best location or delete the ones in temporary folders.
3. You understand your archive’s composition
“50% of my storage is duplicates” is actionable information. You can focus cleanup efforts on the biggest offenders (photos) rather than wasting time on small files.
4. You can make trade-offs intentionally
Want to delete duplicates to free up space? Now you can see exactly what that gains you. Want to keep everything? Now you know what that costs you. Either way, you’re making informed decisions rather than guessing.
What’s Next
You’ve scanned 3.5 million files. You’ve categorized them by type. You’ve found 735 GB of duplicates and identified canonical copies.
Now you need a way to browse, search, and explore this data. You need to find specific files, triage what to keep, and plan your backup strategy.
Next: Cloud Backup Strategy - Protecting your unique files with offsite backup.