Deduplication: Finding 735 GB of Wasted Space

You just scanned 3.5 million files. You indexed 1.47 TB of data across decades of backups. You hashed every file, extracted EXIF data from photos, and categorized everything by type.

Now comes the revelation: you’re storing more than half of it twice.

Here’s what deduplication analysis revealed:

  • 3,576,948 total files indexed
  • 1,185,610 unique content hashes
  • 1,002,684 files are duplicates (multiple copies exist)
  • 735 GB wasted on duplicate storage
  • Actual unique data: 732 GB (not 1.47 TB!)

That’s a 50% reduction in storage needed. Half of everything you’ve been backing up is copies of copies.

Why Duplicates Accumulate

You didn’t create duplicates intentionally. They happened naturally over years of backing up files:

Scenario 1: The incremental backup

You backed up your MacBook in 2015. Then in 2016, you backed it up again to a new drive. Both backups contain your entire photo library from 2010-2015. That’s two copies of every photo from those years.

Scenario 2: The reorganization

You organized your photos in 2018, moving them into folders by year. But you kept the original unorganized backup “just in case.” Now you have the same photos in Photos-Organized and Photos-Original-Backup.

Scenario 3: The cloud sync

You synced your Documents folder to Dropbox. Then you backed up your entire machine, including the Dropbox cache. Now you have every document three times: original, Dropbox folder, and Dropbox cache.

Scenario 4: The “final” versions

That wedding video exists as wedding.mov, wedding-final.mov, wedding-final-compressed.mov, and wedding-for-upload.mp4. Only two of these are actually unique. The other two are byte-for-byte identical copies with different names.

None of this was intentional. It’s just what happens when you make backups over years without a system to track what’s already been saved.

Content Hashing: How Deduplication Works

To find duplicates, you need a way to determine if two files contain identical data, even if they have different names or live in different directories.

File size isn’t enough. Lots of photos are exactly 2,048 KB. That doesn’t mean they’re the same photo.

File name isn’t enough. IMG_1234.jpg and vacation-beach.jpg might be the same photo, just renamed.

File modification time isn’t enough. Every time you copy a file, the modification time changes.

What you need is a content hash: a fingerprint of the file’s actual data. Two files with the same content hash contain identical bytes, guaranteed.

Hash Functions: Speed vs Security Tradeoff

There are several hash algorithms you could use:

Algorithm Output Size Speed Collision Risk Use Case
xxHash64 64 bits 10+ GB/s Very low (non-adversarial) Personal archives, caches
MD5 128 bits ~500 MB/s Known collisions Legacy systems (avoid)
SHA-1 160 bits ~400 MB/s Broken (2017) Git still uses it
SHA-256 256 bits ~200 MB/s None known Cryptographic signatures
BLAKE3 256 bits ~3 GB/s None known Modern cryptography

For deduplication in a personal archive, xxHash64 is the right choice.

Here’s why:

Speed matters when you’re hashing terabytes. At 1.47 TB of data, the difference between 200 MB/s and 10 GB/s is the difference between scanning in 2 hours versus 2 days.

Cryptographic security isn’t necessary. You’re not defending against an attacker trying to forge files. You’re just identifying duplicates in your own archive. xxHash64’s collision resistance is more than adequate for this use case.

64 bits provides enough entropy. With 64 bits, the probability of a collision is negligible unless you have billions of files. At 1.2 million unique files, the collision probability is roughly 1 in 10 trillion.

Implementing Content Hashing

The scanner computes xxHash64 for every file by streaming the file through the hash function:

func hashFile(path string) (string, error) {
    f, err := os.Open(path)
    if err != nil {
        return "", err
    }
    defer f.Close()

    h := xxhash.New()
    if _, err := io.Copy(h, f); err != nil {
        return "", err
    }

    return fmt.Sprintf("%016x", h.Sum64()), nil
}

Key implementation details:

  • Streaming, not loading: io.Copy processes the file in 32KB chunks. A 4GB video file never loads fully into memory.
  • 16-character hex string: The 64-bit hash is formatted as a zero-padded hexadecimal string for consistent storage.
  • Error handling: If a file can’t be read (permissions, corruption, disk failure), the hash is skipped and the error is logged.

The result gets stored in the files table:

CREATE TABLE files (
    id SERIAL PRIMARY KEY,
    path TEXT NOT NULL,
    content_hash VARCHAR(16),  -- xxHash64 as hex
    size_bytes BIGINT,
    category VARCHAR(50),
    -- ... other fields
);

CREATE INDEX index_files_on_content_hash ON files (content_hash);

Now you can find duplicates with a single SQL query.

Finding Duplicates with SQL

Once every file has a content hash, finding duplicates is straightforward.

Step 1: Group by content hash

SELECT
    content_hash,
    COUNT(*) as occurrence_count,
    SUM(size_bytes) as total_bytes
FROM files
WHERE content_hash IS NOT NULL
GROUP BY content_hash
HAVING COUNT(*) > 1
ORDER BY total_bytes DESC;

What this query does:

  • Groups files by their content hash
  • Counts how many files share each hash
  • Sums the total bytes consumed by all copies
  • Filters to only hashes with multiple occurrences
  • Orders by total bytes (shows the most wasteful duplicates first)

Example output:

content_hash occurrence_count total_bytes
a3f5b2c9d1e4 4 3,440,234,496 (3.4 GB)
7c8e1f2a3b4d 3 1,207,959,552 (1.2 GB)
9d4c3e2f1a5b 7 896,532,480 (896 MB)

That first row? That’s a 860 MB video file that exists in four places. You’re wasting 2.6 GB storing three extra copies.

Step 2: Find all occurrences of a specific duplicate

SELECT path, size_bytes, modified_at
FROM files
WHERE content_hash = 'a3f5b2c9d1e4'
ORDER BY path;

Output:

/Volumes/Backup2015/Videos/wedding.mov                   860 MB  2015-08-14
/Volumes/Backup2018/Important/wedding-final.mov          860 MB  2018-03-22
/Users/avi/Dropbox/Archive/wedding.mov                   860 MB  2019-11-05
/Volumes/ExternalDrive/ToSort/wedding-backup.mov         860 MB  2020-01-10

Four copies, different paths, different modification times, identical content.

Step 3: Calculate total wasted space

SELECT
    COUNT(DISTINCT content_hash) as unique_files,
    SUM(size_bytes) as unique_bytes,
    SUM(size_bytes * (occurrence_count - 1)) as wasted_bytes
FROM (
    SELECT
        content_hash,
        size_bytes,
        COUNT(*) as occurrence_count
    FROM files
    WHERE content_hash IS NOT NULL
    GROUP BY content_hash, size_bytes
) grouped;

What this calculates:

  • unique_files: How many distinct pieces of content exist
  • unique_bytes: How much space the data would take with no duplicates
  • wasted_bytes: How much space is consumed by duplicate copies

Result:

unique_files:   1,185,610
unique_bytes:   732 GB
wasted_bytes:   735 GB

You need 732 GB to store everything once. You’re using 1.47 TB. The other 735 GB is duplicates.

The UniqueFile Model

To make deduplication practical, the application layer creates a UniqueFile model that represents each unique piece of content and tracks all its occurrences.

Here’s the schema:

create_table :unique_files do |t|
  t.string :content_hash, null: false      # xxHash64 (unique index)
  t.bigint :size_bytes, null: false        # Size of the content
  t.string :category                       # photo, video, audio, etc.
  t.string :extension                      # .jpg, .mp4, etc.

  # Duplication tracking
  t.integer :occurrence_count, default: 1  # How many copies exist
  t.bigint :canonical_file_entry_id        # Reference to "best" copy
  t.string :canonical_path                 # Path to keep

  # Backup tracking
  t.boolean :needs_backup, default: true
  t.string :backup_status                  # pending, uploaded, skipped, error
  t.string :backup_url                     # Cloud storage location
  t.datetime :backed_up_at
  t.text :backup_error

  t.timestamps
end

add_index :unique_files, :content_hash, unique: true
add_index :unique_files, :category
add_index :unique_files, :backup_status

Key fields:

  • content_hash: The xxHash64 fingerprint (unique across the table)
  • occurrence_count: How many files in the archive have this hash
  • canonical_file_entry_id: Foreign key to the “best” copy in the files table
  • canonical_path: Path to the canonical copy (denormalized for performance)
  • backup_status: Tracks whether this unique content has been backed up

The model:

class UniqueFile < ApplicationRecord
  belongs_to :canonical_entry,
             class_name: "Scanner::FileEntry",
             foreign_key: :canonical_file_entry_id,
             optional: true

  validates :content_hash, presence: true, uniqueness: true
  validates :size_bytes, presence: true

  scope :photos, -> { where(category: "photo") }
  scope :videos, -> { where(category: "video") }
  scope :audio, -> { where(category: "audio") }
  scope :documents, -> { where(category: "document") }
  scope :duplicated, -> { where("occurrence_count > 1") }
  scope :unique_only, -> { where(occurrence_count: 1) }

  def all_occurrences
    Scanner::FileEntry.where(content_hash: content_hash)
  end

  def wasted_space
    size_bytes * (occurrence_count - 1)
  end

  def human_size
    ActiveSupport::NumberHelper.number_to_human_size(size_bytes)
  end

  def human_wasted
    ActiveSupport::NumberHelper.number_to_human_size(wasted_space)
  end

  def canonical_exists?
    canonical_path && File.exist?(canonical_path)
  end
end

Useful queries:

# All duplicated photos
UniqueFile.photos.duplicated

# Total wasted space on video duplicates
UniqueFile.videos.duplicated.sum(&:wasted_space)
# => 524,288,000 (524 MB)

# Find all copies of a specific file
unique_file = UniqueFile.find_by(content_hash: 'a3f5b2c9d1e4')
unique_file.all_occurrences
# => [#<Scanner::FileEntry path="/path/1">, #<Scanner::FileEntry path="/path/2">, ...]

# Photos that appear more than 5 times
UniqueFile.photos.where("occurrence_count > 5").order(occurrence_count: :desc)

Populating the UniqueFile Table

The UniqueFilePopulator reads all unique content hashes from the files table and creates UniqueFile records in batches:

class UniqueFilePopulator
  def run
    unique_hashes = Scanner::FileEntry
      .where.not(content_hash: nil)
      .distinct
      .pluck(:content_hash)

    unique_hashes.each_slice(5000) do |hash_batch|
      files_by_hash = Scanner::FileEntry
        .where(content_hash: hash_batch)
        .group_by(&:content_hash)

      files_by_hash.each do |hash, entries|
        canonical = select_canonical(entries)

        UniqueFile.create_or_find_by!(content_hash: hash) do |uf|
          uf.size_bytes = canonical.size_bytes
          uf.category = canonical.category
          uf.extension = canonical.extension
          uf.occurrence_count = entries.length
          uf.canonical_file_entry_id = canonical.id
          uf.canonical_path = canonical.path
        end
      end
    end
  end
end

Process:

  1. Find all unique content hashes
  2. For each hash, load all files with that hash
  3. Select the “canonical” copy (see next section)
  4. Create a UniqueFile record with aggregated data

Running it:

bin/rails populate:unique_files

Output:

Populating unique files...
Found 1,185,610 unique content hashes
Processed 5000/1185610 (0.4%)
Processed 10000/1185610 (0.8%)
...
=== Summary ===
Total unique files: 1,185,610
Duplicated: 628,492
Wasted space: 735 GB

This takes about 20 minutes on a reasonably fast machine.

Choosing the Canonical Copy

When you have multiple copies of the same file, you need to decide which one to keep as the “canonical” version. This is the copy you’ll back up to the cloud, and potentially the only one you’ll keep long-term.

The selection algorithm:

def select_canonical(entries)
  entries.min_by { |e| [e.path.count("/"), e.id] }
end

What this does:

  1. Prefer shorter paths: A file at /Photos/2015/beach.jpg wins over /Backup/OldMacBook/Users/avi/Desktop/Unsorted/Photos/2015/beach.jpg
  2. Use ID as tiebreaker: If paths have the same depth, pick the entry with the lowest ID (first indexed)

Why prefer shorter paths:

Shorter paths tend to indicate better organization. Files buried deep in backup folders or temporary directories are usually not the canonical location.

Examples:

Path Depth Winner?
/Photos/wedding.mov 2 YES
/Backup2015/Users/avi/Desktop/Temp/wedding.mov 6 No
Path Depth Winner?
/Videos/2018/vacation.mp4 3 YES (lower ID)
/Archive/Videos/vacation.mp4 3 No (higher ID)

Limitations of this approach:

This algorithm is a heuristic. It doesn’t know that /Backup2015/Photos-Organized is better than /Backup2018/Photos-Random-Dump even if they have the same depth.

Alternative strategies:

  1. Prefer specific root directories: If you have well-organized locations, prioritize them explicitly:
PREFERRED_ROOTS = [
  '/Photos',
  '/Videos',
  '/Documents/Personal',
  '/Archive/Organized'
]

def select_canonical(entries)
  # First try preferred roots
  preferred = entries.find { |e| PREFERRED_ROOTS.any? { |root| e.path.start_with?(root) } }
  return preferred if preferred

  # Fall back to shortest path
  entries.min_by { |e| [e.path.count("/"), e.id] }
end
  1. Avoid temp and cache directories:
BAD_PATTERNS = ['temp', 'cache', '.Trash', 'Downloads', 'Desktop']

def select_canonical(entries)
  # Filter out paths with bad patterns
  good_entries = entries.reject { |e| BAD_PATTERNS.any? { |p| e.path.include?(p) } }
  good_entries = entries if good_entries.empty?  # Fallback if all are bad

  good_entries.min_by { |e| [e.path.count("/"), e.id] }
end
  1. User review for important files: Flag high-value duplicates (photos, videos) for manual review before auto-selecting:
def select_canonical(entries)
  if entries.first.category == 'photo' && entries.length > 3
    # Let user choose via web UI
    nil
  else
    entries.min_by { |e| [e.path.count("/"), e.id] }
  end
end

Strategies for Handling Duplicates

Once you’ve identified duplicates, what do you actually do with them? There are four main approaches, each with tradeoffs.

Strategy 1: Keep One, Delete the Rest (Aggressive)

What it means:

Keep only the canonical copy. Delete all other occurrences.

Pros:

  • Maximum space savings
  • Clean, simple result
  • Easy to back up (just one copy of everything)

Cons:

  • Destructive (can’t undo easily)
  • Risky if canonical selection is wrong
  • Loses context (duplicate might be in a meaningful location)

When to use:

After you’ve verified your backups are solid and you’re confident in canonical selection. Good for obviously redundant copies like .DS_Store files or multiple identical backups of the same drive.

Implementation:

unique_file = UniqueFile.find_by(content_hash: 'abc123')

# Get all non-canonical copies
duplicates = unique_file.all_occurrences.where.not(id: unique_file.canonical_file_entry_id)

duplicates.each do |file_entry|
  if File.exist?(file_entry.path)
    File.delete(file_entry.path)
    puts "Deleted: #{file_entry.path}"
  end
end

What it means:

Don’t delete anything. Just track that files are duplicates so you can make informed decisions later.

Pros:

  • Zero risk of data loss
  • Preserves all context and organization
  • Can analyze patterns before acting

Cons:

  • No space savings
  • Still paying for duplicate storage
  • Still backing up duplicates

When to use:

During initial analysis. This is the default approach when you first run deduplication. Get visibility into what’s duplicated before you start making destructive changes.

Implementation:

This is what UniqueFile does by default. Just populate the table and browse the results without deleting anything.

Strategy 3: Backup Unique Only (Practical)

What it means:

Keep all local copies (don’t delete anything), but only back up the canonical copy to the cloud.

Pros:

  • Saves cloud storage costs (50% reduction!)
  • Keeps all local copies for safety
  • Avoids risky deletion decisions

Cons:

  • Still using local disk space for duplicates
  • If canonical copy is lost, need to re-designate

When to use:

This is the ideal middle ground for most people. You get the financial benefit of deduplication (cloud storage is expensive) without the risk of deleting local files.

Implementation:

# Backup only unique content
UniqueFile.where(backup_status: 'pending').find_each do |uf|
  if uf.canonical_exists?
    upload_to_s3(uf.canonical_path)
    uf.update!(
      backup_status: 'uploaded',
      backed_up_at: Time.current,
      backup_url: "s3://bucket/#{uf.content_hash}"
    )
  end
end

# Skip duplicates automatically
duplicates = Scanner::FileEntry
  .where(content_hash: UniqueFile.where(backup_status: 'uploaded').pluck(:content_hash))
  .where.not(id: UniqueFile.pluck(:canonical_file_entry_id))

puts "Skipping #{duplicates.count} duplicate files (already backed up)"

Cost savings:

If cloud storage costs $5/month per 100 GB:

  • Without deduplication: 1.47 TB = $73.50/month
  • With deduplication: 732 GB = $36.60/month
  • Savings: $36.90/month ($442.80/year)

Strategy 4: Analyze Before Deciding (Adaptive)

What it means:

Use different strategies for different types of files.

Examples:

  • Photos and videos: Backup unique only (high value, large files)
  • System files: Delete all duplicates (no value, safe to remove)
  • Documents: Keep all, review manually (may have versions with different edits)
  • Code projects: Keep all (might be different branches with same compiled output)

Pros:

  • Maximizes value while minimizing risk
  • Tailored to actual file importance
  • Can be automated with rules

Cons:

  • More complex to implement
  • Requires careful rule design
  • Easy to make mistakes in rule logic

When to use:

Once you understand your archive’s composition. Start with conservative approach, analyze the results, then build rules based on what you learn.

Implementation:

class DuplicationStrategy
  def self.handle(unique_file)
    case unique_file.category
    when 'photo', 'video'
      BackupUniqueOnly.new(unique_file).execute
    when 'system'
      DeleteDuplicates.new(unique_file).execute
    when 'document'
      FlagForReview.new(unique_file).execute
    when 'code'
      KeepAll.new(unique_file).execute
    else
      KeepAll.new(unique_file).execute  # Safe default
    end
  end
end

Watch Out: Deleting Duplicates Safely

Deleting duplicates is risky. If you delete the wrong file or if your canonical selection algorithm has bugs, you could lose data permanently. Follow these rules:

1. Always verify backups first

Before you delete anything, confirm that:

  • Your backup system is working
  • You can restore files from backup
  • Your backup includes the canonical copies

Test it:

# Check that canonical files actually exist
missing = UniqueFile.where.not(canonical_path: nil).select { |uf| !uf.canonical_exists? }
if missing.any?
  puts "ERROR: #{missing.count} canonical files are missing!"
  puts missing.first(10).map(&:canonical_path)
  exit 1
end

2. Use a dry run first

Don’t actually delete files on the first pass. Log what would be deleted and review it:

def delete_duplicates(unique_file, dry_run: true)
  duplicates = unique_file.all_occurrences.where.not(id: unique_file.canonical_file_entry_id)

  duplicates.each do |file_entry|
    if dry_run
      puts "WOULD DELETE: #{file_entry.path} (#{file_entry.human_size})"
    else
      File.delete(file_entry.path) if File.exist?(file_entry.path)
      puts "DELETED: #{file_entry.path}"
    end
  end
end

# First run (safe)
delete_duplicates(unique_file, dry_run: true)

# Review output, then run for real
delete_duplicates(unique_file, dry_run: false)

3. Move to trash instead of deleting

Instead of File.delete, move duplicates to a trash folder. If something goes wrong, you can restore them:

TRASH_DIR = '/Volumes/ExternalDrive/DeduplicationTrash'

def trash_duplicates(unique_file)
  FileUtils.mkdir_p(TRASH_DIR)

  duplicates = unique_file.all_occurrences.where.not(id: unique_file.canonical_file_entry_id)

  duplicates.each do |file_entry|
    if File.exist?(file_entry.path)
      # Preserve directory structure in trash
      relative_path = file_entry.path.delete_prefix('/')
      trash_path = File.join(TRASH_DIR, relative_path)

      FileUtils.mkdir_p(File.dirname(trash_path))
      FileUtils.mv(file_entry.path, trash_path)

      puts "Trashed: #{file_entry.path} -> #{trash_path}"
    end
  end
end

After a month of confirming nothing is needed, delete the trash folder.

4. Delete in phases by category

Start with low-risk files and work up to high-value files:

# Phase 1: System and cache files (safe to delete)
UniqueFile.where(category: 'system').duplicated.find_each do |uf|
  delete_duplicates(uf, dry_run: false)
end

# Phase 2: Code and config (medium risk)
UniqueFile.where(category: ['code', 'config']).duplicated.find_each do |uf|
  delete_duplicates(uf, dry_run: false)
end

# Phase 3: Documents (higher risk, review first)
UniqueFile.documents.duplicated.find_each do |uf|
  puts "Review: #{uf.canonical_path} (#{uf.occurrence_count} copies)"
  # Manual review before deleting
end

# Phase 4: Photos and videos (NEVER auto-delete)
# Use backup unique only strategy instead

5. Keep a deletion log

Record every file you delete with timestamp and reason:

def log_deletion(file_entry, reason)
  File.open('deletion_log.txt', 'a') do |f|
    f.puts "#{Time.current} | #{reason} | #{file_entry.path} | #{file_entry.content_hash} | #{file_entry.size_bytes}"
  end
end

def delete_with_logging(file_entry, reason)
  log_deletion(file_entry, reason)
  File.delete(file_entry.path) if File.exist?(file_entry.path)
end

If you need to restore something, you have a complete audit trail.

Edge Cases: Same Hash, Different Files

Hash collisions are theoretically possible. Two different files could produce the same xxHash64 value. With 64 bits of entropy and 1.2 million files, the probability is about 1 in 10 trillion, but it’s not zero.

How to detect collisions:

If you’re paranoid (or dealing with critical data), verify that files with the same hash are actually identical by comparing them byte-for-byte:

def verify_no_collision(unique_file)
  occurrences = unique_file.all_occurrences.limit(2).to_a
  return true if occurrences.length < 2

  # Compare first two files byte-for-byte
  file1_content = File.read(occurrences[0].path, mode: 'rb')
  file2_content = File.read(occurrences[1].path, mode: 'rb')

  if file1_content != file2_content
    puts "COLLISION DETECTED!"
    puts "Hash: #{unique_file.content_hash}"
    puts "File 1: #{occurrences[0].path}"
    puts "File 2: #{occurrences[1].path}"
    return false
  end

  true
end

# Check all duplicates for collisions
UniqueFile.duplicated.find_each do |uf|
  verify_no_collision(uf)
end

In practice:

I’ve never seen a collision in a personal archive. xxHash64 is designed to avoid collisions even with adversarial input. For random personal files (photos, videos, documents), collisions are effectively impossible.

If you do find a collision:

  1. Switch to a stronger hash (SHA-256 or BLAKE3)
  2. Re-hash your entire archive with the new algorithm
  3. Update the content_hash column and re-run deduplication

This is extremely unlikely to be necessary.

Real Results: What 50% Deduplication Looks Like

Here’s what the deduplication analysis revealed in my 3.5 million file archive:

Overall statistics:

  • Total files: 3,576,948
  • Unique content: 1,185,610 (33% are unique)
  • Duplicated content: 2,391,338 (67% are duplicates)
  • Unique data: 732 GB
  • Wasted space: 735 GB (50% of total)

By category:

Category Unique Duplicates Wasted Space Notes
Photos 418,234 1,247,892 387 GB 3x duplication on average
Videos 4,421 8,842 215 GB Large files, few copies
Documents 98,234 187,456 45 GB PDFs duplicated across backups
Audio 39,872 79,744 52 GB Music libraries backed up multiple times
Code 245,678 491,356 18 GB Mostly node_modules and build artifacts
Other 379,171 376,048 18 GB Config, data, archives

Biggest offenders:

  1. Photo library duplicated 3x: Same 418k photos in /Photos, /Backup2015/Photos, and /iCloudDrive/Photos. Cost: 387 GB.

  2. Wedding video in 4 places: Same 2.3 GB video file in four different backup locations. Cost: 6.9 GB.

  3. iPhone camera roll synced twice: Photos from iPhone appear both in ~/Pictures/Photos Library and /Backups/iPhone. Cost: 124 GB.

  4. Document backups: Work documents backed up annually from 2015-2023. Each backup is complete, so files from 2015 exist in 8 places. Cost: 45 GB.

  5. Node modules everywhere: Every code project has its own copy of React, Lodash, etc. Cost: 12 GB.

What I did:

  • Backed up unique content only to Backblaze B2 (saved 735 GB of cloud storage)
  • Kept all local copies (too risky to delete photos without more review)
  • Deleted duplicate node_modules (reconstructible from package.json)
  • Deleted system files and caches (safe, no value)
  • Manually reviewed duplicates for important documents (kept multiple versions where edits might differ)

Result:

  • Local storage: Still 1.47 TB (kept all copies for safety)
  • Cloud storage: 732 GB (50% reduction, saves $37/month)
  • Duplicates: Documented and tracked, can delete later if space becomes critical

What This Enables

Deduplication isn’t just about saving space. It changes how you think about your archive:

1. Backup strategy becomes simpler

Instead of “back up everything and hope for the best,” you can back up unique content only. You know exactly what needs to be protected and what’s redundant.

2. You can find the best version

When you have four copies of wedding.mov, which one do you keep? Deduplication identifies them all so you can pick the one in the best location or delete the ones in temporary folders.

3. You understand your archive’s composition

“50% of my storage is duplicates” is actionable information. You can focus cleanup efforts on the biggest offenders (photos) rather than wasting time on small files.

4. You can make trade-offs intentionally

Want to delete duplicates to free up space? Now you can see exactly what that gains you. Want to keep everything? Now you know what that costs you. Either way, you’re making informed decisions rather than guessing.

What’s Next

You’ve scanned 3.5 million files. You’ve categorized them by type. You’ve found 735 GB of duplicates and identified canonical copies.

Now you need a way to browse, search, and explore this data. You need to find specific files, triage what to keep, and plan your backup strategy.


Next: Cloud Backup Strategy - Protecting your unique files with offsite backup.


Back to top

AI Digital Archive - A system for organizing your digital life

This site uses Just the Docs, a documentation theme for Jekyll.