Cloud Backup: Protecting Your Digital Treasures

Your archive database is working. Duplicates are identified. You know exactly what you have. Now comes the most important part: making sure you never lose it.

Local storage fails. Hard drives die. Houses burn down. Laptops get stolen. Floods happen. The only backup strategy that survives disaster is offsite backup. That means cloud storage.

Why Cloud Backup Matters

Hard drives have a 100% failure rate. Not “might fail.” Will fail. The question is when, not if.

Your external drive spinning in your closet right now? Average lifespan is 3 to 5 years. SSDs? 5 to 7 years, then the NAND flash cells start degrading. NAS drives? Better, but still mechanical parts that wear out.

Local backups protect against accidents, not disasters. Time Machine on an external drive saves you when you accidentally delete a file. It does nothing when your house burns down. Or floods. Or gets robbed. Or when both your laptop and backup drive fail in the same week because they were both in your backpack.

Cloud storage survives everything. Proper cloud object storage like S3 has 99.999999999% durability (11 nines). Your data is replicated across multiple geographic regions. AWS could lose an entire data center and your files would be fine. Your house could disappear and your files would be fine.

That wedding video? Those photos of your kids? Your college thesis? If they only exist on local drives, they are temporary. Eventually, you will lose them. Cloud backup makes them permanent.

The Deduplication Advantage

Here is where the archive system pays off. After deduplication, you don’t need to back up 1.47 TB. You need to back up 732 GB.

Total data across all backups: 1.47 TB Unique data after deduplication: 732 GB Duplicate waste: 735 GB (50%)

You save 50% on storage costs immediately. That wedding video you had in four places? Back it up once. Those photo library copies scattered across three drives? One backup covers all of them.

This is the power of content-based deduplication. The cloud backup job processes unique files only. Duplicates are marked “skipped” and reference the canonical copy. You pay once, protect everything.

S3-Compatible Storage Options

“S3-compatible” means it speaks Amazon’s S3 API. Your Rails app doesn’t care who runs the actual storage. AWS, Backblaze, Wasabi, or MinIO all work the same way from the application layer.

AWS S3 (Standard)

  • Cost: $0.023/GB/month
  • Egress: $0.09/GB
  • 732 GB archive: $16.84/month
  • Pros: Most compatible, every tool supports it, instant availability
  • Cons: Expensive for long-term storage, egress fees add up

Backblaze B2

  • Cost: $0.006/GB/month
  • Egress: First 3x storage free, then $0.01/GB
  • 732 GB archive: $4.39/month
  • Pros: Cheapest storage, generous free egress (2.2 TB/month free)
  • Cons: Slower than AWS in some regions

Wasabi Hot Cloud Storage

  • Cost: $0.0069/GB/month (min 1 TB billing)
  • Egress: $0 (unlimited free)
  • 732 GB archive: $5.05/month (charged for 1 TB minimum)
  • Pros: No egress fees, fast, predictable costs
  • Cons: 1 TB minimum billing, 90-day minimum retention

MinIO (Self-Hosted)

  • Cost: Your server costs
  • Egress: Free (your bandwidth)
  • 732 GB archive: Depends on your server
  • Pros: Full control, no vendor lock-in, free software
  • Cons: You manage infrastructure, redundancy, backups of backups

Why This Choice: Wasabi

This project uses Wasabi for three reasons:

1. No egress fees. Restoring your entire archive costs $0. With AWS S3, downloading 732 GB costs $65.88. With Wasabi, it costs nothing. For a disaster recovery scenario where you need everything back, this is huge.

2. Predictable pricing. The price is the price. No surprise charges for API calls, no tiered storage math, no calculating free egress allowances. $0.0069/GB/month, period.

3. S3-compatible API. ActiveStorage works out of the box. No custom adapters, no compatibility layers. It is S3 as far as Rails is concerned.

The 1 TB minimum is fine. This archive is 732 GB now and will grow. Photos and videos accumulate over time. You will hit 1 TB eventually.

Cost Comparison Table

Provider Storage Cost Egress Cost Monthly (732 GB) Full Restore Cost
AWS S3 Standard $0.023/GB $0.09/GB $16.84 $65.88
Backblaze B2 $0.006/GB $0.01/GB* $4.39 $0 (free tier)
Wasabi $0.0069/GB $0 $5.05** $0
MinIO Server cost $0 Varies $0

* First 3x storage size is free (2.2 TB/month for 732 GB) ** Billed at 1 TB minimum ($6.90/month)

Over one year:

  • AWS S3: $202/year + restore costs
  • Backblaze B2: $53/year
  • Wasabi: $83/year (1 TB billing)
  • MinIO: Your server costs

For 732 GB of irreplaceable family photos and videos, $83/year is cheap insurance.

ActiveStorage Configuration

Rails 8 includes ActiveStorage for cloud file management. Configure Wasabi (or any S3-compatible provider) in config/storage.yml:

# config/storage.yml
wasabi:
  service: S3
  access_key_id: <%= Rails.application.credentials.dig(:wasabi, :access_key_id) %>
  secret_access_key: <%= Rails.application.credentials.dig(:wasabi, :secret_access_key) %>
  region: <%= Rails.application.credentials.dig(:wasabi, :region) || 'us-east-1' %>
  endpoint: https://s3.<%= Rails.application.credentials.dig(:wasabi, :region) || 'us-east-1' %>.wasabisys.com
  bucket: avi-archive
  force_path_style: true

Why force_path_style: true? S3 supports two URL styles: virtual-hosted (bucket.s3.amazonaws.com) and path-style (s3.amazonaws.com/bucket). AWS prefers virtual-hosted. Wasabi requires path-style. The flag forces path-style URLs.

Store credentials securely:

# Edit encrypted credentials
rails credentials:edit

# Add Wasabi credentials
wasabi:
  access_key_id: YOUR_ACCESS_KEY
  secret_access_key: YOUR_SECRET_KEY
  region: us-east-1
  bucket: your-archive-bucket

Configure the service in your environment:

# config/environments/production.rb
config.active_storage.service = :wasabi

# config/environments/development.rb
config.active_storage.service = :local  # Use local disk in dev

ActiveStorage now routes all file operations through Wasabi in production.

Backup Status Tracking

The Backupable concern adds cloud backup status to any model:

# app/models/concerns/backupable.rb
module Backupable
  extend ActiveSupport::Concern

  BACKUP_STATUSES = %w[pending uploaded skipped error].freeze

  included do
    validates :backup_status, inclusion: { in: BACKUP_STATUSES }, allow_nil: true

    scope :backup_pending, -> { where(backup_status: [nil, "pending"]) }
    scope :backup_complete, -> { where(backup_status: "uploaded") }
    scope :backup_error, -> { where(backup_status: "error") }
  end

  def mark_uploaded!(url)
    update!(backup_status: "uploaded", backup_url: url, backed_up_at: Time.current)
  end

  def mark_error!(message)
    update!(backup_status: "error", backup_error: message)
  end

  def mark_skipped!
    update!(backup_status: "skipped")
  end
end

Status values:

  • pending (or nil): Not yet uploaded
  • uploaded: Successfully backed up to cloud
  • skipped: Duplicate, references another file’s backup
  • error: Upload failed, see backup_error for details

Models using Backupable:

  • UniqueFile: One row per unique content hash (732 GB of unique files)
  • Photo: Individual photos with EXIF data
  • Video: Video files
  • AudioFile: Audio files
  • Document: Documents

The Backup Workflow

Upload unique files only. Skip duplicates. Track everything.

Step 1: Filter for unique files

# Find all unique files that need backup
files_to_backup = UniqueFile
  .where(needs_backup: true)
  .where(backup_status: [nil, "pending", "error"])
  .where.not(content_hash: nil)
  .order(:category, size_bytes: :desc)  # Prioritize: photos first, largest first
  .limit(1000)

This query:

  • Filters files marked needs_backup: true
  • Skips already uploaded files
  • Includes failed uploads for retry
  • Orders by category (photos first) and size (largest first)
  • Batches 1000 at a time

Step 2: Upload each file

# lib/tasks/backup.rake
namespace :backup do
  desc "Upload unique files to cloud storage"
  task upload: :environment do
    files_to_backup = UniqueFile.backup_pending.limit(1000)

    files_to_backup.find_each do |unique_file|
      begin
        # Check if file exists on disk
        unless File.exist?(unique_file.canonical_path)
          unique_file.mark_error!("File not found: #{unique_file.canonical_path}")
          next
        end

        # Read file content
        file_content = File.open(unique_file.canonical_path)

        # Upload to ActiveStorage
        blob = ActiveStorage::Blob.create_and_upload!(
          io: file_content,
          filename: File.basename(unique_file.canonical_path),
          content_type: MIME::Types.type_for(unique_file.extension).first&.content_type || "application/octet-stream",
          metadata: {
            content_hash: unique_file.content_hash,
            original_path: unique_file.canonical_path,
            size_bytes: unique_file.size_bytes,
            category: unique_file.category
          }
        )

        # Mark as uploaded
        unique_file.mark_uploaded!(blob.url)

        puts "[OK] Uploaded: #{unique_file.canonical_path} (#{unique_file.size_bytes} bytes)"

      rescue => e
        unique_file.mark_error!(e.message)
        puts "[FAIL] Failed: #{unique_file.canonical_path} - #{e.message}"
      end
    end
  end
end

Step 3: Skip duplicates

# Mark duplicates as skipped (they reference the canonical backup)
namespace :backup do
  desc "Mark duplicate files as skipped"
  task skip_duplicates: :environment do
    # For each unique file that's uploaded, mark all duplicate references as skipped
    UniqueFile.backup_complete.find_each do |unique_file|
      # Find all scanner file entries with the same content hash
      Scanner::FileEntry
        .where(content_hash: unique_file.content_hash)
        .where.not(path: unique_file.canonical_path)
        .find_each do |duplicate|
          # Update Photo/Video/AudioFile/Document records that reference this duplicate
          Photo.find_by(file_entry_id: duplicate.id)&.mark_skipped!
          Video.find_by(file_entry_id: duplicate.id)&.mark_skipped!
          AudioFile.find_by(file_entry_id: duplicate.id)&.mark_skipped!
          Document.find_by(file_entry_id: duplicate.id)&.mark_skipped!
        end
    end

    puts "Marked duplicates as skipped"
  end
end

Step 4: Verify uploads

namespace :backup do
  desc "Verify uploaded files exist in cloud storage"
  task verify: :environment do
    UniqueFile.backup_complete.find_each do |unique_file|
      begin
        # Parse blob key from backup_url
        blob = ActiveStorage::Blob.find_by(key: extract_blob_key(unique_file.backup_url))

        unless blob&.service&.exist?(blob.key)
          unique_file.mark_error!("Blob missing from cloud storage")
          puts "[FAIL] Missing: #{unique_file.canonical_path}"
        else
          puts "[OK] Verified: #{unique_file.canonical_path}"
        end

      rescue => e
        unique_file.mark_error!("Verification failed: #{e.message}")
        puts "[FAIL] Error: #{unique_file.canonical_path} - #{e.message}"
      end
    end
  end

  def extract_blob_key(url)
    # Extract ActiveStorage blob key from URL
    # Example URL: https://s3.us-east-1.wasabisys.com/avi-archive/abc123xyz
    url.split('/').last
  end
end

Content Hash Verification

Before marking a file as uploaded, verify the content hash matches:

def verify_upload(unique_file, blob)
  # Download the first chunk to verify
  blob.download do |chunk|
    remote_hash = Digest::SHA256.hexdigest(chunk)
    local_hash = unique_file.content_hash

    unless remote_hash == local_hash
      raise "Hash mismatch: local=#{local_hash}, remote=#{remote_hash}"
    end
  end

  true
end

For large files, consider streaming hash calculation:

def streaming_hash(io)
  digest = Digest::SHA256.new
  while chunk = io.read(8192)
    digest.update(chunk)
  end
  digest.hexdigest
end

This avoids loading entire files into memory.

Disaster Recovery Considerations

Scenario: Total data loss. Your laptop dies. Your NAS fails. Your external drives are gone. Everything local is destroyed.

Recovery plan:

  1. Install Rails app on a new machine
  2. Restore PostgreSQL database from backup (you are backing this up too, right?)
  3. Run restore rake task to download all files from cloud storage
  4. Verify content hashes match database records
  5. Rebuild local directory structure

Restore rake task:

namespace :backup do
  desc "Restore all files from cloud storage"
  task restore: :environment do
    restore_path = ENV['RESTORE_PATH'] || Rails.root.join('restored_archive')
    FileUtils.mkdir_p(restore_path)

    UniqueFile.backup_complete.find_each do |unique_file|
      begin
        blob = ActiveStorage::Blob.find_by(key: extract_blob_key(unique_file.backup_url))

        # Recreate directory structure
        relative_path = unique_file.canonical_path.sub(/^\/Volumes\/[^\/]+/, '')
        full_path = File.join(restore_path, relative_path)
        FileUtils.mkdir_p(File.dirname(full_path))

        # Download file
        blob.download do |chunk|
          File.write(full_path, chunk)
        end

        puts "[OK] Restored: #{full_path}"

      rescue => e
        puts "[FAIL] Failed: #{unique_file.canonical_path} - #{e.message}"
      end
    end
  end
end

Run with:

RESTORE_PATH=/Volumes/NewDrive/Archive rails backup:restore

Database backup is critical. The cloud storage holds your files. The database holds the metadata, EXIF data, GPS locations, categories, and deduplication mapping. Back up PostgreSQL regularly:

# Dump database
pg_dump avi_archive > avi_archive_backup_$(date +%Y%m%d).sql

# Upload to cloud
aws s3 cp avi_archive_backup_20260202.sql s3://avi-archive-db-backups/

Automate this with cron or a scheduled job. Daily database backups are cheap insurance.

Watch Out: Egress Fees

The hidden cost of cloud storage is getting your data back.

AWS S3 charges $0.09/GB for egress (downloads). Restoring 732 GB costs $65.88. If you restore twice, you have paid $131.76 in bandwidth fees. That is more than a year of storage.

Backblaze B2 offers 3x your storage size as free egress per month. For 732 GB stored, you get 2.2 TB/month free downloads. Restoring once per month is free. Restoring twice costs money.

Wasabi has zero egress fees. Restore your entire archive ten times. Download it daily. Transfer it to another provider. It costs $0.

Why this matters: Disaster recovery means downloading everything. Testing your backups (which you should do) means downloading files. Migrating to another provider means downloading everything. Egress fees turn disaster recovery into a financial decision. That is a bad position to be in when your house is on fire.

Free egress removes this friction. You can test restores without anxiety. You can move data freely. You can recover from disasters without calculating costs.

Adapt This: Choose Your Storage Provider

This guide uses Wasabi, but the architecture works with any S3-compatible provider. Swap the config/storage.yml configuration and everything else stays the same.

If you prioritize lowest cost: Use Backblaze B2 ($4.39/month for 732 GB). The free egress tier covers most disaster recovery scenarios.

If you prioritize compatibility: Use AWS S3. Every tool, every library, every service supports it. You pay more, but you never fight compatibility issues.

If you prioritize control: Use MinIO on your own server. You manage the infrastructure, but you own the data completely. Good for privacy-sensitive archives.

If you prioritize simplicity: Use Wasabi. One price, no surprise fees, works exactly like S3. This is the “set it and forget it” option.

The Rails ActiveStorage layer abstracts the provider. The backup workflow is identical regardless of where files land. Choose the provider that matches your priorities, update the config, and ship it.

Backup Priorities

Not all files are equally important. Prioritize backups by category:

Priority 1: Photos (418k files, ~300 GB) Irreplaceable. Family memories, travel photos, events. Back these up first.

Priority 2: Videos (4.4k files, ~200 GB) Also irreplaceable. Graduations, weddings, kids growing up. Second priority.

Priority 3: Documents (98k files, ~50 GB) Tax records, college papers, legal documents. Important but often reproducible.

Priority 4: Audio (40k files, ~80 GB) Music collections, voice memos, podcasts. Nice to have, but often re-downloadable.

Priority 5: Code Projects (2.2k projects, ~100 GB) Your own projects are valuable. Dependencies and node_modules are not. Back up source, skip build artifacts.

Run separate backup jobs for each priority. Start with photos and videos. If you run out of budget or storage, at least the irreplaceable stuff is safe.

Monitoring and Alerts

Track backup progress with simple metrics:

# app/models/backup_stats.rb
class BackupStats
  def self.summary
    {
      total_files: UniqueFile.count,
      uploaded: UniqueFile.backup_complete.count,
      pending: UniqueFile.backup_pending.count,
      errors: UniqueFile.backup_error.count,
      skipped: UniqueFile.where(backup_status: "skipped").count,
      total_bytes_backed_up: UniqueFile.backup_complete.sum(:size_bytes),
      percent_complete: (UniqueFile.backup_complete.count.to_f / UniqueFile.count * 100).round(2)
    }
  end
end

Add a dashboard page:

<!-- app/views/backup_status/index.html.erb -->
<h1>Cloud Backup Status</h1>

<% stats = BackupStats.summary %>

<div class="stats">
  <div class="stat">
    <h3>Files Uploaded</h3>
    <p><%= stats[:uploaded].to_s(:delimited) %> / <%= stats[:total_files].to_s(:delimited) %></p>
    <p><%= stats[:percent_complete] %>%</p>
  </div>

  <div class="stat">
    <h3>Bytes Backed Up</h3>
    <p><%= number_to_human_size(stats[:total_bytes_backed_up]) %></p>
  </div>

  <div class="stat">
    <h3>Pending</h3>
    <p><%= stats[:pending].to_s(:delimited) %></p>
  </div>

  <div class="stat">
    <h3>Errors</h3>
    <p><%= stats[:errors].to_s(:delimited) %></p>
  </div>
</div>

Set up alerts for errors:

# Check for upload errors daily
namespace :backup do
  desc "Report backup errors"
  task report_errors: :environment do
    errors = UniqueFile.backup_error.limit(100)

    if errors.any?
      puts "Backup errors detected:"
      errors.each do |file|
        puts "  #{file.canonical_path}: #{file.backup_error}"
      end

      # Send email alert (optional)
      BackupMailer.error_report(errors).deliver_later
    else
      puts "No backup errors"
    end
  end
end

Run this daily with cron or a scheduled job.

Conclusion

Cloud backup is the final step. Everything before this (scanning, deduplication, categorization) prepares the data. This step makes it permanent.

732 GB of unique content. Not 1.47 TB. Deduplication saves 50% of backup costs.

S3-compatible storage. Wasabi, Backblaze, AWS, or MinIO. The architecture works with any provider.

Backup status tracking. Every file knows if it is pending, uploaded, skipped, or errored.

Verification. Content hashes confirm uploads are correct.

Disaster recovery. Restore everything from cloud with a single rake task.

Your digital archive is now protected against every disaster except the end of the internet. Hard drives will fail. Laptops will die. Houses might burn. Your files will survive.


Next: Customization Guide - Making the system work for your specific needs.


Back to top

AI Digital Archive - A system for organizing your digital life

This site uses Just the Docs, a documentation theme for Jekyll.