Customization: Making It Your Own

This system is not a finished product. It’s a framework you adapt to your specific archive, your specific priorities, and your specific obsessions. Maybe you’re a photographer drowning in RAW files. Maybe you’re a musician with 20 years of Logic projects. Maybe you inherited a hard drive full of family videos and need to organize them before they’re lost forever.

The architecture works for all of these. The customization points are designed to be found and modified. Here’s where to adapt the system for your needs.

Layer 1: Scanner Customization

The scanner is where files get classified, categorized, and fingerprinted. This is the first place to customize because it determines what data flows into the rest of the system.

Add New File Categories

Your archive has file types the default scanner doesn’t know about. The category map is a simple Go map. Find it in scanner/main.go:

var categoryByExt = map[string]struct {
    category   string
    confidence string
}{
    // Default categories
    ".jpg": {"photo", "high"},
    ".mp4": {"video", "high"},
    ".mp3": {"audio", "high"},
    // ... more defaults
}

Add your categories:

// 3D modeling and CAD
".blend": {"3d_model", "high"},
".obj":   {"3d_model", "high"},
".stl":   {"3d_model", "high"},
".dwg":   {"cad", "high"},
".dxf":   {"cad", "high"},

// Music production
".flp":   {"music_production", "high"},  // FL Studio
".logic": {"music_production", "high"},  // Logic Pro
".als":   {"music_production", "high"},  // Ableton Live
".ptx":   {"music_production", "high"},  // Pro Tools

// Ebooks and documents
".epub":  {"ebook", "high"},
".mobi":  {"ebook", "high"},
".azw3":  {"ebook", "high"},

// Game development
".unity": {"game_project", "high"},
".unreal": {"game_project", "high"},
".godot": {"game_project", "high"},

Confidence levels guide your triage:

  • high: Extension strongly indicates category, no review needed
  • medium: Extension suggests category but might be ambiguous
  • low: Extension is weak signal, flag for manual review

After adding categories, recompile and re-scan:

cd scanner
go build -o scanner main.go
./scanner -source /Volumes/YourDrive -db "postgres://localhost/archive"

The scanner’s resume capability means it only processes new files. Existing files keep their old categories unless you clear the database first.

Adapt This: Category-Specific Metadata

Some file types carry metadata beyond what EXIF handles. Music files have ID3 tags. Videos have duration and codec info. PDFs have page counts.

Example: Extract video metadata with FFprobe:

import "os/exec"

func extractVideoMetadata(path string) (map[string]interface{}, error) {
    cmd := exec.Command("ffprobe",
        "-v", "quiet",
        "-print_format", "json",
        "-show_format",
        "-show_streams",
        path)

    output, err := cmd.Output()
    if err != nil {
        return nil, err
    }

    var data map[string]interface{}
    json.Unmarshal(output, &data)
    return data, nil
}

Call this in processFile() when the category is video. Store the JSON in a new JSONB column video_metadata. The pattern is identical to EXIF extraction.

Customize Skip Patterns

The default skip list excludes system directories and dependency folders. Your archive has others worth skipping:

var skipDirs = map[string]bool{
    // Defaults
    "node_modules":    true,
    ".git":            true,
    "Applications":    true,
    "Library":         true,

    // Add your patterns
    "vendor":          true,  // Go dependencies
    ".bundle":         true,  // Ruby gems
    "__pycache__":     true,  // Python bytecode
    ".venv":           true,  // Python virtual envs
    "venv":            true,
    ".next":           true,  // Next.js build output
    ".nuxt":           true,  // Nuxt build output
    "dist":            true,  // Build artifacts
    "build":           true,
    ".cache":          true,  // Generic caches
    "tmp":             true,  // Temporary files
    "temp":            true,
    ".docker":         true,  // Docker cache

    // Project-specific
    "DaVinci Resolve Cache": true,  // Video editing cache
    "Final Cut Pro Cache": true,
    "Adobe Cache":     true,
    ".dropbox.cache":  true,  // Dropbox local cache
}

Pattern-based skipping:

The current implementation only matches exact directory names. For more flexibility, modify the shouldSkipDir() function to match patterns:

func shouldSkipDir(name string) bool {
    // Exact match
    if skipDirs[name] {
        return true
    }

    // Pattern match
    patterns := []string{
        "cache", "Cache", "CACHE",
        ".tmp", "~temp",
        "node_modules",
    }

    lowerName := strings.ToLower(name)
    for _, pattern := range patterns {
        if strings.Contains(lowerName, strings.ToLower(pattern)) {
            return true
        }
    }

    return false
}

Now any directory with “cache” in the name gets skipped automatically.

Expand Sensitive File Detection

The default sensitive patterns catch SSH keys, credentials, and environment files. Your archive has other sensitive data:

var sensitivePatterns = []string{
    // Defaults
    ".ssh", ".aws", ".gnupg", "credentials", ".env",
    "private_key", "id_rsa", "id_ed25519",

    // Financial
    "tax", "Tax", "taxes", "Taxes",
    "bank", "Bank", "banking",
    "investment", "Investment",
    "financial", "Financial",

    // Personal
    "medical", "Medical", "health", "Health",
    "legal", "Legal",
    "passwords", "Passwords",
    "private", "Private",
    "confidential", "Confidential",

    // Work
    "nda", "NDA", "proprietary", "Proprietary",
    "internal", "Internal",
}

Files matching these patterns get is_sensitive = true. The web application filters them from public views. Backup scripts can exclude them from cloud uploads.

Add EXIF Support for New Formats

The default scanner extracts EXIF from common photo formats. Add more:

var exifExtensions = map[string]bool{
    // Defaults
    ".jpg":  true,
    ".jpeg": true,
    ".png":  true,
    ".heic": true,
    ".tiff": true,

    // RAW formats
    ".cr2":  true,  // Canon
    ".cr3":  true,  // Canon (newer)
    ".nef":  true,  // Nikon
    ".arw":  true,  // Sony
    ".dng":  true,  // Adobe Digital Negative
    ".orf":  true,  // Olympus
    ".rw2":  true,  // Panasonic
    ".raf":  true,  // Fujifilm
}

// In processFile():
if exifExtensions[strings.ToLower(ext)] {
    exifData, dateTaken, lat, lon, width, height := extractExif(path)
    record.ExifData = exifData
    record.ExifDateTaken = dateTaken
    record.GpsLat = lat
    record.GpsLon = lon
    record.MediaWidth = width
    record.MediaHeight = height
}

Tune Worker Concurrency

The default worker count is 8. This works for most machines, but you can tune it:

# More workers for fast SSDs and many cores
./scanner -source /path -workers 16

# Fewer workers for slow HDDs or low CPU
./scanner -source /path -workers 4

# Skip hashing for faster exploratory scans
./scanner -source /path -skip-hash

# Skip EXIF if you don't need camera metadata
./scanner -source /path -skip-exif

How to choose:

  • SSD + 8+ cores: 12-16 workers
  • HDD + 4 cores: 4-6 workers
  • Networked storage: 2-4 workers (network is bottleneck)
  • Battery-powered laptop: 4 workers (balance speed and heat)

Watch system load during the scan. If CPU is pegged at 100% and disk is idle, reduce workers. If disk is thrashing and CPU is idle, you’re probably I/O bound and worker count doesn’t matter.

Layer 2: Database Customization

The PostgreSQL schema is designed to be extended. Add columns for new metadata, create indexes for your common queries, and build views for specific reporting needs.

Add Columns for New Metadata

Every file type has unique metadata. Add columns as you need them:

-- Video-specific metadata
ALTER TABLE files ADD COLUMN video_duration_seconds INTEGER;
ALTER TABLE files ADD COLUMN video_resolution TEXT;  -- '1920x1080', '4K', etc
ALTER TABLE files ADD COLUMN video_codec TEXT;
ALTER TABLE files ADD COLUMN video_framerate FLOAT;

-- Audio-specific metadata
ALTER TABLE files ADD COLUMN audio_artist TEXT;
ALTER TABLE files ADD COLUMN audio_album TEXT;
ALTER TABLE files ADD COLUMN audio_year INTEGER;
ALTER TABLE files ADD COLUMN audio_genre TEXT;
ALTER TABLE files ADD COLUMN audio_duration_seconds INTEGER;

-- Document-specific metadata
ALTER TABLE files ADD COLUMN document_page_count INTEGER;
ALTER TABLE files ADD COLUMN document_author TEXT;
ALTER TABLE files ADD COLUMN document_word_count INTEGER;

-- Code-specific metadata
ALTER TABLE files ADD COLUMN code_language TEXT;
ALTER TABLE files ADD COLUMN code_line_count INTEGER;
ALTER TABLE files ADD COLUMN code_last_commit_date TIMESTAMP WITH TIME ZONE;

Add corresponding indexes for columns you’ll query frequently:

CREATE INDEX idx_files_video_duration ON files(video_duration_seconds)
    WHERE video_duration_seconds IS NOT NULL;

CREATE INDEX idx_files_audio_artist ON files(audio_artist)
    WHERE audio_artist IS NOT NULL;

CREATE INDEX idx_files_document_pages ON files(document_page_count)
    WHERE document_page_count IS NOT NULL;

Use WHERE clauses on indexes to save space. Only rows with non-null values get indexed.

Create Custom Views for Reporting

Views simplify complex queries. Create views for your common reporting needs:

-- Photos with full metadata
CREATE VIEW photo_timeline AS
SELECT
    id,
    filename,
    exif_date_taken AS date_taken,
    exif_data->>'Make' AS camera_make,
    exif_data->>'Model' AS camera_model,
    gps_lat,
    gps_lon,
    media_width,
    media_height,
    size_bytes,
    path
FROM files
WHERE category = 'photo'
    AND exif_date_taken IS NOT NULL
ORDER BY exif_date_taken DESC;

-- Duplicate files with wasted space calculation
CREATE VIEW duplicate_files AS
SELECT
    content_hash,
    COUNT(*) AS copy_count,
    ARRAY_AGG(path ORDER BY path) AS file_paths,
    size_bytes,
    size_bytes * (COUNT(*) - 1) AS wasted_bytes
FROM files
WHERE content_hash IS NOT NULL
GROUP BY content_hash, size_bytes
HAVING COUNT(*) > 1
ORDER BY wasted_bytes DESC;

-- Files by decade
CREATE VIEW files_by_decade AS
SELECT
    (EXTRACT(YEAR FROM modified_at)::int / 10) * 10 AS decade,
    category,
    COUNT(*) AS file_count,
    SUM(size_bytes) AS total_bytes
FROM files
WHERE NOT is_dir
GROUP BY decade, category
ORDER BY decade DESC, total_bytes DESC;

-- Largest files per category
CREATE VIEW largest_files_by_category AS
SELECT DISTINCT ON (category)
    category,
    filename,
    size_bytes,
    path,
    modified_at
FROM files
WHERE NOT is_dir
ORDER BY category, size_bytes DESC;

Query views like regular tables:

-- Show all photos from 2015
SELECT * FROM photo_timeline
WHERE EXTRACT(YEAR FROM date_taken) = 2015;

-- Calculate total wasted space
SELECT SUM(wasted_bytes) FROM duplicate_files;

-- Show file distribution by decade
SELECT * FROM files_by_decade;

Views don’t store data. They’re just saved queries. No storage cost, no sync issues.

Adapt This: Partitioning for Huge Archives

If your archive grows beyond 10 million files, consider partitioning the files table by year or category. Partitioning splits a single table into multiple physical tables, improving query performance and maintenance.

Partition by year:

-- Create partitioned table
CREATE TABLE files_partitioned (
    LIKE files INCLUDING ALL
) PARTITION BY RANGE (modified_at);

-- Create partitions
CREATE TABLE files_2020 PARTITION OF files_partitioned
    FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');

CREATE TABLE files_2021 PARTITION OF files_partitioned
    FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');

-- Auto-create future partitions with a function
CREATE OR REPLACE FUNCTION create_partition_for_year(year INT) RETURNS VOID AS $$
BEGIN
    EXECUTE format('
        CREATE TABLE IF NOT EXISTS files_%s PARTITION OF files_partitioned
        FOR VALUES FROM (''%s-01-01'') TO (''%s-01-01'')',
        year, year, year + 1);
END;
$$ LANGUAGE plpgsql;

Queries automatically use the correct partition. Partitioning by year means queries for recent files never touch old partitions.

Create Indexes for Your Common Queries

Index every column you filter or sort by frequently:

-- If you search by filename prefix often
CREATE INDEX idx_files_filename_prefix ON files(filename text_pattern_ops);

-- If you filter by multiple categories
CREATE INDEX idx_files_category_array ON files(category)
    WHERE category IN ('photo', 'video', 'audio');

-- If you join on content_hash frequently
CREATE INDEX idx_files_hash_category ON files(content_hash, category)
    WHERE content_hash IS NOT NULL;

-- If you query by file size ranges
CREATE INDEX idx_files_size_range ON files(size_bytes)
    WHERE size_bytes > 1000000;  -- Only large files

Index cost vs benefit:

Indexes speed reads but slow writes. Every index adds overhead to INSERT and UPDATE operations. For an archive that’s scanned once and queried many times, this tradeoff is obvious. Index everything you query. Writes are rare, reads are constant.

Layer 3: Application Customization

The Rails application is where users interact with the archive. This is the most visible customization layer.

Add Models for Specific File Types

The base File model works for everything, but specific file types benefit from specialized models:

# app/models/photo.rb
class Photo < ApplicationRecord
  self.table_name = 'files'

  default_scope { where(category: 'photo') }

  # Parse EXIF data
  def camera
    return nil unless exif_data
    make = exif_data['Make']
    model = exif_data['Model']
    "#{make} #{model}".strip if make || model
  end

  def exposure_settings
    return {} unless exif_data
    {
      iso: exif_data['ISO'],
      aperture: exif_data['FNumber'],
      shutter: exif_data['ExposureTime'],
      focal_length: exif_data['FocalLength']
    }
  end

  def location
    return nil unless gps_lat && gps_lon
    [gps_lat, gps_lon]
  end

  # Scopes
  scope :recent, -> { order(exif_date_taken: :desc) }
  scope :by_camera, ->(camera) { where("exif_data->>'Model' = ?", camera) }
  scope :with_location, -> { where.not(gps_lat: nil, gps_lon: nil) }
  scope :taken_in, ->(year) { where('EXTRACT(YEAR FROM exif_date_taken) = ?', year) }
end

# app/models/video.rb
class Video < ApplicationRecord
  self.table_name = 'files'

  default_scope { where(category: 'video') }

  def duration_formatted
    return nil unless video_duration_seconds
    seconds = video_duration_seconds
    hours = seconds / 3600
    minutes = (seconds % 3600) / 60
    secs = seconds % 60
    "%02d:%02d:%02d" % [hours, minutes, secs]
  end

  scope :by_resolution, ->(res) { where(video_resolution: res) }
  scope :longer_than, ->(minutes) { where('video_duration_seconds > ?', minutes * 60) }
end

# app/models/music_file.rb
class MusicFile < ApplicationRecord
  self.table_name = 'files'

  default_scope { where(category: 'audio') }

  scope :by_artist, ->(artist) { where(audio_artist: artist) }
  scope :by_album, ->(album) { where(audio_album: album) }
  scope :by_genre, ->(genre) { where(audio_genre: genre) }
  scope :by_year, ->(year) { where(audio_year: year) }
end

Use these models in controllers and views:

# app/controllers/photos_controller.rb
class PhotosController < ApplicationController
  def index
    @photos = Photo.recent
                   .with_location
                   .page(params[:page])
                   .per(50)
  end

  def by_camera
    @camera = params[:camera]
    @photos = Photo.by_camera(@camera)
                   .recent
                   .page(params[:page])
  end

  def timeline
    @years = Photo.where.not(exif_date_taken: nil)
                  .select('EXTRACT(YEAR FROM exif_date_taken) AS year')
                  .group('year')
                  .order('year DESC')
  end
end

Build Custom Dashboards

Different use cases need different dashboards. A photographer cares about photo timeline and camera stats. A music collector cares about artists and albums. Build specialized views:

Photo dashboard:

<!-- app/views/dashboards/photo.html.erb -->
<div class="photo-dashboard">
  <div class="stats">
    <div class="stat">
      <h3><%= number_with_delimiter(Photo.count) %></h3>
      <p>Total Photos</p>
    </div>
    <div class="stat">
      <h3><%= Photo.with_location.count %></h3>
      <p>With GPS</p>
    </div>
    <div class="stat">
      <h3><%= Photo.distinct.pluck("exif_data->>'Model'").count %></h3>
      <p>Cameras</p>
    </div>
  </div>

  <div class="timeline">
    <h2>Photos by Year</h2>
    <%= render partial: 'photos_by_year_chart' %>
  </div>

  <div class="map">
    <h2>Photo Locations</h2>
    <%= render partial: 'photo_map' %>
  </div>

  <div class="recent">
    <h2>Recent Photos</h2>
    <%= render partial: 'photo_grid', locals: { photos: @recent_photos } %>
  </div>
</div>

Music dashboard:

<!-- app/views/dashboards/music.html.erb -->
<div class="music-dashboard">
  <div class="artists">
    <h2>Artists</h2>
    <ul>
      <% @artists.each do |artist, count| %>
        <li>
          <%= link_to artist, music_files_path(artist: artist) %>
          <span class="count"><%= count %> tracks</span>
        </li>
      <% end %>
    </ul>
  </div>

  <div class="albums">
    <h2>Recent Albums</h2>
    <%= render partial: 'album_grid', locals: { albums: @albums } %>
  </div>

  <div class="genres">
    <h2>By Genre</h2>
    <%= render partial: 'genre_breakdown' %>
  </div>
</div>

Add Tagging and Rating Systems

The base schema doesn’t include tags or ratings. Add them:

rails generate model Tag name:string
rails generate model FileTag file_id:bigint tag_id:bigint
rails db:migrate
# app/models/tag.rb
class Tag < ApplicationRecord
  has_many :file_tags, dependent: :destroy
  has_many :files, through: :file_tags

  validates :name, presence: true, uniqueness: true
end

# app/models/file_tag.rb
class FileTag < ApplicationRecord
  belongs_to :file
  belongs_to :tag
end

# Add to app/models/file.rb
class File < ApplicationRecord
  has_many :file_tags, dependent: :destroy
  has_many :tags, through: :file_tags

  def tag_list
    tags.pluck(:name).join(', ')
  end

  def tag_list=(names)
    self.tags = names.split(',').map do |name|
      Tag.find_or_create_by(name: name.strip)
    end
  end
end

Add rating support:

rails generate migration AddRatingToFiles rating:integer
rails db:migrate
# Add to app/models/file.rb
validates :rating, inclusion: { in: 0..5, allow_nil: true }

scope :rated, -> { where.not(rating: nil) }
scope :highly_rated, -> { where('rating >= ?', 4) }

Build UI for tagging and rating in the file detail view.

Create Custom Triage Workflows

Triage is the process of reviewing files marked needs_review = true and deciding what to do with them. Build workflows for different review types:

# app/models/triage_queue.rb
class TriageQueue
  def self.unknown_extensions
    File.where(needs_review: true, review_reason: 'unknown_extension')
        .select(:extension)
        .distinct
        .group(:extension)
        .count
  end

  def self.low_confidence_files
    File.where(needs_review: true, category_confidence: 'low')
        .order(size_bytes: :desc)
        .limit(100)
  end

  def self.files_by_review_reason
    File.where(needs_review: true)
        .group(:review_reason)
        .count
  end
end

# app/controllers/triage_controller.rb
class TriageController < ApplicationController
  def index
    @stats = TriageQueue.files_by_review_reason
    @unknown_exts = TriageQueue.unknown_extensions
  end

  def review
    @file = File.where(needs_review: true).order(:id).first
    redirect_to triage_index_path, notice: 'No files to review!' unless @file
  end

  def resolve
    @file = File.find(params[:id])
    @file.update!(
      needs_review: false,
      category: params[:category],
      category_confidence: 'manual'
    )
    redirect_to review_triage_path, notice: 'File categorized!'
  end

  def skip
    @file = File.find(params[:id])
    redirect_to review_triage_path
  end
end

Build a triage UI that shows one file at a time with context (preview, metadata, similar files) and buttons to categorize or skip.

Layer 4: Integration Points

The archive system has clean boundaries for integrating external services and tools.

AI Classification for Ambiguous Files

Files with unknown extensions or low confidence categories can be classified with AI. Use OpenAI’s vision API for images, or GPT-4 for text content analysis.

Example: Classify images with GPT-4 Vision:

# app/services/ai_classifier.rb
class AiClassifier
  def self.classify_image(file_path)
    client = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])

    # Read image and encode as base64
    image_data = Base64.strict_encode64(File.read(file_path))

    response = client.chat(
      parameters: {
        model: "gpt-4-vision-preview",
        messages: [
          {
            role: "user",
            content: [
              { type: "text", text: "What type of image is this? Respond with one word: photo, screenshot, diagram, meme, document, or artwork." },
              { type: "image_url", image_url: { url: "data:image/jpeg;base64,#{image_data}" } }
            ]
          }
        ],
        max_tokens: 50
      }
    )

    category = response.dig("choices", 0, "message", "content").strip.downcase
    { category: category, confidence: 'ai', source: 'openai_gpt4v' }
  end
end

# Usage in a rake task
# lib/tasks/classify.rake
namespace :archive do
  desc "Classify ambiguous images with AI"
  task classify_images: :environment do
    files = File.where(
      category: 'photo',
      needs_review: true,
      review_reason: 'ambiguous'
    ).limit(100)

    files.find_each do |file|
      result = AiClassifier.classify_image(file.path)
      file.update!(
        category: result[:category],
        category_confidence: result[:confidence],
        category_source: result[:source],
        needs_review: false
      )
      puts "Classified: #{file.filename} -> #{result[:category]}"
    rescue => e
      puts "Error: #{file.filename} - #{e.message}"
    end
  end
end

Run periodically:

rails archive:classify_images

Cost control: The OpenAI Vision API costs about $0.01 per image. Batch small images together and only classify files you actually care about.

Thumbnail Generation

Generate thumbnails for photos and videos to speed up browsing:

# app/services/thumbnail_generator.rb
require 'mini_magick'

class ThumbnailGenerator
  THUMBNAIL_SIZES = {
    small: 200,
    medium: 600,
    large: 1200
  }

  def self.generate_for_photo(file)
    return unless file.category == 'photo'

    THUMBNAIL_SIZES.each do |size_name, width|
      thumbnail_path = thumbnail_path_for(file, size_name)
      next if File.exist?(thumbnail_path)

      image = MiniMagick::Image.open(file.path)
      image.resize "#{width}x#{width}>"
      image.write thumbnail_path
    end

    file.update!(thumbnail_path: thumbnail_path_for(file, :medium))
  rescue => e
    Rails.logger.error("Thumbnail generation failed for #{file.path}: #{e.message}")
  end

  def self.thumbnail_path_for(file, size)
    File.join(Rails.root, 'public', 'thumbnails', size.to_s, "#{file.id}.jpg")
  end
end

# lib/tasks/thumbnails.rake
namespace :archive do
  desc "Generate thumbnails for all photos"
  task generate_thumbnails: :environment do
    Photo.where(thumbnail_path: nil).find_each do |photo|
      ThumbnailGenerator.generate_for_photo(photo)
      print "."
    end
    puts "\nDone!"
  end
end

For videos, use FFmpeg to extract a frame:

ffmpeg -i input.mp4 -ss 00:00:01.000 -vframes 1 thumbnail.jpg

Wrap this in a Ruby service similar to the photo thumbnail generator.

Face Detection

Detect faces in photos using AWS Rekognition or a local model:

# app/services/face_detector.rb
class FaceDetector
  def self.detect_faces_aws(file)
    client = Aws::Rekognition::Client.new(region: 'us-east-1')

    response = client.detect_faces({
      image: { bytes: File.read(file.path) },
      attributes: ['ALL']
    })

    faces = response.face_details.map do |face|
      {
        confidence: face.confidence,
        bounding_box: face.bounding_box.to_h,
        emotions: face.emotions.map { |e| { type: e.type, confidence: e.confidence } },
        age_range: { low: face.age_range.low, high: face.age_range.high }
      }
    end

    file.update!(exif_data: file.exif_data.merge({ faces: faces }))
  end
end

Local alternative using OpenCV via Ruby FFI (no AWS costs, but requires local GPU for speed).

OCR for Documents

Extract text from scanned documents and PDFs with Tesseract:

# app/services/ocr_extractor.rb
class OcrExtractor
  def self.extract_text(file)
    return unless file.category == 'document'
    return unless ['.jpg', '.png', '.tiff', '.pdf'].include?(file.extension)

    text = `tesseract #{file.path} stdout`

    file.update!(
      exif_data: file.exif_data.merge({ ocr_text: text }),
      document_word_count: text.split.size
    )
  end
end

Make extracted text searchable:

-- Add full-text search to OCR text
CREATE INDEX idx_files_ocr_text ON files
USING GIN ((exif_data->>'ocr_text') gin_trgm_ops)
WHERE exif_data ? 'ocr_text';

Query:

File.where("exif_data->>'ocr_text' ILIKE ?", "%invoice%")

Video Analysis with FFprobe

Extract video metadata (duration, resolution, codec, framerate):

# app/services/video_analyzer.rb
class VideoAnalyzer
  def self.analyze(file)
    return unless file.category == 'video'

    json = `ffprobe -v quiet -print_format json -show_format -show_streams "#{file.path}"`
    data = JSON.parse(json)

    video_stream = data['streams'].find { |s| s['codec_type'] == 'video' }
    format = data['format']

    file.update!(
      video_duration_seconds: format['duration'].to_i,
      video_resolution: "#{video_stream['width']}x#{video_stream['height']}",
      video_codec: video_stream['codec_name'],
      video_framerate: eval(video_stream['r_frame_rate']).to_f.round(2)
    )
  rescue => e
    Rails.logger.error("Video analysis failed: #{e.message}")
  end
end

Run in batch:

rails runner "Video.where(video_duration_seconds: nil).find_each { |v| VideoAnalyzer.analyze(v) }"

Adapt This: Automation Ideas

Build automation around scanning, backup, and maintenance tasks.

Auto-Backup on Scan Completion

After scanning a new drive, automatically back up the database and new files to cloud storage:

#!/bin/bash
# scripts/scan_and_backup.sh

DRIVE_PATH="$1"
DB_NAME="archive"
BACKUP_DIR="$HOME/archive_backups"
S3_BUCKET="s3://my-archive-backup"

echo "Scanning $DRIVE_PATH..."
./scanner/scanner -source "$DRIVE_PATH" -db "postgres://localhost/$DB_NAME"

echo "Backing up database..."
pg_dump $DB_NAME | gzip > "$BACKUP_DIR/archive_$(date +%Y%m%d_%H%M%S).sql.gz"

echo "Uploading to S3..."
aws s3 sync "$BACKUP_DIR" "$S3_BUCKET/database/"

echo "Done!"

Run via cron or whenever you plug in a new drive.

Slack/Discord Notifications

Send notifications when scans complete or when interesting files are found:

# app/services/notifier.rb
require 'net/http'
require 'json'

class Notifier
  def self.notify_slack(message)
    uri = URI(ENV['SLACK_WEBHOOK_URL'])
    payload = { text: message }.to_json

    Net::HTTP.post_form(uri, payload: payload)
  end

  def self.scan_complete(stats)
    message = "Archive scan complete!\n" \
              "Files: #{stats[:files_processed]}\n" \
              "Size: #{stats[:total_bytes] / 1.gigabyte}GB\n" \
              "Duration: #{stats[:duration_seconds] / 60} minutes"
    notify_slack(message)
  end

  def self.interesting_files_found(files)
    message = "Found #{files.count} interesting files:\n" +
              files.map { |f| "- #{f.filename}" }.join("\n")
    notify_slack(message)
  end
end

Scheduled Re-Scans

Set up cron jobs to automatically re-scan specific directories for changes:

# crontab -e
# Scan Downloads folder daily at 2am
0 2 * * * cd /path/to/scanner && ./scanner -source ~/Downloads -db "postgres://localhost/archive"

# Scan external drives weekly (when mounted)
0 3 * * 0 cd /path/to/scanner && [ -d /Volumes/Backup1 ] && ./scanner -source /Volumes/Backup1 -db "postgres://localhost/archive"

Duplicate Cleanup Reports

Generate weekly reports of duplicates and wasted space:

# lib/tasks/reports.rake
namespace :archive do
  desc "Generate duplicate files report"
  task duplicate_report: :environment do
    duplicates = File.select('content_hash, COUNT(*) as count, size_bytes')
                     .where.not(content_hash: nil)
                     .group(:content_hash, :size_bytes)
                     .having('COUNT(*) > 1')
                     .order('COUNT(*) DESC, size_bytes DESC')
                     .limit(100)

    total_wasted = duplicates.sum { |d| d.size_bytes * (d.count - 1) }

    report = "=== Duplicate Files Report ===\n"
    report += "Total wasted space: #{total_wasted / 1.gigabyte}GB\n\n"

    duplicates.each do |dup|
      paths = File.where(content_hash: dup.content_hash).pluck(:path)
      report += "#{dup.count} copies of #{paths.first}\n"
      report += "  Size: #{dup.size_bytes / 1.megabyte}MB\n"
      report += "  Wasted: #{dup.size_bytes * (dup.count - 1) / 1.megabyte}MB\n"
      paths.each { |p| report += "  - #{p}\n" }
      report += "\n"
    end

    File.write('reports/duplicates.txt', report)
    Notifier.notify_slack("Duplicate report generated. #{total_wasted / 1.gigabyte}GB wasted.")
  end
end

Run weekly via cron:

0 9 * * 1 cd /path/to/rails/app && rails archive:duplicate_report

Layer 5: Alternative Technologies

The default stack is Go + PostgreSQL + Rails. You can swap pieces based on your preferences and constraints.

Scanner Alternatives

Rust (ripgrep-style):

Rust offers similar performance to Go with stricter memory safety guarantees. Use walkdir for filesystem traversal, xxhash-rust for hashing, and rexiv2 for EXIF.

Pros: Even faster than Go for CPU-bound tasks, excellent error handling, mature ecosystem.

Cons: Steeper learning curve, longer compile times, smaller community than Go.

When to choose: You’re comfortable with Rust, or you need maximum performance (scanning 10M+ files).

Python (easier but slower):

Python is more accessible but significantly slower. Use pathlib for traversal, xxhash for hashing, piexif for EXIF.

Pros: Easy to modify, huge ecosystem, great for experimentation.

Cons: 5-10x slower than Go, GIL limits parallelism, dependency management headaches.

When to choose: You’re prototyping, or your archive is small (under 500K files).

Database Alternatives

SQLite (simpler):

SQLite is a single-file database with zero configuration. Install FTS5 extension for full-text search.

Pros: No server setup, portable (one file), built into everything.

Cons: No fuzzy search (no pg_trgm equivalent), no PostGIS (no GPS queries), weaker JSON support, locks on writes.

When to choose: Your archive is under 1M files and you don’t need GPS queries or fuzzy search.

DuckDB (analytics):

DuckDB is an embedded analytics database. Blazingly fast for aggregations and reporting queries.

Pros: Columnar storage (fast aggregations), Parquet export, SQL analytics, embedded like SQLite.

Cons: Write-heavy workloads are slower, fewer extensions than PostgreSQL, smaller community.

When to choose: You’re doing heavy analytics (reporting, dashboards) and writes are infrequent.

Application Alternatives

Django (Python):

Django is a mature Python web framework with excellent admin interface and ORM.

Pros: Auto-generated admin UI, large ecosystem, easy to learn.

Cons: Python is slower than Ruby, ORM is less flexible than ActiveRecord.

When to choose: You prefer Python over Ruby.

Next.js (React):

Build a modern SPA with Next.js for the frontend and a minimal API backend (Express, FastAPI, Go).

Pros: Modern UI/UX, great performance, easy deployment (Vercel).

Cons: More JavaScript complexity, client-side state management, API design overhead.

When to choose: You want a slick modern interface and are comfortable with React.

Svelte (leaner):

Svelte compiles to vanilla JavaScript with no runtime overhead. Lighter and faster than React.

Pros: Simpler than React, great performance, smaller bundle size.

Cons: Smaller ecosystem, fewer ready-made components.

When to choose: You want a modern SPA but React feels too heavy.

Watch Out: Over-Engineering

Customization is powerful but dangerous. Every custom feature is code you have to maintain. Every integration is a dependency that can break.

Start small. Use the default scanner, database, and application until you hit a real limitation. Don’t add features you might need someday. Add features when you feel the pain of not having them.

Resist scope creep. The goal is a working archive system, not a perfect one. Perfect is the enemy of done. Ship something that works, then iterate.

Document your changes. Six months from now you won’t remember why you modified the scanner’s skip patterns or added a custom JSONB column. Add comments. Keep a changelog.

Future Ideas: Where This Could Go

Here are features I haven’t built yet but keep thinking about:

Semantic search with embeddings: Generate vector embeddings for images (CLIP) and text (sentence-transformers). Query by concept: “photos of beaches” or “documents about taxes.”

Timeline reconstruction: Build a visual timeline of your entire digital life. Photos, videos, documents, code commits, all in one chronological view.

Project detection: Automatically group related files into projects (detect code repos, photo albums, video edits) using path patterns and timestamps.

Content similarity: Find visually similar photos (perceptual hashing), duplicate videos (content fingerprinting), and near-duplicate documents (TF-IDF).

Automated tagging: Use CLIP or GPT-4V to auto-tag images. Run in background, review and approve in triage UI.

Archive versioning: Track changes over time. When you re-scan a directory, detect moved/renamed files, deleted files, and new files. Show history.

Multi-user archives: Share portions of your archive with family. Collaborative tagging. Permission system.

Archival format conversion: Automatically convert obsolete formats (old video codecs, proprietary document formats) to modern standards before the originals become unreadable.

All of these are possible with the architecture we’ve built. The database schema supports it. The scanner provides the raw data. The application is the UI layer.

Making It Yours

The system you build will be different from mine. Your archive is different. Your priorities are different. Your tolerance for complexity is different.

That’s the point. This isn’t a product you install and use as-is. It’s a framework you adapt and extend. The scanner, database, and application are starting points. The real value is in the customization layer you build on top.

Start with the defaults. Run a scan. Explore the data. Find the gaps. Then come back to this guide and add the features you actually need. Your archive will tell you what it needs. Listen to it.


Next: Lessons Learned - Reflections on what worked and what we’d do differently.


Back to top

AI Digital Archive - A system for organizing your digital life

This site uses Just the Docs, a documentation theme for Jekyll.