Architecture: Three Layers, Clear Boundaries

The personal digital archive system is built on a simple principle: separate what you find from what you decide to keep. This architecture has three layers, each with a single responsibility. No layer reaches into another layer’s domain.

The Big Picture

flowchart TB
    subgraph FS["Your Filesystem"]
        FILES["~/Photos, /Volumes/Backup, ~/Desktop, etc.<br/>3.5M files scattered across multiple drives"]
    end

    subgraph SCANNER["Scanner Layer (Go)"]
        S1["Write-only, high-performance"]
        S2["Walks filesystem trees in parallel"]
        S3["Computes content hashes (xxHash64)"]
        S4["Extracts EXIF metadata"]
        S5["Inserts to files table"]
    end

    subgraph DB["Database Layer (PostgreSQL)"]
        subgraph SD["Scanner Domain"]
            FTABLE["files<br/>(3.5M rows)<br/>Scanner owns, Write-only"]
        end
        subgraph AD["Application Domain"]
            ATABLES["directories (1.1M)<br/>projects (2.2k)<br/>unique_files (1.2M)<br/>photos (418k)<br/>App owns, Read-write"]
        end
        EXT["Extensions: pg_trgm, PostGIS, pgvector, ltree"]
    end

    subgraph APP["Application Layer (Rails 8)"]
        A1["Read-heavy browsing and triage"]
        A2["Scanner::FileEntry (read-only model)"]
        A3["Application models (Directory, Photo, etc.)"]
        A4["Web UI for browsing, search, triage"]
    end

    subgraph CLOUD["Cloud Storage (S3)"]
        C1["Backup destination for curated files"]
    end

    FS -->|"Parallel scan with xxHash64<br/>EXIF extraction<br/>Batched inserts"| SCANNER
    SCANNER -->|"Raw file records<br/>(path, size, hash, modified_time, exif)"| DB
    DB -->|"JOIN queries<br/>Hybrid model reads"| APP
    APP -->|"Selected files for preservation<br/>(after triage decisions)"| CLOUD

Layer 1: Scanner (Go)

The scanner has one job: find every file on your drives and record what it sees. It does not interpret. It does not decide what matters. It writes facts to the database and moves on.

Responsibilities:

  • Traverse filesystem trees in parallel (configurable workers)
  • Compute content hashes using xxHash64
  • Extract EXIF metadata from images
  • Batch INSERT operations (1000 records per transaction)
  • Resume interrupted scans
  • Never touch application tables

What it writes:

  • File path (absolute)
  • File size (bytes)
  • Content hash (xxHash64)
  • Modified timestamp
  • EXIF data (if image)
  • Scan timestamp

What it never does:

  • Read from the database (except to check resume state)
  • Decide if a file is important
  • Delete records
  • Update existing records (scans are append-only)

Why This Choice: Go for Scanning

Go gives you three things you need for filesystem work:

  1. Goroutines: Spin up hundreds of workers without thinking about thread pools
  2. Compiled speed: Process 3.5M files without waiting hours
  3. Simple deployment: One binary, no runtime dependencies

The scanner runs when you plug in a new drive or remember an old backup. It needs to be fast and it needs to just work.

Layer 2: Database (PostgreSQL)

PostgreSQL is the only source of truth. Every file the scanner found lives here. Every triage decision you make lives here. Nothing happens in memory that isn’t persisted here first.

Scanner Domain Tables:

files (3.5M rows)

  • Owned exclusively by the scanner
  • Application reads but never writes
  • Contains every file ever scanned
  • May include duplicates (same file in multiple locations)
  • Immutable after scanner runs

Application Domain Tables:

directories (1.1M rows)

  • Groups files by directory
  • Tracks user decisions about directory importance

unique_files (1.2M rows)

  • Deduplicated files (by content hash)
  • One record per unique file content

photos (418k rows)

  • Subset of files identified as photos
  • Enriched with location, camera metadata
  • Linked to albums, events, people

projects (2.2k rows)

  • High-level organization
  • Creative work, life events, archives

Extensions in use:

Extension Purpose
pg_trgm Fuzzy text search (filename matching)
PostGIS Geospatial queries (photo locations)
pgvector Similarity search (future: visual embeddings)
ltree Directory hierarchy queries
fuzzystrmatch Approximate string matching
unaccent Unicode normalization

Why This Choice: PostgreSQL Extensions

You could build fuzzy search in application code. You could use Elasticsearch for text search. You could use a separate geospatial database.

Or you could use PostgreSQL with extensions and keep everything in one place.

The benefits:

  • One database to back up
  • One connection pool to manage
  • JOIN across search, location, and hierarchy in a single query
  • No data synchronization between systems
  • No “eventually consistent” problems

The tradeoff is learning PostgreSQL deeply. The payoff is a system that does not surprise you.

Layer 3: Application (Rails 8)

The Rails application is read-heavy. It queries the database, renders web pages, and lets you make decisions about what to keep. It writes to application tables but treats the files table as read-only.

The Hybrid Model:

# Scanner domain (read-only)
class Scanner::FileEntry < ApplicationRecord
  self.table_name = 'files'

  # Never save, never destroy
  # Only query and display
end

# Application domain (read-write)
class Photo < ApplicationRecord
  belongs_to :file_entry, class_name: 'Scanner::FileEntry'

  # Application decisions live here:
  # - Album assignments
  # - Favorite markers
  # - Rotation corrections
  # - Manual tags
end

Responsibilities:

  • Display file listings and search results
  • Provide triage UI (keep/delete/maybe)
  • Track user decisions in application tables
  • Upload selected files to cloud storage
  • Generate statistics and visualizations
  • Never write to scanner tables

What the hybrid model prevents:

  • Scanner cannot accidentally delete your triage work
  • Application cannot corrupt raw scan data
  • Re-running scanner does not lose your decisions
  • Both systems can evolve independently

Why This Choice: Hybrid Over Single Model

You could make the scanner write to the same tables the application uses. The scanner could populate photos directly. The application could update files when you rename something.

This is a mistake.

The problem with merging domains:

When you re-run the scanner (and you will), it needs to know what it owns. If the scanner and application share tables, you have three bad options:

  1. Scanner deletes everything and starts fresh (loses your triage work)
  2. Scanner tries to merge (complex, error-prone, slow)
  3. Scanner gives up on re-scanning (defeats the purpose)

The hybrid model solves this:

  • Scanner drops and rebuilds files table: safe
  • Application tables reference files by hash: still work
  • Stale references (deleted files): application can detect and clean up
  • New files appear immediately: no manual import needed

Data Flow: Files to Cloud

Understanding how data moves through the system makes the architecture obvious.

flowchart LR
    subgraph P1["Phase 1: Discovery"]
        D1["Scanner walks tree"] --> D2["Hash & extract EXIF"] --> D3["Batch INSERT to files"]
    end

    subgraph P2["Phase 2: Deduplication"]
        DD1["Query DISTINCT hashes"] --> DD2["Create unique_files records"]
    end

    subgraph P3["Phase 3: Classification"]
        C1["Identify images from EXIF"] --> C2["Create photos records"] --> C3["Extract GPS locations"]
    end

    subgraph P4["Phase 4: Triage"]
        T1["User browses in web UI"] --> T2["Marks favorites, assigns albums"] --> T3["Decides what to backup"]
    end

    subgraph P5["Phase 5: Backup"]
        B1["Query photos marked for backup"] --> B2["Read file from disk"] --> B3["Upload to S3"]
    end

    P1 --> P2 --> P3 --> P4 --> P5

Phase 1: Discovery (Scanner → Database)

  1. Scanner starts with root path: ~/Photos
  2. Walks tree with N parallel workers (default: 10)
  3. For each file:
    • Compute xxHash64 of contents
    • Extract EXIF if image
    • Add to batch buffer
  4. Every 1000 files, INSERT batch to files table
  5. Repeat until tree exhausted

Result: files table contains every file found, with content hash and metadata.

Phase 2: Deduplication (Application Processing)

  1. Application runs background job
  2. Query: SELECT DISTINCT ON (content_hash) FROM files
  3. Create unique_files records
  4. Link back to all files table entries with same hash

Result: unique_files table has one record per content, with references to all copies.

Phase 3: Classification (Application Processing)

  1. Application identifies images from EXIF data
  2. Creates photos records
  3. Extracts location from GPS EXIF
  4. Associates with directories, projects

Result: photos table ready for browsing and triage.

Phase 4: Triage (Human Decision)

  1. User browses photos in web UI
  2. Marks favorites, assigns to albums
  3. Decides which to back up to cloud
  4. Application writes decisions to photos, albums tables

Result: Application tables contain human decisions, scanner tables unchanged.

Phase 5: Backup (Application → Cloud)

  1. User triggers backup job
  2. Application queries photos marked for backup
  3. Reads file from disk using path from files table
  4. Uploads to S3 via ActiveStorage
  5. Records cloud location in application tables

Result: Selected files preserved in cloud, linked to triage metadata.

Technology Choices

Layer Technology Why
Scanner Go 1.21+ Parallel I/O, single binary deployment
Database PostgreSQL 15+ Rich extension ecosystem, reliable
Application Rails 8 Rapid UI development, ActiveStorage
Hashing xxHash64 Fast non-cryptographic hash for deduplication
Cloud Storage S3-compatible Standard API, multiple provider options

Why This Choice: xxHash64 Over SHA256

Cryptographic hashes like SHA256 are slow. You do not need cryptographic properties for deduplication. You need to answer: “Are these two files the same content?”

xxHash64 is 10x faster than SHA256 and collision probability is negligible for personal archives (even with millions of files).

If you need cryptographic verification (detecting tampering), add SHA256 as a separate column and compute it only for critical files.

Scaling Considerations

This architecture handles 3.5M files on a single PostgreSQL instance. Here is when you need to think about scaling:

Scanner Scaling

Current limits:

  • 10 parallel workers (configurable)
  • 1000 records per batch INSERT
  • Handles 100k files in ~2 minutes

Scale up when:

  • Scanning takes longer than drive connection time (plug in, scan, unplug)
  • Database becomes bottleneck on INSERT

How to scale:

  • Increase worker count (watch disk I/O, not just CPU)
  • Increase batch size (watch transaction time)
  • Use COPY instead of INSERT for initial load
  • Partition files table by scan date

Database Scaling

Current limits:

  • 3.5M files table
  • 1.1M directories
  • 418k photos
  • All on single instance

Scale up when:

  • Query times exceed 100ms for common operations
  • Disk space exceeds single volume
  • Backup/restore time becomes painful

How to scale:

  • Add read replicas for application queries
  • Partition files table by path prefix or scan date
  • Use table inheritance for files by type
  • Move old scans to archive tables

Application Scaling

Current limits:

  • Single Rails instance
  • Read-heavy workload
  • Occasional background jobs

Scale up when:

  • Response times exceed 200ms
  • Background jobs create queue backlog
  • Concurrent users impact performance

How to scale:

  • Add application servers (Rails is stateless)
  • Move background jobs to separate workers
  • Add Redis for caching and job queue
  • Use CDN for static assets

The Elegance of Separation

The architecture works because each layer trusts the others to do their job and nothing more.

Scanner trusts: Database will store what it writes, application will read it correctly.

Database trusts: Scanner writes valid data, application respects read-only boundaries.

Application trusts: Scanner has found all files, database is source of truth.

You trust: Re-running the scanner will not lose your work.

This is architecture as contract. When you need to rebuild the file catalog (corrupted drive, new backup found, scanner bug fixed), you drop the files table and re-run the scanner. Your triage decisions in photos, your albums, your projects remain untouched.

The scanner does not know about your decisions. Your decisions do not depend on any particular scan. The database holds both, separate and safe.

This is how you build systems that last.


Next: The Scanner - How the Go scanner achieves high performance with parallel workers and batched writes.


Back to top

AI Digital Archive - A system for organizing your digital life

This site uses Just the Docs, a documentation theme for Jekyll.