Architecture: Three Layers, Clear Boundaries

The personal digital archive system is built on a simple principle: separate what you find from what you decide to keep. This architecture has three layers, each with a single responsibility. No layer reaches into another layer’s domain.

The Big Picture

flowchart TB
    subgraph FS["Your Filesystem"]
        FILES["~/Photos, /Volumes/Backup, ~/Desktop, etc.<br/>3.5M files scattered across multiple drives"]
    end

    subgraph SCANNER["Scanner Layer (Go)"]
        S1["Write-only, high-performance"]
        S2["Walks filesystem trees in parallel"]
        S3["Computes content hashes (xxHash64)"]
        S4["Extracts EXIF metadata"]
        S5["Inserts to files table"]
    end

    subgraph DB["Database Layer (PostgreSQL)"]
        subgraph SD["Scanner Domain"]
            FTABLE["files<br/>(3.5M rows)<br/>Scanner owns, Write-only"]
        end
        subgraph AD["Application Domain"]
            ATABLES["directories (1.1M)<br/>projects (2.2k)<br/>unique_files (1.2M)<br/>photos (418k)<br/>App owns, Read-write"]
        end
        EXT["Extensions: pg_trgm, PostGIS, pgvector, ltree"]
    end

    subgraph APP["Application Layer (Rails 8)"]
        A1["Read-heavy browsing and triage"]
        A2["Scanner::FileEntry (read-only model)"]
        A3["Application models (Directory, Photo, etc.)"]
        A4["Web UI for browsing, search, triage"]
    end

    subgraph CLOUD["Cloud Storage (S3)"]
        C1["Backup destination for curated files"]
    end

    FS -->|"Parallel scan with xxHash64<br/>EXIF extraction<br/>Batched inserts"| SCANNER
    SCANNER -->|"Raw file records<br/>(path, size, hash, modified_time, exif)"| DB
    DB -->|"JOIN queries<br/>Hybrid model reads"| APP
    APP -->|"Selected files for preservation<br/>(after triage decisions)"| CLOUD

Layer 1: Scanner (Go)

The scanner has one job: find every file on your drives and record what it sees. It does not interpret. It does not decide what matters. It writes facts to the database and moves on.

Responsibilities:

Traverse filesystem trees in parallel (configurable workers)
Compute content hashes using xxHash64
Extract EXIF metadata from images
Batch INSERT operations (1000 records per transaction)
Resume interrupted scans
Never touch application tables

What it writes:

File path (absolute)
File size (bytes)
Content hash (xxHash64)
Modified timestamp
EXIF data (if image)
Scan timestamp

What it never does:

Read from the database (except to check resume state)
Decide if a file is important
Delete records
Update existing records (scans are append-only)

Why This Choice: Go for Scanning

Go gives you three things you need for filesystem work:

Goroutines: Spin up hundreds of workers without thinking about thread pools
Compiled speed: Process 3.5M files without waiting hours
Simple deployment: One binary, no runtime dependencies

The scanner runs when you plug in a new drive or remember an old backup. It needs to be fast and it needs to just work.

Layer 2: Database (PostgreSQL)

PostgreSQL is the only source of truth. Every file the scanner found lives here. Every triage decision you make lives here. Nothing happens in memory that isn’t persisted here first.

Scanner Domain Tables:

files (3.5M rows)

Owned exclusively by the scanner
Application reads but never writes
Contains every file ever scanned
May include duplicates (same file in multiple locations)
Immutable after scanner runs

Application Domain Tables:

directories (1.1M rows)

Groups files by directory
Tracks user decisions about directory importance

unique_files (1.2M rows)

Deduplicated files (by content hash)
One record per unique file content

photos (418k rows)

Subset of files identified as photos
Enriched with location, camera metadata
Linked to albums, events, people

projects (2.2k rows)

High-level organization
Creative work, life events, archives

Extensions in use:

Extension	Purpose
`pg_trgm`	Fuzzy text search (filename matching)
`PostGIS`	Geospatial queries (photo locations)
`pgvector`	Similarity search (future: visual embeddings)
`ltree`	Directory hierarchy queries
`fuzzystrmatch`	Approximate string matching
`unaccent`	Unicode normalization

Why This Choice: PostgreSQL Extensions

You could build fuzzy search in application code. You could use Elasticsearch for text search. You could use a separate geospatial database.

Or you could use PostgreSQL with extensions and keep everything in one place.

The benefits:

One database to back up
One connection pool to manage
JOIN across search, location, and hierarchy in a single query
No data synchronization between systems
No “eventually consistent” problems

The tradeoff is learning PostgreSQL deeply. The payoff is a system that does not surprise you.

Layer 3: Application (Rails 8)

The Rails application is read-heavy. It queries the database, renders web pages, and lets you make decisions about what to keep. It writes to application tables but treats the files table as read-only.

The Hybrid Model:

# Scanner domain (read-only)
class Scanner::FileEntry < ApplicationRecord
  self.table_name = 'files'

  # Never save, never destroy
  # Only query and display
end

# Application domain (read-write)
class Photo < ApplicationRecord
  belongs_to :file_entry, class_name: 'Scanner::FileEntry'

  # Application decisions live here:
  # - Album assignments
  # - Favorite markers
  # - Rotation corrections
  # - Manual tags
end

Responsibilities:

Display file listings and search results
Provide triage UI (keep/delete/maybe)
Track user decisions in application tables
Upload selected files to cloud storage
Generate statistics and visualizations
Never write to scanner tables

What the hybrid model prevents:

Scanner cannot accidentally delete your triage work
Application cannot corrupt raw scan data
Re-running scanner does not lose your decisions
Both systems can evolve independently

Why This Choice: Hybrid Over Single Model

You could make the scanner write to the same tables the application uses. The scanner could populate photos directly. The application could update files when you rename something.

This is a mistake.

The problem with merging domains:

When you re-run the scanner (and you will), it needs to know what it owns. If the scanner and application share tables, you have three bad options:

Scanner deletes everything and starts fresh (loses your triage work)
Scanner tries to merge (complex, error-prone, slow)
Scanner gives up on re-scanning (defeats the purpose)

The hybrid model solves this:

Scanner drops and rebuilds files table: safe
Application tables reference files by hash: still work
Stale references (deleted files): application can detect and clean up
New files appear immediately: no manual import needed

Data Flow: Files to Cloud

Understanding how data moves through the system makes the architecture obvious.

flowchart LR
    subgraph P1["Phase 1: Discovery"]
        D1["Scanner walks tree"] --> D2["Hash & extract EXIF"] --> D3["Batch INSERT to files"]
    end

    subgraph P2["Phase 2: Deduplication"]
        DD1["Query DISTINCT hashes"] --> DD2["Create unique_files records"]
    end

    subgraph P3["Phase 3: Classification"]
        C1["Identify images from EXIF"] --> C2["Create photos records"] --> C3["Extract GPS locations"]
    end

    subgraph P4["Phase 4: Triage"]
        T1["User browses in web UI"] --> T2["Marks favorites, assigns albums"] --> T3["Decides what to backup"]
    end

    subgraph P5["Phase 5: Backup"]
        B1["Query photos marked for backup"] --> B2["Read file from disk"] --> B3["Upload to S3"]
    end

    P1 --> P2 --> P3 --> P4 --> P5

Phase 1: Discovery (Scanner → Database)

Scanner starts with root path: ~/Photos
Walks tree with N parallel workers (default: 10)
For each file:
- Compute xxHash64 of contents
- Extract EXIF if image
- Add to batch buffer
Every 1000 files, INSERT batch to files table
Repeat until tree exhausted

Result: files table contains every file found, with content hash and metadata.

Phase 2: Deduplication (Application Processing)

Application runs background job
Query: SELECT DISTINCT ON (content_hash) FROM files
Create unique_files records
Link back to all files table entries with same hash

Result: unique_files table has one record per content, with references to all copies.

Phase 3: Classification (Application Processing)

Application identifies images from EXIF data
Creates photos records
Extracts location from GPS EXIF
Associates with directories, projects

Result: photos table ready for browsing and triage.

Phase 4: Triage (Human Decision)

User browses photos in web UI
Marks favorites, assigns to albums
Decides which to back up to cloud
Application writes decisions to photos, albums tables

Result: Application tables contain human decisions, scanner tables unchanged.

Phase 5: Backup (Application → Cloud)

User triggers backup job
Application queries photos marked for backup
Reads file from disk using path from files table
Uploads to S3 via ActiveStorage
Records cloud location in application tables

Result: Selected files preserved in cloud, linked to triage metadata.

Technology Choices

Layer	Technology	Why
Scanner	Go 1.21+	Parallel I/O, single binary deployment
Database	PostgreSQL 15+	Rich extension ecosystem, reliable
Application	Rails 8	Rapid UI development, ActiveStorage
Hashing	xxHash64	Fast non-cryptographic hash for deduplication
Cloud Storage	S3-compatible	Standard API, multiple provider options

Why This Choice: xxHash64 Over SHA256

Cryptographic hashes like SHA256 are slow. You do not need cryptographic properties for deduplication. You need to answer: “Are these two files the same content?”

xxHash64 is 10x faster than SHA256 and collision probability is negligible for personal archives (even with millions of files).

If you need cryptographic verification (detecting tampering), add SHA256 as a separate column and compute it only for critical files.

Scaling Considerations

This architecture handles 3.5M files on a single PostgreSQL instance. Here is when you need to think about scaling:

Scanner Scaling

Current limits:

10 parallel workers (configurable)
1000 records per batch INSERT
Handles 100k files in ~2 minutes

Scale up when:

Scanning takes longer than drive connection time (plug in, scan, unplug)
Database becomes bottleneck on INSERT

How to scale:

Increase worker count (watch disk I/O, not just CPU)
Increase batch size (watch transaction time)
Use COPY instead of INSERT for initial load
Partition files table by scan date

Database Scaling

Current limits:

3.5M files table
1.1M directories
418k photos
All on single instance

Scale up when:

Query times exceed 100ms for common operations
Disk space exceeds single volume
Backup/restore time becomes painful

How to scale:

Add read replicas for application queries
Partition files table by path prefix or scan date
Use table inheritance for files by type
Move old scans to archive tables

Application Scaling

Current limits:

Single Rails instance
Read-heavy workload
Occasional background jobs

Scale up when:

Response times exceed 200ms
Background jobs create queue backlog
Concurrent users impact performance

How to scale:

Add application servers (Rails is stateless)
Move background jobs to separate workers
Add Redis for caching and job queue
Use CDN for static assets

The Elegance of Separation

The architecture works because each layer trusts the others to do their job and nothing more.

Scanner trusts: Database will store what it writes, application will read it correctly.

Database trusts: Scanner writes valid data, application respects read-only boundaries.

Application trusts: Scanner has found all files, database is source of truth.

You trust: Re-running the scanner will not lose your work.

This is architecture as contract. When you need to rebuild the file catalog (corrupted drive, new backup found, scanner bug fixed), you drop the files table and re-run the scanner. Your triage decisions in photos, your albums, your projects remain untouched.

The scanner does not know about your decisions. Your decisions do not depend on any particular scan. The database holds both, separate and safe.

This is how you build systems that last.

Next: The Scanner - How the Go scanner achieves high performance with parallel workers and batched writes.