Architecture: Three Layers, Clear Boundaries
The personal digital archive system is built on a simple principle: separate what you find from what you decide to keep. This architecture has three layers, each with a single responsibility. No layer reaches into another layer’s domain.
The Big Picture
flowchart TB
subgraph FS["Your Filesystem"]
FILES["~/Photos, /Volumes/Backup, ~/Desktop, etc.<br/>3.5M files scattered across multiple drives"]
end
subgraph SCANNER["Scanner Layer (Go)"]
S1["Write-only, high-performance"]
S2["Walks filesystem trees in parallel"]
S3["Computes content hashes (xxHash64)"]
S4["Extracts EXIF metadata"]
S5["Inserts to files table"]
end
subgraph DB["Database Layer (PostgreSQL)"]
subgraph SD["Scanner Domain"]
FTABLE["files<br/>(3.5M rows)<br/>Scanner owns, Write-only"]
end
subgraph AD["Application Domain"]
ATABLES["directories (1.1M)<br/>projects (2.2k)<br/>unique_files (1.2M)<br/>photos (418k)<br/>App owns, Read-write"]
end
EXT["Extensions: pg_trgm, PostGIS, pgvector, ltree"]
end
subgraph APP["Application Layer (Rails 8)"]
A1["Read-heavy browsing and triage"]
A2["Scanner::FileEntry (read-only model)"]
A3["Application models (Directory, Photo, etc.)"]
A4["Web UI for browsing, search, triage"]
end
subgraph CLOUD["Cloud Storage (S3)"]
C1["Backup destination for curated files"]
end
FS -->|"Parallel scan with xxHash64<br/>EXIF extraction<br/>Batched inserts"| SCANNER
SCANNER -->|"Raw file records<br/>(path, size, hash, modified_time, exif)"| DB
DB -->|"JOIN queries<br/>Hybrid model reads"| APP
APP -->|"Selected files for preservation<br/>(after triage decisions)"| CLOUD
Layer 1: Scanner (Go)
The scanner has one job: find every file on your drives and record what it sees. It does not interpret. It does not decide what matters. It writes facts to the database and moves on.
Responsibilities:
- Traverse filesystem trees in parallel (configurable workers)
- Compute content hashes using xxHash64
- Extract EXIF metadata from images
- Batch INSERT operations (1000 records per transaction)
- Resume interrupted scans
- Never touch application tables
What it writes:
- File path (absolute)
- File size (bytes)
- Content hash (xxHash64)
- Modified timestamp
- EXIF data (if image)
- Scan timestamp
What it never does:
- Read from the database (except to check resume state)
- Decide if a file is important
- Delete records
- Update existing records (scans are append-only)
Why This Choice: Go for Scanning
Go gives you three things you need for filesystem work:
- Goroutines: Spin up hundreds of workers without thinking about thread pools
- Compiled speed: Process 3.5M files without waiting hours
- Simple deployment: One binary, no runtime dependencies
The scanner runs when you plug in a new drive or remember an old backup. It needs to be fast and it needs to just work.
Layer 2: Database (PostgreSQL)
PostgreSQL is the only source of truth. Every file the scanner found lives here. Every triage decision you make lives here. Nothing happens in memory that isn’t persisted here first.
Scanner Domain Tables:
files (3.5M rows)
- Owned exclusively by the scanner
- Application reads but never writes
- Contains every file ever scanned
- May include duplicates (same file in multiple locations)
- Immutable after scanner runs
Application Domain Tables:
directories (1.1M rows)
- Groups files by directory
- Tracks user decisions about directory importance
unique_files (1.2M rows)
- Deduplicated files (by content hash)
- One record per unique file content
photos (418k rows)
- Subset of files identified as photos
- Enriched with location, camera metadata
- Linked to albums, events, people
projects (2.2k rows)
- High-level organization
- Creative work, life events, archives
Extensions in use:
| Extension | Purpose |
|---|---|
pg_trgm |
Fuzzy text search (filename matching) |
PostGIS |
Geospatial queries (photo locations) |
pgvector |
Similarity search (future: visual embeddings) |
ltree |
Directory hierarchy queries |
fuzzystrmatch |
Approximate string matching |
unaccent |
Unicode normalization |
Why This Choice: PostgreSQL Extensions
You could build fuzzy search in application code. You could use Elasticsearch for text search. You could use a separate geospatial database.
Or you could use PostgreSQL with extensions and keep everything in one place.
The benefits:
- One database to back up
- One connection pool to manage
- JOIN across search, location, and hierarchy in a single query
- No data synchronization between systems
- No “eventually consistent” problems
The tradeoff is learning PostgreSQL deeply. The payoff is a system that does not surprise you.
Layer 3: Application (Rails 8)
The Rails application is read-heavy. It queries the database, renders web pages, and lets you make decisions about what to keep. It writes to application tables but treats the files table as read-only.
The Hybrid Model:
# Scanner domain (read-only)
class Scanner::FileEntry < ApplicationRecord
self.table_name = 'files'
# Never save, never destroy
# Only query and display
end
# Application domain (read-write)
class Photo < ApplicationRecord
belongs_to :file_entry, class_name: 'Scanner::FileEntry'
# Application decisions live here:
# - Album assignments
# - Favorite markers
# - Rotation corrections
# - Manual tags
end
Responsibilities:
- Display file listings and search results
- Provide triage UI (keep/delete/maybe)
- Track user decisions in application tables
- Upload selected files to cloud storage
- Generate statistics and visualizations
- Never write to scanner tables
What the hybrid model prevents:
- Scanner cannot accidentally delete your triage work
- Application cannot corrupt raw scan data
- Re-running scanner does not lose your decisions
- Both systems can evolve independently
Why This Choice: Hybrid Over Single Model
You could make the scanner write to the same tables the application uses. The scanner could populate photos directly. The application could update files when you rename something.
This is a mistake.
The problem with merging domains:
When you re-run the scanner (and you will), it needs to know what it owns. If the scanner and application share tables, you have three bad options:
- Scanner deletes everything and starts fresh (loses your triage work)
- Scanner tries to merge (complex, error-prone, slow)
- Scanner gives up on re-scanning (defeats the purpose)
The hybrid model solves this:
- Scanner drops and rebuilds
filestable: safe - Application tables reference files by hash: still work
- Stale references (deleted files): application can detect and clean up
- New files appear immediately: no manual import needed
Data Flow: Files to Cloud
Understanding how data moves through the system makes the architecture obvious.
flowchart LR
subgraph P1["Phase 1: Discovery"]
D1["Scanner walks tree"] --> D2["Hash & extract EXIF"] --> D3["Batch INSERT to files"]
end
subgraph P2["Phase 2: Deduplication"]
DD1["Query DISTINCT hashes"] --> DD2["Create unique_files records"]
end
subgraph P3["Phase 3: Classification"]
C1["Identify images from EXIF"] --> C2["Create photos records"] --> C3["Extract GPS locations"]
end
subgraph P4["Phase 4: Triage"]
T1["User browses in web UI"] --> T2["Marks favorites, assigns albums"] --> T3["Decides what to backup"]
end
subgraph P5["Phase 5: Backup"]
B1["Query photos marked for backup"] --> B2["Read file from disk"] --> B3["Upload to S3"]
end
P1 --> P2 --> P3 --> P4 --> P5
Phase 1: Discovery (Scanner → Database)
- Scanner starts with root path: ~/Photos
- Walks tree with N parallel workers (default: 10)
- For each file:
- Compute xxHash64 of contents
- Extract EXIF if image
- Add to batch buffer
- Every 1000 files, INSERT batch to files table
- Repeat until tree exhausted
Result: files table contains every file found, with content hash and metadata.
Phase 2: Deduplication (Application Processing)
- Application runs background job
- Query: SELECT DISTINCT ON (content_hash) FROM files
- Create unique_files records
- Link back to all files table entries with same hash
Result: unique_files table has one record per content, with references to all copies.
Phase 3: Classification (Application Processing)
- Application identifies images from EXIF data
- Creates photos records
- Extracts location from GPS EXIF
- Associates with directories, projects
Result: photos table ready for browsing and triage.
Phase 4: Triage (Human Decision)
- User browses photos in web UI
- Marks favorites, assigns to albums
- Decides which to back up to cloud
- Application writes decisions to photos, albums tables
Result: Application tables contain human decisions, scanner tables unchanged.
Phase 5: Backup (Application → Cloud)
- User triggers backup job
- Application queries photos marked for backup
- Reads file from disk using path from files table
- Uploads to S3 via ActiveStorage
- Records cloud location in application tables
Result: Selected files preserved in cloud, linked to triage metadata.
Technology Choices
| Layer | Technology | Why |
|---|---|---|
| Scanner | Go 1.21+ | Parallel I/O, single binary deployment |
| Database | PostgreSQL 15+ | Rich extension ecosystem, reliable |
| Application | Rails 8 | Rapid UI development, ActiveStorage |
| Hashing | xxHash64 | Fast non-cryptographic hash for deduplication |
| Cloud Storage | S3-compatible | Standard API, multiple provider options |
Why This Choice: xxHash64 Over SHA256
Cryptographic hashes like SHA256 are slow. You do not need cryptographic properties for deduplication. You need to answer: “Are these two files the same content?”
xxHash64 is 10x faster than SHA256 and collision probability is negligible for personal archives (even with millions of files).
If you need cryptographic verification (detecting tampering), add SHA256 as a separate column and compute it only for critical files.
Scaling Considerations
This architecture handles 3.5M files on a single PostgreSQL instance. Here is when you need to think about scaling:
Scanner Scaling
Current limits:
- 10 parallel workers (configurable)
- 1000 records per batch INSERT
- Handles 100k files in ~2 minutes
Scale up when:
- Scanning takes longer than drive connection time (plug in, scan, unplug)
- Database becomes bottleneck on INSERT
How to scale:
- Increase worker count (watch disk I/O, not just CPU)
- Increase batch size (watch transaction time)
- Use COPY instead of INSERT for initial load
- Partition files table by scan date
Database Scaling
Current limits:
- 3.5M files table
- 1.1M directories
- 418k photos
- All on single instance
Scale up when:
- Query times exceed 100ms for common operations
- Disk space exceeds single volume
- Backup/restore time becomes painful
How to scale:
- Add read replicas for application queries
- Partition files table by path prefix or scan date
- Use table inheritance for files by type
- Move old scans to archive tables
Application Scaling
Current limits:
- Single Rails instance
- Read-heavy workload
- Occasional background jobs
Scale up when:
- Response times exceed 200ms
- Background jobs create queue backlog
- Concurrent users impact performance
How to scale:
- Add application servers (Rails is stateless)
- Move background jobs to separate workers
- Add Redis for caching and job queue
- Use CDN for static assets
The Elegance of Separation
The architecture works because each layer trusts the others to do their job and nothing more.
Scanner trusts: Database will store what it writes, application will read it correctly.
Database trusts: Scanner writes valid data, application respects read-only boundaries.
Application trusts: Scanner has found all files, database is source of truth.
You trust: Re-running the scanner will not lose your work.
This is architecture as contract. When you need to rebuild the file catalog (corrupted drive, new backup found, scanner bug fixed), you drop the files table and re-run the scanner. Your triage decisions in photos, your albums, your projects remain untouched.
The scanner does not know about your decisions. Your decisions do not depend on any particular scan. The database holds both, separate and safe.
This is how you build systems that last.
Next: The Scanner - How the Go scanner achieves high performance with parallel workers and batched writes.