Building a Personal Digital Archive System

You have 20 years of digital life scattered across old hard drives, cloud accounts, and forgotten USB sticks. Somewhere in there are photos from your wedding, tax documents you might need, and code projects that taught you everything you know. But you also have 47 copies of the same vacation photo, installer files from 2009, and enough duplicate downloads to fill a small data center.

I built a system to index 3.5 million files (1.47 TB) and discovered something shocking: 735 GB was duplicates. That’s 50% wasted space. Half of my digital history was copies of copies, forgotten and multiplying in the dark corners of my filesystem. This system gave me the power to see everything, understand what I actually have, and make intelligent decisions about what to keep, delete, or back up to the cloud.

This guide shows you how to build the same thing: a high-performance scanner that indexes your entire digital archive, a PostgreSQL database with advanced search capabilities, a Rails web app for browsing and triage, and a strategy for deduplicating and backing up only what matters. It’s not a product you install. It’s a system you build and adapt to your specific chaos.

Key Capabilities

Scan: A Go-based filesystem indexer that walks your directories in parallel, computes content hashes for deduplication, extracts EXIF data from photos, and loads everything into PostgreSQL. Handles 3.5M+ files without breaking a sweat.

Index: PostgreSQL database with pg_trgm for fuzzy filename search, PostGIS for GPS-based photo queries, and pgvector for future AI classification. Every file has a hash, every photo has metadata, every path is searchable.

Browse: Rails 8 web application with a directory browser, category filters, fuzzy search, and detailed file views. See your archive the way it actually exists on disk, with pre-computed stats and smart navigation.

Deduplicate: Content-based deduplication using xxHash64. Identify identical files across your entire archive, calculate wasted space, and make informed decisions about what to keep as the canonical copy.

Backup: Prepare unique content for cloud backup to S3-compatible storage. Why pay to store the same file 47 times? Upload only what’s unique and irreplaceable.

Technology Choices at a Glance

Layer Technology Why
Scanner Go Parallel filesystem traversal, minimal dependencies, fast hashing
Database PostgreSQL 14+ Extensions (pg_trgm, PostGIS, pgvector), proven at scale
Web App Rails 8 Rapid development, Hotwire for rich UI, ActiveStorage for uploads
Hashing xxHash64 Faster than MD5/SHA for deduplication, good collision resistance
Deployment Docker Compose Reproducible local development, easy cloud migration
Search pg_trgm Fuzzy text search without Elasticsearch complexity
Styling Tailwind CSS Utility-first, fast iteration, consistent design

Quick Start Path

Here’s the workflow this system follows:

flowchart LR
    A[Scanner runs] --> B[Database fills]
    B --> C[Web app populates]
    C --> D[You browse & triage]
    D --> E[Backup unique content]
  1. Scanner runs - Walks your filesystem, hashes files, extracts metadata
  2. Database fills - PostgreSQL stores everything with indexes for fast queries
  3. Web app populates - Rails tasks build denormalized views (directories, unique files)
  4. You browse and triage - Web UI shows duplicates, stats, search results
  5. Backup unique content - Upload only canonical copies to S3-compatible storage

Each phase builds on the previous one. The scanner is write-only (fast). The database is the source of truth. The web app is read-heavy (responsive). The backup is selective (cheap).

Who This Is For

This system is for people who:

  • Have accumulated 10+ years of digital files across multiple devices
  • Want to understand what they actually have before it disappears
  • Are drowning in duplicates and don’t know where to start
  • Need a searchable index of their entire digital life
  • Care about backing up unique content, not redundant copies
  • Are comfortable with code and databases (or willing to learn)

You don’t need to be a Go expert or a PostgreSQL wizard. But you should be willing to run commands, read stack traces, and adapt the system to your needs. This is infrastructure for your digital life, not a consumer app.

Documentation

Getting Started

  • The Problem - Why personal digital archives are different from enterprise backups, and why existing tools fall short
  • Architecture Overview - Three-layer design: scanner (write-only), database (source of truth), web app (read-heavy)

Building the System

  • The Scanner - Go-based filesystem indexer with parallel traversal, content hashing, and EXIF extraction
  • Database Design - PostgreSQL schema, extensions (pg_trgm, PostGIS, pgvector), and index strategy
  • The Web Application - Rails app for browsing, searching, and triage with specialized models and Hotwire UI

Analysis and Backup

  • Deduplication Analysis - Content hashing, identifying duplicates, calculating wasted space, choosing canonical copies
  • Cloud Backup Strategy - S3-compatible storage options, cost comparison, ActiveStorage integration, verification

Reference

  • Customization Guide - Adapting the scanner, adding categories, custom metadata, AI integration points
  • Lessons Learned - What worked, what we’d do differently, performance insights, the emotional side of digital archaeology

Adapt This

Paths and Categories: The scanner uses configurable skip patterns and category mappings. You’ll want to customize these for your filesystem layout and file types.

Database Extensions: This guide uses PostGIS for GPS queries and pgvector for future AI features. If you don’t need these, you can skip them and simplify your setup.

Web Framework: Rails is one choice. The database schema works with any web framework (Django, Laravel, Express, etc.). Pick what you know.

Cloud Storage: The backup strategy uses S3-compatible storage, but the principles apply to any cloud provider or even local NAS backup.

Scale: This system handles millions of files. If you have tens of thousands, you can simplify. If you have tens of millions, you’ll need to optimize further.

Let’s Build

Start with The Problem to understand why this approach works, or jump straight to The Scanner if you want to start indexing immediately.

Your digital archive is waiting. Let’s make sense of it.


Back to top

AI Digital Archive - A system for organizing your digital life

This site uses Just the Docs, a documentation theme for Jekyll.