Lessons Learned: What Worked, What Didn’t
After building and running a personal archive system on 3.5 million files, here’s what we learned. This isn’t a victory lap. It’s an honest reflection on what worked brilliantly, what we’d do differently, and what surprised us along the way.
What Worked Well
Go for the Scanner
Writing the scanner in Go was one of our best decisions. The compiled binary made deployment trivial (no dependency hell), and Go’s concurrency model handled filesystem traversal beautifully. We got excellent performance without fighting the language.
When you’re scanning millions of files, startup time matters. Go’s instant startup meant we could restart the scanner without penalty when tuning parameters or recovering from errors.
PostgreSQL Extensions
We chose PostgreSQL early and it paid off immediately. The pg_trgm extension for fuzzy search was game-changing. We got trigram-based similarity matching in pure SQL without external services or complex application code.
SELECT * FROM archive_items
WHERE name % 'vacation'
ORDER BY similarity(name, 'vacation') DESC
LIMIT 20;
This single feature powers the entire search experience. It handles typos, partial matches, and finds files when you only remember fragments of the name.
Hybrid Architecture
Keeping the scanner independent from the Rails app was critical. The scanner runs as a standalone process, writes to the database, and exits. The app reads that data and adds its own layers (tagging, collections, search).
This separation meant we could:
- Run scans without starting the app
- Develop the app without re-scanning
- Replace either component independently
- Debug each system in isolation
When the scanner took 4 hours to process 3.5M files, we weren’t tying up Rails processes or worrying about HTTP timeouts.
xxHash64 for Content Hashing
We chose xxHash64 over MD5 or SHA-256 for duplicate detection. At terabyte scale, speed matters more than cryptographic security. xxHash64 is blazingly fast and collision-resistant enough for file deduplication.
The performance difference is real. xxHash64 let us hash every file during scanning without adding significant time to the process.
Resume Capability
The scanner took 4 hours on our first complete run. We needed to run it multiple times while tuning parameters, fixing bugs, and handling drive disconnections.
Building resume capability from day one saved countless hours. The scanner tracks progress and skips already-processed directories. When something fails, you restart and continue where you left off.
Batched Inserts
We batch database inserts at 1,000 records per transaction using PostgreSQL’s COPY protocol. This turned what would have been millions of individual INSERTs into thousands of efficient bulk operations.
// Batch accumulation
if len(batch) >= 1000 {
flushBatch()
}
The difference is dramatic. Single inserts would have taken days. Batched COPY operations completed in hours.
Content Hashing Over Path-Based Deduplication
We identify duplicates by file content (hash), not path. This catches files that have been moved, renamed, or downloaded multiple times with different names.
Path-based deduplication would have missed the 10 copies of IMG_1234.jpg scattered across backup folders with names like Copy of IMG_1234.jpg and IMG_1234 (1).jpg.
What We’d Do Differently
Start with SQLite for Smaller Archives
PostgreSQL is excellent, but it’s overkill for archives under 500,000 files. If we were starting with a smaller collection, we’d begin with SQLite and migrate later if needed.
SQLite requires no server setup, no authentication configuration, and no background processes. For personal projects, that’s a real advantage. You can always migrate to PostgreSQL when you need advanced features or concurrent access.
Better Progress Estimation
We originally counted files before scanning to show accurate progress. Then we discovered that counting 3.5 million files takes 30 minutes. We removed the pre-count to start scanning faster.
The trade-off was worth it, but progress reporting suffered. We’d invest more in better estimation techniques (sampling directories, tracking historical rates) rather than eliminating progress indicators entirely.
More Metadata Upfront
We extract basic metadata (size, timestamps, EXIF data) during scanning. But we punt on some expensive operations like video duration, audio tags, and thumbnail generation.
In hindsight, extracting more metadata upfront would have been valuable. Scanning already takes hours. Adding 20% more time to capture video duration or audio metadata would have saved implementing separate processing passes later.
Built-In Thumbnail Generation
We generate thumbnails on-demand in the Rails app. This works, but it means the first view of each image is slow. Building thumbnail generation into the scanner would have frontloaded this work.
The scanner already reads every file. Generating thumbnails during that pass (and storing them separately) would have improved the browsing experience from day one.
Technical Insights
EXIF Data Is Often Corrupt
We crashed on EXIF parsing multiple times before adding proper validation. Camera manufacturers don’t follow standards carefully, and file corruption produces wild values.
We’ve seen:
- Dates in the year 2099
- GPS coordinates in the ocean
- Negative image dimensions
- Integer overflow on image width
Now we validate every EXIF field and skip invalid data rather than crashing. Assume EXIF data is corrupt until proven otherwise.
# Validate GPS coordinates
if lat && lon && lat.abs <= 90 && lon.abs <= 180
self.latitude = lat
self.longitude = lon
end
PostGIS COPY Doesn’t Work
We wanted to use PostGIS for geographic search, but PostgreSQL’s COPY protocol doesn’t support PostGIS geometry columns directly. We tried various workarounds before giving up.
Solution: Store latitude and longitude as separate decimal columns. Convert to PostGIS geometries in a post-processing step if needed. For most queries, numeric columns work fine and are easier to import in bulk.
External Drives Are Slow
Scanning external USB drives is painfully slow compared to internal SSDs. We’re talking 10x slower in some cases. There’s no magic fix. External drives have slower seek times, USB overhead, and sometimes spin down to save power.
Budget time accordingly. A 1TB external drive might take 8 hours to scan completely. Run it overnight. Be patient.
Category Classification by Extension
We classify files by extension (.jpg = photo, .mp4 = video, .pdf = document). This simple approach is 95% accurate in practice.
Edge cases exist (.dat files could be anything, .bin is ambiguous), but most personal files follow conventions. Don’t overthink category classification. Extension-based mapping works.
Fuzzy Search Needs Minimum Similarity
PostgreSQL’s pg_trgm similarity scores range from 0 to 1. Without a minimum threshold, searches return thousands of terrible matches.
We use similarity(name, query) > 0.3 as a filter. This threshold was tuned empirically. Too low and you get noise. Too high and you miss legitimate matches. 0.3 works well for our filenames.
Issues We Encountered
EXIF Integer Overflow
Corrupt EXIF data caused integer overflow errors in Ruby’s EXIF parser. A corrupted width value of 2^32 crashed the parsing library.
Fix: Wrap EXIF parsing in rescue blocks and validate ranges before trusting values. Log corrupted files for manual review.
Database Connection Drops on Long Scans
Four-hour scans occasionally hit database connection timeouts or network hiccups. PostgreSQL would drop the connection mid-scan.
Fix: The scanner detects lost connections and reconnects automatically. It commits batches frequently so at most 1,000 records are lost on disconnect. Resume capability handles the rest.
Postgres.app Authentication Hiccups
Postgres.app on macOS occasionally fails authentication after system sleep or updates. The server is running but refuses connections.
Fix: Restart Postgres.app. There’s no deeper insight here. It’s a quirk of the local development setup. In production, use a proper PostgreSQL installation.
Security Considerations
Sensitive File Detection Is Imperfect
We flag files with names like password, secret, or private as potentially sensitive. This catches obvious cases but misses plenty.
A file named notes.txt could contain credit cards. A file called backup.zip might hold tax returns. Automated detection is a helpful hint, not a security guarantee.
Manual review is still required for sensitive data. Don’t trust the system to catch everything.
Don’t Index System Directories
Exclude operating system directories (/System, /Library, C:\Windows) from scans. These directories contain thousands of files you don’t need, and indexing them clutters your archive with irrelevant system files.
We learned this after scanning /Library and getting 500,000 cache files in our archive. Oops.
Exclude Credentials from Cloud Backup
If you back up your archive database to the cloud, be careful. The database contains file paths that might reveal sensitive directory structures. Index data about sensitive files should be excluded or encrypted.
Consider whether your database backup strategy matches your security requirements. The index is searchable metadata about your files. Treat it accordingly.
Consider Encryption for Sensitive Archives
We don’t encrypt our archive database because it’s on a local machine and contains paths/metadata, not file contents. But if you’re archiving truly sensitive material, encryption at rest is worth the complexity.
Think about your threat model. Who has physical access to your machine? Where are database backups stored? Would encryption add meaningful security or just complexity?
The Emotional Side
Finding Treasures Is Genuinely Exciting
We found a complete copy of our high school website (2005!) buried in a backup folder. HTML, images, terrible CSS, everything. Rediscovering that was a genuine thrill.
Building this system creates moments of joy. You’ll find forgotten photos, old writing, music you loved 15 years ago. Those moments make the technical work worthwhile.
Some Files Should Be Deleted
Not everything deserves archival. Downloaded PDFs you never read, temp files from old projects, duplicate downloads, installer DMGs from 2012. Some files are digital clutter.
Archiving everything feels comprehensive, but it’s also exhausting. Give yourself permission to delete obvious junk. You don’t need 50 copies of Untitled.txt.
Decision Fatigue Is Real
Reviewing 3.5 million files and deciding what to keep is mentally taxing. Even with automation, you’ll hit decisions you need to make manually. Keep or delete? Tag or skip? Organize or leave as-is?
Build triage workflows. Set time limits. Work in batches. Accept that you can’t review everything perfectly. Progress over perfection.
Take Breaks During Review
Archival work is slower and more emotional than expected. You’ll run into photos from hard times, files from deceased relatives, reminders of past versions of yourself.
Take breaks. This isn’t a race. The files aren’t going anywhere. Work at a pace that feels sustainable.
Maintaining Long-Term
Re-scan New Drives as Acquired
When you acquire a new external drive or migrate to a new computer, run the scanner on the new data. The system is designed for incremental updates. Point it at the new directory and let it add to the index.
Old data remains stable. New data gets integrated. The archive grows over time.
Periodic Sync to Find Deleted Files
Run the scanner periodically (quarterly? annually?) to catch deleted files. The scanner can compare current filesystem state to database records and mark missing files.
We haven’t implemented deletion tracking yet, but it’s on the roadmap. For now, occasional full re-scans keep the database accurate.
Monitor Backup Status
An archive is only as good as its backups. Monitor backup success. Test restores occasionally. Know where your backups live and how to access them.
We use Time Machine for local backups and cloud storage for critical directories. Find a system that matches your risk tolerance and budget.
Database Maintenance
Run PostgreSQL maintenance occasionally:
VACUUM ANALYZE archive_items;
This reclaims space from deleted records and updates query planner statistics. For a read-heavy archive database, maintenance is low-effort but worth doing quarterly.
Future Directions
AI Embeddings for Semantic Search
We’d like to generate embeddings for text content and enable semantic search. Find documents by meaning, not just keywords. “Tax deductions for home office” instead of exact phrase matching.
This requires processing text content, calling embedding APIs, and storing vectors. It’s feasible but not implemented yet.
Face Recognition for Photos
Identifying people in photos automatically would be valuable. PostgreSQL could store face embeddings and support queries like “all photos containing person X.”
Privacy concerns are real here. Face recognition feels more invasive than filename indexing. We’d implement this carefully with clear opt-in controls.
Automatic Tagging
Can we auto-tag files based on content, metadata, or directory structure? Photos from 2010 get tagged 2010s. Documents containing IRS get tagged taxes. Files in Projects/Website get tagged web-development.
Rule-based tagging is straightforward. ML-based tagging is more interesting but requires training data.
Timeline Visualization
A visual timeline of your digital life would be compelling. See photo density over years, document creation patterns, major file acquisition events.
We have the timestamp data. Building an interactive timeline visualization is a front-end project waiting to happen.
Start Building
If you’re thinking about building your own archive system, start. Don’t wait for the perfect design. Start with a simple scanner, a SQLite database, and a weekend.
You’ll learn what matters for your collection. You’ll discover files you forgot existed. You’ll build something genuinely useful.
Our system isn’t perfect. It has rough edges, missing features, and technical debt. But it works. It has processed 3.5 million files and made them searchable. That’s enough.
Your first version won’t be perfect either. Build it anyway. Fix problems as you encounter them. Add features when you need them. Ship something that works and improve it over time.
The best archive system is the one you actually build.