The Scanner: High-Performance Filesystem Indexing
You need to index 3.5 million files. You need to hash 1.47 TB of content. You need EXIF data from hundreds of thousands of photos. You need to classify everything by type, detect sensitive files, and handle corrupted data gracefully. And you need to do all of this in hours, not days.
This is why the scanner is written in Go.
Why Go for Filesystem Scanning
Three reasons:
1. Concurrency is built into the language. Goroutines make it trivial to scan thousands of files in parallel. You don’t fight with threads or process pools. You just spin up a goroutine for each file and let Go’s runtime handle the scheduling.
2. Single binary deployment. You compile once and get a binary that runs anywhere. No Ruby version managers, no Python virtual environments, no dependency hell. Just copy the scanner to any machine and run it.
3. Performance where it matters. File I/O is the bottleneck when you’re scanning millions of files, but CPU-bound operations like hashing and EXIF extraction benefit enormously from Go’s efficiency. The scanner can hash and extract metadata from files faster than most drives can feed them.
Real numbers: scanning 3.5 million files with full content hashing and EXIF extraction took about 4 hours on a 2019 MacBook Pro. That’s roughly 240 files per second, sustained.
Architecture: Parallel Pipeline with Controlled Concurrency
The scanner uses a producer-consumer pattern with a semaphore to limit concurrent operations:
// Semaphore for limiting concurrent hash operations
sem := make(chan struct{}, *workers)
// Walk the directory tree
filepath.WalkDir(*sourceDir, func(path string, d fs.DirEntry, err error) error {
if err != nil {
errorFiles.Add(1)
return nil
}
// Skip excluded directories
if d.IsDir() {
if shouldSkipDir(d.Name()) {
skippedFiles.Add(1)
return filepath.SkipDir
}
return nil
}
// Skip already indexed files (resume capability)
if existingPaths[path] {
skippedFiles.Add(1)
return nil
}
wg.Add(1)
sem <- struct{}{} // Acquire semaphore
go func(path string, d fs.DirEntry) {
defer wg.Done()
defer func() { <-sem }() // Release semaphore
record, err := processFile(path, d, *skipHash, *skipExif)
if err != nil {
errorFiles.Add(1)
return
}
recordChan <- record
processedFiles.Add(1)
totalBytes.Add(record.SizeBytes)
bar.Add(1)
}(path, d)
return nil
})
What’s happening here:
filepath.WalkDirtraverses the filesystem depth-first, calling the function for each file- For each file, we spawn a goroutine to process it (hash, EXIF, classify)
- The semaphore (
sem) limits how many goroutines can run concurrently (default 8 workers) - Processed records go into a buffered channel (
recordChan) - A separate writer goroutine batches records and inserts them into PostgreSQL
This architecture means the scanner never blocks. While eight workers are hashing and extracting EXIF, the directory walker keeps finding new files. While files are being processed, the database writer batches inserts. Everything flows.
Feature Breakdown
Content Hashing with xxHash64
Deduplication requires content hashing. The question is which algorithm.
MD5 is fast but has known collision issues. SHA-256 is cryptographically secure but overkill for this use case. We’re not verifying signatures, we’re finding duplicate files.
xxHash64 is a non-cryptographic hash that’s extremely fast and has excellent distribution properties. It’s what we use:
func hashFile(path string) (string, error) {
f, err := os.Open(path)
if err != nil {
return "", err
}
defer f.Close()
h := xxhash.New()
if _, err := io.Copy(h, f); err != nil {
return "", err
}
return fmt.Sprintf("%016x", h.Sum64()), nil
}
On a modern SSD, this can hash about 500 MB/s per worker. Eight workers means roughly 4 GB/s of hashing throughput, which is faster than most drives can sustain for random file reads.
Why xxHash64 specifically:
- Speed: 10x faster than MD5, 50x faster than SHA-256
- Quality: Excellent avalanche properties, extremely low collision rate for non-adversarial data
- Simplicity: Returns a 64-bit integer (16 hex characters when formatted)
For a personal archive where you control the input and aren’t defending against malicious actors, xxHash64 is perfect.
EXIF Extraction
Photos carry their own metadata. You want that metadata:
func extractExif(path string) (map[string]interface{}, *time.Time, *float64, *float64, *int, *int) {
f, err := os.Open(path)
if err != nil {
return nil, nil, nil, nil, nil, nil
}
defer f.Close()
x, err := exif.Decode(f)
if err != nil {
return nil, nil, nil, nil, nil, nil
}
data := make(map[string]interface{})
// Camera info
if make, err := x.Get(exif.Make); err == nil {
data["make"] = strings.TrimSpace(make.String())
}
if model, err := x.Get(exif.Model); err == nil {
data["model"] = strings.TrimSpace(model.String())
}
// Exposure settings
if iso, err := x.Get(exif.ISOSpeedRatings); err == nil {
if val, err := iso.Int(0); err == nil {
data["iso"] = val
}
}
// Date taken
var dateTaken *time.Time
if dt, err := x.Get(exif.DateTimeOriginal); err == nil {
if t, err := time.Parse("2006:01:02 15:04:05", strings.Trim(dt.String(), "\"")); err == nil {
dateTaken = &t
data["date_taken"] = t.Format(time.RFC3339)
}
}
// GPS
var lat, lon *float64
if gpsLat, gpsLon, err := x.LatLong(); err == nil {
lat = &gpsLat
lon = &gpsLon
data["gps_lat"] = gpsLat
data["gps_lon"] = gpsLon
}
return data, dateTaken, lat, lon, width, height
}
What we extract:
- Camera info: Make and model (Canon EOS 5D, iPhone 12 Pro)
- Date taken: When the photo was actually captured, not when the file was last modified
- GPS coordinates: Where the photo was taken (if available)
- Exposure settings: ISO, aperture, shutter speed, focal length
- Dimensions: Width and height from EXIF (validated to handle corrupt data)
All of this goes into a JSONB column in PostgreSQL, plus dedicated columns for date_taken and GPS coordinates so you can query them efficiently.
Category Classification
The scanner automatically categorizes files by extension:
var categoryByExt = map[string]struct {
category string
confidence string
}{
// Photos
".jpg": {"photo", "high"}, ".jpeg": {"photo", "high"}, ".png": {"photo", "high"},
".gif": {"photo", "high"}, ".heic": {"photo", "high"}, ".raw": {"photo", "high"},
// Videos
".mov": {"video", "high"}, ".mp4": {"video", "high"}, ".avi": {"video", "high"},
".mkv": {"video", "high"}, ".m4v": {"video", "high"},
// Audio
".mp3": {"audio", "high"}, ".wav": {"audio", "high"}, ".flac": {"audio", "high"},
".m4a": {"audio", "high"},
// Documents
".pdf": {"document", "high"}, ".docx": {"document", "high"},
".txt": {"document", "low"}, ".md": {"document", "low"},
// Code
".rb": {"code", "high"}, ".py": {"code", "high"}, ".js": {"code", "high"},
".go": {"code", "high"}, ".java": {"code", "high"},
// Config
".json": {"config", "medium"}, ".yaml": {"config", "medium"},
".env": {"config", "high"},
// Archives
".zip": {"archive", "high"}, ".tar": {"archive", "high"},
".dmg": {"archive", "high"},
// Data
".sqlite": {"data", "high"}, ".db": {"data", "high"},
".csv": {"data", "medium"},
}
Confidence levels:
- High: Extension strongly indicates the category (.jpg is definitely a photo)
- Medium: Extension suggests the category but could have multiple uses (.json might be config, data, or just a file)
- Low: Extension is ambiguous (.txt could be documentation, code, or notes)
Files with low confidence or no extension get flagged with needs_review = true so you can triage them later.
Full Category Mapping
Here’s the complete breakdown of what gets classified and how:
| Category | Extensions | Confidence | Notes |
|---|---|---|---|
| photo | jpg, jpeg, png, gif, heic, heif, webp, raw, cr2, nef, arw, dng, tiff, tif, bmp | high | All common photo formats including RAW |
| video | mov, mp4, avi, mkv, m4v, webm, wmv, flv, mpg, mpeg, 3gp | high | Video containers and codecs |
| audio | mp3, wav, aiff, flac, m4a, ogg, wma, aac | high | Lossless and lossy audio |
| document | pdf, doc, docx, xls, xlsx, ppt, pptx, pages, numbers, key, odt, rtf | high (most) | Office docs and alternatives |
| document | txt, md, markdown | low | Plain text could be anything |
| code | rb, py, js, ts, jsx, tsx, go, rs, java, c, cpp, h, swift, php, html, css, sql, sh | high | Programming languages |
| config | json, xml, yaml, yml, toml, ini, cfg, conf, plist | medium | Configuration and data |
| config | .env | high | Environment vars (also marked sensitive) |
| archive | zip, rar, tar, gz, 7z, dmg, iso, pkg, deb, rpm, bz2, xz | high | Compressed archives |
| data | sqlite, db | high | Databases |
| data | csv, tsv | medium | Structured data |
| unknown | (no extension or not in map) | low | Flagged for review |
Adapt This: Custom Categories
Your archive might need different categories. Maybe you have a lot of 3D models (.obj, .stl, .blend) or music production files (.mid, .flp, .logic). Add them to the map:
// 3D and CAD
".obj": {"3d_model", "high"}, ".stl": {"3d_model", "high"},
".blend": {"3d_model", "high"}, ".dwg": {"cad", "high"},
// Music production
".mid": {"music_production", "high"}, ".flp": {"music_production", "high"},
".logic": {"music_production", "high"}, ".aiff": {"music_production", "high"},
Recompile and re-scan (or just update existing records with a SQL migration).
Sensitive File Detection
Some files should never be uploaded to the cloud without careful consideration:
var sensitivePatterns = []string{
".ssh", ".aws", ".gnupg", ".gpg", "credentials", ".env",
"DotFiles", "dotfiles", "private_key", "id_rsa", "id_ed25519",
".keychain", "Keychain", "wallet", ".bitcoin", ".ethereum",
}
// Check for sensitive paths
for _, pattern := range sensitivePatterns {
if strings.Contains(path, pattern) {
record.IsSensitive = true
break
}
}
Any file whose path contains these patterns gets flagged with is_sensitive = true. Later, the web application and backup system can filter these out automatically.
Skip Patterns: What Not to Index
You don’t want to index everything. System directories, build artifacts, and dependency folders add millions of files that you don’t care about:
var skipDirs = map[string]bool{
"Applications": true, // macOS apps
"Library": true, // macOS system files
"bin": true, // Unix binaries
"sbin": true, // System binaries
"usr": true, // Unix system files
"etc": true, // Config files
"private": true, // System private
"dev": true, // Device files
".Spotlight-V100": true, // macOS indexing
".fseventsd": true, // macOS file events
".Trashes": true, // macOS trash
"node_modules": true, // JavaScript dependencies
".git": true, // Git internals
}
When the walker encounters a directory with one of these names, it returns filepath.SkipDir, which tells the walker to skip the entire subtree. This saves enormous amounts of time and database space.
Why skip these:
| Directory | Reason |
|---|---|
Applications, Library |
macOS system files, rebuilt with the OS |
node_modules |
JavaScript dependencies, hundreds of thousands of tiny files, reconstructible from package.json |
.git |
Version control internals, large and reconstructible |
bin, sbin, usr, etc |
System directories, not personal data |
.Spotlight-V100, .fseventsd |
macOS indexing caches, constantly changing |
.Trashes |
Deleted files, not worth indexing |
Adapt This: Custom Skip Patterns
Your archive might have other directories you want to skip. Common additions:
"vendor": true, // Go dependencies
".bundle": true, // Ruby bundler
"__pycache__": true, // Python bytecode
".next": true, // Next.js build
".cache": true, // Various caches
"tmp": true, // Temporary files
"Dropbox/.dropbox.cache": true, // Dropbox cache
Add them to the map before scanning.
Resume Capability
The scanner can resume interrupted scans by loading existing file paths from the database:
// Load existing paths for resume capability
fmt.Println("Loading existing files for resume...")
existingPaths := make(map[string]bool)
rows, err := pool.Query(ctx, "SELECT path FROM files")
if err != nil {
fmt.Printf("Error loading existing paths: %v\n", err)
os.Exit(1)
}
for rows.Next() {
var path string
if err := rows.Scan(&path); err == nil {
existingPaths[path] = true
}
}
rows.Close()
fmt.Printf("Found %d existing files, will skip them\n", len(existingPaths))
During the walk, any file whose path is already in existingPaths gets skipped:
// Skip already indexed files (resume capability)
if existingPaths[path] {
skippedFiles.Add(1)
return nil
}
This means you can stop and restart the scanner at any time. If you realize you forgot to plug in an external drive, just stop the scanner, mount the drive, and run it again with the same database. Only new files get processed.
Batched PostgreSQL Inserts
Inserting 3.5 million rows one at a time would take days. The scanner uses PostgreSQL’s COPY protocol to insert records in batches of 1,000:
func insertBatch(ctx context.Context, pool *pgxpool.Pool, records []FileRecord) error {
if len(records) == 0 {
return nil
}
_, err := pool.CopyFrom(
ctx,
pgx.Identifier{"files"},
[]string{
"path", "filename", "extension", "size_bytes", "created_at", "modified_at",
"is_dir", "content_hash", "category", "category_source", "category_confidence",
"needs_review", "review_reason", "is_sensitive", "media_width", "media_height",
"exif_data", "exif_date_taken", "gps_lat", "gps_lon",
},
pgx.CopyFromSlice(len(records), func(i int) ([]interface{}, error) {
r := records[i]
// Convert EXIF data to JSON
var exifJSON []byte
if r.ExifData != nil {
exifJSON, _ = json.Marshal(r.ExifData)
}
return []interface{}{
r.Path, r.Filename, r.Extension, r.SizeBytes, r.CreatedAt, r.ModifiedAt,
r.IsDir, r.ContentHash, r.Category, r.CategorySource, r.CategoryConfidence,
r.NeedsReview, r.ReviewReason, r.IsSensitive, r.MediaWidth, r.MediaHeight,
exifJSON, r.ExifDateTaken, r.GpsLat, r.GpsLon,
}, nil
}),
)
return err
}
COPY is dramatically faster than INSERT:
- Regular INSERT: ~1,000 rows/second
- Batched INSERT with transactions: ~10,000 rows/second
- COPY protocol: ~100,000 rows/second
At 3.5 million files, COPY cuts insertion time from hours to minutes.
CLI Flags
The scanner exposes several configuration options:
scanner -source <directory> \
-db "postgres://user:pass@localhost/archive" \
-workers 8 \
-batch 1000 \
-skip-hash \
-skip-exif \
-verbose
Flags:
-source: Root directory to scan (required)-db: PostgreSQL connection string (default:postgres://localhost/avi_archive)-workers: Number of parallel workers for hashing and EXIF (default: 8)-batch: Batch size for database inserts (default: 1000)-skip-hash: Skip content hashing (faster but no deduplication)-skip-exif: Skip EXIF extraction (faster but no photo metadata)-verbose: Print errors and warnings during scan
Tuning workers:
- More workers = faster scanning if you’re CPU-bound
- Fewer workers = less disk thrashing if you’re I/O-bound
- Start with 8, adjust based on system load
When to skip hashing:
If you’re doing an initial exploratory scan and don’t care about deduplication yet, -skip-hash cuts scan time roughly in half. You can always re-scan later with hashing enabled.
When to skip EXIF:
EXIF extraction is relatively fast (milliseconds per image), but if you have millions of images and don’t care about camera metadata or GPS coordinates, -skip-exif saves time.
Running the Scanner
First run:
# Compile
cd scanner
go build -o scanner main.go
# Run
./scanner -source /Volumes/ExternalDrive/Backups
Scanning multiple drives:
Run the scanner once per drive, pointing to the same database:
./scanner -source /Volumes/Drive1/Backups -db "postgres://localhost/archive"
./scanner -source /Volumes/Drive2/OldMac -db "postgres://localhost/archive"
./scanner -source ~/Documents -db "postgres://localhost/archive"
The scanner appends to the database, it doesn’t overwrite. Resume capability ensures files aren’t duplicated if you scan the same directory twice.
Monitoring progress:
The scanner shows a live progress bar with files per second:
Scanning [=>---------------------------] 347,423 files (1,247/s)
On completion, it prints a summary:
=== Scan Complete ===
Files processed: 3,489,234
Files skipped: 127,845
Errors: 234
Total size: 1.47 TB
Database rows: 3,489,234
Handling Edge Cases
Permission Errors
Some files are protected by the OS. The scanner handles these gracefully:
filepath.WalkDir(*sourceDir, func(path string, d fs.DirEntry, err error) error {
if err != nil {
if *verbose {
fmt.Printf("\nError accessing %s: %v\n", path, err)
}
errorFiles.Add(1)
return nil // Continue walking
}
// ...
})
If a file can’t be accessed:
- The error is logged (if
-verboseis enabled) - The error counter increments
- The walker continues with the next file
You’ll see permission errors for system files, files owned by other users, and occasionally corrupted filesystem entries. This is expected and harmless.
Symlinks
Go’s filepath.WalkDir follows symlinks by default. This can cause problems if you have circular symlinks or links pointing outside your archive.
Current behavior: The scanner follows symlinks and indexes the targets. This means the same file might be indexed multiple times if it’s linked from multiple places.
Alternative approach: Skip symlinks entirely by checking d.Type()&fs.ModeSymlink != 0 and returning early. This prevents circular references but means linked files won’t be indexed.
Choose based on your use case. For most personal archives, following symlinks is fine.
Corrupt EXIF Data
Real-world photos sometimes have corrupt or invalid EXIF data. The scanner validates dimensions to filter out garbage:
// Dimensions (with validation to filter corrupt EXIF data)
var width, height *int
if w, err := x.Get(exif.PixelXDimension); err == nil {
if val, err := w.Int(0); err == nil && val > 0 && val < 100000 {
width = &val
data["width"] = val
}
}
Why validate dimensions:
Some corrupt EXIF blocks report widths like 4,294,967,295 (uint32 max value) or negative dimensions. The validator ensures dimensions are sane (positive and less than 100,000 pixels in either direction).
If EXIF extraction fails entirely, the function returns nil values and the scanner continues. You still get the file indexed, you just don’t get EXIF metadata.
Large Files
The scanner hashes files by streaming them through io.Copy. This means even multi-gigabyte video files get hashed without loading the entire file into memory:
h := xxhash.New()
if _, err := io.Copy(h, f); err != nil {
return "", err
}
Go’s io.Copy uses a 32KB buffer by default, so memory usage stays constant regardless of file size.
Watch Out: Database Connection Pooling
The scanner uses pgxpool for database connections, which maintains a pool of connections and reuses them across goroutines. The default pool size is based on the number of CPU cores.
If you see connection errors:
Error inserting batch: too many connections
You’re probably running the scanner on a machine with many cores (or with high -workers), and PostgreSQL is hitting its connection limit (default 100 connections).
Fix:
- Reduce
-workersto limit concurrency - Increase PostgreSQL’s
max_connectionssetting - Configure the pool explicitly in code:
config, _ := pgxpool.ParseConfig(dbURL)
config.MaxConns = 10 // Limit pool size
pool, _ := pgxpool.NewWithConfig(ctx, config)
For most setups, the defaults work fine.
Dependencies
The scanner uses four external libraries:
require (
github.com/jackc/pgx/v5 v5.8.0 // PostgreSQL driver
github.com/cespare/xxhash/v2 v2.3.0 // Fast content hashing
github.com/rwcarlsen/goexif v0.0.0 // EXIF extraction
github.com/schollz/progressbar/v3 v3.19.0 // Progress display
)
Why these:
- pgx: Pure Go PostgreSQL driver with excellent performance and support for COPY protocol
- xxhash: Fastest non-cryptographic hash function with good distribution
- goexif: Mature EXIF library that handles edge cases gracefully
- progressbar: Clean, simple progress indication
No dependencies on C libraries, no cgo, no external binaries. Just Go and PostgreSQL.
Results: What to Expect
My scan results:
- Files: 3.5 million
- Size: 1.47 TB
- Duration: ~4 hours (includes full content hashing and EXIF extraction)
- Errors: 234 (mostly permission denied on system files)
- Database size: ~25 GB (including indexes and JSONB data)
Your results will vary based on:
- Drive speed (SSD vs HDD makes an enormous difference)
- File size distribution (millions of tiny files are slower than thousands of large files)
- Number of workers (more parallelism helps on fast drives)
- Whether you’re hashing and extracting EXIF
Next: Database Design - How the PostgreSQL schema supports fuzzy search, geographic queries, and AI-ready embeddings.