The Scanner: High-Performance Filesystem Indexing

You need to index 3.5 million files. You need to hash 1.47 TB of content. You need EXIF data from hundreds of thousands of photos. You need to classify everything by type, detect sensitive files, and handle corrupted data gracefully. And you need to do all of this in hours, not days.

This is why the scanner is written in Go.

Why Go for Filesystem Scanning

Three reasons:

1. Concurrency is built into the language. Goroutines make it trivial to scan thousands of files in parallel. You don’t fight with threads or process pools. You just spin up a goroutine for each file and let Go’s runtime handle the scheduling.

2. Single binary deployment. You compile once and get a binary that runs anywhere. No Ruby version managers, no Python virtual environments, no dependency hell. Just copy the scanner to any machine and run it.

3. Performance where it matters. File I/O is the bottleneck when you’re scanning millions of files, but CPU-bound operations like hashing and EXIF extraction benefit enormously from Go’s efficiency. The scanner can hash and extract metadata from files faster than most drives can feed them.

Real numbers: scanning 3.5 million files with full content hashing and EXIF extraction took about 4 hours on a 2019 MacBook Pro. That’s roughly 240 files per second, sustained.

Architecture: Parallel Pipeline with Controlled Concurrency

The scanner uses a producer-consumer pattern with a semaphore to limit concurrent operations:

// Semaphore for limiting concurrent hash operations
sem := make(chan struct{}, *workers)

// Walk the directory tree
filepath.WalkDir(*sourceDir, func(path string, d fs.DirEntry, err error) error {
    if err != nil {
        errorFiles.Add(1)
        return nil
    }

    // Skip excluded directories
    if d.IsDir() {
        if shouldSkipDir(d.Name()) {
            skippedFiles.Add(1)
            return filepath.SkipDir
        }
        return nil
    }

    // Skip already indexed files (resume capability)
    if existingPaths[path] {
        skippedFiles.Add(1)
        return nil
    }

    wg.Add(1)
    sem <- struct{}{} // Acquire semaphore

    go func(path string, d fs.DirEntry) {
        defer wg.Done()
        defer func() { <-sem }() // Release semaphore

        record, err := processFile(path, d, *skipHash, *skipExif)
        if err != nil {
            errorFiles.Add(1)
            return
        }

        recordChan <- record
        processedFiles.Add(1)
        totalBytes.Add(record.SizeBytes)
        bar.Add(1)
    }(path, d)

    return nil
})

What’s happening here:

  • filepath.WalkDir traverses the filesystem depth-first, calling the function for each file
  • For each file, we spawn a goroutine to process it (hash, EXIF, classify)
  • The semaphore (sem) limits how many goroutines can run concurrently (default 8 workers)
  • Processed records go into a buffered channel (recordChan)
  • A separate writer goroutine batches records and inserts them into PostgreSQL

This architecture means the scanner never blocks. While eight workers are hashing and extracting EXIF, the directory walker keeps finding new files. While files are being processed, the database writer batches inserts. Everything flows.

Feature Breakdown

Content Hashing with xxHash64

Deduplication requires content hashing. The question is which algorithm.

MD5 is fast but has known collision issues. SHA-256 is cryptographically secure but overkill for this use case. We’re not verifying signatures, we’re finding duplicate files.

xxHash64 is a non-cryptographic hash that’s extremely fast and has excellent distribution properties. It’s what we use:

func hashFile(path string) (string, error) {
    f, err := os.Open(path)
    if err != nil {
        return "", err
    }
    defer f.Close()

    h := xxhash.New()
    if _, err := io.Copy(h, f); err != nil {
        return "", err
    }

    return fmt.Sprintf("%016x", h.Sum64()), nil
}

On a modern SSD, this can hash about 500 MB/s per worker. Eight workers means roughly 4 GB/s of hashing throughput, which is faster than most drives can sustain for random file reads.

Why xxHash64 specifically:

  • Speed: 10x faster than MD5, 50x faster than SHA-256
  • Quality: Excellent avalanche properties, extremely low collision rate for non-adversarial data
  • Simplicity: Returns a 64-bit integer (16 hex characters when formatted)

For a personal archive where you control the input and aren’t defending against malicious actors, xxHash64 is perfect.

EXIF Extraction

Photos carry their own metadata. You want that metadata:

func extractExif(path string) (map[string]interface{}, *time.Time, *float64, *float64, *int, *int) {
    f, err := os.Open(path)
    if err != nil {
        return nil, nil, nil, nil, nil, nil
    }
    defer f.Close()

    x, err := exif.Decode(f)
    if err != nil {
        return nil, nil, nil, nil, nil, nil
    }

    data := make(map[string]interface{})

    // Camera info
    if make, err := x.Get(exif.Make); err == nil {
        data["make"] = strings.TrimSpace(make.String())
    }
    if model, err := x.Get(exif.Model); err == nil {
        data["model"] = strings.TrimSpace(model.String())
    }

    // Exposure settings
    if iso, err := x.Get(exif.ISOSpeedRatings); err == nil {
        if val, err := iso.Int(0); err == nil {
            data["iso"] = val
        }
    }

    // Date taken
    var dateTaken *time.Time
    if dt, err := x.Get(exif.DateTimeOriginal); err == nil {
        if t, err := time.Parse("2006:01:02 15:04:05", strings.Trim(dt.String(), "\"")); err == nil {
            dateTaken = &t
            data["date_taken"] = t.Format(time.RFC3339)
        }
    }

    // GPS
    var lat, lon *float64
    if gpsLat, gpsLon, err := x.LatLong(); err == nil {
        lat = &gpsLat
        lon = &gpsLon
        data["gps_lat"] = gpsLat
        data["gps_lon"] = gpsLon
    }

    return data, dateTaken, lat, lon, width, height
}

What we extract:

  • Camera info: Make and model (Canon EOS 5D, iPhone 12 Pro)
  • Date taken: When the photo was actually captured, not when the file was last modified
  • GPS coordinates: Where the photo was taken (if available)
  • Exposure settings: ISO, aperture, shutter speed, focal length
  • Dimensions: Width and height from EXIF (validated to handle corrupt data)

All of this goes into a JSONB column in PostgreSQL, plus dedicated columns for date_taken and GPS coordinates so you can query them efficiently.

Category Classification

The scanner automatically categorizes files by extension:

var categoryByExt = map[string]struct {
    category   string
    confidence string
}{
    // Photos
    ".jpg": {"photo", "high"}, ".jpeg": {"photo", "high"}, ".png": {"photo", "high"},
    ".gif": {"photo", "high"}, ".heic": {"photo", "high"}, ".raw": {"photo", "high"},

    // Videos
    ".mov": {"video", "high"}, ".mp4": {"video", "high"}, ".avi": {"video", "high"},
    ".mkv": {"video", "high"}, ".m4v": {"video", "high"},

    // Audio
    ".mp3": {"audio", "high"}, ".wav": {"audio", "high"}, ".flac": {"audio", "high"},
    ".m4a": {"audio", "high"},

    // Documents
    ".pdf": {"document", "high"}, ".docx": {"document", "high"},
    ".txt": {"document", "low"}, ".md": {"document", "low"},

    // Code
    ".rb": {"code", "high"}, ".py": {"code", "high"}, ".js": {"code", "high"},
    ".go": {"code", "high"}, ".java": {"code", "high"},

    // Config
    ".json": {"config", "medium"}, ".yaml": {"config", "medium"},
    ".env": {"config", "high"},

    // Archives
    ".zip": {"archive", "high"}, ".tar": {"archive", "high"},
    ".dmg": {"archive", "high"},

    // Data
    ".sqlite": {"data", "high"}, ".db": {"data", "high"},
    ".csv": {"data", "medium"},
}

Confidence levels:

  • High: Extension strongly indicates the category (.jpg is definitely a photo)
  • Medium: Extension suggests the category but could have multiple uses (.json might be config, data, or just a file)
  • Low: Extension is ambiguous (.txt could be documentation, code, or notes)

Files with low confidence or no extension get flagged with needs_review = true so you can triage them later.

Full Category Mapping

Here’s the complete breakdown of what gets classified and how:

Category Extensions Confidence Notes
photo jpg, jpeg, png, gif, heic, heif, webp, raw, cr2, nef, arw, dng, tiff, tif, bmp high All common photo formats including RAW
video mov, mp4, avi, mkv, m4v, webm, wmv, flv, mpg, mpeg, 3gp high Video containers and codecs
audio mp3, wav, aiff, flac, m4a, ogg, wma, aac high Lossless and lossy audio
document pdf, doc, docx, xls, xlsx, ppt, pptx, pages, numbers, key, odt, rtf high (most) Office docs and alternatives
document txt, md, markdown low Plain text could be anything
code rb, py, js, ts, jsx, tsx, go, rs, java, c, cpp, h, swift, php, html, css, sql, sh high Programming languages
config json, xml, yaml, yml, toml, ini, cfg, conf, plist medium Configuration and data
config .env high Environment vars (also marked sensitive)
archive zip, rar, tar, gz, 7z, dmg, iso, pkg, deb, rpm, bz2, xz high Compressed archives
data sqlite, db high Databases
data csv, tsv medium Structured data
unknown (no extension or not in map) low Flagged for review

Adapt This: Custom Categories

Your archive might need different categories. Maybe you have a lot of 3D models (.obj, .stl, .blend) or music production files (.mid, .flp, .logic). Add them to the map:

// 3D and CAD
".obj": {"3d_model", "high"}, ".stl": {"3d_model", "high"},
".blend": {"3d_model", "high"}, ".dwg": {"cad", "high"},

// Music production
".mid": {"music_production", "high"}, ".flp": {"music_production", "high"},
".logic": {"music_production", "high"}, ".aiff": {"music_production", "high"},

Recompile and re-scan (or just update existing records with a SQL migration).

Sensitive File Detection

Some files should never be uploaded to the cloud without careful consideration:

var sensitivePatterns = []string{
    ".ssh", ".aws", ".gnupg", ".gpg", "credentials", ".env",
    "DotFiles", "dotfiles", "private_key", "id_rsa", "id_ed25519",
    ".keychain", "Keychain", "wallet", ".bitcoin", ".ethereum",
}

// Check for sensitive paths
for _, pattern := range sensitivePatterns {
    if strings.Contains(path, pattern) {
        record.IsSensitive = true
        break
    }
}

Any file whose path contains these patterns gets flagged with is_sensitive = true. Later, the web application and backup system can filter these out automatically.

Skip Patterns: What Not to Index

You don’t want to index everything. System directories, build artifacts, and dependency folders add millions of files that you don’t care about:

var skipDirs = map[string]bool{
    "Applications":    true,  // macOS apps
    "Library":         true,  // macOS system files
    "bin":             true,  // Unix binaries
    "sbin":            true,  // System binaries
    "usr":             true,  // Unix system files
    "etc":             true,  // Config files
    "private":         true,  // System private
    "dev":             true,  // Device files
    ".Spotlight-V100": true,  // macOS indexing
    ".fseventsd":      true,  // macOS file events
    ".Trashes":        true,  // macOS trash
    "node_modules":    true,  // JavaScript dependencies
    ".git":            true,  // Git internals
}

When the walker encounters a directory with one of these names, it returns filepath.SkipDir, which tells the walker to skip the entire subtree. This saves enormous amounts of time and database space.

Why skip these:

Directory Reason
Applications, Library macOS system files, rebuilt with the OS
node_modules JavaScript dependencies, hundreds of thousands of tiny files, reconstructible from package.json
.git Version control internals, large and reconstructible
bin, sbin, usr, etc System directories, not personal data
.Spotlight-V100, .fseventsd macOS indexing caches, constantly changing
.Trashes Deleted files, not worth indexing

Adapt This: Custom Skip Patterns

Your archive might have other directories you want to skip. Common additions:

"vendor":          true,  // Go dependencies
".bundle":         true,  // Ruby bundler
"__pycache__":     true,  // Python bytecode
".next":           true,  // Next.js build
".cache":          true,  // Various caches
"tmp":             true,  // Temporary files
"Dropbox/.dropbox.cache": true,  // Dropbox cache

Add them to the map before scanning.

Resume Capability

The scanner can resume interrupted scans by loading existing file paths from the database:

// Load existing paths for resume capability
fmt.Println("Loading existing files for resume...")
existingPaths := make(map[string]bool)
rows, err := pool.Query(ctx, "SELECT path FROM files")
if err != nil {
    fmt.Printf("Error loading existing paths: %v\n", err)
    os.Exit(1)
}
for rows.Next() {
    var path string
    if err := rows.Scan(&path); err == nil {
        existingPaths[path] = true
    }
}
rows.Close()
fmt.Printf("Found %d existing files, will skip them\n", len(existingPaths))

During the walk, any file whose path is already in existingPaths gets skipped:

// Skip already indexed files (resume capability)
if existingPaths[path] {
    skippedFiles.Add(1)
    return nil
}

This means you can stop and restart the scanner at any time. If you realize you forgot to plug in an external drive, just stop the scanner, mount the drive, and run it again with the same database. Only new files get processed.

Batched PostgreSQL Inserts

Inserting 3.5 million rows one at a time would take days. The scanner uses PostgreSQL’s COPY protocol to insert records in batches of 1,000:

func insertBatch(ctx context.Context, pool *pgxpool.Pool, records []FileRecord) error {
    if len(records) == 0 {
        return nil
    }

    _, err := pool.CopyFrom(
        ctx,
        pgx.Identifier{"files"},
        []string{
            "path", "filename", "extension", "size_bytes", "created_at", "modified_at",
            "is_dir", "content_hash", "category", "category_source", "category_confidence",
            "needs_review", "review_reason", "is_sensitive", "media_width", "media_height",
            "exif_data", "exif_date_taken", "gps_lat", "gps_lon",
        },
        pgx.CopyFromSlice(len(records), func(i int) ([]interface{}, error) {
            r := records[i]

            // Convert EXIF data to JSON
            var exifJSON []byte
            if r.ExifData != nil {
                exifJSON, _ = json.Marshal(r.ExifData)
            }

            return []interface{}{
                r.Path, r.Filename, r.Extension, r.SizeBytes, r.CreatedAt, r.ModifiedAt,
                r.IsDir, r.ContentHash, r.Category, r.CategorySource, r.CategoryConfidence,
                r.NeedsReview, r.ReviewReason, r.IsSensitive, r.MediaWidth, r.MediaHeight,
                exifJSON, r.ExifDateTaken, r.GpsLat, r.GpsLon,
            }, nil
        }),
    )

    return err
}

COPY is dramatically faster than INSERT:

  • Regular INSERT: ~1,000 rows/second
  • Batched INSERT with transactions: ~10,000 rows/second
  • COPY protocol: ~100,000 rows/second

At 3.5 million files, COPY cuts insertion time from hours to minutes.

CLI Flags

The scanner exposes several configuration options:

scanner -source <directory> \
        -db "postgres://user:pass@localhost/archive" \
        -workers 8 \
        -batch 1000 \
        -skip-hash \
        -skip-exif \
        -verbose

Flags:

  • -source: Root directory to scan (required)
  • -db: PostgreSQL connection string (default: postgres://localhost/avi_archive)
  • -workers: Number of parallel workers for hashing and EXIF (default: 8)
  • -batch: Batch size for database inserts (default: 1000)
  • -skip-hash: Skip content hashing (faster but no deduplication)
  • -skip-exif: Skip EXIF extraction (faster but no photo metadata)
  • -verbose: Print errors and warnings during scan

Tuning workers:

  • More workers = faster scanning if you’re CPU-bound
  • Fewer workers = less disk thrashing if you’re I/O-bound
  • Start with 8, adjust based on system load

When to skip hashing:

If you’re doing an initial exploratory scan and don’t care about deduplication yet, -skip-hash cuts scan time roughly in half. You can always re-scan later with hashing enabled.

When to skip EXIF:

EXIF extraction is relatively fast (milliseconds per image), but if you have millions of images and don’t care about camera metadata or GPS coordinates, -skip-exif saves time.

Running the Scanner

First run:

# Compile
cd scanner
go build -o scanner main.go

# Run
./scanner -source /Volumes/ExternalDrive/Backups

Scanning multiple drives:

Run the scanner once per drive, pointing to the same database:

./scanner -source /Volumes/Drive1/Backups -db "postgres://localhost/archive"
./scanner -source /Volumes/Drive2/OldMac -db "postgres://localhost/archive"
./scanner -source ~/Documents -db "postgres://localhost/archive"

The scanner appends to the database, it doesn’t overwrite. Resume capability ensures files aren’t duplicated if you scan the same directory twice.

Monitoring progress:

The scanner shows a live progress bar with files per second:

Scanning [=>---------------------------]  347,423 files (1,247/s)

On completion, it prints a summary:

=== Scan Complete ===
Files processed: 3,489,234
Files skipped:   127,845
Errors:          234
Total size:      1.47 TB
Database rows:   3,489,234

Handling Edge Cases

Permission Errors

Some files are protected by the OS. The scanner handles these gracefully:

filepath.WalkDir(*sourceDir, func(path string, d fs.DirEntry, err error) error {
    if err != nil {
        if *verbose {
            fmt.Printf("\nError accessing %s: %v\n", path, err)
        }
        errorFiles.Add(1)
        return nil  // Continue walking
    }
    // ...
})

If a file can’t be accessed:

  1. The error is logged (if -verbose is enabled)
  2. The error counter increments
  3. The walker continues with the next file

You’ll see permission errors for system files, files owned by other users, and occasionally corrupted filesystem entries. This is expected and harmless.

Go’s filepath.WalkDir follows symlinks by default. This can cause problems if you have circular symlinks or links pointing outside your archive.

Current behavior: The scanner follows symlinks and indexes the targets. This means the same file might be indexed multiple times if it’s linked from multiple places.

Alternative approach: Skip symlinks entirely by checking d.Type()&fs.ModeSymlink != 0 and returning early. This prevents circular references but means linked files won’t be indexed.

Choose based on your use case. For most personal archives, following symlinks is fine.

Corrupt EXIF Data

Real-world photos sometimes have corrupt or invalid EXIF data. The scanner validates dimensions to filter out garbage:

// Dimensions (with validation to filter corrupt EXIF data)
var width, height *int
if w, err := x.Get(exif.PixelXDimension); err == nil {
    if val, err := w.Int(0); err == nil && val > 0 && val < 100000 {
        width = &val
        data["width"] = val
    }
}

Why validate dimensions:

Some corrupt EXIF blocks report widths like 4,294,967,295 (uint32 max value) or negative dimensions. The validator ensures dimensions are sane (positive and less than 100,000 pixels in either direction).

If EXIF extraction fails entirely, the function returns nil values and the scanner continues. You still get the file indexed, you just don’t get EXIF metadata.

Large Files

The scanner hashes files by streaming them through io.Copy. This means even multi-gigabyte video files get hashed without loading the entire file into memory:

h := xxhash.New()
if _, err := io.Copy(h, f); err != nil {
    return "", err
}

Go’s io.Copy uses a 32KB buffer by default, so memory usage stays constant regardless of file size.

Watch Out: Database Connection Pooling

The scanner uses pgxpool for database connections, which maintains a pool of connections and reuses them across goroutines. The default pool size is based on the number of CPU cores.

If you see connection errors:

Error inserting batch: too many connections

You’re probably running the scanner on a machine with many cores (or with high -workers), and PostgreSQL is hitting its connection limit (default 100 connections).

Fix:

  1. Reduce -workers to limit concurrency
  2. Increase PostgreSQL’s max_connections setting
  3. Configure the pool explicitly in code:
config, _ := pgxpool.ParseConfig(dbURL)
config.MaxConns = 10  // Limit pool size
pool, _ := pgxpool.NewWithConfig(ctx, config)

For most setups, the defaults work fine.

Dependencies

The scanner uses four external libraries:

require (
    github.com/jackc/pgx/v5 v5.8.0          // PostgreSQL driver
    github.com/cespare/xxhash/v2 v2.3.0     // Fast content hashing
    github.com/rwcarlsen/goexif v0.0.0      // EXIF extraction
    github.com/schollz/progressbar/v3 v3.19.0 // Progress display
)

Why these:

  • pgx: Pure Go PostgreSQL driver with excellent performance and support for COPY protocol
  • xxhash: Fastest non-cryptographic hash function with good distribution
  • goexif: Mature EXIF library that handles edge cases gracefully
  • progressbar: Clean, simple progress indication

No dependencies on C libraries, no cgo, no external binaries. Just Go and PostgreSQL.

Results: What to Expect

My scan results:

  • Files: 3.5 million
  • Size: 1.47 TB
  • Duration: ~4 hours (includes full content hashing and EXIF extraction)
  • Errors: 234 (mostly permission denied on system files)
  • Database size: ~25 GB (including indexes and JSONB data)

Your results will vary based on:

  • Drive speed (SSD vs HDD makes an enormous difference)
  • File size distribution (millions of tiny files are slower than thousands of large files)
  • Number of workers (more parallelism helps on fast drives)
  • Whether you’re hashing and extracting EXIF

Next: Database Design - How the PostgreSQL schema supports fuzzy search, geographic queries, and AI-ready embeddings.


Back to top

AI Digital Archive - A system for organizing your digital life

This site uses Just the Docs, a documentation theme for Jekyll.