Automate an HTML Disk Catalog: Tools and Scripts to Generate Index Pages

Automate an HTML Disk Catalog: Tools and Scripts to Generate Index PagesAn HTML disk catalog is a web-style index of files and directories stored on a disk, NAS, or removable media. Automating the creation of such a catalog saves time, ensures consistent formatting, and makes archives searchable and navigable from any modern browser. This article explains why automation helps, outlines common design choices, surveys tools and scripting approaches, and gives concrete examples for building automated index pages that are fast, secure, and maintainable.


Why automate an HTML disk catalog?

  • Manual generation is time-consuming and error-prone, especially for large or frequently changing file collections.
  • Automation ensures consistency: file listing format, metadata shown, and navigation structure.
  • Automated catalogs can include metadata (file size, dates, checksums), previews (thumbnails for images, audio/video players), and search or filter features.
  • An HTML catalog is platform-independent: accessible over a local file:// URL or served by any web server.

Design considerations

Before choosing tools or writing scripts, decide on these aspects:

  • Scope: single folder, entire disk, multiple volumes, or a network share.
  • Depth: flat index (one page per folder) vs. single-page all-in-one index.
  • Metadata: include file size, modification date, MIME type, owner, permissions, checksums (MD5/SHA256), EXIF data for photos, or ID3 tags for audio.
  • Sorting & filtering: client-side (JavaScript) or server-side.
  • Search: simple text filter, indexed search (Lunr.js), or server-backed search.
  • Thumbnails/previews: generate server-side during catalog build to avoid on-the-fly processing.
  • Security & privacy: exclude sensitive files, avoid exposing system files, and sanitize file names/paths in generated HTML.
  • Performance: paginate large listings or load items lazily.

Tools and approaches overview

  1. Static-site generators

    • Use tools like Jekyll, Hugo, or MkDocs with custom content pages generated from disk scans. Good when you want templating, theming, and easy deployment.
  2. Dedicated index generators

    • Programs like tree (with HTML wrapper scripts), httrack-like archivers, or specialized indexers that produce HTML directly.
  3. Scripting (recommended for flexibility)

    • Write scripts in Python, Node.js, Bash, or PowerShell to traverse directories, collect metadata, and render HTML templates.
  4. Hybrid: server-side apps

    • Small web apps (Flask, Express, or static file servers with directory listing enhancements) that generate or serve dynamic indexes on request.

  • Quick one-off indexes: a Bash script using find + a simple HTML template.
  • Cross-platform recurring tasks: Python script with Jinja2 templates and optional thumbnail generation (Pillow).
  • Large photo/video archives: Python + exifread or exiftool for metadata + ffmpeg for video thumbnails.
  • Searchable catalogs: generate JSON index and use Lunr.js or Elasticlunr on the client side for instant search.
  • Windows environments: PowerShell script that exports directory trees and metadata into HTML with embedded icons/previews.

Example 1 — Python script that generates folder index pages

This example shows a concise Python approach using os, hashlib (for optional checksums), and Jinja2 for templating. It produces one index.html per directory, includes file size and mtime, and generates image thumbnails using Pillow.

Prerequisites:

  • Python 3.8+
  • pip install Jinja2 Pillow

Example script (save as generate_catalog.py):

#!/usr/bin/env python3 import os import sys import hashlib from datetime import datetime from jinja2 import Environment, FileSystemLoader, select_autoescape from PIL import Image ROOT = sys.argv[1] if len(sys.argv) > 1 else '.' OUTNAME = 'index.html' THUMB_DIR = '.thumbs' THUMB_SIZE = (240, 240) env = Environment(     loader=FileSystemLoader(os.path.dirname(__file__) or '.'),     autoescape=select_autoescape(['html', 'xml']) ) template = env.from_string(""" <!doctype html> <html> <head><meta charset="utf-8"><title>Index of {{ path }}</title> <style>body{font-family:system-ui,Segoe UI,Roboto,Helvetica,Arial;} .file{display:flex;gap:.6rem;padding:.2rem 0}</style> </head> <body> <h1>Index of {{ path }}</h1> <ul> {% for d in dirs %} <li class="dir">📁 <a href="{{ d.rel }}/index.html">{{ d.name }}/</a></li> {% endfor %} {% for f in files %} <li class="file">   {% if f.thumb %}     <img src="{{ f.thumb }}" style="width:72px;height:auto;margin-right:.5rem">   {% endif %}   <a href="{{ f.rel }}">{{ f.name }}</a>   <small> — {{ f.size }} — {{ f.mtime }}</small> </li> {% endfor %} </ul> </body></html> """) def human_size(n):     for unit in ['B','KB','MB','GB','TB']:         if n < 1024.0:             return f"{n:.1f}{unit}"         n /= 1024.0     return f"{n:.1f}PB" def make_thumb(src_path, thumb_path):     try:         os.makedirs(os.path.dirname(thumb_path), exist_ok=True)         with Image.open(src_path) as im:             im.thumbnail(THUMB_SIZE)             im.save(thumb_path, format='JPEG', quality=80)         return True     except Exception:         return False for dirpath, dirnames, filenames in os.walk(ROOT):     rel_dir = os.path.relpath(dirpath, ROOT)     dirs = [{'name':d, 'rel':os.path.join(d)} for d in sorted(dirnames)]     files = []     for fname in sorted(filenames):         full = os.path.join(dirpath, fname)         rel = os.path.join(rel_dir, fname) if rel_dir != '.' else fname         st = os.stat(full)         thumb = None         if fname.lower().endswith(('.png','.jpg','.jpeg','.gif','webp')):             thumb_path = os.path.join(dirpath, THUMB_DIR, fname + '.jpg')             if make_thumb(full, thumb_path):                 thumb = os.path.join(THUMB_DIR, fname + '.jpg')         files.append({             'name': fname,             'rel': rel.replace('\','/'),             'size': human_size(st.st_size),             'mtime': datetime.fromtimestamp(st.st_mtime).isoformat(sep=' ', timespec='seconds'),             'thumb': thumb.replace('\','/') if thumb else None         })     out_html = template.render(path=rel_dir, dirs=dirs, files=files)     with open(os.path.join(dirpath, OUTNAME), 'w', encoding='utf-8') as f:         f.write(out_html) 

Run:

  • python generate_catalog.py /path/to/archive

This produces index.html files and thumbnails under .thumbs/ for image files.


This example generates a single-page index.json with file metadata, then uses Lunr.js on the client for full-text search. Steps:

  1. Use Node to walk directories and produce index.json.
  2. Create index.html which loads index.json and Lunr to provide a search box and dynamic results.

Node script (outline):

  • Use fs/promises, path, and a simple recursive walker.
  • Save index as an array with fields: path, name, size, mtime, type.
  • Optional: precompute tokens or tags for faster search.

Client HTML loads index.json and builds a Lunr index in the browser, or uses a lightweight library like Fuse.js for fuzzy search.


Example 3 — PowerShell for Windows users

PowerShell can quickly export directory listings with metadata and build an HTML page using ConvertTo-Html. Example:

Get-ChildItem -Recurse | Where-Object {!$_.PSIsContainer} | Select-Object FullName, Name, Length, LastWriteTime | ConvertTo-Html -Title "Disk Catalog" -PreContent "<h1>Disk Catalog</h1>" | Out-File index.html -Encoding utf8 

Enhance by adding icons, thumbnails, or grouping by folder with additional scripting.


Best practices

  • Exclude system or hidden files by default; provide an opt-in flag to include them.
  • Sanitize file names and escape HTML to prevent XSS.
  • Store generated thumbnails and metadata alongside indexes to avoid reprocessing.
  • Use checksums (SHA256) for archival integrity; compute them incrementally and cache results.
  • For very large archives, split indexes by folder or month and provide a top-level navigation page.
  • Use HTTP headers and caching when serving indexes over a network; set far-future expires for thumbnails.

Performance tips

  • Keep index generation incremental: detect changed files by mtime and update only those entries.
  • Parallelize thumbnail and checksum computation (thread pools or multiprocessing).
  • Use binary file size and datestamp caching to skip unchanged files.
  • Compress JSON indexes and enable gzip/ brotli on the web server.

Security and privacy

  • Avoid embedding any sensitive metadata (full local paths, user IDs) unless necessary.
  • When exposing catalogs externally, ensure correct access controls; consider static hosting behind authentication.
  • Sanitize and encode filenames in HTML to prevent injection/XSS.
  • For public shares, remove thumbnails or previews that reveal private content unintentionally.

Extensions and advanced features

  • Add previews: audio/video players, text file previews, PDF thumbnails.
  • Integrate with metadata extractors: exiftool for rich photo metadata; ffprobe for video metadata.
  • Provide multiple output formats: HTML + JSON + CSV for interoperability.
  • Offer export options: ZIP selected files, or generate an offline static site.
  • Add incremental update endpoints: an API route that returns changed files since a timestamp.

Summary

Automating an HTML disk catalog improves usability, consistency, and maintainability of file archives. Choose the approach that fits your environment: small Bash/Powershell scripts for quick jobs, Python for cross-platform flexibility, or Node.js for single-page searchable catalogs. Include metadata and thumbnails where helpful, cache expensive computations, and keep security and privacy front of mind. The examples above give a practical starting point you can adapt and extend for your needs.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *