Changelog

All notable changes to facetmask are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Nothing yet.

[0.2.0]

Added

In-app update notifications. On launch the desktop app checks the published version manifest (site/version.json, read from the public GitHub raw URL so it works without the site's Cloudflare Access OTP) and, if a newer build exists, shows a popup with the version, a bullet list of what's new, and Update now / Skip this version / Later — plus a link to the full changelog. Skipped versions are remembered. Help → Check for updates… runs it on demand; facetmask --version prints the build. update.py is a small Tk-free, unit-tested module (12 tests). Set FACETMASK_FAKE_UPDATE=1 to force the prompt for a UI self-test. Replaces the brittle git fetch/pull in the launchers as the update path.
GPU-aware installer. install.bat now detects the GPU's compute capability via nvidia-smi and auto-selects the matching PyTorch wheel — cu128 for RTX 50-series (Blackwell, compute 12.x), cu124 for older NVIDIA cards, CPU build when there's no NVIDIA GPU — instead of hardcoding cu124. This permanently fixes the "CUDA error: no kernel image is available" crash on Blackwell cards without anyone needing to know CUDA versions. Installs torch with --no-cache-dir (so pip can't silently reuse a wrong- variant cached wheel) and adds a real GPU compute test to the verify step (is_available() lies on the no-kernel trap; a trivial GPU op is the honest check). Detection validated on RTX 4070 (→ cu124) and is the right mapping for RTX 50-series (→ cu128).
SAM 3 low-VRAM masking — fixes GPU out-of-memory crashes on long runs with limited VRAM (e.g. 8 GB RTX 5060 Ti OOM'ing after ~200 frames). SAM 3's image processor force-resizes every frame to a fixed square, so VRAM scales with that square (quadratic), not the source resolution. The masker now: loads in bfloat16 with SDPA attention on CUDA, runs under torch.inference_mode, casts pixel values to the model dtype, moves results off-GPU and frees the CUDA cache between frames (the fragmentation that makes long runs fail where short ones don't), and **auto-picks the processed square size from the GPU's VRAM** (~560px on 8 GB, native 1008px on 16 GB+). On a mid-run OOM it auto-steps the resolution down and reloads, so a too-high guess self-corrects instead of crashing. New --sam-image-size CLI flag (and mask_image_size pipeline kwarg) for a manual override; masks are still written at the source resolution. The launchers export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to further cut fragmentation OOMs.
Native desktop app (facetmask-app console-script, launch-app.bat). A real Windows window — no browser, no local webserver, no port — built on Tkinter/ttk (already a dependency, so zero new packages). Native OS file dialogs that belong to the window, a single-screen form (input/output, preset, interval, SAM prompts, collapsible Advanced), and a richer status panel than the web UI: a per-stage checklist (extract → mask → voronoi → reframe), a live progress bar with percent + elapsed timer, a calibration- based ETA, a per-file batch queue showing each input's detected type (360 / photo), a scrolling run log, and a working Stop button (cooperative cancel via the progress callback). Same pipeline and output as the CLI and Gradio UI. The Gradio web UI is retained as a cross-platform fallback. New module desktop.py; pure logic (run controller, stage helpers, estimator) is Tk-free and unit-tested headless. The desktop app follows the OS light/dark theme, renders crisply on high-DPI displays (process DPI-awareness + Tk scaling), keeps Advanced settings in an always-open right-hand side panel, pairs every slider with a live typable numeric field, and shows hover tooltips on the non-obvious controls.
Non-360 batch processing (passthrough mode). Ordinary photos and non-equirect videos are now first-class inputs. Each source is auto-classified by aspect ratio — ~2:1 (±0.2 band) → equirect 360 (reframed into pinhole views as before); anything else → passthrough: the frame is already a pinhole image, so it skips reframing (and Voronoi) entirely. It still goes through the same sharpest-frame-per-interval extraction and optional full-frame SAM masking. Passthrough frames land in photos/ with their masks in photo_masks/ ({stem}_mask.png, Metashape convention). A single batch may freely mix 360 clips and ordinary media — a folder ripped off a device routes each input correctly. Works identically across the desktop app, the Gradio UI, and the CLI with no new flags (detection is automatic). reframe_log.json records a processing_mode per source plus passthrough_images_written / passthrough_masks_written totals.
Folder batching now picks up images too. --input-dir (CLI) and the folder pickers (both UIs) collect video *and* image files (.jpg/.jpeg/.png/.webp/.tif/.tiff/.bmp), not just videos, so a mixed device dump batches in one pass.
New tests: 14 desktop-UI tests (run controller success/error/cancel, stage helpers, plain-text estimator), ~20 passthrough-pipeline tests (aspect classifier, source-mode resolution, layout photo paths, pure-passthrough + mixed-batch end-to-end), and batch-input collection.

Added (earlier this cycle)

Gradio web UI (facetmask-gui console-script). Drag-drop MP4, preset dropdown, comma-separated SAM 3 prompts, collapsible advanced panel for interval / overlap / Voronoi / JPEG quality, "Process" button, live status + run log output. Designed for non-technical operators who shouldn't have to learn the CLI — same pipeline, same output structure, just point-and-click. New [gui] optional dependency group brings in Gradio 4.x / 5.x.
7 new GUI tests (preset description rendering, empty-input error paths, missing-gradio import error message, console-script arg parsing). Test count was 110 after first pass.

Changed (GUI iteration after first real-user feedback)

Path-based video input instead of file upload. Gradio's `gr.File` widget copies uploaded files to its temp dir even when `type="filepath"`, which is wasteful for multi-GB Insta360 videos. The new `gr.Textbox` accepts a full filesystem path and the pipeline reads from disk in place — no copy, works equally well for local files and NAS-mounted UNC paths.
Curated SAM 3 prompt checkboxes instead of a single freeform textbox. Common AccessCity-relevant classes (`person, car`, `bus, truck, bicycle, motorcycle, sky`, `traffic cone, garbage can`) are checkboxes that pre-populate the prompt string; a separate freeform textbox accepts open-vocabulary additions. The two sources are deduplicated and concatenated automatically.
Live per-stage progress via `gr.Progress`. The pipeline's existing `progress_callback` is now plumbed through to the UI so the operator sees `stage: current/total` updates during long runs (was: no feedback at all between click and completion).
In-UI Hugging Face token override in the Advanced panel (password-masked). Required when SAM 3 masking is enabled and neither `HF_TOKEN env var nor a .hf_token` file is auto- discoverable from the facetmask install location.

Added (GUI test coverage)

`combine_prompts()` helper plus 6 tests covering: checkbox-only, freeform-only, both, deduplication, whitespace handling, empty inputs. Plus 4 new path-validation tests (none / empty / nonexistent / directory-not-file).

Test count: 121 passing.

[0.1.0] — 2026-05-15

Initial release. Feature parity with the equirect-to-COLMAP-dataset path of the alexmgee/lichtfeld-360-plugin, running standalone as a CLI tool.

Added

presets.py — seven view-layout presets covering the common equirect-to-pinhole reframing layouts:
cubemap-6 — classic 6-face cube
default-16 — two ±35° rings of 8, no poles (alexmgee plugin's default; street-capture optimised)
fibonacci-10/14/20/24 — Fibonacci-spiral sphere sampling, full-sphere coverage
icosahedral-20 — face-centroid sampling, exactly uniform by construction
reframer.py — cv2.remap-based equirect → pinhole reprojection. Map caching per (preset, equirect resolution) so batch reframing of many frames with the same preset costs one map computation + per-frame cv2.remap calls. Supports overlap_degrees to widen each view's effective FOV.
extractor.py — Tenengrad sharpness scoring per frame, interval partitioning with scene-change splits (threshold 0.3), pick of the sharpest frame per (sub-)chunk. Port of the alexmgee plugin's "Best" extraction mode.
masker.py — SAM 3 (facebook/sam3 via transformers >= 5.0) with open-vocabulary text prompts. Lazy imports of heavy ML deps — module import is free; torch/transformers only load when Sam3Masker is instantiated. Token lookup order: explicit arg → HF_TOKEN env var → .hf_token file at cwd or repo root → error.
overlap_masks.py — Voronoi sphere partitioning. For each preset view, assigns every direction on the sphere to the view whose centre direction is closest, producing per-view exclusion masks that prevent COLMAP from extracting duplicate features across overlapping views. combine_masks() helper OR-combines binary masks.
pipeline.py — orchestrates extract → (optional SAM) → reframe + per-view mask projection → (optional Voronoi) → combined per-view mask output. Writes frames/, masks/, views/, view_masks/, overlap_masks/, rig_config.json, reframe_log.json.
cli.py + __main__.py — argparse entry point. Subcommands: extract (run the pipeline), list-presets (show available presets and exit). Optional tqdm progress bars, falls back to periodic stderr prints if tqdm isn't installed.
103 tests across presets, reframer, extractor, masker infrastructure, and overlap-masks geometry. All passing.
Packaging: pyproject.toml with facetmask console-script entry point. Apache-2.0 licensed. Optional [sam] and [dev] extras.
Verified end-to-end against a real Forum back-entrance Insta360 MP4: extracts equirect frames, runs SAM 3 with prompts person, car, bus, truck, bicycle, produces correctly-shaped masks on the operator and visible vehicles + pedestrians, projects masks per view, combines with Voronoi, writes alexmgee-compatible output.

Performance

SAM 3 inference on CPU: ~22s per prompt per equirect frame (5760×2880). 5 prompts × 1 frame = ~113s.
The same workload on an RTX 5070 Ti is expected to run in well under a minute (vision-backbone-dominated, scales with GPU FLOPS).
See ROADMAP.md for the get_vision_features() optimization that should drop multi-prompt latency by 3–4× on top of the hardware speedup.