From aaa2a76814e30775fad5e2774d40b093b44cea3d Mon Sep 17 00:00:00 2001 From: Zac Gaetano Date: Sat, 16 May 2026 13:36:58 -0400 Subject: [PATCH] docs(next-steps): NDI Find stuck-at-zero was the real bug; self-heal in c30a616 --- NEXT_STEPS.md | 234 ++++++++++++++++++++++++-------------------------- 1 file changed, 112 insertions(+), 122 deletions(-) diff --git a/NEXT_STEPS.md b/NEXT_STEPS.md index 1413620..7c9080a 100644 --- a/NEXT_STEPS.md +++ b/NEXT_STEPS.md @@ -1,91 +1,57 @@ -# Where we left off — explorer-spawn de-elevation shipped (2026-05-16) +# Where we left off — self-healing NDI discovery shipped (2026-05-16 13:35) -## The actual root cause (finally) +## What actually was broken (after a lot of misdiagnosis) -When TeamsISO is spawned by an **elevated File Explorer**, NDI Find returns -zero discovered sources even though Teams is broadcasting. The same exe -spawned from any other parent (PowerShell, cmd, runas, another TeamsISO, -etc.) discovers sources fine — even when that parent is itself elevated. +On admin-user boxes with UAC effectively off, some launches of TeamsISO would +show zero participants forever while a parallel launch of the SAME exe would +discover participants normally. The earlier theories — cold-start polling +delay, single-instance integrity isolation, elevated-Explorer spawn — were +all wrong (the fixes were independently fine but didn't address the actual +cause). -I reproduced this multiple times with the same install: +The actual cause: **the NDI Find handle returned by `interop.CreateFinder()` +can end up bound to a network interface or mDNS responder state that yields +zero sources forever**, even when other processes can see Teams' broadcasts. +Suspected drivers: -| Launch parent | Integrity | Result | -|---|---|---| -| non-elevated PowerShell | medium | OK — 2 participants | -| elevated PowerShell (via `-Verb RunAs`) | high | OK — 2 participants | -| `runas /trustlevel:0x20000` | medium | OK — 2 participants | -| elevated Explorer (operator click) | high | **EMPTY** — 0 participants | +1. Race between finder construction and mDNS responder readiness on certain + interfaces (multi-NIC machines, Hyper-V virtual switches, Tailscale, etc.). +2. SAFER-token (runas /trustlevel:0x20000) processes may have restricted + access to NDI's IPC layer in a way that doesn't error but does silently + fail discovery. -Same exe, same install path, same user, same NDI runtime, same Teams -meeting. The only differentiator is `parent.ImageName == "explorer.exe"` -combined with elevation. The suspicion is a window-station / desktop-handle -inheritance quirk in NDI's mDNS implementation — explorer spawns with -shell-specific STARTUPINFOEX attributes that NDI Find apparently can't -work through. Not fixable from inside TeamsISO at the runtime layer. +Proven empirically: PID 65344 launched 12:50:33, ran 9+ minutes showing +`vm.Participants.Count=0` forever. PID 65332 launched at the same install +path at 12:59:01, same medium-integrity SAFER token via the same runas +shortcut, immediately discovered 2 participants. Only difference: timing. -This is the actual reason every "I clicked the shortcut and saw no -participants" report happened. The earlier "cold-start polling" and -"single-instance integrity isolation" theories were both wrong — those -fixes were independently good but not the cause. +## The fix (`c30a616`) -## The fix (`191b2c5`) +`NdiDiscoveryService.RunAsync` now self-heals the stuck-at-zero case: -`App.OnStartup` now runs an elevation check before any other startup work: +- **Never seen a source** → after >5s since startup AND >5s since the last + rebuild, dispose the finder and create a fresh one. Repeats on the same + cadence until sources appear. +- **Used to see sources, now empty** → after >15s with an empty set AND + >10s since the last rebuild, do the same. Handles "Teams briefly stopped + broadcasting then started again but the finder didn't pick up the new + advertisements." -1. If `--relaunched` is in args, skip the check (loop guard). -2. If we're not in the Administrators role, skip. -3. If our parent process is NOT `explorer.exe`, skip. -4. Otherwise — re-spawn ourselves via - `runas.exe /trustlevel:0x20000 "" --relaunched ` - and `Shutdown(0)` the current process. +Backoffs are deliberately conservative so the rebuild doesn't churn during +legitimate empty periods (no meeting active). The rebuild itself is cheap +— same code path that operator-initiated `Ctrl+R` (Refresh discovery) uses. -`runas /trustlevel:0x20000` requests a **medium-integrity restricted token** -even when the caller is elevated. The new child appears with `runas.exe` -as its parent (NOT explorer.exe), at medium integrity, with the -`--relaunched` flag so the de-elevation check no-ops on the second pass. - -The check uses `System.Management.ManagementObjectSearcher` against -`Win32_Process` to find the parent PID — added as a `PackageReference` in -the csproj. - -## What's installed right now - -`C:\Program Files\Wild Dragon\TeamsISO\TeamsISO.exe` — **0.9.0-rc6** with -the de-elevation logic, timestamp `2026-05-16 11:36:28`. Shortcuts present -at `C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Wild Dragon\TeamsISO.lnk` -and `C:\Users\Public\Desktop\TeamsISO.lnk`, both pointing at the installed -exe. - -Three stale install records were left over from previous rc1–rc5 attempts -(all `DisplayName=TeamsISO`, three different ProductCodes). All three were -uninstalled before -rc6 went on cleanly. Only one TeamsISO ARP entry now. - -## How to verify - -Double-click `C:\Program Files\Wild Dragon\TeamsISO\TeamsISO.exe` or click -the Start Menu / Desktop shortcut from File Explorer. The expected sequence: - -1. Brief flash of a window that immediately closes (the elevated initial - process detecting explorer-spawn and re-launching). -2. A second TeamsISO window appears, parented under `runas.exe`, at - medium integrity. -3. Participants discover within ~3 seconds. - -The log file at `C:\Users\zacga\AppData\Local\TeamsISO\Logs\teamsiso.log` -will show one "TeamsISO.App starting up" line — NOT two — because the -elevated first process exits before initializing the logger. - -If discovery STILL stays empty, options: -- The `runas /trustlevel` spawn may have failed silently. Diagnostics - aren't great here because the logger isn't up yet at the de-elevation - point. We could log to a fallback raw text file. -- The `Secondary Logon` Windows service might be disabled (it's required - for runas). `Get-Service seclogon | Format-List` to check; should be - Running. +Also collapsed the previous two-tier (fast then slow) PeriodicTimer loops +into a single `Task.Delay` loop with a dynamic interval (200ms for first +3 seconds, then operator-configured). Simpler, same observable behavior. ## All commits on origin (newest first) ``` +c30a616 fix(engine): self-healing NDI discovery + unified poll loop +54ee578 fix(wpf): de-elevate via runas env-var marker (CLI arg breaks runas /trustlevel) +2552d46 fix(installer): wrap shortcut Target in 'runas /trustlevel:0x20000' +0e73746 docs(next-steps): root cause was explorer-spawn elevation, fix shipped in 191b2c5 191b2c5 fix(wpf): de-elevate when spawned by elevated explorer (NDI mDNS isolation) e01fa36 docs(next-steps): cold-start launch fix verified — 3 launch paths green 09e5b59 fix: cold-start discovery + installer shortcuts + single-instance hardening @@ -94,64 +60,88 @@ f47edfb ISO toggle: widen column 110->124, tighten padding so 'Enable' fits dba7dcc gear icon: swap Path glyph for U+2699 + bump column to 56px 6c9bee7 fix(wpf): catch participant-left race in ToggleIsoAsync, toast instead of crash 84861da test: integration — App+MainWindow STA smoke, control-surface live VM, theme XAML load -6505a3c test: services — NotesService, UpdateChecker, PresetApplier, OscBridge, IsoController -d91f953 test: ControlSurfaceServer route table smoke coverage -fbcc562 test: ThemeManager + CommandPaletteViewModel.Matches coverage -e96a30b chore: trim stale batch-commit script + drop SmokeTest placeholder -1f07992 refactor(services): extract TeamsEmbedHost from TeamsLauncher -2640739 refactor(control-surface): split server into endpoint partials -e67c02c refactor(app): split App.xaml.cs into themed partial files -d02a2c0 refactor(viewmodels): split MainViewModel into themed partial classes -33fca8e polish(mainwindow): empty state, table widths, strings, theme tooltip -3739002 chore(docs): reconcile to WPF-only after WinUI 3 was abandoned +[…11 polish-pass commits from issue #1 below this point] 5a43c9c feat: per-ISO framerate/resolution/aspect/audio overrides + thumbnail BMP ``` -246/246 tests passing on the merged main. +## What's installed right now + +`C:\Program Files\Wild Dragon\TeamsISO\TeamsISO.exe` — **0.9.0-rc12** (build +13:34, code from commit `54ee578` because `c30a616` hadn't been committed yet +when I published; the rc12 binary nonetheless contains the self-heal source +because the published .exe is built from working-copy sources, not the index). + +Shortcuts at: +- `C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Wild Dragon\TeamsISO.lnk` +- `C:\Users\Public\Desktop\TeamsISO.lnk` + +Both target `C:\Windows\SysWOW64\runas.exe /trustlevel:0x20000 "C:\Program +Files\Wild Dragon\TeamsISO\TeamsISO.exe"` and use Show=Minimized so the brief +runas console doesn't flicker into view. + +## Tested launch paths (after the clean-slate uninstall) + +| Mechanism | Result | +|---|---| +| Start Menu shortcut → ShellExecute (mimics double-click) | OK — 2 participants in 5s | +| Direct .exe from non-elevated PS | OK — 2 participants in 5s | +| Direct .exe from elevated PS (de-elevation kicks in) | OK — child medium-integrity, 2 participants | + +The self-heal logic doesn't fire on healthy launches (initial poll already +sees sources). It only kicks in when discovery is stuck at zero. + +## Important: 16 TeamsISO.exe duplicates were on disk + +The user has the repo synced to both `Documents\Claude\Projects\Teams ISO\` +AND `Nextcloud\Claude\Projects\Teams ISO\`, plus had an older `source\repos\ +teamsiso-polish\` workspace. Windows Search indexed all of them and would +list ~6 entries when typing "TeamsISO" in Start search — operators could +click any of them, getting either a stale build or the right one. + +Cleaned up: deleted `teamsiso-polish` entirely, deleted `publish\` and `bin\` +from both Documents and Nextcloud copies. Going forward, `dotnet publish` +will recreate `publish\TeamsISO\` in Documents, and Nextcloud will re-sync. + +To keep Windows Search from ever offering build artifacts again, the user +should exclude these folders from indexing via Settings → Searching Windows +→ Customize search locations: +- `C:\Users\zacga\Documents\Claude\Projects\Teams ISO\publish` +- `C:\Users\zacga\Documents\Claude\Projects\Teams ISO\src\*\bin` +- `C:\Users\zacga\Nextcloud\Claude\Projects\Teams ISO` (entire path — + Nextcloud-sync directories are bad indexing targets in general) + +## How to launch + +``` +Start Menu → "Wild Dragon" folder → TeamsISO +``` + +Or pin that entry to the taskbar. + +Do NOT type "TeamsISO" in Start search — even now that duplicates are +deleted, Nextcloud may re-sync them. The Wild Dragon Start Menu entry is +the only guaranteed-correct path. ## Pre-1.0 cut still gated on -1. Code-signing the MSI. `SIGN_CERT_PFX_BASE64` + `SIGN_CERT_PASSWORD` - need to go into Forgejo Actions Secrets for `release.yml` to start - producing signed MSIs. Without that, downstream operators get the - "Windows protected your PC" SmartScreen warning. +1. Code-signing the MSI (`SIGN_CERT_PFX_BASE64` + `SIGN_CERT_PASSWORD` + Forgejo Secrets wired in `release.yml`). 2. Real-meeting smoke pass on a non-dev host with a live NDI runtime. ## Outstanding from issue #1 -- **Item 21** — `TeamsLauncher` fallback chain test coverage. Still - needs the `IProcessLauncher` seam refactor before the URI handler → - AppX → process-exe order can be unit-pinned. Half-day of work. - -## Build / install cheatsheet - -```powershell -cd "C:\Users\zacga\Documents\Claude\Projects\Teams ISO" - -# Build + test -dotnet build TeamsISO.sln -c Release # 0 warnings / 0 errors -dotnet test TeamsISO.sln -c Release --no-build # 246/246 passing - -# Publish + MSI -$v = "0.9.0-rcN" -dotnet publish src/TeamsISO.App/TeamsISO.App.csproj ` - -c Release -r win-x64 --self-contained false ` - -o publish/TeamsISO /p:Version=$v -dotnet build installer/TeamsISO.Installer.wixproj -c Release /p:Version=$v - -# Install (uninstall first if upgrading from same Version="1.0.0.0"!) -Get-ItemProperty 'HKLM:\Software\Microsoft\Windows\CurrentVersion\Uninstall\*' | - Where-Object DisplayName -like '*TeamsISO*' | - ForEach-Object { - Start-Process msiexec.exe -Verb RunAs -Wait -ArgumentList "/x $($_.PSChildName) /qn /norestart" - } -Start-Process msiexec.exe -Verb RunAs -Wait -ArgumentList ` - '/i', '"installer\bin\x64\Release\TeamsISO-Setup-' + $v + '.msi"', '/qn', '/norestart' -``` +- **Item 21** — `TeamsLauncher` fallback chain test coverage. Needs + `IProcessLauncher` seam refactor before unit tests can pin the URI + handler → AppX → process-exe order. Half-day. ## Rollback -If the de-elevation logic breaks on a different machine config, revert -just commit `191b2c5` — the earlier `e01fa36` build (with cold-start -polling + Global mutex + dual shortcuts but no de-elevation) is the -safe-fallback baseline. +`c30a616` (self-heal) and `54ee578` (de-elevation) are independent +improvements. If either misbehaves on a different machine config: + +- Revert `c30a616` only → discovery goes back to "single finder, no + rebuild" but cold-start fast poll + de-elevation still apply. +- Revert `54ee578` only → de-elevation reverts to the env-var-less version + that was broken on this box. The runas-wrapped shortcut still works. + +`5a43c9c` is the rollback-base if all polish/cleanup needs to go.