GPU Utilization Optimization¶

Goal: maximize GPU utilization toward 100% on the primary dev environment (Intel Iris Xe, i7-1355U).

Baseline Measurements (2026-04-03)¶

Metric	Intel Iris Xe (iGPU)	NVIDIA 950M (dGPU)
GPU Usage (MangoHud)	~63%	~99%
FPS	154	129
Total Frame	6.85 ms	7.75 ms
Scene Render	2.25 ms	3.29 ms
Billboard Render	1.31 ms	2.55 ms
Post-Process	3.20 ms	3.39 ms
Swap Buffers	1.16 ms	0.87 ms

Analysis Evolution¶

Initial Hypotheses (Pre-Tracy)¶

#	Hypothesis	Estimated Impact	Confidence
H1	GPU Profiler query readback (`glGetQueryObjectui64v`) blocks CPU	15-25% GPU idle	70%
H2	Bitonic sort GPU barriers cause pipeline flushes	2-3%	50%
H3	PostProcess UBO `glBufferSubData` implicit sync on Mesa	3-5%	40%
H4	Shared memory bandwidth (iGPU) limits throughput	structural	60%

Tracy Instrumentation Results (Measured)¶

Added PROFILE_ZONE markers around key sync points. Tracy Statistics revealed:

Zone	Mean	Median	P99	Verdict
GPU Query Readback (sync)	37 µs	33 µs	94 µs	Not a bottleneck — negligible
GPU Sort: SSBO Upload	28 µs	—	—	Not a bottleneck
GPU Sort: Compute Dispatch	55 µs	50 µs	135 µs	Not a bottleneck
PostProcess UBO Upload	(< 10 µs)	—	—	Not a bottleneck

All four initial hypotheses were invalidated by measurement. None of these sync points cause significant stalls.

Revised Hypothesis — CPU-GPU Pipeline Bubble (Confidence: 95% → Invalidated)¶

The initial Tracy timeline analysis suggested a CPU-GPU pipeline bubble:

GPU: [==Scene+PP+UI==][.........Swap Buffers (IDLE).........][==next frame==]
CPU: [==GL commands==][Tracy][Swap][Collect][Poll][Update][==GL commands==]
                       <---- GPU idle while CPU does non-GL work ---->

This hypothesis was tested and invalidated. Reordering the main loop (moving CPU-only work after SwapBuffers) produced no measurable improvement — the GPU idle gap remained identical. The "Swap Buffers" GPU idle was not caused by a pipeline bubble but by external throttling (see H6).

Final Root Cause — X11/Compiz Compositor Overhead (Confidence: 95%)¶

Deeper Tracy profiling in fullscreen mode revealed the actual root cause:

Evidence from Tracy (Frame 2,053, fullscreen, VSync OFF):

Zone	Time	% of Frame
`GLFW PollEvents`	625 µs	28%
GPU useful work	~200 µs	9%
Render submit + Swap	~400 µs	18%
GPU idle (starving)	~1000 µs	45%

glfwPollEvents() calls into the X11 server, which is mediated by the Compiz compositing window manager. Each round-trip incurs significant latency compared to Wayland's direct model.

Cross-platform comparison:

Environment	Display	Compositor	`PollEvents`	GPU Usage
Intel Iris Xe (i7-1355U)	X11	Compiz	~625 µs	~63%
NVIDIA 950M (Bazzite)	Wayland	Native	~10-50 µs	~99%

The GPU is idle ~37% of the time because the CPU spends 625 µs per frame in X11/Compiz event polling — time during which no GL commands are submitted. The NVIDIA 950M achieves 100% not because it's slower, but because Wayland's event model has negligible overhead.

This is a system-level limitation, not an application-level bug. No code change can reduce X11/Compiz glfwPollEvents() latency.

Updated Confidence Table¶

#	Finding	Impact	Confidence	Method
~~H1~~	Query readback stall	~37 µs (negligible)	Measured	Tracy Statistics
~~H2~~	Sort barrier flushes	~55 µs (negligible)	Measured	Tracy Statistics
~~H3~~	UBO implicit sync	< 10 µs (negligible)	Measured	Tracy Statistics
H4	Shared memory bandwidth	Structural, not primary	40%	Unchanged
~~H5~~	~~CPU-GPU pipeline bubble~~	~~30-40% GPU idle~~	Invalidated	Loop reorder tested, no effect
H6	X11/Compiz event polling overhead	~28% frame time, ~37% GPU idle	95%	Tracy Timeline (fullscreen)

Proposed Fix — Main Loop Reordering (Tested — No Effect)¶

The loop reordering approach was implemented and tested:

Before: PollEvents → physics/camera → App Update → Render → Tracy → SwapBuffers → Collect
After:  PollEvents → Render → SwapBuffers → physics/camera/App Update → Collect

Result: No measurable change in GPU utilization or frame time. The reorder was reverted because:

It added 1 frame of input latency for zero benefit
The GPU idle gap was caused by X11/Compiz, not by CPU-side work ordering

Conclusion¶

The ~63% GPU utilization on Intel Iris Xe under X11/Compiz is a system-level characteristic, not an application defect. The GPU renders the scene in ~200 µs but the CPU spends ~625 µs in X11 event polling per frame, starving the GPU of new work.

Mitigation options (all external to the application):

Option	Expected Impact	Feasibility
Switch to Wayland compositor	GPU → ~100%	Requires desktop environment change
Use a non-compositing WM (e.g., i3, dwm)	Reduced PollEvents overhead	User preference
Disable Compiz compositing (`compiz --replace --no-composite`)	Partial improvement	May break desktop features

No application-level code changes are planned for this issue.

Phase 1: Tracy Instrumentation (Done)¶

Added PROFILE_ZONE CPU markers at key synchronization points:

Zone	File	Purpose
`"GPU Query Readback (sync)"`	`gpu_profiler.c`	Measure blocking `glGetQueryObjectui64v` loop
`"GI Probe Sync (buffer upload)"`	`scene.c`	`glBufferSubData` SSBO + 3D texture packing for GI probes
`"GPU Sort: SSBO Upload"`	`sphere_sorting.c`	Instance data transfer to GPU
`"GPU Sort: Compute Dispatch"`	`sphere_sorting.c`	Full dispatch + barrier chain
`"PostProcess UBO Upload"`	`postprocess.c`	`glBufferSubData` implicit sync detection

Phase 2: Main Loop Reordering (Tested — Invalidated)¶

Reorganized app_run() to move CPU-only work (camera physics, UI update, notifier, sampler) after glfwSwapBuffers(). The change compiled and passed all 60/60 tests but produced zero improvement in GPU utilization. The reorder was reverted.

This led to the discovery of the actual root cause (H6: X11/Compiz overhead) through additional Tracy instrumentation of the full frame loop.

Additional Tracy Zones (Phase 2 Diagnostic)¶

Zone	File	Purpose
`"Frame Timing"`	`app.c`	Timing/FPS/sampler block
`"UI & Notifier Update"`	`app.c`	Action notifier, UI overlay, postprocess time
`"Camera Physics"`	`app.c`	Fixed timestep physics + rotation interpolation
`"PostProcess Resize"`	`app.c`	Deferred FBO/texture recreation
`"Icosphere Regen"`	`app.c`	Mesh regeneration on subdivision change