Effect Benchmark — A/B GPU Cost Measurement¶
An automated tool for measuring the individual GPU cost of each post-process effect within the uber-shader ("Final Composite"). Multi-pass effects (Bloom, DoF, Auto Exposure, Motion Blur) already have their own GPU Profiler stage and are not affected.
Overview¶
Usage¶
| Key | Action |
|---|---|
| 8 | Start the sweep (or shows "Already running" if in progress) |
The sweep takes approximately 22 seconds at 60 fps (8 phases × (30 + 120) frames ÷ 60).
- Launch the application
- Stabilize the scene (do not move the camera during the bench)
- Press 8
- Wait for the "FX Benchmark: Done (see log)" notification
- Read results in the log output
⚠️ Important: Do not interact with the scene or toggle effects during the benchmark. The system saves/restores
active_effectsbut any external change would invalidate measurements.
Reading the Results¶
Sample real output (Intel Iris Xe, 1920×1080, IBL scene + 20 spheres):
╔══════════════════════════════════════════════════════╗
║ POSTPROCESS EFFECT BENCHMARK RESULTS ║
╠══════════════════════════════════════════════════════╣
║ Baseline (all ON): 1.1308 ms (±0.0222 ms) ║
╠════════════════════╦═══════════╦═══════════╦════════╣
║ Effect ║ Cost(ms) ║ StdDev ║ Status ║
╠════════════════════╬═══════════╬═══════════╬════════╣
║ FXAA ║ +0.0110 ║ ±0.0042 ║ ON ║
║ Chromatic Aberration ║ — ║ — ║ OFF ║
║ Vignette ║ +0.0109 ║ ±0.0045 ║ ON ║
║ Grain ║ -0.0014 ║ ±0.0242 ║ ON ║
║ Color Grading ║ -0.0289 ║ ±0.0342 ║ ON ║
║ Banding ║ — ║ — ║ OFF ║
║ Exposure ║ — ║ — ║ OFF ║
╠════════════════════╬═══════════╬═══════════╬════════╣
║ Sum of costs ║ -0.0083 ║ ║ ║
╚════════════════════╩═══════════╩═══════════╩════════╝
Columns¶
| Column | Meaning |
|---|---|
| Effect | Post-process effect name |
| Cost(ms) | baseline_mean - mean_with_effect_OFF. Positive = the effect costs GPU time |
| StdDev | Standard deviation over 120 samples. Indicates measurement stability |
| Status | ON = tested (was active), OFF = skipped (was already disabled) |
Interpreting Values¶
Positive cost (+0.0110 ms)¶
The effect adds GPU time. This is the expected case. The larger the value, the more costly the effect.
Negative cost (-0.0014 ms, -0.0289 ms)¶
A negative cost means that disabling the effect slows the composite. This is counter-intuitive but normal on an iGPU. Possible causes:
-
Measurement noise — If
|cost| < stddev, the measurement is within noise. Example: Grain costs -0.0014 ms ± 0.0242 → true cost is indistinguishable from zero. -
Branch divergence — The uber-shader uses
if (effect_enabled). On SIMD GPUs (wavefronts/warps), branch cost depends on coherence within the warp. Disabling a single effect may change the divergence pattern and paradoxically slow adjacent warps. -
Register/cache pressure — The GLSL compiler may reorganize registers when dead code is eliminated. A different configuration may have slightly different memory pressure.
-
ALU/TEX scheduling — On Intel iGPU, ALUs share memory bandwidth with the CPU. One less computation may leave TEX units waiting without ALU overlap.
Sum ≠ baseline¶
The "Sum of costs" line will rarely equal baseline_mean. This is expected:
effects are not additive since they share the same execution units
(ALU, texture caches, bandwidth). The interaction between effects creates
masking effects (latency hiding).
Practical Rules¶
| Observation | Conclusion |
|---|---|
cost > 0 and cost > 2 × stddev |
The effect has a significant, measurable cost |
cost > 0 but cost < stddev |
Probable cost but not statistically significant |
cost ≈ 0 (pos or neg) and high stddev |
Noise — re-run the bench with a stable scene |
cost < 0 and |cost| > stddev |
Divergence/cache effect — not alarming, inherent to uber-shader |
| All costs very small (<0.05 ms) | Postprocess is not the bottleneck — look elsewhere (geometry, lighting) |
Benchmarked Effects¶
Only fragment-shader effects executed in the "Final Composite" draw call are measured by A/B toggle:
| Effect | Bit | Macro |
|---|---|---|
| FXAA | 1 << 12 |
POSTFX_FXAA |
| Chromatic Aberration | 1 << 3 |
POSTFX_CHROM_ABBR |
| Vignette | 1 << 0 |
POSTFX_VIGNETTE |
| Grain | 1 << 1 |
POSTFX_GRAIN |
| Color Grading | 1 << 5 |
POSTFX_COLOR_GRADING |
| Banding | 1 << 14 |
POSTFX_BANDING |
| Exposure | 1 << 2 |
POSTFX_EXPOSURE |
Multi-pass effects (Bloom, DoF, Auto Exposure, Motion Blur) already have
their own stage in the GPU Profiler (F1 to display the overlay) and do not
need A/B testing.
Internal Architecture¶
Why A/B?¶
GPU timer queries (GL_TIMESTAMP) measure time between two draw calls.
However, all fragment-shader effects execute within a single fullscreen
quad draw call ("Final Composite"). It is impossible to place timers
inside a draw call.
The A/B method works around this:
State Machine¶
Per-Frame Flow¶
effect_benchmark_update() is called after gpu_profiler_begin_frame()
to read frame N-1 results (double-buffered timer queries):
-
Warmup (30 frames) — Results are discarded. Lets the driver/GPU stabilize caches and the pipeline after the state change.
-
Accumulation (120 frames) — Accumulates
sum_msandsum_sq_msto compute mean and standard deviation:
- Transition — Computes stats, stores result, disables next effect, resets counter.
Files¶
| File | Role |
|---|---|
include/effect_benchmark.h |
Types (EffectBenchmark, BenchPhase, EffectBenchResult), constants, API |
src/effect_benchmark.c |
State machine, accumulation, effect table, result display |
include/app.h |
EffectBenchmark effect_bench field in App |
src/app.c |
effect_benchmark_init() at startup, effect_benchmark_update() per frame |
src/app_input.c |
Key 8 binding → effect_benchmark_start() |
API¶
// Initialization (once at startup)
void effect_benchmark_init(EffectBenchmark* bench,
PostProcess* postprocess,
GPUProfiler* profiler);
// Start a sweep (returns false if already running)
bool effect_benchmark_start(EffectBenchmark* bench);
// Call every frame after gpu_profiler_begin_frame()
// Returns true when the sweep just finished
bool effect_benchmark_update(EffectBenchmark* bench);
// Check if a benchmark is running
bool effect_benchmark_is_running(const EffectBenchmark* bench);
// Display results (called automatically at the end)
void effect_benchmark_log_results(const EffectBenchmark* bench);
Measurement Parameters¶
| Constant | Value | Role |
|---|---|---|
BENCH_WARMUP_FRAMES |
30 | Frames discarded after each state change (pipeline stabilization) |
BENCH_MEASURE_FRAMES |
120 | Frames sampled per phase (≈2s at 60fps) |
BENCH_MAX_EFFECTS |
16 | Maximum effect table capacity |
Limitations¶
-
iGPU Precision — On integrated GPU (Intel Iris Xe), timer query resolution is around 80 ns. Very light effects (< 0.01 ms) are often within noise.
-
Non-Additivity — The cost of an effect depends on other active effects (latency hiding, register pressure). The sum of individual costs will not equal the total cost.
-
Scene Stability Required — Moving the camera during the bench modifies fragment load (overdraw, fill rate) and skews measurements.
-
GPU Divergence — The
ifbranches of the uber-shader have a cost that depends on the spatial coherence of pixels. A/B does not capture the additional divergence cost when multiple effects are simultaneously active.
Changelog¶
| Date | Change |
|---|---|
| 2026-02-07 | Created effect_benchmark module (header, implementation, integration) |
| 2026-02-08 | Added BENCH_STABILIZE phase and Timeout (2s) for reliability |