Post-Process Pipeline — Optimizations & Fixes (February 2026)¶
Date: February 7, 2026 Target: OpenGL 4.6, Mesa / Intel Iris Xe (RPL-U), unified-memory iGPU Modified files: 12 files (headers, C sources, shaders, tests)
Table of Contents¶
- Summary
- Bug Fixes
- FXAA / Motion Blur — Fragment pipeline order
- Neighbor Max — Compute over-dispatch
- Luminance Adaptation — Parallel reduction
- Missing memory barriers
- Auto-Exposure — FBO unbind before compute
- Performance Optimizations
- UBO dirty flag — Partial upload
- Sampler uniforms — Bind once
- VAO bind/unbind — Factored out
- GPU Profiler Refactoring
- Double-buffering metadata
- Separating recording_count / stage_count
- Restructured profiling stages
- Modified Files
- Validation
Summary¶
A series of optimizations and fixes applied to the post-processing pipeline and the GPU profiler. Changes cover four axes:
- Bug fixes in effect execution order, compute shader dispatches, and synchronization barriers.
- CPU-side optimizations reducing redundant OpenGL calls per frame.
- GPU profiler refactoring to fix index conflicts between frames in the double-buffered system.
- Profiling restructure to separate compute cost from fragment cost in motion blur.
1. Bug Fixes¶
1.1 FXAA / Motion Blur — Fragment Pipeline Order¶
File: shaders/postprocess.frag
Problem: FXAA was applied directly on screenTexture before
Motion Blur and Chromatic Aberration. Motion Blur was therefore never
smoothed by FXAA, and FXAA did not benefit from motion-blurred content.
Before:
After:
FXAA now receives color already processed by MB+CA via the color parameter,
instead of re-sampling screenTexture independently.
1.2 Neighbor Max — Compute Over-Dispatch¶
File: src/effects/fx_motion_blur.c
Problem: The Neighbor Max dispatch was identical to Tile Max
(groups = ceil(pixels / 16)), whereas Neighbor Max operates on
tiles, not pixels. This caused a 16× over-dispatch in X and 16× in Y (256× total).
Before:
glDispatchCompute(groups_x, groups_y, 1); // groups = ceil(width/16)
// For 1920×1080: 120×68 = 8160 groups (each 16×16 = 256 threads)
// Total: ~2M threads to process 8160 tiles
After:
int neighbor_groups_x = (tile_count_x + 15) / 16;
int neighbor_groups_y = (tile_count_y + 15) / 16;
glDispatchCompute(neighbor_groups_x, neighbor_groups_y, 1);
// For 1920×1080: 8×5 = 40 groups → ~10K threads for 8160 tiles
1.3 Luminance Adaptation — Parallel Reduction¶
File: shaders/lum_adapt.comp
Problem: The luminance adaptation compute shader used a single
thread (gl_GlobalInvocationID == (0,0)) that iterated sequentially over
64×64 = 4096 texels of the luminance map.
After: Parallel reduction in shared memory with 256 threads (16×16),
each thread processing 16 texels (4×4 block), followed by a logarithmic
reduction 256 → 128 → 64 → ... → 1.
layout(local_size_x = 16, local_size_y = 16) in;
shared float sharedLogLum[256];
shared float sharedValidCount[256];
// Each thread accumulates 4×4 texels
// Parallel reduction with barrier()
for (uint s = 128u; s > 0u; s >>= 1u) { ... }
1.4 Missing Memory Barriers¶
File: src/postprocess.c
Addition: glMemoryBarrier(GL_TEXTURE_FETCH_BARRIER_BIT) at the start of
postprocess_end(), after scene MRT rendering and before compute effects
(Bloom, DoF, AE, MB). Guarantees that writes to color, velocity, and
depth/stencil textures are visible to compute shaders.
1.5 Auto-Exposure — FBO Unbind Before Compute¶
File: src/effects/fx_auto_exposure.c
Problem: The downsample FBO remained bound during the adaptation compute
shader dispatch. The compute shader was reading downsample_tex while it was
still attached to the active FBO, creating a potential read/write conflict.
After: The FBO is unbound (glBindFramebuffer(GL_FRAMEBUFFER, 0))
before the memory barrier and dispatch, ensuring rasterized data is flushed.
2. Performance Optimizations¶
2.1 UBO Dirty Flag — Partial Upload¶
Files: include/postprocess.h, src/postprocess.c
Problem: The PostProcessUBO (~300 bytes, std140 layout) was fully
rebuilt and uploaded every frame via glBufferSubData, even if no parameters
had changed (only time changes each frame).
Solution: Added a bool ubo_dirty flag to the PostProcess struct.
- When
ubo_dirty = true: Full UBO rebuild andsizeof(PostProcessUBO)byte upload. - When
ubo_dirty = false: Partial 8-byte upload only (active_effects+time), the two first fields of the UBO.
The flag is set to true in the 17 parameter setters (postprocess_set_*)
and at initialization. postprocess_update_time() does not set the flag
since time is in the header that is always updated.
if (post_processing->ubo_dirty) {
PostProcessUBO ubo = { /* ... full rebuild ... */ };
glBufferSubData(GL_UNIFORM_BUFFER, 0, sizeof(PostProcessUBO), &ubo);
post_processing->ubo_dirty = false;
} else {
struct { uint32_t active_effects; float time; } header = { ... };
glBufferSubData(GL_UNIFORM_BUFFER, 0, sizeof(header), &header);
}
Gain: In steady state (no UI interaction), the upload drops from ~300 bytes/frame to 8 bytes/frame.
2.2 Sampler Uniforms — Bind Once¶
Files: src/postprocess.c, src/effects/fx_bloom.c,
src/effects/fx_dof.c, src/effects/fx_auto_exposure.c,
src/effects/fx_motion_blur.c
Problem: shader_set_int(shader, "samplerName", unit) was called every
frame in *_render() functions, even though sampler uniforms are
per-program state (glProgramUniform) that never changes: each sampler
is always associated with the same texture unit.
Solution:
-
Postprocess (uber-shader): 8 sampler→unit bindings are configured in
setup_sampler_uniforms(), called once perupdate_current_shader()(shader change or recompilation). -
Effect shaders: Sampler uniforms are configured in
*_init()functions: - Bloom:
srcTexture = 0on prefilter, downsample, upsample - Auto-Exposure:
sceneTexture = 0on downsample,lumTexture = 0on adapt - Motion Blur:
velocityTexture = 0on tile_max,tileMaxTexture = 0on neighbor_max - DoF: Reuses Bloom shaders (already configured)
Gain: Elimination of ~15 glGetUniformLocation + glUniform1i calls per frame.
Note: silent_warnings = true is now set before update_current_shader()
in postprocess_compile_optimized() to avoid warnings about missing uniforms
on compiled-out samplers.
2.3 VAO Bind/Unbind — Factored Out¶
Files: src/postprocess.c, src/effects/fx_bloom.c,
src/effects/fx_dof.c, src/effects/fx_auto_exposure.c
Problem: Each effect (Bloom, DoF, AE downsample) independently bound
and unbound the screen quad VAO (glBindVertexArray(vao) /
glBindVertexArray(0)), generating unnecessary state changes since all
fullscreen passes use the same VAO.
Solution: The VAO is bound once at the start of postprocess_end()
and unbound once after the last glDrawArrays (Final Composite).
// Start of postprocess_end()
glBindVertexArray(post_processing->screen_quad_vao);
// ... Bloom, DoF, AE, MB, Final Composite ...
// After last draw
glBindVertexArray(0);
Gain: Elimination of ~6 glBindVertexArray calls per frame (3 bind +
3 unbind in Bloom, DoF, AE).
3. GPU Profiler Refactoring¶
3.1 Double-Buffering Metadata¶
Files: include/gpu_profiler.h, src/gpu_profiler.c
Problem: Stage metadata (name, color, depth, parent_index) was written
directly into profiler->stages[] during recording. This array is also read
by the UI for display. With query double-buffering, frame N write indices could
overwrite frame N-1 metadata still being read.
Solution: Added GPUStageInfo stage_info[MAX_GPU_STAGES] to each
GPUQueryBuffer. Metadata is written to the write buffer during
gpu_profiler_start_stage(), then restored to stages[] during readback in
gpu_profiler_begin_frame().
typedef struct {
char name[MAX_GPU_STAGE_NAME];
uint32_t color;
int depth;
int parent_index;
} GPUStageInfo;
3.2 Separating recording_count / stage_count¶
Files: include/gpu_profiler.h, src/gpu_profiler.c,
tests/test_gpu_profiler.c
Problem: stage_count served both as a recording counter (write-path)
and a display counter (read-path). After the buffer swap, it was reset to 0,
causing the UI to flicker for one frame.
Solution: Two distinct counters:
- recording_count: write counter, reset to 0 on each begin_frame().
- stage_count: display counter, updated from the completed read buffer, never reset to 0.
3.3 Restructured Profiling Stages¶
Files: src/postprocess.c, include/app_settings.h
Post-process profiling is restructured to distinguish:
| Stage | Content |
|---|---|
| Post-Process | Parent of all sub-stages |
| Bloom | Prefilter + Downsample + Upsample |
| DoF | Downsample + Tent blur |
| Auto-Exposure | Lum downsample + Compute adaptation |
| MB Compute | Tile Max + Neighbor Max (compute dispatches) |
| Final Composite | Fullscreen quad: MB sampling, CA, FXAA, etc. |
Added GPU_PROFILER_COMPOSITE_COLOR (Nord Frost Medium Blue, 0x81A1C1)
to app_settings.h.
4. Modified Files¶
| File | Changes |
|---|---|
include/postprocess.h |
Added bool ubo_dirty |
include/gpu_profiler.h |
GPUStageInfo, recording_count, docs |
include/perf_timer.h |
Clarified GPUTimer docs |
include/app_settings.h |
GPU_PROFILER_COMPOSITE_COLOR |
src/postprocess.c |
UBO dirty flag, setup_sampler_uniforms(), VAO factoring, memory barrier, profiling restructure |
src/gpu_profiler.c |
Metadata double-buffering, recording_count, docs |
src/effects/fx_bloom.c |
Sampler init, removed per-frame VAO/sampler |
src/effects/fx_dof.c |
Removed per-frame VAO/sampler |
src/effects/fx_auto_exposure.c |
Sampler init, FBO unbind, removed per-frame VAO/sampler |
src/effects/fx_motion_blur.c |
Sampler init, fixed dispatch, docs |
shaders/postprocess.frag |
FXAA/MB/CA reordering |
shaders/lum_adapt.comp |
Parallel reduction with 256 threads |
tests/test_gpu_profiler.c |
Adapted to new fields |
5. Validation¶
- Build:
make all— 0 errors, 0 warnings - Tests:
make test— 32/32 tests passed (100%) - Lint:
make lint— All checks passed (clang-tidy) - Run:
make run-release— Visual verification OK