GPU Rendering Synchronization: Intel vs NVIDIA¶
Date: 2026-01-30 Status: Resolved Impact: Critical - Visual quality consistency across GPU vendors
Executive Summary¶
Investigation and resolution of rendering differences between Intel and NVIDIA GPUs in the suckless-ogl PBR renderer. Issues manifested as white halos and incorrect FXAA edge blending on NVIDIA hardware.
Visual Comparison¶
| GPU | Before Fix | After Fix |
|---|---|---|
| Intel | ✅ Clean edges, proper FXAA | ✅ Unchanged |
| NVIDIA | ❌ White halos, buggy edges | ✅ Identical to Intel |
Root Causes Identified¶
Issue 1: FXAA Luminance Recalculation¶
File: shaders/postprocess/fxaa.glsl (Lines 176-198)
Problem: Edge search loop recalculated luma using sqrt(), which has different precision on Intel vs NVIDIA.
Fix: Use pre-calculated luma from alpha channel instead.
- lumaEnd1 = FxaaLuma(texture(screenTexture, uv1).rgb);
+ lumaEnd1 = texture(screenTexture, uv1).a; // Pre-calculated in PBR shader
Issue 2: Derivative-Based Roughness Clamping¶
File: shaders/pbr_functions.glsl (Lines 76-100)
Problem: dFdx()/dFdy() produce different values on Intel vs NVIDIA, causing extreme roughness values at edges on NVIDIA.
Attempted Fixes:
- ❌ Threshold 0.1 → 0.5: Reduced but didn't eliminate halos
- ❌ Saturation
min(maxVariation, 1.0): Still visible artifacts - ✅ Complete removal: Achieved visual parity
Final Solution: Disabled roughness clamping entirely.
float compute_roughness_clamping(vec3 N_val, float roughness_val)
{
// Disabled: derivatives have different precision on NVIDIA vs Intel
roughness_val = clamp(roughness_val, 0.0, 1.0);
return roughness_val;
}
Why Derivatives Differ¶
| Vendor | Implementation | Behavior |
|---|---|---|
| Intel | Conservative 2x2 quad finite differences | Stable, predictable values |
| NVIDIA | Optimized hardware units | Different rounding, can spike |
Result: pow(maxVariation, 0.1) amplified vendor differences → white halos on NVIDIA.
Trade-offs¶
Lost¶
- Geometric anti-aliasing on curved surfaces
- Specular aliasing prevention on very smooth metals
Gained¶
- ✅ Cross-vendor consistency
- ✅ Predictable behavior
- ✅ Simplified shader code
- ✅ Minor performance improvement
Verdict: FXAA already provides excellent AA. Loss of hardware-dependent roughness clamping is compensated by a stable analytic minimum and sphere-specific curvature clamping.
Derivatives vs Analytic Performance¶
A key finding during this synchronization effort was the trade-off between using GLSL built-ins and custom analytic math.
| Method | Built-in (dFdx, fwidth) |
Analytic Curvature/Fade |
|---|---|---|
| Vendor Consistency | ❌ Poor (Driver/Hardware precision) | ✅ 100% (Mathematical) |
| Latency | Medium (Quad-sync required) | Low (Pure ALU) |
| Logic Safety | ❌ Fails in divergent branches | ✅ Branch-safe |
| Implementation | Trivial (1 line) | Complex (Custom math per primitive) |
Cost Evaluation: While analytic math uses more ALU cycles (approx. 5-10 more), it avoids the hardware synchronization latency of quads and ensures that regression maps between Intel, NVIDIA, and AMD remain dark (bit-perfect parity).
Validation Results¶
Visual Inspection¶
Reference Metrics (FXAA Synthetic Test)¶
Target values based on correct Intel HD 4600 behavior (Sphere Pattern):
| Metric | Reference Value (Intel) | NVIDIA (GTX 950M) | Pass Threshold |
|---|---|---|---|
| Edge Noise (No AA) | ~0.0026 | 0.0026 | N/A |
| Edge Noise (FXAA) | ~0.0015 | 0.0015 | < 0.0020 |
| Noise Reduction | ~41.88% | 41.85% | > 10% |
Conclusion: NVIDIA rendering is now mathematically identical to Intel (delta < 0.1%), confirming the precision fix is successful.
Files Modified¶
shaders/postprocess/fxaa.glslshaders/pbr_functions.glsl
PBO Mapping Synchronization (Implicit vs. Explicit)¶
One of the most elusive performance issues in OpenGL is the implicit synchronization that occurs during glMapBuffer.
The Symptom¶
ApiTrace reports: api performance issue 1: memory mapping a busy "buffer" BO stalled and took 1.379 ms.
The Cause¶
If you try to map a buffer that is currently being used by a pending GPU command (like a glReadPixels or glTexSubImage2D), the driver must pause the CPU until that command is finished. Even "unsynchronized" mapping can sometimes stall if the buffer hasn't been properly fenced.
The Fix: Explicit Fencing (GLsync)¶
Instead of letting the driver guess, we use explicit synchronization:
- Fence after Command:
- Wait before Map:
// Non-blocking wait (timeout 0)
GLenum status = glClientWaitSync(app->sync[!idx], GL_SYNC_FLUSH_COMMANDS_BIT, 0);
if (status != GL_TIMEOUT_EXPIRED) {
void* ptr = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
// ... process ...
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
}
By checking the fence with a zero timeout, we ensure that if the GPU is still busy, the CPU simply skips the logic for that frame instead of waiting. This is crucial for maintaining a high and stable frame rate.
References¶
- shader-cross-gpu-compatibility.md - General guidelines
- FXAA 3.11 Whitepaper
- OpenGL Derivatives