GPU Rendering Synchronization: Intel vs NVIDIA¶

Date: 2026-01-30 Status: Resolved Impact: Critical - Visual quality consistency across GPU vendors

Executive Summary¶

Investigation and resolution of rendering differences between Intel and NVIDIA GPUs in the suckless-ogl PBR renderer. Issues manifested as white halos and incorrect FXAA edge blending on NVIDIA hardware.

Visual Comparison¶

GPU	Before Fix	After Fix
Intel	✅ Clean edges, proper FXAA	✅ Unchanged
NVIDIA	❌ White halos, buggy edges	✅ Identical to Intel

Root Causes Identified¶

Issue 1: FXAA Luminance Recalculation¶

File: shaders/postprocess/fxaa.glsl (Lines 176-198)

Problem: Edge search loop recalculated luma using sqrt(), which has different precision on Intel vs NVIDIA.

Fix: Use pre-calculated luma from alpha channel instead.

- lumaEnd1 = FxaaLuma(texture(screenTexture, uv1).rgb);
+ lumaEnd1 = texture(screenTexture, uv1).a;  // Pre-calculated in PBR shader

Issue 2: Derivative-Based Roughness Clamping¶

File: shaders/pbr_functions.glsl (Lines 76-100)

Problem: dFdx()/dFdy() produce different values on Intel vs NVIDIA, causing extreme roughness values at edges on NVIDIA.

Attempted Fixes:

❌ Threshold 0.1 → 0.5: Reduced but didn't eliminate halos
❌ Saturation min(maxVariation, 1.0): Still visible artifacts
✅ Complete removal: Achieved visual parity

Final Solution: Disabled roughness clamping entirely.

float compute_roughness_clamping(vec3 N_val, float roughness_val)
{
    // Disabled: derivatives have different precision on NVIDIA vs Intel
    roughness_val = clamp(roughness_val, 0.0, 1.0);
    return roughness_val;
}

Why Derivatives Differ¶

Vendor	Implementation	Behavior
Intel	Conservative 2x2 quad finite differences	Stable, predictable values
NVIDIA	Optimized hardware units	Different rounding, can spike

Result: pow(maxVariation, 0.1) amplified vendor differences → white halos on NVIDIA.

Trade-offs¶

Lost¶

Geometric anti-aliasing on curved surfaces
Specular aliasing prevention on very smooth metals

Gained¶

✅ Cross-vendor consistency
✅ Predictable behavior
✅ Simplified shader code
✅ Minor performance improvement

Verdict: FXAA already provides excellent AA. Loss of hardware-dependent roughness clamping is compensated by a stable analytic minimum and sphere-specific curvature clamping.

Derivatives vs Analytic Performance¶

A key finding during this synchronization effort was the trade-off between using GLSL built-ins and custom analytic math.

Method	Built-in (`dFdx`, `fwidth`)	Analytic Curvature/Fade
Vendor Consistency	❌ Poor (Driver/Hardware precision)	✅ 100% (Mathematical)
Latency	Medium (Quad-sync required)	Low (Pure ALU)
Logic Safety	❌ Fails in divergent branches	✅ Branch-safe
Implementation	Trivial (1 line)	Complex (Custom math per primitive)

Cost Evaluation: While analytic math uses more ALU cycles (approx. 5-10 more), it avoids the hardware synchronization latency of quads and ensures that regression maps between Intel, NVIDIA, and AMD remain dark (bit-perfect parity).

Validation Results¶

Visual Inspection¶

Before:  Intel ✅  |  NVIDIA ❌ (halos, buggy FXAA)
After:   Intel ✅  |  NVIDIA ✅ (identical rendering)

Reference Metrics (FXAA Synthetic Test)¶

Target values based on correct Intel HD 4600 behavior (Sphere Pattern):

Metric	Reference Value (Intel)	NVIDIA (GTX 950M)	Pass Threshold
Edge Noise (No AA)	~0.0026	0.0026	N/A
Edge Noise (FXAA)	~0.0015	0.0015	< 0.0020
Noise Reduction	~41.88%	41.85%	> 10%

Conclusion: NVIDIA rendering is now mathematically identical to Intel (delta < 0.1%), confirming the precision fix is successful.

Files Modified¶

shaders/postprocess/fxaa.glsl
shaders/pbr_functions.glsl

PBO Mapping Synchronization (Implicit vs. Explicit)¶

One of the most elusive performance issues in OpenGL is the implicit synchronization that occurs during glMapBuffer.

The Symptom¶

ApiTrace reports: api performance issue 1: memory mapping a busy "buffer" BO stalled and took 1.379 ms.

The Cause¶

If you try to map a buffer that is currently being used by a pending GPU command (like a glReadPixels or glTexSubImage2D), the driver must pause the CPU until that command is finished. Even "unsynchronized" mapping can sometimes stall if the buffer hasn't been properly fenced.

The Fix: Explicit Fencing (`GLsync`)¶

Instead of letting the driver guess, we use explicit synchronization:

Fence after Command:

glReadPixels(...);
app->sync[idx] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

Wait before Map:

// Non-blocking wait (timeout 0)
GLenum status = glClientWaitSync(app->sync[!idx], GL_SYNC_FLUSH_COMMANDS_BIT, 0);
if (status != GL_TIMEOUT_EXPIRED) {
    void* ptr = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
    // ... process ...
    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
}

By checking the fence with a zero timeout, we ensure that if the GPU is still busy, the CPU simply skips the logic for that frame instead of waiting. This is crucial for maintaining a high and stable frame rate.

References¶

shader-cross-gpu-compatibility.md - General guidelines
FXAA 3.11 Whitepaper
OpenGL Derivatives