SIMD Optimization: Texture Uploads¶
Overview¶
This guide details the implementation of SIMD (Single Instruction, Multiple Data) optimizations used to accelerate the conversion of high-dynamic-range (HDR) texture data from 32-bit floating point (float) to 16-bit half-precision (half) format.
Motivation¶
Profiling via Tracy revealed that glTexSubImage2D was a significant bottleneck during the loading phase, particularly for large HDR environment maps (4K resolution).
- Problem: The OpenGL driver was performing the
Float32 -> Float16conversion on the CPU using a scalar (single-value) path, leading to high CPU usage and stalling the main thread. See the detailed Mesa F32-to-F16 Analysis for a technical breakdown of this bottleneck. - Goal: Offload this conversion to optimized hardware vector instructions (AVX2/F16C) before handing the data to the driver, allowing
glTexSubImage2Dto perform a fastmemcpy.
Implementation¶
The optimization is implemented in src/simd_utils.c and relies on x86-64 extensions:
- AVX2: For processing 8 floats (256 bits) in parallel.
- F16C: For the
_mm256_cvtps_phintrinsic, which converts a vector of 8 floats to 8 half-floats.
Code Path¶
- Check: The build system determines if
__AVX2__and__F16C__are available. - SIMD Path: If available, the code processes pixels in batches of 8 using intrinsics.
- Scalar Fallback: For the remaining pixels ("tail") or on unsupported hardware, a scalar implementation
float_to_half_intrinsicis used.
Build System Configuration¶
To ensure these instructions are used, the build system (CMakeLists.txt) forces the use of native architecture flags when ENABLE_NATIVE_ARCH=ON is set.
option(ENABLE_NATIVE_ARCH "Enable native architecture optimizations" ON)
if(ENABLE_NATIVE_ARCH)
add_compile_options(-march=native -mavx2 -mf16c)
endif()
[!IMPORTANT] This requires the host CPU (where the code is compiled) to support these instructions. The CI/CD pipeline or distrobox container must expose these CPU flags.
Performance Analysis¶
Impact¶
- Before: 4K HDR upload took ~112ms (driver conversion).
- After: 4K HDR upload takes ~35ms (SIMD conversion + upload).
- Speedup: ~3.2x faster texture uploads.
Memory Trade-off¶
This optimization requires a temporary buffer to store the converted 16-bit data before upload.
- Ram Usage:
Width * Height * 4 (RGBA) * 2 (Bytes) - Example: A 4K texture (4096x2048) requires a ~64MB temporary allocation.
- Rationale: We trade transient RAM usage for significantly reduced CPU time to avoid frame stutters during loading.
Verification¶
To verify the optimization is active at runtime, check the log output:
If you see Software Fallback (No F16C/AVX), check your build flags and CPU capabilities (lscpu | grep f16c).