glTexSubImage2D Bottleneck Analysis: Float32 → Float16 Conversion in Mesa 25.0.7¶
Motivation¶
Profiling via Tracy revealed that glTexSubImage2D was a significant bottleneck during the resource loading phase, particularly for large HDR environment maps (4K resolution).
- Problem: The OpenGL driver (Mesa) performs the
Float32 → Float16conversion on the CPU using a scalar (single-value) path, leading to high CPU usage and stalling the main thread. - Goal: Understand why Mesa does not implement vectorized conversion (AVX2/F16C) for this path, and evaluate alternatives.
Full Call Chain¶
glTexSubImage2D() [src/mesa/main/teximage.c:4096]
└─► st_TexSubImage() [src/mesa/state_tracker/st_cb_texture.c:2117]
│
├─ 1. Fast path: texture_subdata (memcpy) [line ~2162]
│ → SKIPPED if source format ≠ dest format (GL_FLOAT ≠ GL_HALF_FLOAT)
│ → Condition: _mesa_texstore_can_use_memcpy() must return true
│
├─ 2. Blit-based path (GPU blit) [line ~2198]
│ → SKIPPED if _mesa_format_matches_format_and_type() fails
│
└─ 3. CPU FALLBACK: _mesa_store_texsubimage() [line ~2398]
└─► store_texsubimage() [src/mesa/main/texstore.c:978]
└─► _mesa_texstore() [src/mesa/main/texstore.c:1091]
└─► texstore_rgba() [src/mesa/main/texstore.c:680]
└─► _mesa_format_convert() [src/mesa/main/format_utils.c:278]
│
├─ If src and dst are array formats:
│ └─► _mesa_swizzle_and_convert() [format_utils.c:1417]
│ └─► convert_half_float() [format_utils.c:913]
│ case MESA_ARRAY_FORMAT_TYPE_FLOAT:
│ SWIZZLE_CONVERT(uint16_t, float,
│ _mesa_float_to_half(src)) ← HERE
│
└─ Otherwise (via float intermediate):
└─► _mesa_pack_float_rgba_row() [format_pack.h:38]
Files Involved (Detailed)¶
1. GL Entry Point — src/mesa/main/teximage.c¶
Line 4096 — _mesa_TexSubImage2D():
_mesa_TexSubImage2D(GLenum target, GLint level,
GLint xoffset, GLint yoffset,
GLsizei width, GLsizei height,
GLenum format, GLenum type, const GLvoid *pixels)
Dispatches to ctx->Driver.TexSubImage which points to st_TexSubImage.
2. State Tracker — src/mesa/state_tracker/st_cb_texture.c¶
Line 2117 — st_TexSubImage():
The state tracker attempts 3 paths, in order:
Fast path: texture_subdata (lines 2157-2196)¶
if (pixels &&
!unpack->BufferObj &&
!is_ms &&
_mesa_texstore_can_use_memcpy(ctx, texImage->_BaseFormat,
texImage->TexFormat, format, type,
unpack)) {
// ... direct memcpy via pipe->texture_subdata()
return;
}
Failure condition: _mesa_texstore_can_use_memcpy() checks that the source format (GL_FLOAT) matches the texture format (HALF_FLOAT) exactly. It does not → skipped.
Blit-based path (lines 2198-2390)¶
Creates an intermediate GPU texture, copies data via memcpy, then blits. But if _mesa_format_matches_format_and_type() already matches (meaning the memcpy fallback would be just as fast), the code jumps to the fallback.
CPU Fallback (lines 2393-2400)¶
fallback:
_mesa_store_texsubimage(ctx, dims, texImage, xoffset, yoffset, zoffset,
width, height, depth, format, type, pixels, unpack);
3. Fallback Texture Store — src/mesa/main/texstore.c¶
Line 978 — store_texsubimage():
Maps the texture into CPU memory, then calls _mesa_texstore() → texstore_rgba().
Line 818 — texstore_rgba() calls _mesa_format_convert():
_mesa_format_convert(dstSlices[img], dstFormat, dstRowStride,
src, srcMesaFormat, srcRowStride,
srcWidth, srcHeight,
needRebase ? rebaseSwizzle : NULL);
4. Format Conversion — src/mesa/main/format_utils.c¶
Line 278 — _mesa_format_convert():
When src and dst are both array formats (FLOAT RGBA and HALF RGBA), the code goes directly through:
_mesa_swizzle_and_convert(dst, dst_type, dst_num_channels,
src, src_type, src_num_channels,
src2dst, normalized, width);
Line 1417 — _mesa_swizzle_and_convert():
The switch on dst_type dispatches to convert_half_float():
case MESA_ARRAY_FORMAT_TYPE_HALF:
convert_half_float(void_dst, num_dst_channels, void_src, src_type,
num_src_channels, swizzle, normalized, count);
break;
Line 913 — convert_half_float():
case MESA_ARRAY_FORMAT_TYPE_FLOAT:
SWIZZLE_CONVERT(uint16_t, float, _mesa_float_to_half(src));
break;
5. The Scalar Loop — SWIZZLE_CONVERT_LOOP Macro¶
Line 728 — The macro expands to:
for (s = 0; s < count; ++s) { // for each pixel
for (j = 0; j < SRC_CHANS; ++j) { // for each channel (R, G, B, A)
SRC_TYPE src = typed_src[j];
tmp[j] = _mesa_float_to_half(src); // ← scalar conversion
}
typed_dst[0] = tmp[swizzle_x];
typed_dst[1] = tmp[swizzle_y];
typed_dst[2] = tmp[swizzle_z];
typed_dst[3] = tmp[swizzle_w];
typed_src += SRC_CHANS;
typed_dst += DST_CHANS;
}
For an RGBA 4K texture (4096×2048), this results in 4096 × 2048 × 4 = ~33 million calls to _mesa_float_to_half().
6. The Scalar Conversion — src/util/half_float.h¶
Line 58 — _mesa_float_to_half():
static inline uint16_t
_mesa_float_to_half(float val)
{
#if defined(USE_X86_64_ASM)
if (util_get_cpu_caps()->has_f16c) {
__m128 in = {val}; // loads ONE float into XMM (128-bit)
__m128i out;
__asm volatile("vcvtps2ph $0, %1, %0" : "=v"(out) : "v"(in));
return out[0]; // extracts ONE half
}
#endif
return _mesa_float_to_half_slow(val); // pure software fallback
}
Even with F16C available, the conversion is scalar: vcvtps2ph can convert 4 floats (XMM) or 8 floats (YMM/AVX2) in a single instruction, but here it converts only one value per call.
7. Software Fallback — src/util/half_float.c¶
Line 57 — _mesa_float_to_half_slow():
Pure software conversion via bit manipulation (mantissa/exponent extraction, denorm/NaN/Inf handling, rounding). Even slower than the scalar F16C path.
Git History (blame) and Merge Requests¶
Key Commits in Chronological Order¶
| Date | Commit | Author | Description | MR |
|---|---|---|---|---|
| 2014-07-11 | d55f77b503ab |
Jason Ekstrand (Intel) | Created the SWIZZLE_CONVERT framework |
pre-GitLab |
| 2014-09-12 | 418da979053d |
Brian Paul (VMware) | Refactored SWIZZLE_CONVERT_LOOP macro |
pre-GitLab |
| 2014-09-12 | cfeb394224f2 |
Brian Paul (VMware) | Extracted convert_half_float() (compile time ÷4) |
pre-GitLab |
| 2014-11-27 | ea79ab3e8c37 |
Iago Toral Quiroga (Igalia) | Replaced GL types → MESA_ARRAY_FORMAT_TYPE_* |
pre-GitLab |
| 2020-09-18 | ffcdf76799b0 |
Marek Olšák (AMD) | Added F16C inline asm (scalar) | !6987 |
| 2021-02-25 | a9618e7c4214 |
Rob Clark | util_get_cpu_caps() accessor |
!9266 |
| 2023-06-27 | c9a5cac4ffa4 |
Konstantin Seurer | immintrin.h → xmmintrin.h (compile time -22%) |
!23871 |
| 2024-10-03 | 3c0bf4238188 |
David Heidelberg | aarch64 fcvt half→float (not float→half!) |
!31564 |
MR !6987 — The F16C Commit (Details)¶
- Title: "Implement F16C using inline assembly on x86_64"
- Author: Marek Olšák (
@mareko, AMD) - Branch:
f16c-v2 - Reviewer: Matt Turner (Intel)
- Merged: October 7, 2020
- Primary motivation: Fix
bptc-float-modestest on llvmpipe - Approach: Optimize the existing scalar
_mesa_float_to_half()function, not create a batch variant
Why Mesa Has No Vectorized Conversion¶
1. Layered Architecture Incompatible with Vectorization¶
The SWIZZLE_CONVERT_LOOP framework (Jason Ekstrand, 2014) was designed as a generic format converter:
- Iterates pixel by pixel, channel by channel
- The conversion is passed as a macro parameter (
CONV = _mesa_float_to_half(src)) - Swizzling (RGBA→BGRA reordering, etc.) is interleaved with conversion
- The code relies on compiler loop unrolling (
nested switchonDST_CHANS×SRC_CHANS) rather than explicit vectorization
Vectorizing this code would require significant refactoring: separating conversion from swizzle, or creating a specialized float[4] → half[4] path with SIMD intrinsics.
2. _mesa_float_to_half Was Designed as a Scalar API¶
When Marek Olšák added F16C support (MR !6987, 2020), he optimized the existing scalar function:
His motivation was to fix a bug (bptc-float-modes on llvmpipe), not to optimize texture upload throughput. He never added a batch variant such as:
// Hypothetical — does NOT exist in Mesa
void _mesa_floats_to_halfs(uint16_t *dst, const float *src, size_t count);
3. llvmpipe Already Does Vectorized Conversion... via JIT¶
In src/gallium/auxiliary/gallivm/lp_bld_conv.c (line 183), llvmpipe uses LLVM to emit vectorized instructions:
if (util_get_cpu_caps()->has_f16c &&
(length == 4 || length == 8)) {
intrinsic = "llvm.x86.vcvtps2ph.128"; // 4 floats at once
// or "llvm.x86.vcvtps2ph.256" // 8 floats at once (AVX)
}
This code is reserved for the JIT rendering pipeline (shaders compiled on-the-fly) and is never used for CPU-side texture uploads via glTexSubImage2D.
4. Historical Issues with AVX Intrinsics¶
Mesa release notes for 18.3.3, 19.0.0, and 19.3.0 document recurring build issues with _mm256_cvtps_ph:
"format_types.h:1220: undefined reference to
_mm256_cvtps_ph"
The SWR driver (Intel Software Rasterizer, now deprecated) emitted AVX/F16C intrinsics without verifying compiler and target support. These bugs made Mesa developers cautious about using SIMD intrinsics in C "hot path" code.
5. The CPU Path Is Not the Primary Use Case¶
For hardware drivers (radeonsi, iris, nouveau...):
st_TexSubImagefirst triestexture_subdata(direct memcpy if formats match)- Then GPU blit (conversion done by the GPU)
- The CPU fallback is only reached when source format ≠ destination format
Mesa developers assume that applications should provide data in the texture's native format (i.e., pass GL_HALF_FLOAT if the internal texture format is GL_RGBA16F).
Concrete Impact for a 4K RGBA HDR Texture¶
| Parameter | Value |
|---|---|
| Resolution | 4096 × 2048 |
| Channels | 4 (RGBA) |
| Total pixels | 8,388,608 |
Calls to _mesa_float_to_half() |
33,554,432 |
| Instruction per call | vcvtps2ph (if F16C) or software fallback |
| XMM register utilization | ¼ (4 slots available, 1 used) |
| YMM register utilization | ⅛ (8 slots available, not used) |
Theoretical Gain from Vectorization¶
| Method | Conversions/instruction | Theoretical speedup |
|---|---|---|
| Current scalar (XMM, 1 float) | 1 | 1× |
| SSE/XMM vectorized (4 floats) | 4 | ~4× |
| AVX/YMM vectorized (8 floats) | 8 | ~8× |
Alternatives and Workarounds¶
Application-side (recommended by Mesa)¶
- Supply data as Float16 directly to
glTexSubImage2D(type = GL_HALF_FLOAT), which activates the fast memcpy path - Perform the F32→F16 conversion application-side using AVX2/F16C intrinsics before the upload, freeing the driver from an expensive conversion
- Use PBOs (Pixel Buffer Objects) to make the upload asynchronous
Driver-side Mesa (potential improvements)¶
- Add a batch variant
_mesa_floats_to_halfs_avx2()usingvcvtps2phon YMM registers - Create a specialized path in
convert_half_float()when the swizzle is identity (RGBA→RGBA) - Integrate vectorized conversion directly into
_mesa_format_convert()for theFLOAT→HALFcase without swizzle
Conclusion¶
The observed bottleneck is real and confirmed by the source code: the Float32→Float16 conversion in Mesa is scalar even on CPUs with F16C support. This is the result of accumulated design decisions:
- A generic conversion framework designed in 2014 for flexibility, not performance
- A 2020 F16C optimization targeting bug fixing, not throughput
- An assumption that applications provide data in the correct format
- The fact that llvmpipe's JIT code already vectorizes this path, but only for shaders
For the specific use case (asynchronous HDR resource loading on CPU threads while the GPU renders), the optimal solution is to perform the conversion application-side with SIMD instructions before calling glTexSubImage2D, allowing the driver to perform a simple memcpy.