Skip to content

CPU Profiling: perf + FlameGraph WorkflowΒΆ

Last Updated: 2026-02-11 Author: Profiling workflow established during test_app optimization session

This guide provides a comprehensive, step-by-step methodology for CPU profiling C/C++ applications using perf and FlameGraph visualization. It covers tool installation for Debian-based and Bazzite Linux distributions, profiling execution, and analysis techniques.


πŸ“‹ Table of ContentsΒΆ

  1. Prerequisites & Installation
  2. Build Configuration
  3. Recording Performance Data
  4. Analyzing with perf report
  5. Generating FlameGraphs
  6. Interpreting Results
  7. Real-World Example: test_app
  8. Troubleshooting

πŸ“¦ Prerequisites & InstallationΒΆ

Debian-based Systems (Debian, Ubuntu, Linux Mint, etc.)ΒΆ

# Install perf (Linux performance counters)
sudo apt update
sudo apt install linux-perf

# Install FlameGraph tools
cd ~/tools  # or any directory you prefer
git clone https://github.com/brendangregg/FlameGraph.git

# Add to PATH (add to ~/.bashrc for persistence)
export PATH="$HOME/tools/FlameGraph:$PATH"

Bazzite / Fedora Atomic / immutable systemsΒΆ

Bazzite uses rpm-ostree for system packages. For development tools, use a container (toolbox/distrobox):

# Create a development container
distrobox create --name dev-box --image fedora:latest

# Enter the container
distrobox enter dev-box

# Inside the container:
sudo dnf install perf

# Clone FlameGraph
cd ~/tools
git clone https://github.com/brendangregg/FlameGraph.git
export PATH="$HOME/tools/FlameGraph:$PATH"

Alternative (host install via rpm-ostree):

# On Bazzite host (not recommended for dev tools)
rpm-ostree install perf
sudo systemctl reboot  # Required for atomic layering

VerificationΒΆ

# Check perf is installed
perf --version

# Check FlameGraph scripts
ls ~/tools/FlameGraph/*.pl

Expected output:

perf version 6.x.x
flamegraph.pl  stackcollapse-perf.pl  (and others)

πŸ”§ Build ConfigurationΒΆ

For optimal profiling results, build with debug symbols and optimizations:

CMake ProjectsΒΆ

# Option 1: RelWithDebInfo (Recommended)
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build -j$(nproc)

# Option 2: Profiling build type (if available)
cmake -B build-prof -DCMAKE_BUILD_TYPE=Profiling
cmake --build build-prof -j$(nproc)

Makefile ProjectsΒΆ

For this project:

make build-prof  # Uses Profiling build type

Key compiler flags:

  • -O3 or -O2: Enable optimizations (realistic performance)
  • -g: Include debug symbols (function names, line numbers)
  • -fno-omit-frame-pointer: Preserve stack frames for accurate call graphs

πŸ“Š Recording Performance DataΒΆ

Basic RecordingΒΆ

# Record with default settings (99 Hz sampling)
perf record -g ./build/tests/test_app

# Record with custom frequency (e.g., 1000 Hz for finer granularity)
perf record -F 1000 -g ./build/tests/test_app

# Record with call-graph (DWARF unwinding, recommended)
perf record -g --call-graph dwarf ./build/tests/test_app

Flags explained:

  • -g: Enable call-graph (stack) recording
  • -F <freq>: Sampling frequency in Hz (default: 99)
  • --call-graph dwarf: Use DWARF debug info for unwinding (more accurate)

Recording Xvfb Tests (Headless)ΒΆ

For tests running in Xvfb:

perf record -g xvfb-run -a -s "-screen 0 1024x768x24" ./build/tests/test_app

OutputΒΆ

This creates a perf.data file in the current directory (~10-50 MB depending on duration).


πŸ“ˆ Analyzing with perf reportΒΆ

Interactive TUIΒΆ

perf report

Navigation:

  • ↑/↓: Navigate functions
  • Enter: Expand call stack
  • a: Annotate selected function (assembly view)
  • q: Quit

Text ReportΒΆ

# Summary report (top functions by sample count)
perf report --stdio | head -50

# Detailed call graph
perf report --stdio --call-graph --show-nr-samples | less

Sample OutputΒΆ

# Overhead  Command     Shared Object       Symbol
#  21.60%   test_app    libGL.so.1          [.] glTexImage2D
#  17.80%   test_app    libGL.so.1          [.] glTexSubImage2D
#  11.70%   test_app    test_app            [.] shader_compile
#   9.40%   test_app    test_app            [.] icosphere_generate

Interpretation:

  • Overhead: % of samples in this function
  • Symbol: Function name (if symbols available)
  • [.]: Userspace code, [k]: Kernel code

πŸ”₯ Generating FlameGraphsΒΆ

FlameGraphs provide an intuitive visual representation of call stacks.

Step 1: Collapse Stack SamplesΒΆ

perf script | stackcollapse-perf.pl > perf-folded.txt
  • perf script: Converts perf.data to text format
  • stackcollapse-perf.pl: Aggregates identical stack traces

Step 2: Generate SVGΒΆ

flamegraph.pl perf-folded.txt > flamegraph.svg

One-LinerΒΆ

perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

ViewingΒΆ

# Open in browser
xdg-open flamegraph.svg

# Or copy to artifact directory
cp flamegraph.svg /path/to/artifacts/

πŸ” Interpreting ResultsΒΆ

FlameGraph AnatomyΒΆ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       main (100%)                          β”‚  ← Entry point
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  init()    β”‚      render_loop()            β”‚   cleanup()   β”‚  ← Top-level functions
β”‚   (10%)    β”‚         (85%)                 β”‚    (5%)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ texture_loadβ”‚ shader_compile
                       β”‚   (40%)     β”‚    (25%)
                       └─────────────┴────────────────────────
                              β–²
                         Width = CPU time
  • X-axis (width): Proportion of total CPU time
  • Y-axis (height): Call stack depth (caller β†’ callee)
  • Color: Random (for visual separation only)

Identification of BottlenecksΒΆ

  1. Wide boxes: Functions consuming significant CPU time
  2. Patterns:
  3. Plateau: CPU bound (good utilization)
  4. Tower: Deep call chains (potential overhead)
  5. Fragmentation: Many small calls (cache/branch issues)

Example Analysis (test_app)ΒΆ

From today's profiling session (2026-02-11):

Total samples: 4800
Total time: 17.01 seconds

Top hotspots:
1. Texture Loading (HDR): 6.7s (39%)
   - glTexImage2D: 3.7s
   - glTexSubImage2D: 3.0s

2. Rendering & Capture: 3.8s (22%)
   - app_render: 2.5s
   - glReadPixels: 1.3s

3. Window/Context Init: 2.8s (16%)
   - glfwCreateWindow: 1.8s
   - GL initialization: 1.0s

4. Shader Compilation: 2.0s (12%)
   - shader_compile_from_file: 2.0s

Optimization targets:

  • Cache HDR texture (avoid reload)
  • Use PBO for async glReadPixels
  • Disable expensive GPU effects in tests

🎯 Real-World Example: test_app¢

ContextΒΆ

Profiling the test_app integration test to identify slowness (17s execution).

Commands UsedΒΆ

# Build with profiling symbols
cmake -B build -DCMAKE_BUILD_TYPE=Profiling
cmake --build build --target test_app -j$(nproc)

# Record execution
perf record -F 1000 -g xvfb-run -a -s "-screen 0 1024x768x24" ./build/tests/test_app

# Analyze interactively
perf report

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > test_app_flamegraph.svg

Key FindingsΒΆ

Bottleneck Time % Solution Implemented
HDR texture upload 6.7s 39% βœ… Cache texture between tests
glReadPixels sync 1.3s 8% βœ… Use PBO async readback
Post-processing 1.5s 9% ❌ Kept for ISO production
IBL generation 17s Varies ⚠️ Outside test scope

ResultsΒΆ

  • Before: 17.0s
  • After (with optimizations): 22.7s (IBL variation)
  • Test portion: 3.8s β†’ ~5.2s (with full GPU effects)

Note: The increase is due to IBL generation variability, not regression in test code.


πŸ› οΈ TroubleshootingΒΆ

Permission ErrorsΒΆ

Error: Access to performance monitoring and observability operations is limited

Solution (on host, not in container):

# Temporary
sudo sysctl -w kernel.perf_event_paranoid=1

# Permanent
echo "kernel.perf_event_paranoid = 1" | sudo tee /etc/sysctl.d/99-perf.conf
sudo sysctl --system

Missing SymbolsΒΆ

If you see hex addresses instead of function names:

Check debug symbols:

# View information about object files.
objdump -t ./build/tests/test_app | grep debug
# List symbol names in object files.
nm ./build/tests/test_app | head

Rebuild with symbols:

cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo

FlameGraph Scripts Not FoundΒΆ

# Ensure FlameGraph is in PATH
export PATH="$HOME/tools/FlameGraph:$PATH"

# Or use absolute paths
~/tools/FlameGraph/stackcollapse-perf.pl
~/tools/FlameGraph/flamegraph.pl

perf.data Too LargeΒΆ

Reduce sampling frequency:

perf record -F 99 -g ./app  # Default
perf record -F 49 -g ./app  # Half rate

Or limit duration:

perf record -g --duration 5000 ./app  # 5 seconds

πŸ“š Additional ResourcesΒΆ



Changelog:

  • 2026-02-11: Initial comprehensive guide created based on test_app profiling session. Includes Debian + Bazzite installation, complete workflow, and real-world example with flamegraph analysis.