Hi OpenBLAS team,
I have been profiling SGEMM performance on an AMD Zen 4 (EPYC 9334) server and noticed that the build system accurately falls back to the COOPERLAKE target to leverage the AVX-512/BF16 instructions. While the micro-kernels run flawlessly, the default COOPERLAKE macro-blocking parameters (P, Q, R) seem to bottleneck on AMD's 32MB L3 CCX boundaries.
I recently ran Bayesian Optimisation parameter tuning specifically targeting the macro-blocking dimensions for this hardware. Without altering any assembly, adjusting the blocking limits yielded a performance jump from ~2.8 TFLOPs to ~3.0 TFLOPs for square matrices around N=8192. I also observed similar improvements on smaller square matrices, though performance on heavily skewed shapes (skinny/wide matrices) remained largely unchanged.
It appears the default COOPERLAKE blocking parameters—which assume a monolithic L3—might be causing some cache contention on this 32 MB L3 Zen 4 CCX boundary. The tuned parameters seem to eliminate that L3 thrashing.
Before I open a PR or formalise these values into param.h, I want to confirm this is the right direction or am I missing something.
Thanks for your time and guidance!
Kilaru Vasu Deva.
Hi OpenBLAS team,
I have been profiling SGEMM performance on an AMD Zen 4 (EPYC 9334) server and noticed that the build system accurately falls back to the COOPERLAKE target to leverage the AVX-512/BF16 instructions. While the micro-kernels run flawlessly, the default COOPERLAKE macro-blocking parameters (P, Q, R) seem to bottleneck on AMD's 32MB L3 CCX boundaries.
I recently ran Bayesian Optimisation parameter tuning specifically targeting the macro-blocking dimensions for this hardware. Without altering any assembly, adjusting the blocking limits yielded a performance jump from ~2.8 TFLOPs to ~3.0 TFLOPs for square matrices around N=8192. I also observed similar improvements on smaller square matrices, though performance on heavily skewed shapes (skinny/wide matrices) remained largely unchanged.
It appears the default COOPERLAKE blocking parameters—which assume a monolithic L3—might be causing some cache contention on this 32 MB L3 Zen 4 CCX boundary. The tuned parameters seem to eliminate that L3 thrashing.Before I open a PR or formalise these values into
param.h, I want to confirm this is the right direction or am I missing something.Thanks for your time and guidance!
Kilaru Vasu Deva.