[Optimization] Zen 4 / EPYC Macro-Blocking (P, Q, R) tuning utilizing COOPERLAKE kernels

Hi OpenBLAS team,

I have been profiling SGEMM performance on an AMD Zen 4 (EPYC 9334) server and noticed that the build system accurately falls back to the COOPERLAKE target to leverage the AVX-512/BF16 instructions. While the micro-kernels run flawlessly, the default COOPERLAKE macro-blocking parameters (P, Q, R) seem to bottleneck on AMD's 32MB L3 CCX boundaries.

I recently ran Bayesian Optimisation parameter tuning specifically targeting the macro-blocking dimensions for this hardware. Without altering any assembly, adjusting the blocking limits yielded a performance jump from ~2.8 TFLOPs to ~3.0 TFLOPs for square matrices around N=8192. I also observed similar improvements on smaller square matrices, though performance on heavily skewed shapes (skinny/wide matrices) remained largely unchanged.

`It appears the default COOPERLAKE blocking parameters—which assume a monolithic L3—might be causing some cache contention on this 32 MB L3 Zen 4 CCX boundary. The tuned parameters seem to eliminate that L3 thrashing.`

Before I open a PR or formalise these values into `param.h`, I want to confirm this is the right direction or am I missing something.

Thanks for your time and guidance!
Kilaru Vasu Deva.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] Zen 4 / EPYC Macro-Blocking (P, Q, R) tuning utilizing COOPERLAKE kernels #5837

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Optimization] Zen 4 / EPYC Macro-Blocking (P, Q, R) tuning utilizing COOPERLAKE kernels #5837

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions