As a crucial operator in numerous scientific and engineering
computing applications, the automatic optimization of Gen-
eral Matrix Multiplication (GEMM) with full utilization of
ever-evolving hardware architectures (e.g. GPUs and RISC-
V) is of paramount importance. While Large Language Mod-
els (LLMs) can generate functionally correct code for sim-
ple tasks, they have yet to produce high-performance code.
The key challenge resides in deeply understanding diverse
hardware architectures and crafting prompts that effectively
unleash the potential of LLMs to generate high-performance
code.
In this paper, we propose a novel prompt mechanism called
QiMeng-GEMM , which enables LLMs to comprehend the
architectural characteristics of different hardware platforms
and automatically search for the optimization combinations
for GEMM. The key of QiMeng-GEMM is a set of in-
formative, adaptive, and iterative meta-prompts. Based on
this, a searching strategy for optimal combinations of meta-
prompts is used to iteratively generate high-performance
code. Extensive experiments conducted on 4 leading LLMs,
various paradigmatic hardware platforms, and representa-
tive matrix dimensions unequivocally demonstrate QiMeng-
GEMM's superior performance in auto-generating optimized
GEMM code. Compared to vanilla prompts, our method
achieves a performance enhancement of up to 113x. Even
when compared to human experts, our method can reach
115% of cuBLAS on NVIDIA GPUs and 211% of Open-
BLAS on RISC-V CPUs. Notably, while human experts of-
ten take months to optimize GEMM, our approach reduces
the development cost by over 240x.