QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

1SKL of Processors, Institute of Computing Technology, CAS
1Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
3University of Chinese Academy of Sciences, Beijing, China
4Institute of AI for Industries, CAS, China

Overview

As a crucial operator in numerous scientific and engineering computing applications, the automatic optimization of Gen- eral Matrix Multiplication (GEMM) with full utilization of ever-evolving hardware architectures (e.g. GPUs and RISC- V) is of paramount importance. While Large Language Mod- els (LLMs) can generate functionally correct code for sim- ple tasks, they have yet to produce high-performance code. The key challenge resides in deeply understanding diverse hardware architectures and crafting prompts that effectively unleash the potential of LLMs to generate high-performance code.

In this paper, we propose a novel prompt mechanism called QiMeng-GEMM , which enables LLMs to comprehend the architectural characteristics of different hardware platforms and automatically search for the optimization combinations for GEMM. The key of QiMeng-GEMM is a set of in- formative, adaptive, and iterative meta-prompts. Based on this, a searching strategy for optimal combinations of meta- prompts is used to iteratively generate high-performance code. Extensive experiments conducted on 4 leading LLMs, various paradigmatic hardware platforms, and representa- tive matrix dimensions unequivocally demonstrate QiMeng- GEMM's superior performance in auto-generating optimized GEMM code. Compared to vanilla prompts, our method achieves a performance enhancement of up to 113x. Even when compared to human experts, our method can reach 115% of cuBLAS on NVIDIA GPUs and 211% of Open- BLAS on RISC-V CPUs. Notably, while human experts of- ten take months to optimize GEMM, our approach reduces the development cost by over 240x.

We define a set of meta-prompts that incorporate fve common GEMM optimization techniques. And we introduce an LLMs -based auto-tuning to automatically search for the optimal combination of meta-prompts. LLMs first generate the initial code and then iteratively optimize the GEMM implementation and finally generate the optimal code on specific platforms.

Main Results

We conduct our QiMeng-GEMM on some SOTA LLMs to evaluate the effectiveness. We evaluate the GEMM code on various platforms including CPU(RISC-V Xuantie C910) and GPU(NVIDIA RTX 4070). We test the performace of vanilla GPT-4o and claude-3.5-sonnet on many dimensions of GEMM.

Implemented on various LLMs, QiMeng-GEMM achieves a performance enhancement of up to 113x compared to vanilla LLMs. Besides QiMeng-GEMM can achieve better performances compared to handcraft high performance library such as OpenBLAS and cuBLAS.

We extend our QiMeng-GEMM to many other platforms include RISC-V and NVIDIA GPUs, and we can still get the significent performance gains.

We have developed QiMeng-GEMM into a tool that can conveniently generate GEMM code. It can automatically generate GEMM code for different platforms.

BibTex

      
@article{
  title={QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models}, 
  volume={39}, 
  url={https://ojs.aaai.org/index.php/AAAI/article/view/34461}, 
  DOI={10.1609/aaai.v39i21.34461}, 
  number={21},
  journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
  author={Zhou, Qirui and Wen, Yuanbo and Chen, Ruizhi and Gao, Ke and Xiong, Weiqiang and Li, Ling and Guo, Qi and Wu, Yanjun and Chen, Yunji}, 
  year={2025}, 
  month={Apr.}, 
  pages={22982-22990}
}