QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

¹SKL of Processors, Institute of Computing Technology, CAS
¹Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
³University of Chinese Academy of Sciences, Beijing, China
⁴Institute of AI for Industries, CAS, China

Abstract

As a crucial operator in numerous scientific and engineering computing applications, the automatic optimization of Gen- eral Matrix Multiplication (GEMM) with full utilization of ever-evolving hardware architectures (e.g. GPUs and RISC- V) is of paramount importance. While Large Language Mod- els (LLMs) can generate functionally correct code for sim- ple tasks, they have yet to produce high-performance code. The key challenge resides in deeply understanding diverse hardware architectures and crafting prompts that effectively unleash the potential of LLMs to generate high-performance code.

In this paper, we propose a novel prompt mechanism called QiMeng-GEMM , which enables LLMs to comprehend the architectural characteristics of different hardware platforms and automatically search for the optimization combinations for GEMM. The key of QiMeng-GEMM is a set of in- formative, adaptive, and iterative meta-prompts. Based on this, a searching strategy for optimal combinations of meta- prompts is used to iteratively generate high-performance code. Extensive experiments conducted on 4 leading LLMs, various paradigmatic hardware platforms, and representa- tive matrix dimensions unequivocally demonstrate QiMeng- GEMM's superior performance in auto-generating optimized GEMM code. Compared to vanilla prompts, our method achieves a performance enhancement of up to 113x. Even when compared to human experts, our method can reach 115% of cuBLAS on NVIDIA GPUs and 211% of Open- BLAS on RISC-V CPUs. Notably, while human experts of- ten take months to optimize GEMM, our approach reduces the development cost by over 240x.

Overview

We define a set of meta-prompts that incorporate fve common GEMM optimization techniques. And we introduce an LLMs -based auto-tuning to automatically search for the optimal combination of meta-prompts. LLMs first generate the initial code and then iteratively optimize the GEMM implementation and finally generate the optimal code on specific platforms.

Main Results

We conduct our QiMeng-GEMM on some SOTA LLMs to evaluate the effectiveness. We evaluate the GEMM code on various platforms including CPU(RISC-V Xuantie C910) and GPU(NVIDIA RTX 4070). We test the performace of vanilla GPT-4o and claude-3.5-sonnet on many dimensions of GEMM.

Implemented on various LLMs, QiMeng-GEMM achieves a performance enhancement of up to 113x compared to vanilla LLMs. Besides QiMeng-GEMM can achieve better performances compared to handcraft high performance library such as OpenBLAS and cuBLAS.

We extend our QiMeng-GEMM to many other platforms include RISC-V and NVIDIA GPUs, and we can still get the significent performance gains.

We have developed QiMeng-GEMM into a tool that can conveniently generate GEMM code. It can automatically generate GEMM code for different platforms.

BibTex

@article{ title={QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models}, volume={39}, url={https://ojs.aaai.org/index.php/AAAI/article/view/34461}, DOI={10.1609/aaai.v39i21.34461}, number={21}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Zhou, Qirui and Wen, Yuanbo and Chen, Ruizhi and Gao, Ke and Xiong, Weiqiang and Li, Ling and Guo, Qi and Wu, Yanjun and Chen, Yunji}, year={2025}, month={Apr.}, pages={22982-22990} }