HPL-MxP Multiple lookahead Optimization for Kunpeng Processors

[1] [1] Wikipedia. Half-precision floating-point format[EB/OL]. [2023-10-01]. https://en.wikipedia.org/w/index.php?title=Half-precision_floating-point_format&oldid=1157476282.

[2] [2] Wikipedia. Bfloat16 floating-point format[EB/OL]. [2023-10-01]. https://en.wikipedia.org/w/index.php?title=Bfloat16_floating-point_format&oldid=1155660759.

[4] [4] NVIDIA Corporation. NVIDIA Tesla V100[EB/OL]. [2023-10-01]. https://www.nvidia.com/en-us/data-center/v100/.

[5] [5] NVIDIA Developer Team. INT4 for AI inference[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/int4-for-ai-inference/.

[6] [6] NVIDIA Developer Team. NVIDIA, Arm, and Intel publish FP8 specification for standardization as an interchange format for AI[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/.

[7] [7] WANG N G, CHOI J, BRAND D, et al. Training deep neural networks with 8-bit floating point numbers[EB/OL]. [2023-10-01]. https://arxiv.org/abs/1812.08011v1.

[8] [8] MICIKEVICIUS P, NARANG S R, ALBEN J, et al. Mixed precision training[EB/OL]. [2023-10-01]. https://arxiv.org/abs/1710.03740v3.

[9] [9] NVIDIA Developer Team. Tensor cores: mixed precision scientific computing[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/.

[10] [10] ANZT H, BOMAN E G, GATES M, et al. Towards use of mixed precision in ECP math libraries[D]. Livermore, USA: Lawrence Livermore National Laboratory, 2021.

[11] [11] Netlib Organization. High-performance linpack benchmark[EB/OL]. [2023-10-01]. https://netlib.org/benchmark/hpl.

[12] [12] Innovative Computing Laboratory. HPL-AI: high-performance linpack for artificial intelligence[EB/OL]. [2023-10-01]. https://icl.utk.edu/hpl-ai/.

[13] [13] HPL-MxP Team. HPL-MxP: high-performance linpack mixed precision benchmark[EB/OL]. [2023-10-01]. https://hpl-mxp.org.

[14] [14] KUDO S, NITADORI K, INA T, et al. Implementation and numerical techniques for one EFlop/s HPL-AI benchmark on fugaku[C]//Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. Washington D.C., USA: IEEE Press, 2020: 256-266.

[15] [15] LIN R F, YUAN X H, XUE W, et al. 5 ExaFlop/s HPL-MxP benchmark with linear scalability on the 40-million-core sunway supercomputer[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. New York, USA: ACM Press, 2023: 536-547.

[16] [16] HPL-MxP Team. HPL-MxP benchmark results[EB/OL]. [2023-10-01]. https://hpl-mxp.org/results.md.

[18] [18] CARSON E, HIGHAM N J. Accelerating the solution of linear systems by iterative refinement in three precisions[J]. SIAM Journal on Scientific Computing, 2018, 40(2): 817-847.

[19] [19] CARSON E, HIGHAM N J. A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems[J]. SIAM Journal on Scientific Computing, 2017, 39(6): 2834-2856.

[20] [20] HIGHAM N J, PRANESH S, ZOUNON M. Squeezing a matrix into half precision, with an application to solving linear systems[J]. SIAM Journal on Scientific Computing, 2019, 41(4): 2536-2551.

[21] [21] HAIDAR A, TOMOV S, DONGARRA J, et al. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA: IEEE Press, 2018: 603-613.

[22] [22] BLANCHARD P, HIGHAM N J, LOPEZ F, et al. Mixed precision block fused multiply-add: error analysis and application to GPU tensor cores[J]. SIAM Journal on Scientific Computing, 2020, 42(3): 124-141.

[23] [23] TASI Y H, LUSZCZEK P, DONGARRA J. HPL-AI repository[EB/OL]. [2023-10-01]. https://bitbucket.org/icl/hpl-ai/.

[24] [24] NVIDIA Corporation. HPC benchmarks container[EB/OL]. [2023-10-01]. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks.

[25] [25] TOMOV S, DONGARRA J. Matrix algebra on GPU and multicore architectures[C]//Proceedings of Workshop on Electronic Structure Calculation Methods Accelerators. Washington D.C., USA: IEEE Press 2010: 5-8.

[26] [26] RIKEN Center for Computational Science. HPL-AI project[EB/OL]. [2023-10-01]. https://www.r-ccs.riken.jp/labs/lpnctrt/projects/hpl-ai/index.html.

Tools

Get Citation

Copy Citation Text

GAO Ang, WANG Yinshan, YAN Wen, SONG Changcheng, WANG Long, YAO Erlin. HPL-MxP Multiple lookahead Optimization for Kunpeng Processors[J]. Computer Engineering, 2025, 51(8): 354

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category:

Received: Nov. 3, 2023

Accepted: Aug. 26, 2025

Published Online: Aug. 26, 2025

The Author Email: WANG Yinshan (wangyinshan@ict.ac.cn)

DOI:10.19678/j.issn.1000-3428.0068758

Topics