Computer Engineering, Volume. 51, Issue 8, 354(2025)

HPL-MxP Multiple lookahead Optimization for Kunpeng Processors

GAO Ang1,2, WANG Yinshan1,2、*, YAN Wen1,2, SONG Changcheng3, WANG Long3, and YAO Erlin1,2
Author Affiliations
  • 1Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • 2University of Chinese Academy of Sciences, Beijing 101408, China
  • 3Huawei Technologies Co., Ltd, Hangzhou 310052, Zhejiang, China
  • show less
    References(24)

    [1] [1] Wikipedia. Half-precision floating-point format[EB/OL]. [2023-10-01]. https://en.wikipedia.org/w/index.php?title=Half-precision_floating-point_format&oldid=1157476282.

    [2] [2] Wikipedia. Bfloat16 floating-point format[EB/OL]. [2023-10-01]. https://en.wikipedia.org/w/index.php?title=Bfloat16_floating-point_format&oldid=1155660759.

    [4] [4] NVIDIA Corporation. NVIDIA Tesla V100[EB/OL]. [2023-10-01]. https://www.nvidia.com/en-us/data-center/v100/.

    [5] [5] NVIDIA Developer Team. INT4 for AI inference[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/int4-for-ai-inference/.

    [6] [6] NVIDIA Developer Team. NVIDIA, Arm, and Intel publish FP8 specification for standardization as an interchange format for AI[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/.

    [7] [7] WANG N G, CHOI J, BRAND D, et al. Training deep neural networks with 8-bit floating point numbers[EB/OL]. [2023-10-01]. https://arxiv.org/abs/1812.08011v1.

    [8] [8] MICIKEVICIUS P, NARANG S R, ALBEN J, et al. Mixed precision training[EB/OL]. [2023-10-01]. https://arxiv.org/abs/1710.03740v3.

    [9] [9] NVIDIA Developer Team. Tensor cores: mixed precision scientific computing[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/.

    [10] [10] ANZT H, BOMAN E G, GATES M, et al. Towards use of mixed precision in ECP math libraries[D]. Livermore, USA: Lawrence Livermore National Laboratory, 2021.

    [11] [11] Netlib Organization. High-performance linpack benchmark[EB/OL]. [2023-10-01]. https://netlib.org/benchmark/hpl.

    [12] [12] Innovative Computing Laboratory. HPL-AI: high-performance linpack for artificial intelligence[EB/OL]. [2023-10-01]. https://icl.utk.edu/hpl-ai/.

    [13] [13] HPL-MxP Team. HPL-MxP: high-performance linpack mixed precision benchmark[EB/OL]. [2023-10-01]. https://hpl-mxp.org.

    [14] [14] KUDO S, NITADORI K, INA T, et al. Implementation and numerical techniques for one EFlop/s HPL-AI benchmark on fugaku[C]//Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. Washington D.C., USA: IEEE Press, 2020: 256-266.

    [15] [15] LIN R F, YUAN X H, XUE W, et al. 5 ExaFlop/s HPL-MxP benchmark with linear scalability on the 40-million-core sunway supercomputer[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. New York, USA: ACM Press, 2023: 536-547.

    [16] [16] HPL-MxP Team. HPL-MxP benchmark results[EB/OL]. [2023-10-01]. https://hpl-mxp.org/results.md.

    [18] [18] CARSON E, HIGHAM N J. Accelerating the solution of linear systems by iterative refinement in three precisions[J]. SIAM Journal on Scientific Computing, 2018, 40(2): 817-847.

    [19] [19] CARSON E, HIGHAM N J. A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems[J]. SIAM Journal on Scientific Computing, 2017, 39(6): 2834-2856.

    [20] [20] HIGHAM N J, PRANESH S, ZOUNON M. Squeezing a matrix into half precision, with an application to solving linear systems[J]. SIAM Journal on Scientific Computing, 2019, 41(4): 2536-2551.

    [21] [21] HAIDAR A, TOMOV S, DONGARRA J, et al. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA: IEEE Press, 2018: 603-613.

    [22] [22] BLANCHARD P, HIGHAM N J, LOPEZ F, et al. Mixed precision block fused multiply-add: error analysis and application to GPU tensor cores[J]. SIAM Journal on Scientific Computing, 2020, 42(3): 124-141.

    [23] [23] TASI Y H, LUSZCZEK P, DONGARRA J. HPL-AI repository[EB/OL]. [2023-10-01]. https://bitbucket.org/icl/hpl-ai/.

    [24] [24] NVIDIA Corporation. HPC benchmarks container[EB/OL]. [2023-10-01]. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks.

    [25] [25] TOMOV S, DONGARRA J. Matrix algebra on GPU and multicore architectures[C]//Proceedings of Workshop on Electronic Structure Calculation Methods Accelerators. Washington D.C., USA: IEEE Press 2010: 5-8.

    [26] [26] RIKEN Center for Computational Science. HPL-AI project[EB/OL]. [2023-10-01]. https://www.r-ccs.riken.jp/labs/lpnctrt/projects/hpl-ai/index.html.

    Tools

    Get Citation

    Copy Citation Text

    GAO Ang, WANG Yinshan, YAN Wen, SONG Changcheng, WANG Long, YAO Erlin. HPL-MxP Multiple lookahead Optimization for Kunpeng Processors[J]. Computer Engineering, 2025, 51(8): 354

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Nov. 3, 2023

    Accepted: Aug. 26, 2025

    Published Online: Aug. 26, 2025

    The Author Email: WANG Yinshan (wangyinshan@ict.ac.cn)

    DOI:10.19678/j.issn.1000-3428.0068758

    Topics