Spectroscopy and Spectral Analysis, Volume. 43, Issue 4, 1043(2023)

Wavelength Selection Method of Near-Infrared Spectrum Based on Random Forest Feature Importance and Interval Partial Least Square Method

CHEN Rui1, WANG Xue1,2, WANG Zi-wen1, QU Hao1, MA Tie-min1, CHEN Zheng-guang1, and GAO Rui3
Author Affiliations
  • 1[in Chinese]
  • 2[in Chinese]
  • 3[in Chinese]
  • show less

    In the rapidly establishing quantitative analysis model of near-infrared spectroscopy, feature wavelength selection is one of the more effective methods to improve prediction accuracy. Through selecting effective information, redundant data is reduced, and the effectiveness of the data set is improved. Random Forest (RF) is an integrated algorithm. The feature importance of spectroscopy wavelength can be calculated by using RF. And the mean square error average value is used as the feature importance result based on the mean decrease accuracy (MDA) method of Out-of-Bag data (OOB). The feature variables are selected to form the feature wave subset by setting the feature importance threshold. However, there is no theoretical basis for setting the threshold range. So it is necessary to explore the range of feature importance thresholds. On the other hand, due to the random characteristics of RF, invalid or even interfering variables may be included in the characteristic wavelength subset, and the selected effectiveness variables cannot be guaranteed. Therefore, the RF-iPLS feature wavelength selection algorithm is further proposed.The feature wavelength subset is divided into intervals by interval partial least squares (iPLS), which makes up for the problem of invalid variables caused by RF randomness and redundant information by iPLS. In order to illustrate the rationality of the RF-iPLS algorithm, the RF-MC-iPLS algorithm is constructed using by Monte Carlo (MC) method. The comparison feature subset is generated after 500 samples.Although the structure of RF-iPLS is similar to that of RF-MC-iPLS, its running time is shortened by 11.12%. The results show that the feature wavelength selection of the RF-iPLS algorithm is effective and has low time complexity in the prediction model. Furthermore, to verify the algorithm’s effectiveness, RF-iPLS was applied to grain protein near-infrared spectroscopy data sets and PLSR models were established. It is compared with the full spectrum PLSR and PLSR models based on different wavelength selection methods. The results show that compared with 117 wavelength points of the full spectrum, RF-iPLS selects 12 feature wavelength points. The RMSEC of the modeling set is reduced from 2.61 to 0.64. The prediction accuracy is improved by about 75.5%. The RMSEP of the prediction set is reduced from 2.63 to 0.69, and the prediction accuracy is improved by 73.8%. The prediction accuracy and optimal prediction results show that RF-iPLS is an effective feature wavelength selection method, and it can simplify the complexity of the near-infrared spectral quantitative analysis model and achieve efficient dimensionality reduction.

    Tools

    Get Citation

    Copy Citation Text

    CHEN Rui, WANG Xue, WANG Zi-wen, QU Hao, MA Tie-min, CHEN Zheng-guang, GAO Rui. Wavelength Selection Method of Near-Infrared Spectrum Based on Random Forest Feature Importance and Interval Partial Least Square Method[J]. Spectroscopy and Spectral Analysis, 2023, 43(4): 1043

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Received: Feb. 7, 2022

    Accepted: --

    Published Online: May. 3, 2023

    The Author Email:

    DOI:10.3964/j.issn.1000-0593(2023)04-1043-08

    Topics