Machine learning model based on non-convex penalized huberized-SVM

Peng Wang; Ji Guo; Lin-Feng Li

doi:10.1016/j.jnlest.2024.100246

1 Introduction

In high-dimensional data mining, one of the most common practical problems is to classify the target based on the features of the data. Many scholars have been committed to classification methods to improve their accuracy, however, most traditional methods are unable to meet real needs, due to the limitations of assumptions and the diversity and complexity of data structures [1,2]. The birth of machine learning has opened a new chapter for the development of the data mining technology. The support vector machine (SVM), as a classification method developed on the basis of machine learning theory, is mainly based on Vapnik-Chervonenkis (VC) dimension and structural risk minimization in statistical learning theory. It falls under the category of supervised learning. It is mainly used for classification and now often used in regression analysis [3,4]. Many scholars have proved that it has advantages in solving the problems of small samples as well as non-linear and high-dimensional data [4]. Therefore, it has the capability to alleviate the problems of “dimensionality catastrophe” and “overfitting” in traditional classification methods.

In the standard binary classification problem, linear SVM [5,6] is used to find the maximal interval hyperplane $f\left( {{{\mathbf{x}}_i}} \right) = {\beta _0} + {\mathbf{x}}_i^{\text{T}} {\Cambriabfont\text{β}}$ that distinguishes between the categories 1 and –1. Let ${{\mathbf{x}}_i} \in {\mathbb{R}^p}$ (p denotes the number of independent variables in the model) be the predictor and ${y_i} \in \left\{ { - 1{\rm{,}}{\text{ }}1} \right\}{\text{ }}\left( {i = 1{\rm{,}}{\text{ }}2{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}n} \right)$ be the class label, it is well known that standard SVM can also be written in the form of a hinge loss and l₂-penalized regularization framework:

$ \mathop {{\rm{min}} }\limits_{{\beta _0}{\rm{,}}{\Cambriabfont\text{β}}} \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{{\left[ {1 - {y_i}\left( {{\beta _0} + {\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}}} \right)} \right]}_ + } + \left( {{\lambda \mathord{\left/ {\vphantom {\lambda 2}} \right. } 2}} \right)} \left\| {\Cambriabfont\text{β}} \right\|_2^2 $ ()

where β is the vector of coefficients comprising each variable, β₀ is the intercept term, λ is a regularization parameter controlling the sparsification of the coefficients of the model variables, the loss $l\left( t \right) = {\left[ {1 - t} \right]_ + }$ is called the hinge loss, and $ {\Vert \cdot \Vert }_{2} $ is the l₂-norm.

Since l₂-penalized SVMs use all variables for classification, leading to a huge challenge in high-dimensional classification [7]. Based on this, Zhu et al. [7] and Wegkamp and Yuan [8] considered l₂-penalized SVMs. Although the l₁-penalized function also leads to sparsity, the number of selected variables does not exceed the sample size [9]. Thus, when the dimensionality of the variables is high, the l₁-penalized model may lose many useful variables, i.e., the l₁-penalized model does not have the Oracle consistency property in variable selection. Related studies have also shown that the consistency in variable selection in l₁-penalized models relies on a strict “non-reproducibility condition” on the matrix design, which is easily violated in practice [10,11]. Furthermore, the regularization parameter of l₁-penalized SVMs for the consistency in variable selection is not optimal for the prediction accuracy [12,13]. To overcome the drawbacks of the l₁-penalized function, Fan and Li [9] and Zhang and Huang [14] proposed non-convex penalized functions such as smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP), respectively. For non-convex penalties, Kim et al. [15] and Zhang [16] studied the Oracle property of the high-dimensional SCAD and MCP penalized least squares regression, respectively. However, since the hinge loss in SVMs is not a smooth function, it brings difficulty to the model computation, and the Karush-Kuhn-Tucker local optimality condition is usually insufficient to build a non-smooth and non-convex penalized model [17]. Therefore, many scholars considered replacing the hinge loss with other losses. For instance, Mangasarian [18] used the square hinge loss in SVMs. Rosset and Zhu [19] suggested using the huberized hinge loss. They showed that among the penalized models with the logistic regression loss, huberized loss, and square hinge loss, outliers had the least effect on the huberized loss model. Combined with the fact that the huberized loss function and the hinge loss function share a similar shape and classification performance, we consider replacing the hinge loss function with the huberized loss function and developing a non-convex penalized huberized-SVM model.

In this paper, the generalized coordinate descent (GCD) algorithm is used to solve the non-convex penalized huberized-SVM model and prove that the model satisfies the local Oracle property in variable selection. Simulation results demonstrate that the proposed method has a better variable selection function and higher accuracy. It is also demonstrated that the proposed model can effectively make a prediction in the financial distress of listed companies.

2 Non-convex penalized huberized-SVM

2.1 Model

In a standard binary classification problem, n pairs of training data, $\left\{ {{{\mathbf{x}}_i}{\rm{,}}{\text{ }}{y_i}} \right\}{\text{ }}\left( {i = 1{\rm{,}}{\text{ }}2{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}n} \right)$, are given, the model framework is designed as

$ \mathop {{\rm{min}} }\limits_{{\beta _0}{\rm{,}}{\Cambriabfont\text{β}}} \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{y_i}\left( {{\beta _0} + {\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}}} \right)} \right)} + \sum\limits_{j = 1}^p {{P_\lambda }\left( {\left| {{\beta _j}} \right|} \right)} $ (1)

where ${\beta _j}$ is the jth coefficient of ${\Cambriabfont\text{β}}$, $\left| {{\beta _j}} \right|$ denotes the absolute value of ${\beta _j}$, and l_δ(·) is the huberized hinge loss:

\begin{array}{l} 0, t > 1 \\ {(1 - t)}^{2} / 2 δ, 1 - δ < t \leq 1 \\ 1 - t - δ / 2, t \leq 1 - δ \end{array}

where $\delta > 0$, usually $\delta {\text{ = }}2$ suggested by Yang and Zou [20,21].

As shown in Fig. 1, the shape of the huberized hinge loss is very similar to that of the hinge loss used in standard SVM. Especially when δ is very small, they almost coincide. Then, it can be easily proved that

Figure 1.Comparison of the hinge loss with the huberized hinge loss.

Download full size

View all figures

\begin{array}{ll} 1 - t, t < 1 \\ 0, t \geq 1 . \end{array}

In (1), $ {P_\lambda }( \cdot ) $ is the non-convex penalty. For example, its specific form of SCAD [9] is

$ {P_\lambda }(x) = \lambda \int_0^{|x|} {{\rm{min}} \{ 1{\rm{,}}{\text{ }}{{{{[a - t/\lambda ]}_ + }} \mathord{\left/ {\vphantom {{{{[a - t/\lambda ]}_ + }} {(a - 1)}}} \right. } {(a - 1)}}\} } {\text{d}}t{\text{, }}a > 2{\mathrm{.}} $ ()

And its specific form of MCP [16] is

$ {P_\lambda }(x) = \lambda \int_0^{|x|} {{{[1 - t/(a\lambda )]}_ + }} {\text{d}}t{\text{, }}a > 1{\mathrm{.}} $ ()

In both penalties, a is a priori parameter. Fang et al. [22] demonstrated that the selection of the parameter a has a less effect on the model results. Thus, for convenience, in this paper, following the suggestions of Fan and Li [9] and Zhang [16], we fix a = 3.7 in SCAD and a = 2 in MCP.

2.2 Algorithm

The GCD algorithm is used to solve the model (1). Without loss of generality, suppose that the values of x, ${x_{i{\text{,}}j}}$, have been normalized, then $(1/n)\displaystyle\sum\limits_{i = 1}^n {{x_{i{\text{,}}j}} = 0} $ and $(1/n)\displaystyle\sum\limits_{i = 1}^n {x_{_{i{\text{,}}j}}^2 = 1} $ for $j = 1{\text{, }}2{\text{, }} \cdots {\text{, }}p$. Defining the current interval as ${r_i} = {y_i}({\tilde \beta _0} + {\mathbf{x}}_i^{\text{T}}\tilde {\Cambriabfont\text{β}})$, the corresponding coordinate descent algorithm is to minimize the following function:

$ F\left( {{\beta _j}|{{\tilde \beta }_0}{\text{, }}\tilde {\Cambriabfont\text{β}}} \right) = \left( {1/n} \right)\sum\limits_{i = 1}^n {{l_\delta }\left\{ {{r_i} + {y_i}{x_{i{\text{,}}j}}\left( {{\beta _j} - {{\tilde \beta }_j}} \right)} \right\}} + {P_\lambda }\left( {\left| {{\beta _j}} \right|} \right) $ (2)

where $ {\tilde {\beta} _0} $, $ \tilde {\Cambriabfont\text{β}} $, and $ {\tilde \beta _j} $ denote the current values of ${\beta _0}$, β, and ${\beta _j}$ in the calculation process, respectively.

For the non-convex penalty $ {P_\lambda }\left( {|{\beta _j}|} \right) $, the local linear approximation is used to approximate

$ {P_{{\lambda _1}}}\left( {|{\beta _j}|} \right) \approx {P_{{\lambda _1}}}\left( {|{{\tilde \beta }_j}|} \right) + {P'_{{\lambda _1}}}\left( {|{\beta _j}|} \right)\left( {|{\beta _j}| - |{{\tilde \beta }_j}|} \right){\text{, }}|{\beta _j}| \approx |{\tilde \beta _j}| {\mathrm{.}} $ (3)

Since the huberized hinge loss l_δ(·) satisfies the quadratic majorization condition [20], it can be proved that $ F({\beta _j}|{\tilde \beta _0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}) < \hat F({\beta _j}|{\tilde \beta _0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}) $, where

$ \hat F\left( {{\beta _j}|{{\tilde \beta }_0}{\rm{,}}{\text{ }}\tilde {\Cambriabfont\text{β}}} \right) = \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{r_i}} \right)} + \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right)} {y_i}{x_{i{\text{,}}j}}\left( {{\beta _j} - {{\tilde \beta }_j}} \right) + \left( {{1 \mathord{\left/ {\vphantom {1 \delta }} \right. } \delta }} \right){\left( {{\beta _j} - {{\tilde \beta }_j}} \right)^2} {\mathrm{.}} $ (4)

Since $ \hat F( \cdot ) $ is the quadratic principal function of the objective function $ F( \cdot ) $, the solution for minimizing (2) can be approximated by the solution for minimizing (4). By minimizing (4), we can obtain the iterative update (5):

$ \hat \beta _j^{{\text{new}}} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{{\beta _0}} \hat F\left( {{\beta _j}|{{\tilde \beta }_0}{\rm{,}}{\text{ }}\tilde {\Cambriabfont\text{β}}} \right) = \left( {{\delta \mathord{\left/ {\vphantom {\delta 2}} \right. } 2}} \right)S\left( {z{\rm{,}}{\text{ }}{{P'}_{{\lambda _1}}}\left( {\left| {{{\tilde \beta }_j}} \right|} \right)} \right) $ (5)

where

$ S\left(z\rm{,}\text{ }t\right)=\left[\left|z\right|-t\right]_+\text{sign}\left(z\right) $ ()

$ z = \left( {{2 \mathord{\left/ {\vphantom {2 \delta }} \right. } \delta }} \right){\tilde \beta _j} - \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta' }({r_i})} {y_i}{x_{i{\text{,}}j}}{\mathrm{.}} $ ()

Similar to (5), it is possible to minimize

$ \hat F\left( {{\beta _0}|{{\tilde \beta }_0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}} \right) = \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{r_i}} \right) + \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right)} {y_i}\left( {{\beta _0} - {{\tilde \beta }_0}} \right) + \left( {{1 \mathord{\left/ {\vphantom {1 \delta }} \right. } \delta }} \right){{\left( {{\beta _0} - {{\tilde \beta }_0}} \right)}^2}} $ (6)

to obtain the iterative update (2) for the intercept term β₀:

$ \hat \beta _0^{{\text{new}}} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{{\beta _0}} \hat F\left( {{\beta _0}|{{\tilde \beta }_0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}} \right) = {\tilde \beta _0} - \left( {{\delta \mathord{\left/ {\vphantom {\delta {2n}}} \right. } {2n}}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right){y_i}} {\mathrm{.}} $ (7)

The resulted algorithm for solving the model (1) is shown in Table 1, and the flowchart is shown in Fig. 2. It is worth noting that the error ε here is taken by default as 10^–6.

Table 1. Desciption of the resulted algorithm for solving the model (1).

View table

View all Tables

Table 1. Desciption of the resulted algorithm for solving the model (1).

Algorithm
1. Initialize $ {\tilde \beta _0} $ and $ \tilde {\Cambriabfont\text{β}} $.
2. Iterate 1) and 2) until convergence:
1) Cyclic coordinate descent: j = 1, 2, ···, p.
i) Calculate $ {r_i} = {y_i}({\tilde \beta _0} + {\mathbf{x}}_i^{\text{T}}\tilde {\Cambriabfont\text{β}}) $;
ii) Calculate $ \hat \beta _j^{{\text{new}}} = \left( {{\delta \mathord{\left/ {\vphantom {\delta 2}} \right. } 2}} \right)S\left( {z{\rm{,}}{\text{ }}{P_{\lambda _1}'}\left( {\left\| {{{\tilde \beta }_j}} \right\|} \right)} \right) $;
iii) Let $ {\tilde \beta _j} = \hat \beta _j^{{\text{new}}}{\rm{,}}{\text{ }}j = j + 1 $.
2) Update the intercept term:
i) Calculate $ {r_i} = {y_i}\left( {{{\tilde \beta }_0} + {\mathbf{x}}_i^{\text{T}}\tilde {\Cambriabfont\text{β}}} \right) $;
ii) Calculate $ \hat \beta _0^{{\text{new}}} = {\tilde \beta _0} - \left( {{\delta \mathord{\left/ {\vphantom {\delta {2n}}} \right. } {2n}}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right){y_i}} $;
iii) Let $ {\tilde \beta _j} = \hat \beta _j^{{\text{new}}}{\rm{,}}{\text{ }}j = j + 1 $.
3. If $\mathop {{\rm{max}} }\limits_{0 \leq j \leq p} \left\| {\hat \beta _j^{{\text{new}}} - {{\tilde \beta }_j}} \right\| < \varepsilon $, stop the iteration; otherwise, return to step 2.

Figure 2.Flowchart of the algorithm.

Download full size

View all figures

2.3 Asymptotic property

In this subsection, the local Oracle property of the model (1) is discussed. Since β₀ does not affect variable selection, we set β₀ = 0.

Let ${{\Cambriabfont\text{β}}^{\text{*}}}{\text{ = }}\left( {\beta _1^{\text{*}}{\rm{,}}{\text{ }}\beta _2^{\text{*}}{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}\beta _p^{\text{*}}} \right)$ denote the true value of the model parameters:

$ {{\Cambriabfont\text{β}}^*} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{\Cambriabfont\text{β}} L\left( {\Cambriabfont\text{β}} \right) = \mathop {{\rm{arg}} {\rm{min}} \,{E}}\limits_{\Cambriabfont\text{β}} \left( {{l_\delta }\left( {y{{\mathbf{x}}^{\text{T}}}{\Cambriabfont\text{β}}} \right)} \right) {\mathrm{,}} $ ()

where $ L\left( {\Cambriabfont\text{β}} \right) = E\left( {{l_\delta }\left( {y{{\mathbf{x}}^{\text{T}}}{\Cambriabfont\text{β}}} \right)} \right) $ is the true loss and E(·) denotes the mathematical expectation of a random variable.

Let p_n be the number of covariates. Let $A = \left\{ {\left. j \right|\beta _j^* \ne 0{\text{, }}0 \leq j \leq {p_n}} \right\}$ denote the set of subscripts of non-zero coefficients, $ {{\Cambriabfont\text{β}}_A} $ is a vector composed of all non-zero variable coefficients, and $ \boldsymbol{\mathrm{\mathbf{x}}}_A $ is a vector composed of independent variables with non-zero coefficients. Its ith observation vector is denoted by $ \boldsymbol{\mathrm{\mathbf{x}}}_{i\mathrm{,}A} $. ${A^c}$ is the complement set of A, and ${{\Cambriabfont\text{β}}_{{A^c}}}$ is the vector formed by removing the components of the elements of the vector β at the corresponding positions in A.

Thus, the Oracle estimator $ \hat {\Cambriabfont\text{β}} $ of β is defined as

$ \hat {\Cambriabfont\text{β}} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{\Cambriabfont\text{β}} {L_n}({\Cambriabfont\text{β}}) = \mathop {{\rm{arg}} {\rm{min}} }\limits_{\Cambriabfont\text{β}} \left\{ {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{y_i}{\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}}} \right)} } \right\} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{{{\Cambriabfont\text{β}}_{{A^c}}} = {\mathbf{0}}} \left\{ {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{y_i}{\mathbf{x}}_{i{\mathrm{,}}A}^{\text{T}}{{\Cambriabfont\text{β}}_A}} \right)} } \right\} $ ()

where $ {\hat \beta _j} = 0{\text{, }}j \in {A^c} $ and L_n(·) denotes the sample loss.

The above definition of the Oracle estimation suggests that the estimates derived from the principle of minimizing empirical losses are the same as those derived from the principle of minimizing the empirical loss function with only the covariates of interest. The Oracle estimator was proposed initially by Fan and Li [9], which requires that the results of the model’s variable selection are consistent with the real set of variables, so “whether the model’s estimator is the Oracle estimator” is usually used as an important criterion for judging whether the model’s variable selection is good or bad.

To obtain the Oracle properties of the estimator, the following regularization conditions are given. Detialed certification procedures can be found in the proof of the theorems, which is provided in Appendix.

Condition 1: The loss function l(·) has a first-order continuous derivative, and $\exists {M_1}{\text{, }}{M_2} > 0$, so that

$ \left|l'(t)\right|\le M_1\left(\left|t\right|+1\right)\rm{,}\text{ }\partial l'(t)\le M_2\rm{,}\text{ }\forall t. $ ()

Condition 2: When $n \to \infty $, we have ${q_n} = O\left( {{n^{{c_1}}}} \right){\rm{,}}{\text{ }}0 \leq {c_1} \leq {1 \mathord{\left/ {\vphantom {1 2}} \right. } 2}$, where q_n is the base of the set A, and O(·) denotes infinity (or infinitesimal) of the same order.

Condition 3: The Hessian matrix ${\mathbf{H}}\left( {{{\Cambriabfont\text{β}}_A}} \right){\text{ = }}E\left( {{\nabla ^2}{l_\delta }\left( {y{\mathbf{x}}_A^{\text{T}}{{\Cambriabfont\text{β}}_A}} \right)} \right)$ satisfies the condition $0 < {M_3} < {\lambda _{{\rm{min}} }}\left\{ {{\mathbf{H}}\left( {{{\Cambriabfont\text{β}}_A}} \right)} \right\} \leq {\lambda _{\rm{max}}}\left\{ {{\mathbf{H}}\left( {{{\Cambriabfont\text{β}}_A}} \right)} \right\} < {M_4} < \infty $, where x_A is the matrix formed by removing the columns corresponding to the elements in A^c from the observation matrix x, and λ_min and λ_max denote the minimum and maximum eigenvalues of the matrix, respectively.

Condition 4: $\exists {M_5} > 0$, so that ${\lambda _{{\rm{max}} }}({n^{ - 1}}{\mathbf{x}}_A^{\text{T}}{{\mathbf{x}}_A}) \leq {M_5}$ and the elements, $ x_{i{\mathrm{,}}j}\text{ }\left(1\le i\le n\rm{,}\text{ }j\in A^c\right) $, are the sub-Gaussian random variables.

Condition 5: $\exists {c_2}{\rm{,}}\;{M_6} > 0$, so that $1 - {c_1} \leq {c_2} \leq 1$, and ${n^{{{\left( {1 - {c_2}} \right)} \mathord{\left/ {\vphantom {{\left( {1 - {c_2}} \right)} 2}} \right. } 2}}}\mathop {{\rm{min}} }\limits_{j \in A} \left| {\beta _j^*} \right| \geq {M_6}$.

Condition 1 requires the loss function to be smooth and relatively flat, which can be satisfied by some common SVM loss functions, such as the huberized hinge loss function and the squared hinge loss function. Condition 2 implies that the divergence rate of the number of non-zero coefficients is not larger than n^1/2. Under the assumption of Condition 3, the Hessian matrix of the loss function is definitely positive, and its eigenvalues are consistently bounded. The conditions for the maximum eigenvalues of the design matrix assumed in Condition 4 are similar to those in Ref. [15] and Ref. [18]. Condition 5 requires that the signals of the variables do not decay too fast.

Theorem 1. If Conditions 1–5 are satisfied, then the Oracle estimate of the model (1) satisfies

$ {\left\| {{{ \hat {\Cambriabfont\text{β}}}_A} - {\Cambriabfont\text{β}}_A^*} \right\|_2} = {O_p}\left( {\sqrt {{q_n}/n} } \right) $ ()

where O_p(·) denotes the same order infinitesimal with probability.

By Theorem 1, it can be seen that the Oracle estimator of the model (1) satisfies consistency as long as ${q_n} = O\left( {{n^{{c_1}}}} \right){\rm{,}}{\text{ }}0 \leq {c_1} \leq {1 \mathord{\left/ {\vphantom {1 2}} \right. } 2}$.

Lemma 1. If Conditions 1–5 are satisfied, and ${\rm{log}} \left( {{p_n}} \right){q_n}{\rm{log}} \left( n \right){n^{ - 1/2}} = O\left( {{\lambda _n}} \right)$, then the probability

$ \rm{Pr} \left\{ {\mathop {{\rm{max}} }\limits_{j \in {A^c}} \left| {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum\limits_{i = 1}^n {{y_i}{x_{i{\mathrm{,}}j}}{l_\delta' }\left( {{y_i}{\mathbf{x}}_{i{\mathrm{,}}A}^{\text{T}}{\Cambriabfont\text{β}}_A^*} \right)} } \right| > {{{\lambda _n}} \mathord{\left/ {\vphantom {{{\lambda _n}} 2}} \right. } 2}} \right\} \to 0{\rm{,}}{\text{ }}n \to \infty . $ ()

Lemma 2. Let Conditions 1–5 satisfy ${\rm{log}} \left( {{p_n}} \right){q_n}{\rm{log}} \left( n \right){n^{ - 1/2}} = O\left( {{\lambda _n}} \right)$ and ${\lambda _n} = O\left( {{n^{ - (1 - {c_2})/2}}} \right)$. For j = 1, 2, ···, p, let ${s_j}\left( { \hat {\Cambriabfont\text{β}}} \right) = {{\partial {{L_n}( \hat {\Cambriabfont\text{β}})} } \mathord{\left/ {\vphantom {{\partial {{L_n}( \hat {\Cambriabfont\text{β}})} } {\partial {\beta _j}}}} \right. } {\partial {\beta _j}}}$, where s_j(·) denotes the subgradient.

Then, for the Oracle estimator $ \hat {\Cambriabfont\text{β}}$ and the corresponding ${s_j}( \hat {\Cambriabfont\text{β}})$, there is a probability of convergence to 1, satisfying

$ {s_j}\left( { \hat {\Cambriabfont\text{β}}} \right) = 0{\text{, }}\left| {{{\hat \beta }_j}} \right| \geq \left( {a + {1 \mathord{\left/ {\vphantom {1 2}} \right. } 2}} \right){\lambda _n}{\text{, }}j \in A{\rm{;}} $ ()

$ \left| {{s_j}\left( { \hat {\Cambriabfont\text{β}}} \right)} \right| \leq {\lambda _n}{\text{, }}\left| {{{\hat \beta }_j}} \right| = 0{\text{, }}j \in {A^c}{\rm{.}} $ ()

Theorem 2. If Conditions 1–5 are satisfied, let B_n(λ_n) denote the set of local minimum points of the objective function:

$ {Q_n}\left( {\Cambriabfont\text{β}} \right) = {L_n}\left( {\Cambriabfont\text{β}} \right) + \sum\limits_{j = 1}^p {{P_{{\lambda _n}}}\left( {\left| {{\beta _j}} \right|} \right)} {\rm{.}} $ ()

Then, the Oracle estimator $ \hat {\Cambriabfont\text{β}} = \left( { \hat {\Cambriabfont\text{β}}_A^{\text{T}}{\text{, }}{{\mathbf{0}}^{\text{T}}}} \right) $ of the model (1) satisfies

$ \rm{Pr} \left\{ { \hat {\Cambriabfont\text{β}} \in {B_n}\left( {{\lambda _n}} \right)} \right\} \to 1{\text{, }}n \to \infty $ ()

where ${\rm{log}} \left( {{p_n}} \right){q_n}{\rm{log}} \left( n \right){n^{ - 1/2}} = O\left( {{\lambda _n}} \right)$ and ${\lambda _n} = O\left( {{n^{ - (1 - {c_2})/2}}} \right)$.

According to Theorem 2, the local Oracle property of the model (1) is satisfied if $ {c_1} < \tau < {{{c_2}} \mathord{\left/ {\vphantom {{{c_2}} 2}} \right. } 2} $, ${\lambda _n} = O\left( {{n^{ - 1/2 + \tau }}} \right)$ even if $p = O\left( {\exp \left( {{n^{{{\left( {\tau - {c_1}} \right)} \mathord{\left/ {\vphantom {{\left( {\tau - {c_1}} \right)} 2}} \right. } 2}}}} \right)} \right)$ . Therefore, the local Oracle property of the above proposed structural coefficient model is still satisfied even if the number of covariates increases exponentially with the sample size.

3 Experimental analysis

To validate the performance of the above model for non-convex huberized-SVM, experiments in two cases were designed to identify whether there is a correlation between variables. MCP and SCAD are the two more commonly used non-convex penalties. Here, MCP huberized-SVM (MCP-HSVM) and SCAD huberized-SVM (SCAD-HSVM) were selected to represent non-convex huberized-SVM. The results of MCP-HSVM and SCAD-HSVM were compared with those of conventional SVM and classical least absolute shrinkage and selection operator (LASSO)-SVM. The training parameters involved in all the methods in this paper were selected by a 5-fold cross-validation method by constraining.

3.1 Evaluation metrics

To evaluate the prediction performance of various methods, out-of-sample prediction accuracy (ACC), true positive rate (TPR), false positive rate (FPR), and area under the receiver operating characteristic (ROC) curve (AUC) were used. The higher the values of ACC, TPR, and AUC, the better the prediction performance of the model; the lower the value of FPR, the better the prediction performance. For measuring the performance of variable selection, three indicators, the number of significant variables (NV), false negative rate (FNR), and false discovery rate (FDR) of variable selection, were used. The closer NV to the actual number of variables, the better the performance of variable selection. In contrast, FNR represents the error rate of omission of significant variables (Type II error rate), and FDR is the error rate of insignificant variables being identified as significant (Type I error rate). Obviously, the lower these two indicators, the better the performance of variable selection.

3.2 Cases where variables are not correlated

Here, the variable dimension p = 100, the variables ${{\mathbf{x}}_1}{\rm{,}}{\text{ }}{{\mathbf{x}}_2}{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}{{\mathbf{x}}_p} \sim \mathbb{N}\left( {0{\rm{,}}{\text{ }}1} \right)$, and the sample size n = 200, where 100 observations were used as the training set and the remaining 100 observations were used as the test set. The parameter ${\Cambriabfont\text{β}} = \left( 1{\rm{,}}\;0.5{\rm{,}}\;0.8{\rm{,}}\; - 0.35{\rm{,}}\;0{\rm{,}}\;1.2{\rm{,}}\;0.92{\rm{,}}\; - 1.1{\rm{,}}\; - 0.2{\rm{,}}\;1{\rm{,}}\;1.5{\rm{,}}\;2.2{\rm{,}}\;0.95{\rm{,}}\; - 1.3{\rm{,}}\; - 0.5{\rm{,}}\;0.6{\rm{,}}\;0.2{\rm{,}} 0{\rm{,}}\; \cdots {\rm{,}}\;0 \right)_p^{\text{T}}$.

Take $\varepsilon \sim \mathbb{N}\left( {0{\rm{,}}{\text{ }}1} \right)$, let $z = {\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}} + 0.5\varepsilon $, and construct the response variable y, so that,

$ \rm{Pr}\left(y=-1\right)=\mathrm{exp}\left(z\right)\mathord{\left/\vphantom{e^z\left(1+e^z\right)}\right.}\left(1+\mathrm{exp}\left(z\right)\right) $ ()

$ \rm{Pr}\left(y=1\right)=1\mathord{\left/\vphantom{1\left(1+e^z\right)}\right.}\left(1+\mathrm{exp}\left(z\right)\right)\rm{.} $ ()

According to the above setting, which satisfies the preceding assumptions, a set of observations with a sample size of 200 was randomly selected, and 5-fold cross-validation was used to calculate each evaluation indicator of the classification prediction and variable selection by SVM, LASSO-SVM, MCP-HSVM, and SCAD-HSVM methods, respectively. The cross-validation process was repeated 100 times to obtain the values of 100 sets of evaluation indicators, and according to these evaluation indicators, the results of the classification prediction and variable selection are shown in Table 2 and Table 3, where Min, 1st Qu, Median, Mean, 3rd Qu, and Max represent the minimum, first quartile, median, arithmetic mean, third quartile, and maximum of each indicator, respectively.

Table 2. Results of evaluation indicators for the classification prediction of each model.

View table

View all Tables

Table 2. Results of evaluation indicators for the classification prediction of each model.

Indicators	Methods	Min	1st Qu	Median	Mean	3rd Qu	Max
ACC	SVM	0.7016	0.7435	0.8010	0.8000	0.8534	0.8952
	LASSO-SVM	0.7612	0.8114	0.8680	0.8653	0.9025	0.9589
	MCP-HSVM	0.7773	0.8260	0.8721	0.8749	0.9180	0.9708
	SCAD-HSVM	0.7842	0.8396	0.8862	0.8884	0.9372	0.9804
AUC	SVM	0.7840	0.8170	0.8627	0.8608	0.9021	0.9335
	LASSO-SVM	0.8436	0.8916	0.9378	0.9290	0.9657	0.9902
	MCP-HSVM	0.8489	0.8801	0.9239	0.9239	0.9625	0.9969
	SCAD-HSVM	0.8502	0.8867	0.9328	0.9256	0.9626	0.9974
TPR	SVM	0.6691	0.7075	0.7546	0.7602	0.8055	0.8645
	LASSO-SVM	0.7553	0.7988	0.8370	0.8463	0.8809	0.9500
	MCP-HSVM	0.7602	0.8068	0.8434	0.8512	0.8978	0.9581
	SCAD-HSVM	0.7595	0.7925	0.8525	0.8542	0.9102	0.9555
FPR	SVM	0.2101	0.2342	0.2535	0.2571	0.2807	0.3097
	LASSO-SVM	0.1470	0.1710	0.1974	0.1960	0.2195	0.2456
	MCP-HSVM	0.1643	0.1868	0.2100	0.2121	0.2381	0.2634
	SCAD-HSVM	0.1683	0.1982	0.2239	0.2209	0.2457	0.2670

Table 3. Results of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

View table

View all Tables

Table 3. Results of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

Indicators	Methods	Min	1st Qu	Median	Mean	3rd Qu	Max
NV	LASSO-SVM	5.0000	10.0000	17.0000	16.6400	24.2500	38.0000
	MCP-HSVM	6.0000	9.0000	15.0000	19.6000	25.7500	47.0000
	SCAD-HSVM	8.0000	9.0000	10.0000	10.2600	11.0000	13.0000
FNR	LASSO-SVM	0.0176	0.1029	0.2294	0.2512	0.3247	0.5112
	MCP-HSVM	0.0765	0.1252	0.2000	0.2182	0.3059	0.4824
	SCAD-HSVM	0.1029	0.2294	0.2512	0.2824	0.3471	0.5647
FDR	LASSO-SVM	0.0000	0.0764	0.2248	0.2679	0.3564	0.4578
	MCP-HSVM	0.0000	0.1193	0.2556	0.2880	0.3810	0.4610
	SCAD-HSVM	0.0000	0.1000	0.1577	0.1735	0.2662	0.3500

As shown in Table 2, in terms of the classification prediction, the ACC, AUC, and TPR values of LASSO-SVM, MCP-HSVM, and SCAD-HSVM are significantly higher than those of SVM while the FPR values are lower, indicating that the prediction accuracy and classification performance are significantly superior to those of traditional SVM methods. In contrast, the ACC, AUC, TPR, and FPR values of MCP-HSVM and SCAD-HSVM are basically slightly higher than those of LASSO-SVM, which indicates that the advantages of MCP-HSVM and SCAD-HSVM over the existing LASSO-SVM are not obvious in terms of the classification prediction.

As shown in Table 3, in terms of variable selection, the arithmetic means and medians of NV obtained by LASSO-SVM and MCP-HSVM are relatively close to the number of true significant variables (17.0000), whereas the maximum NV value of SCAD-HSVM is only 13.0000, which is obviously less than 17.0000, indicating that the omission of significant variables by SCAD-HSVM is more serious. In terms of FNR, its arithmetic means and medians obtained by LASSO-SVM and MCP-HSVM are basically slightly larger than 0.2000, which reveals that the two methods (especially MCP-HSVM) have a relatively low chance of making mistakes due to the omission of significant variables, so that most of the actual significant variables can be selected by applying the two methods. In terms of FDR, the results of SCAD-HSVM are significantly better than those of the other two methods, and the corresponding arithmetic mean and median of FDR are less than 0.2000, indicating that SCAD-HSVM is not likely to select insignificant variables, and therefore the classification prediction results are not easily affected by noisy variables.

3.3 Cases where variables are correlated

Here, the variable dimension p = 100, the sample size n = 200, x~$\mathbb{N}$(0, Σ), Σ_i,j= ρ , i, j = 1, 2, ···, 100, and the real values of the parameter β and the response variable y were constructed in the same way as the previous setting. Let ρ = 0.3, ρ = 0.5, and ρ = 0.8, respectively, similar to the previous experimental procedure. The experimental results are shown in Table 4 and Table 5.

Table 4. Results of evaluation indicators for the classification prediction of each model under different correlations.

View table

View all Tables

Table 4. Results of evaluation indicators for the classification prediction of each model under different correlations.

ρ	Indicators	SVM	LASSO-SVM	MCP-HSVM	SCAD-HSVM
ρ = 0.3	ACC	0.7758	0.8339	0.8436	0.8508
	AUC	0.8351	0.8802	0.9050	0.8956
	TPR	0.7290	0.7996	0.8306	0.8260
	FPR	0.2827	0.2103	0.2206	0.2230
ρ = 0.5	ACC	0.7496	0.8188	0.8348	0.8310
	AUC	0.7828	0.8650	0.8861	0.8853
	TPR	0.6939	0.7647	0.8245	0.8130
	FPR	0.3238	0.2357	0.2394	0.2345
ρ = 0.8	ACC	0.6760	0.7786	0.8200	0.8196
	AUC	0.7209	0.8212	0.8701	0.8635
	TPR	0.6331	0.7211	0.8135	0.7992
	FPR	0.3301	0.2567	0.2412	0.2451

Table 5. Means of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

View table

View all Tables

Table 5. Means of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

ρ	Indicators	LASSO-SVM	MCP-HSVM	SCAD-HSVM
ρ = 0.3	NV	15.8200	18.4800	14.2600
	FNR	0.2424	0.1612	0.2118
	FDR	0.2360	0.2057	0.1885
ρ = 0.5	NV	12.9000	16.0800	13.7600
	FNR	0.3117	0.2294	0.2941
	FDR	0.3298	0.2325	0.2916
ρ = 0.8	NV	10.2400	15.7600	12.9800
	FNR	0.3729	0.2800	0.3329
	FDR	0.1655	0.2397	0.3244

As shown in Table 4, the ACC, AUC, and TPR values of LASSO-SVM, MCP-HSVM, and SCAD-HSVM are significantly higher than those of traditional SVM, and the FPR values are significantly lower. This means that the three methods outperform traditional SVM in terms of the prediction accuracy and classifier performance, and as the correlation between variables increases, the advantage becomes more apparent. As for the comparison between MCP-HSVM, SCAD-HSVM, and LASSO-SVM, the advantages of MCP-HSVM and SCAD-HSVM in terms of the prediction accuracy and classifier performance are not obvious when the correlation between variables is low, but as the correlation between variables increases, the advantages become more and more obvious. This indicates that the classification prediction performance of MCP-HSVM and SCAD-HSVM is less influenced by the correlation between variables, especially for MCP-HSVM, which is almost unaffected by the correlation between variables.

As shown in Table 5, in terms of variable selection, the arithmetic means of NV selected by LASSO-SVM and SCAD-HSVM are significantly smaller than the number of true significant variables (17.0000) with a certain gap, and the gap becomes larger as the correlation between variables increases, especially for LASSO-SVM. Compared with the two methods, the arithmetic mean of NV of MCP-HSVM is always closer to the true value (17.0000). As the correlation between variables increases, the FNR values of all three methods increase, indicating that the increase in the correlation between variables leads to an increase in the number of missed significant variables. However, MCP-HSVM is affected in a limited way, and the arithmetic mean of its FNR under all three correlations is less than 0.3, indicating that no more than five significant variables are missing. In terms of FDR, there is no significant change in the arithmetic means of FDR of the three methods as the correlation between variables increases. Combining with the data in Table 3, when there is a linear correlation between variables, MCP-HSVM performs the best on variable selection, while LASSO-SVM performs the worst, and it is easy to miss the actual significant variables when there is a strong correlation between variables.

4 Prediction of financial distress for listed companies

4.1 Data sources and processing

The financial index system used in this paper (see Table 6) and its data were taken from Wind Information (http://www.wind.com.cn). A total of 3456 listed companies from A shares in Shanghai and Shenzhen were selected as the basic sample, and 72 companies were screened for financial distress in the past two years as positive examples.

Table 6. Financial indicators used in this paper.

View table

View all Tables

Table 6. Financial indicators used in this paper.

Level 1 indicators	Level 2 indicators
Solvency	Current ratio: X₁Quick ratio: X₂Equity ratio: X₃Total equity/liabilities attributable to shareholders of the parent company: X₄Earnings before interest, taxes, depreciation, and amortization/total liabilities: X₅Net cash flow from operating activities/total liabilities: X₆Interest coverage ratio (earnings before interest and taxes/interest expense): X₇
Indicators per share	Earnings per share (RMB): X₈Operating cash flow per share (RMB): X₉Net assets per share (net of minority interests) (RMB): X₁₀
Revenue quality	Net income from operating activities/total profit (%): X₁₁Net gain from changes in value/total profit (%): X₁₂Net non-operating income and expenses/total profit (%): X₁₃Net income after deducting non-recurring gain and losses/net income (%): X₁₄
Profitability	Gross profit margin on sales (%): X₁₅Net sales margin (%): X₁₆Net return on assets (%): X₁₇Net margin on total assets (%): X₁₈
Operating capacity	Inventory turnover ratio (times): X₁₉Accounts receivable turnover ratio (times): X₂₀Current assets turnover ratio (times): X₂₁Fixed assets turnover ratio (times): X₂₂Total assets turnover ratio (times): X₂₃
Capital structure	Asset-liability ratio (%): X₂₄Equity multiplier: X₂₅Current assets/total assets (%): X₂₆Non-current assets/total assets (%): X₂₇Current liabilities/total liabilities (%): X₂₈Non-current liabilities/total liabilities (%): X₂₉

As shown in Table 6, a total of 29 variables were selected. We chose 72 listed companies that have been labeled as special treatment (ST) for two consecutive years in 2020 and 2021 as positive samples and the rest of listed companies as negative samples. The positive samples use the data two years prior to being labeled as ST, i.e., if the company was labeled as ST in 2020, we used the data of 2018, and if the company was labeled as ST in 2021, we used the data of 2019. The four steps of data processing are as follows:

1) Exclude companies listed after 2018.

2) Exclude missing values and outliers. The R language program package outliers were used to test the sample observations of outliers, and the sample observations that were tested as outliers were excluded. The samples with many missing values were also excluded.

3) Standardization. In order to eliminate the magnitude influence on classification results and variable selection, the explanatory variables need to be standardized. In this paper, the Z-score standardization method was used, and standardized values were obtained by dividing the difference between the original and arithmetic mean values by the standard deviation.

After the above steps, 1956 sample observations were finally obtained, including 72 positive sample points and 1884 negative sample points, which were used as the final sample data for modeling.

4) Data imbalance processing. After data pre-processing, it is found that the ratio of positive to negative cases is close to 1∶26. Listed companies in financial distress are in the minority. The imbalance of the sample data poses a challenge to our modeling. Facing the problem of the unbalanced sample, the machine learning algorithm may be unstable, and the prediction results may be biased. In this paper, double sampling was used, i.e., resampling of positive cases and oversampling of negative cases at the same time. Therefore, the ratio of positive to negative cases is close to 1∶1, and the total number of training set samples is 602.

4.2 Analysis of financial distress prediction results

4.2.1 Selection of evaluation indicators

The processed data were randomly selected 100 times, with 70% as the training set for modeling and 30% as the test set for the out-of-sample prediction. For the 586 sample points in the test set, there are only 21 positive samples. The unbalanced test set samples result in the prediction accuracy of 96.3% even if all the test samples were classified as negative cases. It is clear that the prediction accuracy is not a good measure of model effectiveness at this point. To better measure the accuracy of imbalanced sample models, the commonly used method is to use AUC values, which are the areas under the ROC curve. The larger the AUC value, the better the model. The cost of the classification error is quite different for a financial distress prediction model for listed companies. For this case, it is derived that more attention is paid to the TPR value based on the AUC value, i.e., the percentage of correctly predicted positive samples.

4.2.2 Results of prediction and variable selection

The training set was modeled 100 times by traditional SVM, LASSO-SVM, MCP-HSVM, and SCAD-HSVM, and the average results are shown in Table 7. There is little difference in the average accuracy of the prediction of the four methods, but in terms of TPR and AUC, the latter three methods are significantly superior to traditional SVM, especially the MCP-HSVM and SCAD-HSVM methods.

Table 7. Prediction results of each method (average of 100 predictions).

View table
View all Tables
Table 7. Prediction results of each method (average of 100 predictions).

Methods ACC AUC TPR NV
SVM 0.8776 0.7676 0.7057 /
Lasso-SVM 0.8632 0.7967 0.7301 17.8
MCP-HSVM 0.8652 0.8187 0.7682 23.6
SCAD-HSVM 0.8754 0.8385 0.7494 22.1

As shown in Table 7, the results of the last three methods are still relatively consistent in terms of variable selection. The strong correlation between some indicators of the financial index system was considered, and the results of the previous simulation experiments show that the variable selection results of MCP-HSVM are slightly better than those of the other two methods when there were correlations between variables. Thus, the results of MCP-HSVM were used as the basis for analysis. The results of their variable selection were rearranged. The selection frequency of each financial indicator from 100 modeling runs of the LASSO-SVM, MCP-HSVM, and SCAD-HSVM methods is shown in Table 8.

Table 8. Selection frequency of each financial indicator.

View table

View all Tables

Table 8. Selection frequency of each financial indicator.

Variables	LASSO-SVM	MCP-HSVM	SCAD-HSVM	Variables	LASSO-SVM	MCP-HSVM	SCAD-HSVM
X₁	0	15	11	X₁₆	100	100	100
X₂	100	100	100	X₁₇	0	93	100
X₃	40	50	45	X₁₈	100	100	100
X₄	0	8	0	X₁₉	0	0	55
X₅	100	100	100	X₂₀	100	100	100
X₆	50	100	58	X₂₁	100	100	100
X₇	0	0	5	X₂₂	100	100	100
X₈	100	100	100	X₂₃	95	100	55
X₉	100	100	100	X₂₄	91	95	100
X₁₀	100	100	100	X₂₅	0	40	0
X₁₁	92	95	100	X₂₆	100	100	100
X₁₂	50	95	100	X₂₇	100	100	100
X₁₃	0	0	3	X₂₈	85	95	100
X₁₄	100	100	100	X₂₉	0	55	76
X₁₅	0	51	0

As shown in Table 8, among the seven indicators of solvency, the selection frequencies of X₂, X₅, and X₆ are high, whereas those of other indicators are low (less than 40 times); the indicators per share are selected in every modeling; among the four indicators of revenue quality, the selection frequencies of X₁₁, X₁₂, and X₁₄ are high, and only X₁₃ is not selected; among the four indicators of profitability, the selection frequencies of X₁₆, X₁₇, and X₁₈ are high, and the selection frequency of X₁₅ is larger than 50%; among the five indicators of operating capacity, the selection frequencies of X₂₀, X₂₁, X₂₂, and X₂₃ are high, and X₁₉ is not selected in every modeling; among the six indicators of capital structure, the selection frequencies of X₂₄, X₂₆, X₂₇, and X₂₈ are high, meanwhile X₂₅ and X₂₉ are selected more than 40 times. Therefore, among the six aspects, the indicators per share have the greatest impact on the financial distress of listed companies, followed by the indicators of profitability and capital structure, and then the indicators of revenue quality and operating capacity. The indicators of solvency have the weakest impact, while there are still three of them which have an impact on the financial distress of listed companies. The variables categorized by the selection frequency of MCP-HSVM are shown in Table 9.

Table 9. Classification of variables by the selection frequency of MCP-HSVM.

View table

View all Tables

Table 9. Classification of variables by the selection frequency of MCP-HSVM.

Selection frequency	Level 2 indicators
100	X₂ (quick ratio), X₅ (earnings before interest, taxes, depreciation, and amortization/total liabilities), X₆ (net cash flow from operating activities/total liabilities), X₈ (earnings per share), X₉ (operating cash flow per share), X₁₀ (net assets per share), X₁₄(net income after deducting non-recurring gain and losses/net income), X₁₆ (net sales margin), X₁₈ (net margin on total assets), X₂₀ (accounts receivable turnover ratio), X₂₁ (current assets turnover ratio), X₂₂ (fixed assets turnover ratio), X₂₃ (total assets turnover ratio), X₂₆ (current assets/total assets), and X₂₇ (non-current assets/total assets)
90−100	X₁₁ (net income from operating activities/total profit), X₁₂ (net gain from changes in value/total profit), X₁₇ (return on net assets), X₂₄ (asset-liability ratio), and X₂₈ (current liabilities/total liabilities)
60−90	None
40−60	X₃ (equity ratio), X₁₅ (gross profit margin on sales), X₂₅ (equity multiplier), and X₂₉ (non-current liabilities/total liabilities)
20−40	None
1−20	X₁ (current ratio) and X₄ (total equity/liabilities attributable to shareholders of the parent company)
0	X₇ (interest coverage ratio (earnings before interest and taxes/interest expense)), X₁₃ (net non-operating income and expenses/total profit), and X₁₉ (inventory turnover ratio)

5 Conclusion

Through simulation experiments and model comparisons, it is concluded that two non-convex huberized-SVM methods, MCP-HSVM and SCAD-HSVM, outperform the traditional SVM methods in terms of the prediction accuracy and classifier performance. When there is no linear relationship between variables, there is no significant difference between MCP-HSVM, SCAD-HSVM, and the existing LASSO-SVM in terms of the prediction accuracy and classifier performance; in terms of variable selection, the numbers of variables selected by both LASSO-SVM and MCP-HSVM are closer to the number of real significant variables, but FDR of SCAD-HSVM is lower. When there is a linear correlation between variables, MCP-HSVM and SCAD-HSVM outperform SVM and LASSO-SVM in terms of the classification prediction and variable selection, especially when there is a high linear correlation between variables.

Based on this, the above methods were applied to the prediction of financial distress of listed companies in this paper. The prediction results show that, among the six indicators affecting the financial distress of listed companies, the indicators per share have the greatest influence on the financial distress of listed companies, followed by the indicators of profitability and capital structure, and then the indicators of revenue quality and operating capacity. The indicators of solvency have the weakest influence on the financial distress of listed companies. Through modeling 100 times with 29 indicators from the six aspects, 15 indicators were selected each time, and 5 indicators were selected with a frequency between 90 and 100. Listed companies can assess the financial situation with the 20 indicators and make early warn of their possible financial distress in advance with higher precision.

Appendix: Supplementary Data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jnlest.2024.100246.

Disclosures

The authors declare no conflicts of interest.

Category:

Received: May. 21, 2023

Accepted: Mar. 15, 2024

Published Online: Jul. 5, 2024

The Author Email: Peng Wang (pwang@xzmu.edu.cn)

DOI:10.1016/j.jnlest.2024.100246

Table 1. Desciption of the resulted algorithm for solving the model (1).

Table 1. Desciption of the resulted algorithm for solving the model (1).

Table 2. Results of evaluation indicators for the classification prediction of each model.

Table 2. Results of evaluation indicators for the classification prediction of each model.

Table 3. Results of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

Table 3. Results of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

Table 4. Results of evaluation indicators for the classification prediction of each model under different correlations.

Table 4. Results of evaluation indicators for the classification prediction of each model under different correlations.

Table 5. Means of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

Table 5. Means of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.

Table 6. Financial indicators used in this paper.

Table 6. Financial indicators used in this paper.

Table 7. Prediction results of each method (average of 100 predictions).

Table 7. Prediction results of each method (average of 100 predictions).

Table 8. Selection frequency of each financial indicator.

Table 8. Selection frequency of each financial indicator.

Table 9. Classification of variables by the selection frequency of MCP-HSVM.

Table 9. Classification of variables by the selection frequency of MCP-HSVM.

微信扫一扫：分享