1 Introduction
In high-dimensional data mining, one of the most common practical problems is to classify the target based on the features of the data. Many scholars have been committed to classification methods to improve their accuracy, however, most traditional methods are unable to meet real needs, due to the limitations of assumptions and the diversity and complexity of data structures [1,2]. The birth of machine learning has opened a new chapter for the development of the data mining technology. The support vector machine (SVM), as a classification method developed on the basis of machine learning theory, is mainly based on Vapnik-Chervonenkis (VC) dimension and structural risk minimization in statistical learning theory. It falls under the category of supervised learning. It is mainly used for classification and now often used in regression analysis [3,4]. Many scholars have proved that it has advantages in solving the problems of small samples as well as non-linear and high-dimensional data [4]. Therefore, it has the capability to alleviate the problems of “dimensionality catastrophe” and “overfitting” in traditional classification methods.
In the standard binary classification problem, linear SVM [5,6] is used to find the maximal interval hyperplane $f\left( {{{\mathbf{x}}_i}} \right) = {\beta _0} + {\mathbf{x}}_i^{\text{T}} {\Cambriabfont\text{β}}$ that distinguishes between the categories 1 and –1. Let ${{\mathbf{x}}_i} \in {\mathbb{R}^p}$ (p denotes the number of independent variables in the model) be the predictor and ${y_i} \in \left\{ { - 1{\rm{,}}{\text{ }}1} \right\}{\text{ }}\left( {i = 1{\rm{,}}{\text{ }}2{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}n} \right)$ be the class label, it is well known that standard SVM can also be written in the form of a hinge loss and l2-penalized regularization framework:
$ \mathop {{\rm{min}} }\limits_{{\beta _0}{\rm{,}}{\Cambriabfont\text{β}}} \left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{{\left[ {1 - {y_i}\left( {{\beta _0} + {\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}}} \right)} \right]}_ + } + \left( {{\lambda \mathord{\left/
{\vphantom {\lambda 2}} \right.
} 2}} \right)} \left\| {\Cambriabfont\text{β}} \right\|_2^2 $ ()
where β is the vector of coefficients comprising each variable, β0 is the intercept term, λ is a regularization parameter controlling the sparsification of the coefficients of the model variables, the loss $l\left( t \right) = {\left[ {1 - t} \right]_ + }$ is called the hinge loss, and $ {\Vert \cdot \Vert }_{2} $ is the l2-norm.
Since l2-penalized SVMs use all variables for classification, leading to a huge challenge in high-dimensional classification [7]. Based on this, Zhu et al. [7] and Wegkamp and Yuan [8] considered l2-penalized SVMs. Although the l1-penalized function also leads to sparsity, the number of selected variables does not exceed the sample size [9]. Thus, when the dimensionality of the variables is high, the l1-penalized model may lose many useful variables, i.e., the l1-penalized model does not have the Oracle consistency property in variable selection. Related studies have also shown that the consistency in variable selection in l1-penalized models relies on a strict “non-reproducibility condition” on the matrix design, which is easily violated in practice [10,11]. Furthermore, the regularization parameter of l1-penalized SVMs for the consistency in variable selection is not optimal for the prediction accuracy [12,13]. To overcome the drawbacks of the l1-penalized function, Fan and Li [9] and Zhang and Huang [14] proposed non-convex penalized functions such as smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP), respectively. For non-convex penalties, Kim et al. [15] and Zhang [16] studied the Oracle property of the high-dimensional SCAD and MCP penalized least squares regression, respectively. However, since the hinge loss in SVMs is not a smooth function, it brings difficulty to the model computation, and the Karush-Kuhn-Tucker local optimality condition is usually insufficient to build a non-smooth and non-convex penalized model [17]. Therefore, many scholars considered replacing the hinge loss with other losses. For instance, Mangasarian [18] used the square hinge loss in SVMs. Rosset and Zhu [19] suggested using the huberized hinge loss. They showed that among the penalized models with the logistic regression loss, huberized loss, and square hinge loss, outliers had the least effect on the huberized loss model. Combined with the fact that the huberized loss function and the hinge loss function share a similar shape and classification performance, we consider replacing the hinge loss function with the huberized loss function and developing a non-convex penalized huberized-SVM model.
In this paper, the generalized coordinate descent (GCD) algorithm is used to solve the non-convex penalized huberized-SVM model and prove that the model satisfies the local Oracle property in variable selection. Simulation results demonstrate that the proposed method has a better variable selection function and higher accuracy. It is also demonstrated that the proposed model can effectively make a prediction in the financial distress of listed companies.
2 Non-convex penalized huberized-SVM
2.1 Model
In a standard binary classification problem, n pairs of training data, $\left\{ {{{\mathbf{x}}_i}{\rm{,}}{\text{ }}{y_i}} \right\}{\text{ }}\left( {i = 1{\rm{,}}{\text{ }}2{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}n} \right)$, are given, the model framework is designed as
$ \mathop {{\rm{min}} }\limits_{{\beta _0}{\rm{,}}{\Cambriabfont\text{β}}} \left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{y_i}\left( {{\beta _0} + {\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}}} \right)} \right)} + \sum\limits_{j = 1}^p {{P_\lambda }\left( {\left| {{\beta _j}} \right|} \right)} $ (1)
where ${\beta _j}$ is the jth coefficient of ${\Cambriabfont\text{β}}$, $\left| {{\beta _j}} \right|$ denotes the absolute value of ${\beta _j}$, and lδ(·) is the huberized hinge loss:
$ {l_\delta }(t) = \left\{ {} \right. $ ()
where $\delta > 0$, usually $\delta {\text{ = }}2$ suggested by Yang and Zou [20,21].
As shown in Fig. 1, the shape of the huberized hinge loss is very similar to that of the hinge loss used in standard SVM. Especially when δ is very small, they almost coincide. Then, it can be easily proved that

Figure 1.Comparison of the hinge loss with the huberized hinge loss.
$ \mathop {\lim }\limits_{\delta \to {0^ + }} {l_\delta }(t) = {[1 - t]_ + } = \left\{ {} \right. $ ()
In (1), $ {P_\lambda }( \cdot ) $ is the non-convex penalty. For example, its specific form of SCAD [9] is
$ {P_\lambda }(x) = \lambda \int_0^{|x|} {{\rm{min}} \{ 1{\rm{,}}{\text{ }}{{{{[a - t/\lambda ]}_ + }} \mathord{\left/
{\vphantom {{{{[a - t/\lambda ]}_ + }} {(a - 1)}}} \right.
} {(a - 1)}}\} } {\text{d}}t{\text{, }}a > 2{\mathrm{.}} $ ()
And its specific form of MCP [16] is
$ {P_\lambda }(x) = \lambda \int_0^{|x|} {{{[1 - t/(a\lambda )]}_ + }} {\text{d}}t{\text{, }}a > 1{\mathrm{.}} $ ()
In both penalties, a is a priori parameter. Fang et al. [22] demonstrated that the selection of the parameter a has a less effect on the model results. Thus, for convenience, in this paper, following the suggestions of Fan and Li [9] and Zhang [16], we fix a = 3.7 in SCAD and a = 2 in MCP.
2.2 Algorithm
The GCD algorithm is used to solve the model (1). Without loss of generality, suppose that the values of x, ${x_{i{\text{,}}j}}$, have been normalized, then $(1/n)\displaystyle\sum\limits_{i = 1}^n {{x_{i{\text{,}}j}} = 0} $ and $(1/n)\displaystyle\sum\limits_{i = 1}^n {x_{_{i{\text{,}}j}}^2 = 1} $ for $j = 1{\text{, }}2{\text{, }} \cdots {\text{, }}p$. Defining the current interval as ${r_i} = {y_i}({\tilde \beta _0} + {\mathbf{x}}_i^{\text{T}}\tilde {\Cambriabfont\text{β}})$, the corresponding coordinate descent algorithm is to minimize the following function:
$ F\left( {{\beta _j}|{{\tilde \beta }_0}{\text{, }}\tilde {\Cambriabfont\text{β}}} \right) = \left( {1/n} \right)\sum\limits_{i = 1}^n {{l_\delta }\left\{ {{r_i} + {y_i}{x_{i{\text{,}}j}}\left( {{\beta _j} - {{\tilde \beta }_j}} \right)} \right\}} + {P_\lambda }\left( {\left| {{\beta _j}} \right|} \right) $ (2)
where $ {\tilde {\beta} _0} $, $ \tilde {\Cambriabfont\text{β}} $, and $ {\tilde \beta _j} $ denote the current values of ${\beta _0}$, β, and ${\beta _j}$ in the calculation process, respectively.
For the non-convex penalty $ {P_\lambda }\left( {|{\beta _j}|} \right) $, the local linear approximation is used to approximate
$ {P_{{\lambda _1}}}\left( {|{\beta _j}|} \right) \approx {P_{{\lambda _1}}}\left( {|{{\tilde \beta }_j}|} \right) + {P'_{{\lambda _1}}}\left( {|{\beta _j}|} \right)\left( {|{\beta _j}| - |{{\tilde \beta }_j}|} \right){\text{, }}|{\beta _j}| \approx |{\tilde \beta _j}| {\mathrm{.}} $ (3)
Since the huberized hinge loss lδ(·) satisfies the quadratic majorization condition [20], it can be proved that $ F({\beta _j}|{\tilde \beta _0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}) < \hat F({\beta _j}|{\tilde \beta _0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}) $, where
$ \hat F\left( {{\beta _j}|{{\tilde \beta }_0}{\rm{,}}{\text{ }}\tilde {\Cambriabfont\text{β}}} \right) = \left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{r_i}} \right)} + \left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right)} {y_i}{x_{i{\text{,}}j}}\left( {{\beta _j} - {{\tilde \beta }_j}} \right) + \left( {{1 \mathord{\left/
{\vphantom {1 \delta }} \right.
} \delta }} \right){\left( {{\beta _j} - {{\tilde \beta }_j}} \right)^2} {\mathrm{.}} $ (4)
Since $ \hat F( \cdot ) $ is the quadratic principal function of the objective function $ F( \cdot ) $, the solution for minimizing (2) can be approximated by the solution for minimizing (4). By minimizing (4), we can obtain the iterative update (5):
$ \hat \beta _j^{{\text{new}}} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{{\beta _0}} \hat F\left( {{\beta _j}|{{\tilde \beta }_0}{\rm{,}}{\text{ }}\tilde {\Cambriabfont\text{β}}} \right) = \left( {{\delta \mathord{\left/
{\vphantom {\delta 2}} \right.
} 2}} \right)S\left( {z{\rm{,}}{\text{ }}{{P'}_{{\lambda _1}}}\left( {\left| {{{\tilde \beta }_j}} \right|} \right)} \right) $ (5)
where
$ S\left(z\rm{,}\text{ }t\right)=\left[\left|z\right|-t\right]_+\text{sign}\left(z\right) $ ()
$ z = \left( {{2 \mathord{\left/
{\vphantom {2 \delta }} \right.
} \delta }} \right){\tilde \beta _j} - \left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta' }({r_i})} {y_i}{x_{i{\text{,}}j}}{\mathrm{.}} $ ()
Similar to (5), it is possible to minimize
$ \hat F\left( {{\beta _0}|{{\tilde \beta }_0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}} \right) = \left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{r_i}} \right) + \left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right)} {y_i}\left( {{\beta _0} - {{\tilde \beta }_0}} \right) + \left( {{1 \mathord{\left/
{\vphantom {1 \delta }} \right.
} \delta }} \right){{\left( {{\beta _0} - {{\tilde \beta }_0}} \right)}^2}} $ (6)
to obtain the iterative update (2) for the intercept term β0:
$ \hat \beta _0^{{\text{new}}} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{{\beta _0}} \hat F\left( {{\beta _0}|{{\tilde \beta }_0}{\rm{,}}\;\tilde {\Cambriabfont\text{β}}} \right) = {\tilde \beta _0} - \left( {{\delta \mathord{\left/
{\vphantom {\delta {2n}}} \right.
} {2n}}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right){y_i}} {\mathrm{.}} $ (7)
The resulted algorithm for solving the model (1) is shown in Table 1, and the flowchart is shown in Fig. 2. It is worth noting that the error ε here is taken by default as 10–6.

Table 1. Desciption of the resulted algorithm for solving the model (1).
Table 1. Desciption of the resulted algorithm for solving the model (1).
Algorithm | 1. Initialize $ {\tilde \beta _0} $ and $ \tilde {\Cambriabfont\text{β}} $. | 2. Iterate 1) and 2) until convergence: | 1) Cyclic coordinate descent: j = 1, 2, ···, p. | i) Calculate $ {r_i} = {y_i}({\tilde \beta _0} + {\mathbf{x}}_i^{\text{T}}\tilde {\Cambriabfont\text{β}}) $; | ii) Calculate $ \hat \beta _j^{{\text{new}}} = \left( {{\delta \mathord{\left/ {\vphantom {\delta 2}} \right. } 2}} \right)S\left( {z{\rm{,}}{\text{ }}{P_{\lambda _1}'}\left( {\left| {{{\tilde \beta }_j}} \right|} \right)} \right) $; | iii) Let $ {\tilde \beta _j} = \hat \beta _j^{{\text{new}}}{\rm{,}}{\text{ }}j = j + 1 $. | 2) Update the intercept term: | i) Calculate $ {r_i} = {y_i}\left( {{{\tilde \beta }_0} + {\mathbf{x}}_i^{\text{T}}\tilde {\Cambriabfont\text{β}}} \right) $; | ii) Calculate $ \hat \beta _0^{{\text{new}}} = {\tilde \beta _0} - \left( {{\delta \mathord{\left/ {\vphantom {\delta {2n}}} \right. } {2n}}} \right)\sum\limits_{i = 1}^n {{l_\delta' }\left( {{r_i}} \right){y_i}} $; | iii) Let $ {\tilde \beta _j} = \hat \beta _j^{{\text{new}}}{\rm{,}}{\text{ }}j = j + 1 $. | 3. If $\mathop {{\rm{max}} }\limits_{0 \leq j \leq p} \left| {\hat \beta _j^{{\text{new}}} - {{\tilde \beta }_j}} \right| < \varepsilon $, stop the iteration; otherwise, return to step 2. |
|

Figure 2.Flowchart of the algorithm.
2.3 Asymptotic property
In this subsection, the local Oracle property of the model (1) is discussed. Since β0 does not affect variable selection, we set β0 = 0.
Let ${{\Cambriabfont\text{β}}^{\text{*}}}{\text{ = }}\left( {\beta _1^{\text{*}}{\rm{,}}{\text{ }}\beta _2^{\text{*}}{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}\beta _p^{\text{*}}} \right)$ denote the true value of the model parameters:
$ {{\Cambriabfont\text{β}}^*} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{\Cambriabfont\text{β}} L\left( {\Cambriabfont\text{β}} \right) = \mathop {{\rm{arg}} {\rm{min}} \,{E}}\limits_{\Cambriabfont\text{β}} \left( {{l_\delta }\left( {y{{\mathbf{x}}^{\text{T}}}{\Cambriabfont\text{β}}} \right)} \right) {\mathrm{,}} $ ()
where $ L\left( {\Cambriabfont\text{β}} \right) = E\left( {{l_\delta }\left( {y{{\mathbf{x}}^{\text{T}}}{\Cambriabfont\text{β}}} \right)} \right) $ is the true loss and E(·) denotes the mathematical expectation of a random variable.
Let pn be the number of covariates. Let $A = \left\{ {\left. j \right|\beta _j^* \ne 0{\text{, }}0 \leq j \leq {p_n}} \right\}$ denote the set of subscripts of non-zero coefficients, $ {{\Cambriabfont\text{β}}_A} $ is a vector composed of all non-zero variable coefficients, and $ \boldsymbol{\mathrm{\mathbf{x}}}_A $ is a vector composed of independent variables with non-zero coefficients. Its ith observation vector is denoted by $ \boldsymbol{\mathrm{\mathbf{x}}}_{i\mathrm{,}A} $. ${A^c}$ is the complement set of A, and ${{\Cambriabfont\text{β}}_{{A^c}}}$ is the vector formed by removing the components of the elements of the vector β at the corresponding positions in A.
Thus, the Oracle estimator $ \hat {\Cambriabfont\text{β}} $ of β is defined as
$ \hat {\Cambriabfont\text{β}} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{\Cambriabfont\text{β}} {L_n}({\Cambriabfont\text{β}}) = \mathop {{\rm{arg}} {\rm{min}} }\limits_{\Cambriabfont\text{β}} \left\{ {\left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{y_i}{\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}}} \right)} } \right\} = \mathop {{\rm{arg}} {\rm{min}} }\limits_{{{\Cambriabfont\text{β}}_{{A^c}}} = {\mathbf{0}}} \left\{ {\left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{l_\delta }\left( {{y_i}{\mathbf{x}}_{i{\mathrm{,}}A}^{\text{T}}{{\Cambriabfont\text{β}}_A}} \right)} } \right\} $ ()
where $ {\hat \beta _j} = 0{\text{, }}j \in {A^c} $ and Ln(·) denotes the sample loss.
The above definition of the Oracle estimation suggests that the estimates derived from the principle of minimizing empirical losses are the same as those derived from the principle of minimizing the empirical loss function with only the covariates of interest. The Oracle estimator was proposed initially by Fan and Li [9], which requires that the results of the model’s variable selection are consistent with the real set of variables, so “whether the model’s estimator is the Oracle estimator” is usually used as an important criterion for judging whether the model’s variable selection is good or bad.
To obtain the Oracle properties of the estimator, the following regularization conditions are given. Detialed certification procedures can be found in the proof of the theorems, which is provided in Appendix.
Condition 1: The loss function l(·) has a first-order continuous derivative, and $\exists {M_1}{\text{, }}{M_2} > 0$, so that
$ \left|l'(t)\right|\le M_1\left(\left|t\right|+1\right)\rm{,}\text{ }\partial l'(t)\le M_2\rm{,}\text{ }\forall t. $ ()
Condition 2: When $n \to \infty $, we have ${q_n} = O\left( {{n^{{c_1}}}} \right){\rm{,}}{\text{ }}0 \leq {c_1} \leq {1 \mathord{\left/ {\vphantom {1 2}} \right. } 2}$, where qn is the base of the set A, and O(·) denotes infinity (or infinitesimal) of the same order.
Condition 3: The Hessian matrix ${\mathbf{H}}\left( {{{\Cambriabfont\text{β}}_A}} \right){\text{ = }}E\left( {{\nabla ^2}{l_\delta }\left( {y{\mathbf{x}}_A^{\text{T}}{{\Cambriabfont\text{β}}_A}} \right)} \right)$ satisfies the condition $0 < {M_3} < {\lambda _{{\rm{min}} }}\left\{ {{\mathbf{H}}\left( {{{\Cambriabfont\text{β}}_A}} \right)} \right\} \leq {\lambda _{\rm{max}}}\left\{ {{\mathbf{H}}\left( {{{\Cambriabfont\text{β}}_A}} \right)} \right\} < {M_4} < \infty $, where xA is the matrix formed by removing the columns corresponding to the elements in Ac from the observation matrix x, and λmin and λmax denote the minimum and maximum eigenvalues of the matrix, respectively.
Condition 4: $\exists {M_5} > 0$, so that ${\lambda _{{\rm{max}} }}({n^{ - 1}}{\mathbf{x}}_A^{\text{T}}{{\mathbf{x}}_A}) \leq {M_5}$ and the elements, $ x_{i{\mathrm{,}}j}\text{ }\left(1\le i\le n\rm{,}\text{ }j\in A^c\right) $, are the sub-Gaussian random variables.
Condition 5: $\exists {c_2}{\rm{,}}\;{M_6} > 0$, so that $1 - {c_1} \leq {c_2} \leq 1$, and ${n^{{{\left( {1 - {c_2}} \right)} \mathord{\left/ {\vphantom {{\left( {1 - {c_2}} \right)} 2}} \right. } 2}}}\mathop {{\rm{min}} }\limits_{j \in A} \left| {\beta _j^*} \right| \geq {M_6}$.
Condition 1 requires the loss function to be smooth and relatively flat, which can be satisfied by some common SVM loss functions, such as the huberized hinge loss function and the squared hinge loss function. Condition 2 implies that the divergence rate of the number of non-zero coefficients is not larger than n1/2. Under the assumption of Condition 3, the Hessian matrix of the loss function is definitely positive, and its eigenvalues are consistently bounded. The conditions for the maximum eigenvalues of the design matrix assumed in Condition 4 are similar to those in Ref. [15] and Ref. [18]. Condition 5 requires that the signals of the variables do not decay too fast.
Theorem 1. If Conditions 1–5 are satisfied, then the Oracle estimate of the model (1) satisfies
$ {\left\| {{{ \hat {\Cambriabfont\text{β}}}_A} - {\Cambriabfont\text{β}}_A^*} \right\|_2} = {O_p}\left( {\sqrt {{q_n}/n} } \right) $ ()
where Op(·) denotes the same order infinitesimal with probability.
By Theorem 1, it can be seen that the Oracle estimator of the model (1) satisfies consistency as long as ${q_n} = O\left( {{n^{{c_1}}}} \right){\rm{,}}{\text{ }}0 \leq {c_1} \leq {1 \mathord{\left/ {\vphantom {1 2}} \right. } 2}$.
Lemma 1. If Conditions 1–5 are satisfied, and ${\rm{log}} \left( {{p_n}} \right){q_n}{\rm{log}} \left( n \right){n^{ - 1/2}} = O\left( {{\lambda _n}} \right)$, then the probability
$ \rm{Pr} \left\{ {\mathop {{\rm{max}} }\limits_{j \in {A^c}} \left| {\left( {{1 \mathord{\left/
{\vphantom {1 n}} \right.
} n}} \right)\sum\limits_{i = 1}^n {{y_i}{x_{i{\mathrm{,}}j}}{l_\delta' }\left( {{y_i}{\mathbf{x}}_{i{\mathrm{,}}A}^{\text{T}}{\Cambriabfont\text{β}}_A^*} \right)} } \right| > {{{\lambda _n}} \mathord{\left/
{\vphantom {{{\lambda _n}} 2}} \right.
} 2}} \right\} \to 0{\rm{,}}{\text{ }}n \to \infty . $ ()
Lemma 2. Let Conditions 1–5 satisfy ${\rm{log}} \left( {{p_n}} \right){q_n}{\rm{log}} \left( n \right){n^{ - 1/2}} = O\left( {{\lambda _n}} \right)$ and ${\lambda _n} = O\left( {{n^{ - (1 - {c_2})/2}}} \right)$. For j = 1, 2, ···, p, let ${s_j}\left( { \hat {\Cambriabfont\text{β}}} \right) = {{\partial {{L_n}( \hat {\Cambriabfont\text{β}})} } \mathord{\left/ {\vphantom {{\partial {{L_n}( \hat {\Cambriabfont\text{β}})} } {\partial {\beta _j}}}} \right. } {\partial {\beta _j}}}$, where sj(·) denotes the subgradient.
Then, for the Oracle estimator $ \hat {\Cambriabfont\text{β}}$ and the corresponding ${s_j}( \hat {\Cambriabfont\text{β}})$, there is a probability of convergence to 1, satisfying
$ {s_j}\left( { \hat {\Cambriabfont\text{β}}} \right) = 0{\text{, }}\left| {{{\hat \beta }_j}} \right| \geq \left( {a + {1 \mathord{\left/
{\vphantom {1 2}} \right.
} 2}} \right){\lambda _n}{\text{, }}j \in A{\rm{;}} $ ()
$ \left| {{s_j}\left( { \hat {\Cambriabfont\text{β}}} \right)} \right| \leq {\lambda _n}{\text{, }}\left| {{{\hat \beta }_j}} \right| = 0{\text{, }}j \in {A^c}{\rm{.}} $ ()
Theorem 2. If Conditions 1–5 are satisfied, let Bn(λn) denote the set of local minimum points of the objective function:
$ {Q_n}\left( {\Cambriabfont\text{β}} \right) = {L_n}\left( {\Cambriabfont\text{β}} \right) + \sum\limits_{j = 1}^p {{P_{{\lambda _n}}}\left( {\left| {{\beta _j}} \right|} \right)} {\rm{.}} $ ()
Then, the Oracle estimator $ \hat {\Cambriabfont\text{β}} = \left( { \hat {\Cambriabfont\text{β}}_A^{\text{T}}{\text{, }}{{\mathbf{0}}^{\text{T}}}} \right) $ of the model (1) satisfies
$ \rm{Pr} \left\{ { \hat {\Cambriabfont\text{β}} \in {B_n}\left( {{\lambda _n}} \right)} \right\} \to 1{\text{, }}n \to \infty $ ()
where ${\rm{log}} \left( {{p_n}} \right){q_n}{\rm{log}} \left( n \right){n^{ - 1/2}} = O\left( {{\lambda _n}} \right)$ and ${\lambda _n} = O\left( {{n^{ - (1 - {c_2})/2}}} \right)$.
According to Theorem 2, the local Oracle property of the model (1) is satisfied if $ {c_1} < \tau < {{{c_2}} \mathord{\left/ {\vphantom {{{c_2}} 2}} \right. } 2} $, ${\lambda _n} = O\left( {{n^{ - 1/2 + \tau }}} \right)$ even if $p = O\left( {\exp \left( {{n^{{{\left( {\tau - {c_1}} \right)} \mathord{\left/ {\vphantom {{\left( {\tau - {c_1}} \right)} 2}} \right. } 2}}}} \right)} \right)$ . Therefore, the local Oracle property of the above proposed structural coefficient model is still satisfied even if the number of covariates increases exponentially with the sample size.
3 Experimental analysis
To validate the performance of the above model for non-convex huberized-SVM, experiments in two cases were designed to identify whether there is a correlation between variables. MCP and SCAD are the two more commonly used non-convex penalties. Here, MCP huberized-SVM (MCP-HSVM) and SCAD huberized-SVM (SCAD-HSVM) were selected to represent non-convex huberized-SVM. The results of MCP-HSVM and SCAD-HSVM were compared with those of conventional SVM and classical least absolute shrinkage and selection operator (LASSO)-SVM. The training parameters involved in all the methods in this paper were selected by a 5-fold cross-validation method by constraining.
3.1 Evaluation metrics
To evaluate the prediction performance of various methods, out-of-sample prediction accuracy (ACC), true positive rate (TPR), false positive rate (FPR), and area under the receiver operating characteristic (ROC) curve (AUC) were used. The higher the values of ACC, TPR, and AUC, the better the prediction performance of the model; the lower the value of FPR, the better the prediction performance. For measuring the performance of variable selection, three indicators, the number of significant variables (NV), false negative rate (FNR), and false discovery rate (FDR) of variable selection, were used. The closer NV to the actual number of variables, the better the performance of variable selection. In contrast, FNR represents the error rate of omission of significant variables (Type II error rate), and FDR is the error rate of insignificant variables being identified as significant (Type I error rate). Obviously, the lower these two indicators, the better the performance of variable selection.
3.2 Cases where variables are not correlated
Here, the variable dimension p = 100, the variables ${{\mathbf{x}}_1}{\rm{,}}{\text{ }}{{\mathbf{x}}_2}{\rm{,}}{\text{ }} \cdots {\rm{,}}{\text{ }}{{\mathbf{x}}_p} \sim \mathbb{N}\left( {0{\rm{,}}{\text{ }}1} \right)$, and the sample size n = 200, where 100 observations were used as the training set and the remaining 100 observations were used as the test set. The parameter ${\Cambriabfont\text{β}} = \left( 1{\rm{,}}\;0.5{\rm{,}}\;0.8{\rm{,}}\; - 0.35{\rm{,}}\;0{\rm{,}}\;1.2{\rm{,}}\;0.92{\rm{,}}\; - 1.1{\rm{,}}\; - 0.2{\rm{,}}\;1{\rm{,}}\;1.5{\rm{,}}\;2.2{\rm{,}}\;0.95{\rm{,}}\; - 1.3{\rm{,}}\; - 0.5{\rm{,}}\;0.6{\rm{,}}\;0.2{\rm{,}} 0{\rm{,}}\; \cdots {\rm{,}}\;0 \right)_p^{\text{T}}$.
Take $\varepsilon \sim \mathbb{N}\left( {0{\rm{,}}{\text{ }}1} \right)$, let $z = {\mathbf{x}}_i^{\text{T}}{\Cambriabfont\text{β}} + 0.5\varepsilon $, and construct the response variable y, so that,
$ \rm{Pr}\left(y=-1\right)=\mathrm{exp}\left(z\right)\mathord{\left/\vphantom{e^z\left(1+e^z\right)}\right.}\left(1+\mathrm{exp}\left(z\right)\right) $ ()
$ \rm{Pr}\left(y=1\right)=1\mathord{\left/\vphantom{1\left(1+e^z\right)}\right.}\left(1+\mathrm{exp}\left(z\right)\right)\rm{.} $ ()
According to the above setting, which satisfies the preceding assumptions, a set of observations with a sample size of 200 was randomly selected, and 5-fold cross-validation was used to calculate each evaluation indicator of the classification prediction and variable selection by SVM, LASSO-SVM, MCP-HSVM, and SCAD-HSVM methods, respectively. The cross-validation process was repeated 100 times to obtain the values of 100 sets of evaluation indicators, and according to these evaluation indicators, the results of the classification prediction and variable selection are shown in Table 2 and Table 3, where Min, 1st Qu, Median, Mean, 3rd Qu, and Max represent the minimum, first quartile, median, arithmetic mean, third quartile, and maximum of each indicator, respectively.

Table 2. Results of evaluation indicators for the classification prediction of each model.
Table 2. Results of evaluation indicators for the classification prediction of each model.
Indicators | Methods | Min | 1st Qu | Median | Mean | 3rd Qu | Max | ACC | SVM | 0.7016 | 0.7435 | 0.8010 | 0.8000 | 0.8534 | 0.8952 | LASSO-SVM | 0.7612 | 0.8114 | 0.8680 | 0.8653 | 0.9025 | 0.9589 | MCP-HSVM | 0.7773 | 0.8260 | 0.8721 | 0.8749 | 0.9180 | 0.9708 | SCAD-HSVM | 0.7842 | 0.8396 | 0.8862 | 0.8884 | 0.9372 | 0.9804 | AUC | SVM | 0.7840 | 0.8170 | 0.8627 | 0.8608 | 0.9021 | 0.9335 | LASSO-SVM | 0.8436 | 0.8916 | 0.9378 | 0.9290 | 0.9657 | 0.9902 | MCP-HSVM | 0.8489 | 0.8801 | 0.9239 | 0.9239 | 0.9625 | 0.9969 | SCAD-HSVM | 0.8502 | 0.8867 | 0.9328 | 0.9256 | 0.9626 | 0.9974 | TPR | SVM | 0.6691 | 0.7075 | 0.7546 | 0.7602 | 0.8055 | 0.8645 | LASSO-SVM | 0.7553 | 0.7988 | 0.8370 | 0.8463 | 0.8809 | 0.9500 | MCP-HSVM | 0.7602 | 0.8068 | 0.8434 | 0.8512 | 0.8978 | 0.9581 | SCAD-HSVM | 0.7595 | 0.7925 | 0.8525 | 0.8542 | 0.9102 | 0.9555 | FPR | SVM | 0.2101 | 0.2342 | 0.2535 | 0.2571 | 0.2807 | 0.3097 | LASSO-SVM | 0.1470 | 0.1710 | 0.1974 | 0.1960 | 0.2195 | 0.2456 | MCP-HSVM | 0.1643 | 0.1868 | 0.2100 | 0.2121 | 0.2381 | 0.2634 | SCAD-HSVM | 0.1683 | 0.1982 | 0.2239 | 0.2209 | 0.2457 | 0.2670 |
|

Table 3. Results of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.
Table 3. Results of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.
Indicators | Methods | Min | 1st Qu | Median | Mean | 3rd Qu | Max | NV | LASSO-SVM | 5.0000 | 10.0000 | 17.0000 | 16.6400 | 24.2500 | 38.0000 | MCP-HSVM | 6.0000 | 9.0000 | 15.0000 | 19.6000 | 25.7500 | 47.0000 | SCAD-HSVM | 8.0000 | 9.0000 | 10.0000 | 10.2600 | 11.0000 | 13.0000 | FNR | LASSO-SVM | 0.0176 | 0.1029 | 0.2294 | 0.2512 | 0.3247 | 0.5112 | MCP-HSVM | 0.0765 | 0.1252 | 0.2000 | 0.2182 | 0.3059 | 0.4824 | SCAD-HSVM | 0.1029 | 0.2294 | 0.2512 | 0.2824 | 0.3471 | 0.5647 | FDR | LASSO-SVM | 0.0000 | 0.0764 | 0.2248 | 0.2679 | 0.3564 | 0.4578 | MCP-HSVM | 0.0000 | 0.1193 | 0.2556 | 0.2880 | 0.3810 | 0.4610 | SCAD-HSVM | 0.0000 | 0.1000 | 0.1577 | 0.1735 | 0.2662 | 0.3500 |
|
As shown in Table 2, in terms of the classification prediction, the ACC, AUC, and TPR values of LASSO-SVM, MCP-HSVM, and SCAD-HSVM are significantly higher than those of SVM while the FPR values are lower, indicating that the prediction accuracy and classification performance are significantly superior to those of traditional SVM methods. In contrast, the ACC, AUC, TPR, and FPR values of MCP-HSVM and SCAD-HSVM are basically slightly higher than those of LASSO-SVM, which indicates that the advantages of MCP-HSVM and SCAD-HSVM over the existing LASSO-SVM are not obvious in terms of the classification prediction.
As shown in Table 3, in terms of variable selection, the arithmetic means and medians of NV obtained by LASSO-SVM and MCP-HSVM are relatively close to the number of true significant variables (17.0000), whereas the maximum NV value of SCAD-HSVM is only 13.0000, which is obviously less than 17.0000, indicating that the omission of significant variables by SCAD-HSVM is more serious. In terms of FNR, its arithmetic means and medians obtained by LASSO-SVM and MCP-HSVM are basically slightly larger than 0.2000, which reveals that the two methods (especially MCP-HSVM) have a relatively low chance of making mistakes due to the omission of significant variables, so that most of the actual significant variables can be selected by applying the two methods. In terms of FDR, the results of SCAD-HSVM are significantly better than those of the other two methods, and the corresponding arithmetic mean and median of FDR are less than 0.2000, indicating that SCAD-HSVM is not likely to select insignificant variables, and therefore the classification prediction results are not easily affected by noisy variables.
3.3 Cases where variables are correlated
Here, the variable dimension p = 100, the sample size n = 200, x~$\mathbb{N}$(0, Σ), Σi,j= ρ , i, j = 1, 2, ···, 100, and the real values of the parameter β and the response variable y were constructed in the same way as the previous setting. Let ρ = 0.3, ρ = 0.5, and ρ = 0.8, respectively, similar to the previous experimental procedure. The experimental results are shown in Table 4 and Table 5.

Table 4. Results of evaluation indicators for the classification prediction of each model under different correlations.
Table 4. Results of evaluation indicators for the classification prediction of each model under different correlations.
ρ | Indicators | SVM | LASSO-SVM | MCP-HSVM | SCAD-HSVM | ρ = 0.3 | ACC | 0.7758 | 0.8339 | 0.8436 | 0.8508 | AUC | 0.8351 | 0.8802 | 0.9050 | 0.8956 | TPR | 0.7290 | 0.7996 | 0.8306 | 0.8260 | FPR | 0.2827 | 0.2103 | 0.2206 | 0.2230 | ρ = 0.5 | ACC | 0.7496 | 0.8188 | 0.8348 | 0.8310 | AUC | 0.7828 | 0.8650 | 0.8861 | 0.8853 | TPR | 0.6939 | 0.7647 | 0.8245 | 0.8130 | FPR | 0.3238 | 0.2357 | 0.2394 | 0.2345 | ρ = 0.8 | ACC | 0.6760 | 0.7786 | 0.8200 | 0.8196 | AUC | 0.7209 | 0.8212 | 0.8701 | 0.8635 | TPR | 0.6331 | 0.7211 | 0.8135 | 0.7992 | FPR | 0.3301 | 0.2567 | 0.2412 | 0.2451 |
|

Table 5. Means of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.
Table 5. Means of evaluation indicators for variable selection of LASSO-SVM, MCP-HSVM, and SCAD-HSVM.
ρ | Indicators | LASSO-SVM | MCP-HSVM | SCAD-HSVM | ρ = 0.3 | NV | 15.8200 | 18.4800 | 14.2600 | FNR | 0.2424 | 0.1612 | 0.2118 | FDR | 0.2360 | 0.2057 | 0.1885 | ρ = 0.5 | NV | 12.9000 | 16.0800 | 13.7600 | FNR | 0.3117 | 0.2294 | 0.2941 | FDR | 0.3298 | 0.2325 | 0.2916 | ρ = 0.8 | NV | 10.2400 | 15.7600 | 12.9800 | FNR | 0.3729 | 0.2800 | 0.3329 | FDR | 0.1655 | 0.2397 | 0.3244 |
|
As shown in Table 4, the ACC, AUC, and TPR values of LASSO-SVM, MCP-HSVM, and SCAD-HSVM are significantly higher than those of traditional SVM, and the FPR values are significantly lower. This means that the three methods outperform traditional SVM in terms of the prediction accuracy and classifier performance, and as the correlation between variables increases, the advantage becomes more apparent. As for the comparison between MCP-HSVM, SCAD-HSVM, and LASSO-SVM, the advantages of MCP-HSVM and SCAD-HSVM in terms of the prediction accuracy and classifier performance are not obvious when the correlation between variables is low, but as the correlation between variables increases, the advantages become more and more obvious. This indicates that the classification prediction performance of MCP-HSVM and SCAD-HSVM is less influenced by the correlation between variables, especially for MCP-HSVM, which is almost unaffected by the correlation between variables.
As shown in Table 5, in terms of variable selection, the arithmetic means of NV selected by LASSO-SVM and SCAD-HSVM are significantly smaller than the number of true significant variables (17.0000) with a certain gap, and the gap becomes larger as the correlation between variables increases, especially for LASSO-SVM. Compared with the two methods, the arithmetic mean of NV of MCP-HSVM is always closer to the true value (17.0000). As the correlation between variables increases, the FNR values of all three methods increase, indicating that the increase in the correlation between variables leads to an increase in the number of missed significant variables. However, MCP-HSVM is affected in a limited way, and the arithmetic mean of its FNR under all three correlations is less than 0.3, indicating that no more than five significant variables are missing. In terms of FDR, there is no significant change in the arithmetic means of FDR of the three methods as the correlation between variables increases. Combining with the data in Table 3, when there is a linear correlation between variables, MCP-HSVM performs the best on variable selection, while LASSO-SVM performs the worst, and it is easy to miss the actual significant variables when there is a strong correlation between variables.
4 Prediction of financial distress for listed companies
4.1 Data sources and processing
The financial index system used in this paper (see Table 6) and its data were taken from Wind Information (http://www.wind.com.cn). A total of 3456 listed companies from A shares in Shanghai and Shenzhen were selected as the basic sample, and 72 companies were screened for financial distress in the past two years as positive examples.

Table 6. Financial indicators used in this paper.
Table 6. Financial indicators used in this paper.
Level 1 indicators | Level 2 indicators | Solvency | Current ratio: X1Quick ratio: X2Equity ratio: X3Total equity/liabilities attributable to shareholders of the parent company: X4Earnings before interest, taxes, depreciation, and amortization/total liabilities: X5Net cash flow from operating activities/total liabilities: X6Interest coverage ratio (earnings before interest and taxes/interest expense): X7 | Indicators per share | Earnings per share (RMB): X8Operating cash flow per share (RMB): X9Net assets per share (net of minority interests) (RMB): X10 | Revenue quality | Net income from operating activities/total profit (%): X11Net gain from changes in value/total profit (%): X12Net non-operating income and expenses/total profit (%): X13Net income after deducting non-recurring gain and losses/net income (%): X14 | Profitability | Gross profit margin on sales (%): X15Net sales margin (%): X16Net return on assets (%): X17Net margin on total assets (%): X18 | Operating capacity | Inventory turnover ratio (times): X19Accounts receivable turnover ratio (times): X20Current assets turnover ratio (times): X21Fixed assets turnover ratio (times): X22Total assets turnover ratio (times): X23 | Capital structure | Asset-liability ratio (%): X24Equity multiplier: X25Current assets/total assets (%): X26Non-current assets/total assets (%): X27Current liabilities/total liabilities (%): X28Non-current liabilities/total liabilities (%): X29 |
|
As shown in Table 6, a total of 29 variables were selected. We chose 72 listed companies that have been labeled as special treatment (ST) for two consecutive years in 2020 and 2021 as positive samples and the rest of listed companies as negative samples. The positive samples use the data two years prior to being labeled as ST, i.e., if the company was labeled as ST in 2020, we used the data of 2018, and if the company was labeled as ST in 2021, we used the data of 2019. The four steps of data processing are as follows:
1) Exclude companies listed after 2018.
2) Exclude missing values and outliers. The R language program package outliers were used to test the sample observations of outliers, and the sample observations that were tested as outliers were excluded. The samples with many missing values were also excluded.
3) Standardization. In order to eliminate the magnitude influence on classification results and variable selection, the explanatory variables need to be standardized. In this paper, the Z-score standardization method was used, and standardized values were obtained by dividing the difference between the original and arithmetic mean values by the standard deviation.
After the above steps, 1956 sample observations were finally obtained, including 72 positive sample points and 1884 negative sample points, which were used as the final sample data for modeling.
4) Data imbalance processing. After data pre-processing, it is found that the ratio of positive to negative cases is close to 1∶26. Listed companies in financial distress are in the minority. The imbalance of the sample data poses a challenge to our modeling. Facing the problem of the unbalanced sample, the machine learning algorithm may be unstable, and the prediction results may be biased. In this paper, double sampling was used, i.e., resampling of positive cases and oversampling of negative cases at the same time. Therefore, the ratio of positive to negative cases is close to 1∶1, and the total number of training set samples is 602.
4.2 Analysis of financial distress prediction results
4.2.1 Selection of evaluation indicators
The processed data were randomly selected 100 times, with 70% as the training set for modeling and 30% as the test set for the out-of-sample prediction. For the 586 sample points in the test set, there are only 21 positive samples. The unbalanced test set samples result in the prediction accuracy of 96.3% even if all the test samples were classified as negative cases. It is clear that the prediction accuracy is not a good measure of model effectiveness at this point. To better measure the accuracy of imbalanced sample models, the commonly used method is to use AUC values, which are the areas under the ROC curve. The larger the AUC value, the better the model. The cost of the classification error is quite different for a financial distress prediction model for listed companies. For this case, it is derived that more attention is paid to the TPR value based on the AUC value, i.e., the percentage of correctly predicted positive samples.
4.2.2 Results of prediction and variable selection
The training set was modeled 100 times by traditional SVM, LASSO-SVM, MCP-HSVM, and SCAD-HSVM, and the average results are shown in Table 7. There is little difference in the average accuracy of the prediction of the four methods, but in terms of TPR and AUC, the latter three methods are significantly superior to traditional SVM, especially the MCP-HSVM and SCAD-HSVM methods.

Table 7. Prediction results of each method (average of 100 predictions).
Table 7. Prediction results of each method (average of 100 predictions).
Methods | ACC | AUC | TPR | NV | SVM | 0.8776 | 0.7676 | 0.7057 | / | Lasso-SVM | 0.8632 | 0.7967 | 0.7301 | 17.8 | MCP-HSVM | 0.8652 | 0.8187 | 0.7682 | 23.6 | SCAD-HSVM | 0.8754 | 0.8385 | 0.7494 | 22.1 |
|
As shown in Table 7, the results of the last three methods are still relatively consistent in terms of variable selection. The strong correlation between some indicators of the financial index system was considered, and the results of the previous simulation experiments show that the variable selection results of MCP-HSVM are slightly better than those of the other two methods when there were correlations between variables. Thus, the results of MCP-HSVM were used as the basis for analysis. The results of their variable selection were rearranged. The selection frequency of each financial indicator from 100 modeling runs of the LASSO-SVM, MCP-HSVM, and SCAD-HSVM methods is shown in Table 8.

Table 8. Selection frequency of each financial indicator.
Table 8. Selection frequency of each financial indicator.
Variables | LASSO-SVM | MCP-HSVM | SCAD-HSVM | Variables | LASSO-SVM | MCP-HSVM | SCAD-HSVM | X1 | 0 | 15 | 11 | X16 | 100 | 100 | 100 | X2 | 100 | 100 | 100 | X17 | 0 | 93 | 100 | X3 | 40 | 50 | 45 | X18 | 100 | 100 | 100 | X4 | 0 | 8 | 0 | X19 | 0 | 0 | 55 | X5 | 100 | 100 | 100 | X20 | 100 | 100 | 100 | X6 | 50 | 100 | 58 | X21 | 100 | 100 | 100 | X7 | 0 | 0 | 5 | X22 | 100 | 100 | 100 | X8 | 100 | 100 | 100 | X23 | 95 | 100 | 55 | X9 | 100 | 100 | 100 | X24 | 91 | 95 | 100 | X10 | 100 | 100 | 100 | X25 | 0 | 40 | 0 | X11 | 92 | 95 | 100 | X26 | 100 | 100 | 100 | X12 | 50 | 95 | 100 | X27 | 100 | 100 | 100 | X13 | 0 | 0 | 3 | X28 | 85 | 95 | 100 | X14 | 100 | 100 | 100 | X29 | 0 | 55 | 76 | X15 | 0 | 51 | 0 | | | | |
|
As shown in Table 8, among the seven indicators of solvency, the selection frequencies of X2, X5, and X6 are high, whereas those of other indicators are low (less than 40 times); the indicators per share are selected in every modeling; among the four indicators of revenue quality, the selection frequencies of X11, X12, and X14 are high, and only X13 is not selected; among the four indicators of profitability, the selection frequencies of X16, X17, and X18 are high, and the selection frequency of X15 is larger than 50%; among the five indicators of operating capacity, the selection frequencies of X20, X21, X22, and X23 are high, and X19 is not selected in every modeling; among the six indicators of capital structure, the selection frequencies of X24, X26, X27, and X28 are high, meanwhile X25 and X29 are selected more than 40 times. Therefore, among the six aspects, the indicators per share have the greatest impact on the financial distress of listed companies, followed by the indicators of profitability and capital structure, and then the indicators of revenue quality and operating capacity. The indicators of solvency have the weakest impact, while there are still three of them which have an impact on the financial distress of listed companies. The variables categorized by the selection frequency of MCP-HSVM are shown in Table 9.

Table 9. Classification of variables by the selection frequency of MCP-HSVM.
Table 9. Classification of variables by the selection frequency of MCP-HSVM.
Selection frequency | Level 2 indicators | 100 | X2 (quick ratio), X5 (earnings before interest, taxes, depreciation, and amortization/total liabilities), X6 (net cash flow from operating activities/total liabilities), X8 (earnings per share), X9 (operating cash flow per share), X10 (net assets per share), X14 (net income after deducting non-recurring gain and losses/net income), X16 (net sales margin), X18 (net margin on total assets), X20 (accounts receivable turnover ratio), X21 (current assets turnover ratio), X22 (fixed assets turnover ratio), X23 (total assets turnover ratio), X26 (current assets/total assets), and X27 (non-current assets/total assets) | 90−100 | X11 (net income from operating activities/total profit), X12 (net gain from changes in value/total profit), X17 (return on net assets), X24 (asset-liability ratio), and X28 (current liabilities/total liabilities) | 60−90 | None | 40−60 | X3 (equity ratio), X15 (gross profit margin on sales), X25 (equity multiplier), and X29 (non-current liabilities/total liabilities) | 20−40 | None | 1−20 | X1 (current ratio) and X4 (total equity/liabilities attributable to shareholders of the parent company) | 0 | X7 (interest coverage ratio (earnings before interest and taxes/interest expense)), X13 (net non-operating income and expenses/total profit), and X19 (inventory turnover ratio) |
|
5 Conclusion
Through simulation experiments and model comparisons, it is concluded that two non-convex huberized-SVM methods, MCP-HSVM and SCAD-HSVM, outperform the traditional SVM methods in terms of the prediction accuracy and classifier performance. When there is no linear relationship between variables, there is no significant difference between MCP-HSVM, SCAD-HSVM, and the existing LASSO-SVM in terms of the prediction accuracy and classifier performance; in terms of variable selection, the numbers of variables selected by both LASSO-SVM and MCP-HSVM are closer to the number of real significant variables, but FDR of SCAD-HSVM is lower. When there is a linear correlation between variables, MCP-HSVM and SCAD-HSVM outperform SVM and LASSO-SVM in terms of the classification prediction and variable selection, especially when there is a high linear correlation between variables.
Based on this, the above methods were applied to the prediction of financial distress of listed companies in this paper. The prediction results show that, among the six indicators affecting the financial distress of listed companies, the indicators per share have the greatest influence on the financial distress of listed companies, followed by the indicators of profitability and capital structure, and then the indicators of revenue quality and operating capacity. The indicators of solvency have the weakest influence on the financial distress of listed companies. Through modeling 100 times with 29 indicators from the six aspects, 15 indicators were selected each time, and 5 indicators were selected with a frequency between 90 and 100. Listed companies can assess the financial situation with the 20 indicators and make early warn of their possible financial distress in advance with higher precision.
Appendix: Supplementary Data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.jnlest.2024.100246.
Disclosures
The authors declare no conflicts of interest.