Deep plug-and-play self-supervised neural networks for spectral snapshot compressive imaging

2　Method

2.1　PnP method

The PnP method can decompose the original problem，which is difficult to solve，into subproblems that are easy to solve with good results^［26］. Hyperspectral compressive sensing image reconstruction can be conceptualized as the task of addressing the subsequent optimization problem：

a r g \underset{x}{m i n} \frac{1}{2} ∥ y - A x ∥_{2}^{2} + λ R (x)

，(8)

where $y$ is the measured value， $x$ is the original signal and $λ R (x)$ is the regularity term. The basic idea of PnP method for inverse problems is to use a pretrained denoiser for the desired signal as a prior. The method decomposes the whole problem into easier subproblems and solves the subproblems alternately in an iterative manner. The denoising network can be used as a flexible plug-in（i.e.，it can be easily changed）in the process. Specifically，problem（8）can be decomposed into the following subproblems using the alternating multiplier method^［27］：

x^{k + 1} = \underset{x}{a r g m i n} \frac{1}{2} ∥ A x - y ∥_{2}^{2} + \frac{ρ}{2} {(x - (z^{k} - u^{k}))}_{2}^{2},

(9)

z^{k + 1} = \underset{z}{a r g m i n} λ R (z) + \frac{ρ}{2} {(z - (x^{k + 1} + u^{k}))}_{2}^{2},

(10)

u^{k + 1} = u^{k} + (x^{k + 1} - z^{k + 1})

，(11)

where $z$ is the auxiliary variable， $u$ is the multiplier， $ρ$ is the penalty factor，and $k$ is the number of iterations. Let the auxiliary function $p r o x_{g} (v) = \underset{x}{a r g m i n} g (x) + \frac{1}{2} ∥ x - v ∥_{2}^{2}$ ，then Eqs.（9） -（11）can be written in the following form：

x^{k + 1} = p r o x_{\frac{f}{ρ}} (z^{k} - u^{k})

，(12)

z^{k + 1} = p r o x_{\frac{λ R}{ρ}} (x^{k + 1} + u^{k})

，(13)

u^{k + 1} = u^{k} + (x^{k + 1} - z^{k + 1})

，(14)

where $f (x) = \frac{1}{2} ∥ A x - y ∥_{2}^{2}$ . Eq.（12） has a solution in closed form，and Eq.（13） can be viewed as a denoising prior. The final PnP-ADMM solution for the compressive sensing hyperspectral reconstruction can be written as^［28］

x^{k + 1} = (z^{k} - u^{k}) + \frac{A^{⊤} [y - A (z^{k} - u^{k})]}{[D i a g (A A^{⊤}) + ρ]}

，(15)

z^{k + 1} = 𝒟_{{\hat{σ}}_{k}} (x^{k + 1} + u^{k})

，(16)

u^{k + 1} = u^{k} + (x^{k + 1} - z^{k + 1})

，(17)

where $σ^{2} = λ / ρ$ is the estimated noise bias and $𝒟_{{\hat{σ}}_{k}}$ is the denoiser. It should be noted that the performance of the noise reducer $𝒟_{{\hat{σ}}_{k}}$ directly affects the final reconstruction results. The noise reducer $𝒟_{{\hat{σ}}_{k}}$ used here in this paper is a self-supervised hyperspectral denoising network for the purpose of final self-supervised image reconstruction. During the practical application，the initial inputs consist of 2D measurements acquired from the detector and the mask matrix. Following several iterations of the pre-trained denoising network，the final reconstructed data cube is obtained. This process is shown in Fig. 2.

Figure 2.PnP image reconstruction framework

2.2　Neighbor2Neighbor method

The most critical part of the PnP-ADMM solution for compressive sensing hyperspectral reconstruction is the part of Eq.（16），which can be approximated using a deep learning model. Specifically it can be solved using models in deep learning image denoising，but these models are mostly trained by supervised learning. There are two primary unsupervised deep learning denoising methods，one relies on deep image prior denoising method，which doesn’t require training but has longer reconstruction times. The other is a denoising method based on Noise2Noise concept，which still needs training but can significantly diminish the reconstruction time.

The core idea of Noise2Noise is that for an unobserved clean scene $x$ and two observed independent noise-containing images $y$ and $z$ ，a noise reduction network trained with $(y, z)$ pairings is equivalent to a network trained with $(y, x)$ pairings，provided the noise is obeying a zero mean^［29］. The optimisation objective of Noise2Noise is

\underset{θ}{a r g m i n} E_{x, y, z} {(f_{θ} (y) - z)}_{2}^{2}

，(18)

where $f (\cdot)$ is the noise reduction network. Noise2Noise requires at least 2 separate noise-containing images for each scene，which is difficult to satisfy in real scenes. To increase the practical value of Noise2Noise，the theory of Noise2Noise is extended. For a single noisy image，one of the possible ways to construct two similar but not identical images is downsampling^［30］.

The Neighbor2Neighbor downsampling idea is shown in Fig. 3. For a greyscale image $y$ of size $H \times W$ ，divide it into $\frac{H}{2} \times \frac{W}{2}$ blocks of pixels of size $2 \times 2$ . Then two separate pixels are randomly sampled from each pixel block，which are finally combined to form two sub-sampled image $g_{1} (y)$ and $g_{2} (y)$ of size $\frac{H}{2} \times \frac{W}{2}$ . At this point equation （18）becomes

\underset{θ}{a r g m i n} E_{x, y} {(f_{θ} (g_{1} (y)) - g_{2} (y))}^{2}

.(19)

Figure 3.Image downsampling method

Literature［30］demonstrated that using Eq.（19） directly as the optimisation objective would end up with denoising results that are too smooth. Therefore，we consider adding a penalty term to Eq.（19） to get the final optimisation objective as

\begin{matrix} \underset{θ}{m i n} E_{x, y} {(f_{θ} (g_{1} (y)) - g_{2} (y))}_{2}^{2} \\ + γ E_{x, y} {(f_{θ} (g_{1} (y)) - g_{2} (y) - g_{1} (f_{θ} (y)) + g_{2} (f_{θ} (y)))}_{2}^{2} \end{matrix}

.(20)

2.3　Hyperspectral denoising network

Deep denoising models for greyscale and RGB images have gained longevity in recent years. We propose deep plug-and-play self-supervised hyperspectral image denoising network（Self-HSIDeCNN）. FFDNet is chosen as the base model framework for our model，which has the advantages of being flexible and fast，and has already shown excellent performance in denoising RGB images and greyscale images^［31］. We use FFDNet as a backbone network for two main reasons. Firstly，compared to UNet，a commonly used backbone network in image processing，FFDNet is designed for image denoising，and the noise level map can be varied during training to make the model flexible for various levels of noise. Second，compared with the transformer architecture，which has stronger feature extraction capability，FFDNet's training and inference are faster，and its downsampling module can effectively reduce the model's requirement on computational resources. The structure of our neural network is shown in Figure 4.

Figure 4.Self-HSIDeCNN network architecture

In the process of performing frame-level noise reduction，for the $n_{λ}$ -th band of the spectral data cube $X \in ℝ^{N_{x} \times N_{y} \times N_{λ}}$ ，in addition to inputting this band of size $N_{x} \times N_{y}$ into the neural network，the images of its neighbouring $K$ bands will also be inputted（in this paper，we take $K$ = 6）. For $K + 1$ band images at this point，following the sampling strategy in 2.2，two subgraphs，both of size $(K + 1) \times \frac{N_{x}}{2} \times \frac{N_{y}}{2}$ ，are obtained. One subgraph $y_{1}$ is used as model input and one subgraph $y_{2}$ is used as labels to compute the loss function. For subgraph $y_{1}$ ，in order to speed up model training，another downsampling is performed after input to the model. Together with the noise estimation level map，the final model has an input size of $(4 K + 5) \times \frac{N_{x}}{4} \times \frac{N_{y}}{4}$ .

In order to better capture the correlation between different spectral bands of hyperspectral images，an inter-channel attention structure SENet is added to form a SEBlock after every two convolutional layers in the neural network. This structure is equivalent to assigning a separate weight to each channel feature map，which increases the number of parameters in the model，but enhances the non-local feature extraction capability of the model. The SENet structure is shown in Fig. 5. A total of seven SEBlocks are used in this paper，each using a convolutional kernel of size $3 \times 3$ and a ReLU activation function（the output of the last SEBlock does not use an activation function）. The feature map size is reduced to $(K + 1) \times \frac{N_{x}}{2} \times \frac{N_{y}}{2}$ by up-sampling after 7 SEBlocks，and then the loss function is calculated with the sub-sampled image $y_{2}$ . The noise level $σ$ in the noise level estimation map is randomly changed during the training process，allowing the model to flexibly adapt to different noises.

Figure 5.SENet architecture

After obtaining the hyperspectral denoising model，we can train the model based on the Neighbor2Neighbor self-supervised learning method. According to the optimization objective shown in Eq.（20），the loss function is designed as

\begin{matrix} L = ℒ_{r e c} + γ \cdot ℒ_{r e g} \\ = {(f_{θ} (g_{1} (y)) - g_{2} (y))}_{2}^{2} \\ + γ \cdot {(f_{θ} (g_{1} (y)) - g_{2} (y) - (g_{1} (f_{θ} (y)) - g_{2} (f_{θ} (y))))}_{2}^{2} \end{matrix}

，(21)

where $g_{1, 2} (\cdot)$ is a stochastic downsampling function and $f_{θ} (\cdot)$ is a noise reduction neural network. $γ$ is a penalty term（fixed at 5 in this paper）used to balance the level of detail preserved in the denoising results. To keep the gradient stable，the gradients of $g_{1} (f_{θ} (y))$ and $g_{2} (f_{θ} (y))$ are not propagated during training. This training process is shown in Fig. 6. In the inference stage，we only need to input the noisy image into the model to obtain the denoised image directly.

Figure 6.Self-supervised training process

For denoising models with supervised learning methods，the models cannot be trained in the absence of clean images as labels. For the compressed perceptual image reconstruction model with supervised learning method，the training data containing noise will make the model worse（we will discuss this content in the subsequent section）. As can be seen from Figure 6，this self-supervised learning method only needs noise-containing images to complete the training of the model，without the need of clean images as labels. The self-supervised learning approach is significantly superior in the hyperspectral domain where noise-free data is very expensive to obtain.

3　Experiments and results

Our proposed method can be divided into two steps. First，we train the hyperspectral denoising network based on the self-supervised approach. Then，this network is directly embedded into the PnP framework for self-supervised image reconstruction. In this section，we first describe the dataset and implementation details. Then，to evaluate the effectiveness，the proposed method is compared with models trained based on supervised learning approach. Furthermore，ablation studies are conducted to analyze the effect of hyperparameters on the results.

3.1　Training details

Model training was performed using the CAVE hyperspectral dataset，consisting of 32 scenes，each with a resolution of $512 \times 512$ and containing a total of 31 bands from 400 $n m$ to 800 $n m$ . Five of the scenarios were selected as the test set and the remaining scenarios as the training set. The RGB images of the five test scenes are shown in Fig. 7. To increase the training data for subsequent training，this dataset was randomly cropped and data augmented（including flipping，rotating，mirroring，and combinations of these operations）. A total of 33，480 data of size 256 × 256 × 7 were finally obtained for training. For the five test scenarios，the space was downsampled to a 256 × 256 × 31 data cube. In this paper，the code was implemented using the PyTorch framework with 500 epochs using the Adam optimiser. The initial learning rate was set to 0.000 1 and multiplied by 0.5 after every 100 epochs. Training of the entire network took approximately 8 hours，using a machine equipped with an Intel Xeon CPU，360 GB of memory，and four Nvidia RTX 4090 Ti GPUs with 24 GB RAM.

Figure 7.RGB image of test scene

3.2　Denoising results

Before comparing the reconstruction outcomes，it is essential to access both the supervised and the self-supervised denoising results to investigate the impact of the Neighbor2Neighbor self-supervised learning strategy on the denoising outcomes. The identical model，parameters，and training data are employed here，differing solely in the loss function and back-propagation gradient. For the supervised learning approach，the loss function is

L = {(f_{θ} (y) - x)}_{2}^{2}

.(22)

For the self-supervised learning approach，the loss function is shown in Eq.（21）.

With the maximum pixel value of 255，let the Gaussian noise variance be $σ$ and the Poisson noise intensity be $λ$ . In this paper，we compare four different noise degradation models，which are fixed Gaussian noise（ $σ = 25$ ），range Gaussian noise（ $σ ϵ [5, 50]$ ），fixed Poisson noise（ $λ = 30$ ），and range Poisson noise（ $λ ϵ [5, 50]$ ）. Using peak signal to noise ratio（PSNR）and Structural Similarity（SSIM）as evaluation metrics. The five test scenario metrics were averaged and the final results are shown in Table 1.

Table 1. Comparison of denoising results

Table 1. Comparison of denoising results

Noise type	Supervised Model	Self-supervised Model
$σ = 25$	38.12 dB，0.951	39.33 dB，0.978
$σ ϵ [5, 50]$	37.63 dB，0.924	39.16 dB，0.969
$λ = 30$	38.28 dB，0.964	40.25 dB，0.983
$λ ϵ [5, 50]$	38.25 dB，0.966	39.86 dB，0.976

The self-supervised learning model，utilizing the Neighbor2Neighbor strategy，demonstrates superior performance compared to the supervised learning model across various noise degradation models. This superiority arises from the slight disparity $ε$ between the labels of the self-supervised learning model and the desired denoising results of the model inputs. Additionally，with the incorporation of the penalty term in the loss function，the model achieves enhanced generalization and performs better on the test set.

The hyperparameter γ is used to avoid overly smooth denoising results. When $γ = 0$ ，it means that the penalty term in the loss function does not exist. In such a case， $f_{θ} (g_{1} (y))$ and $f_{θ} (g_{2} (y))$ are not exactly the same，and the model tends to output the average value of $f_{θ} (g_{1} (y))$ and $f_{θ} (g_{2} (y))$ because the loss function achieves its minimum value at this point. In order to verify the influence of the superparameter $γ$ on the denoising results，the noise in the training data is unchanged（this paper uses the Gaussian noise with variance of 25 and the Poisson noise with intensity of 30），and only the value of the superparameter is changed to verify the denoising effect of different models on the test set. The final results are shown in Table 2.

Table 2. The influence of hyperparameter $γ$ on denoising results

View table
View all Tables
Table 2. The influence of hyperparameter $γ$ on denoising results

$γ$
PSNR / dB
（ $σ = 25$ ）
SSIM
（ $σ = 25$ ）
PSNR / dB
（ $λ = 30$ ）
SSIM
（ $λ = 30$ ）
0 39.27 0.970 40.26 0.983
1 39.23 0.962 40.23 0.981
2 39.26 0.965 40.34 0.986
5 39.33 0.978 40.25 0.983
20 39.30 0.977 39.89 0.979

For Gaussian noise，the denoising effect is first analysed when $γ = 0$ and $γ = 1$ . When $γ = 0$ ，the penalty term does not exist，the denoising result will be too smooth，and the performance of Self-HSIDeCNN is not optimal. However，because of the presence of downsampling and upsampling modules in Self-HSIDeCNN，the denoising outcomes exhibit excessive smoothness in the downsampled subgraphs. However，despite this，the high-frequency information in the final output image of the model remains well-preserved after upsampling the subgraphs. This explains why the denoising effect becomes worse at $γ = 1$ （the presence of the penalty term makes the denoising inadequate）. At $γ = 5$ ，although a small amount of noise is not removed due to the increase in γ，the original information of the image is better recovered（compared to $γ = 0$ ）. At $γ = 20$ ，more noise is retained along with the high-frequency detail information of the image，so the denoising effect becomes worse again.

For Poisson noise，the trend is similar to that of Gaussian noise，but its optimum is achieved at $γ = 2$ . This is because the model has a better denoising effect on Poisson noise. At this point，only a smaller $γ$ is needed to alleviate the problem of over-smoothing caused by the self-supervised learning method.

In this paper，we fix $γ = 5$ for experiments，and the value of $γ$ should be determined according to the scene characteristics and experimental results in practical applications.

3.3　Reconstruction results

The self-supervised learning model Self-HSIDeCNN based on Neighbor2Neighbor in 3.2 can be directly embedded in deep plug-and-play architectures for compressive sensing image reconstruction. We name it PnP-Self-HSIDeCNN. Deep plug-and-play frameworks often require a warm start to speed up convergence. Here，the GAP-TV denoiser is used for 90 iterations first，and then the denoiser is switched to the self-supervised learning model for better reconstruction results. In addition，for the estimated noise level $σ$ at the time of reconstruction，it was set to 30 regardless of the noise degradation model（normalised to 0 to 1）. As the iteration progresses，this parameter can be gradually decreased to enhance the reconstruction quality. We employ another reconstruction algorithm named PnP-HSI，which utilizes a denoising model trained through supervised learning with the PnP framework.

To compare with end-to-end neural networks，we selected both U-net and TSA-net models. U-net consists of two main components：an encoder and a decoder. Each coding block contains two $3 \times 3$ convolutional layers and a $2 \times 2$ maximum pooling layer using the ReLU activation function. TSA-net，directly uses the structure from the literature［19］without further changes. For the dataset，the CAVE hyperspectral dataset is also used and its size is randomly cropped to 256 × 256 in the spatial dimension for data augmentation，resulting in 5000 training data of size $256 \times 256 \times 31$ . The CASSI system forward model simulation was performed on the data cube before inputting the model，and the final model input was a 2D measurement of $256 \times 316$ . It is important to emphasize that our primary aim is to investigate the utilization of self-supervised learning in compressive sensing image reconstruction and to analyze the influence of various noise degradation models on the reconstruction outcomes. Consequently，we refrain from comparing our approach with current state-of-the-art models in terms of performance，as these models are trained using a supervised approach.

The final comparison results are shown in Table 3. In terms of reconstruction quality，the self-supervised reconstruction model proposed in this paper outperforms both U-net and TSA-net in terms of PSNR and SSIM of the reconstruction results. It can also be seen that the optimal results were achieved by the self-supervised learning model trained based on fixed Gaussian noise. This is because the mathematical form of $𝒟_{{\hat{σ}}_{k}}$ in Eq.（13） corresponds to a fixed Gaussian noise denoiser. Unless otherwise stated，the PnP-Self-HSIDeCNN used in the subsequent comparison experiments was trained based on Gaussian fixed noise（ $σ = 25$ ）.

Table 3. Comparison of reconstruction results

Table 3. Comparison of reconstruction results

	U-net	TSA-net	PnP-HSI	Ours （ $σ = 25$ ）	Ours （ $σ ϵ [5, 50]$ ）	Ours （ $λ = 30$ ）	Ours （ $λ ϵ [5, 50]$ ）
Scene1	26.29dB，0.843	26.47dB，0.855	29.56dB，0.875	31.12dB，0.892	30.44dB，0.843	28.34dB，0.788	28.51dB，0.798
Scene2	37.20dB，0.941	36.98dB，0.935	37.59dB，0.959	39.62dB，0.959	37.01dB，0.921	37.55dB，0.963	38.39dB，0.966
Scene3	33.87dB，0.918	35.07dB，0.917	36.27dB，0.924	35.89dB，0.916	34.31dB，0.888	34.14dB，0.908	34.25dB，0.909
Scene4	32.99dB，0.910	33.16dB，0.929	34.88dB，0.933	35.22dB，0.928	34.13dB，0.870	32.96dB，0.910	33.02dB，0.903
Scene5	20.11dB，0.788	21.14dB，0.794	21.55dB，0.797	23.89dB，0.838	23.94dB，0.808	23.06dB，0.810	22.70dB，0.802
Mean	30.09dB，0.880	30.56dB，0.886	31.97dB，0.898	33.15dB，0.907	31.97dB，0.866	31.21dB，0.876	31.37dB，0.876

Fig. 8 depicts the visual representation of a test scenario. The image on the left exhibits the RGB image of the scene alongside the 2D measurements following simulation，while the image on the right showcases the reconstruction results obtained from various algorithms. In Fig. 9，an enlarged view of the last band image on the right side of Fig. 8 is presented to facilitate a visual comparison of its reconstruction quality.

Figure 8.Comparison of the visual effect of the reconstruction results of different algorithms：（a）RGB image；（b）2D measurements；（c）ground truth；（d）U-net；（e）TSA-net；（f）PnP-HIS；（g）PnP-Self-HSIDeCNN

Figure 9.Comparison of visual reconstruction effects in the final band：（a）ground truth；（b）U-net；（c）TSA-net；（d）PnP-HIS；（e）PnP-Self-HSIDeCNN

In the evaluation of hyperspectral images，we not only consider spatial metrics but also emphasize spectral metrics. Fig. 10 illustrates the reconstruction of the spectral profile of the central position for this scene. For the spectral curves obtained from the reconstruction of different algorithms in Fig. 10（b），we use Spectral Angle Mapper（SAM）to evaluate their similarity with the real spectral curves，and this result is shown in the legend of Fig. 10（b）. From Fig. 10，it can be seen that the reconstructed spectral curve based on PnP-Self-HSIDeCNN is most similar to the real spectral curve.

Figure 10.Comparison of spectral curve reconstruction results：（a）RGB image；（b）spectral curve comparison

Based on the aforementioned metrics，it is evident that the self-supervised reconstruction algorithm proposed in this paper，based on the deep plug-and-play framework，yields superior results across various metrics. However，in terms of runtime，the end-to-end neural networks U-net and TSA-net hold a significant advantage，completing reconstruction in less than 1 second after training，whereas algorithms utilizing deep plug-and-play frameworks require several minutes due to the iterative process. Nevertheless，runtime durations at the minute level are generally acceptable in practical applications. Moreover，the method circumvents the necessity for clean data as labels during training，thereby substantially mitigating the issue of inadequate training data for compressive sensing deep learning reconstruction models in real-world scenarios.

3.4　Comparison of generalisability

In real-world application scenarios，data often contains noise，and ideally，clean data with minimal noise levels is preferred. In our proposed self-supervised reconstruction model，spectral data cubes containing noise can be obtained from real scenarios and then utilized for training to ensure generalization. However，for supervised deep learning models，training becomes challenging due to the absence of clean images as labels. One approach is to directly train with noisy images as labels，but this typically leads to a degradation in model performance. To evaluate the generalizability of the models，Gaussian noise was introduced to the CAVE dataset to simulate real-world scenarios. U-net and TSA-net were retrained and tested using this noisy data. Conversely，self-supervised networks，which inherently incorporate noise during the training process，can be directly tested with noisy data. The resulting performance is summarized in Table 4.

Table 4. Comparison of generalisability

Table 4. Comparison of generalisability

	U-net	TSA-net	PnP-HSI	Ours （ $σ = 25$ ）
Scene1	25.89dB，0.824	25.88dB，0.838	29.52dB，0.873	31.13dB，0.890
Scene2	36.41dB，0.931	35.49dB，0.927	37.54dB，0.957	39.59dB，0.957
Scene3	31.78dB，0.887	33.80dB，0.904	36.23dB，0.919	35.84dB，0.912
Scene4	31.03dB，0.901	32.03dB，0.917	34.80dB，0.934	35.21dB，0.929
Scene5	20.14dB，0.773	20.50dB，0.787	21.56dB，0.798	23.84dB，0.834
Mean	29.05dB，0.863	29.54dB，0.874	31.93dB，0.899	33.12dB，0.904
Difference	-1.04dB，-0.017	-1.02dB，-0.012	-0.04dB，-0.001	-0.03dB，-0.003
Percentage	-3.46%，-1.93%	-3.34%，-1.35%	-0.13%，-1.11%	-0.09%，-0.33%

From Table 4，it's evident that the performance of the algorithms declines when tested on noisy data. However，the degradation in performance for the two algorithms based on the deep plug-and-play framework is considerably less than that observed for the two end-to-end neural networks. This indicates that the self-supervised learning model exhibits stronger robustness and generalization. Building upon the above findings，the self-supervised reconstruction algorithm proposed in this article，based on the deep plug-and-play framework，demonstrates superior performance across multiple indicators. Importantly，this method doesn't necessitate clean data as labels for training，thereby alleviating the challenge of insufficient training data for compressive sensing deep learning reconstruction models in practical applications.

3.5　Ablation experiment

3.5.1　Hyperparameterisation $γ$

In 3.2，we verified the effect of the hyperparameter γ on the model’s denoising results. Here，we explore its impact on the hyperspectral image reconstruction. Using the denoising model from Section 3.2 directly and maintaining other parameters in the PnP-Self-HSIDeCNN algorithm unchanged，we present the reconstruction results in Table 5.

Table 5. Influence of hyperparameter $γ$ on reconstruction results

Table 5. Influence of hyperparameter $γ$ on reconstruction results

	$γ = 0$		$γ = 1$		$γ = 2$		$γ = 5$		$γ = 20$
	PSNR/dB	SSIM	PSNR/dB	SSIM	PSNR/dB	SSIM	PSNR/dB	SSIM	PSNR/dB	SSIM
Scene1	30.92	0.862	30.42	0.862	30.92	0.878	29.70	0.785	29.89	0.855
Scene2	37.19	0.921	38.54	0.940	40.19	0.951	35.50	0.921	37.67	0.944
Scene3	33.93	0.888	36.28	0.908	36.36	0.913	32.74	0.888	34.28	0.892
Scene4	33.98	0.883	35.07	0.899	35.47	0.894	33.19	0.861	35.38	0.916
Scene5	22.97	0.769	23.86	0.833	23.80	0.843	23.29	0.791	23.18	0.803
Mean	31.79	0.865	32.83	0.888	33.35	0.896	30.88	0.849	32.08	0.882

For scenes 1 to 4，an appropriate penalty term can enhance the models’ reconstruction quality. However，excessively large values of $γ$ can degrade the model’s denoising effect（even worse than without the penalty term），consequently affecting the final hyperspectral compressed perceptual image reconstruction. However，for scene 5，the model performance is better when $γ$ is not zero. This is attributed to scene 5’s rich colour information and more complex spatial-spectral curves，consequently affecting the final hyperspectral compressed perceptual image reconstruction. The comparison of Tables 2 and 5 reveals that the model’s performance on the denoising task and on the image reconstruction task are not entirely correlated. On the denoising task，the model achieves optimal performance with $γ = 5$ ，but on the image reconstruction task the model achieves optimal performance with $γ = 2$ . In summary，the value of $γ$ should also be determined according to the specific usage scenario.

3.5.2　Noise estimation level $σ$

In the aforementioned comparison，the estimated noise level σ is set to 30，irrespective of the noise degradation model. This approach is chosen to facilitate direct model usage for reconstruction without the need for tedious parameter adjustment processes. Consequently，the reconstruction results of the PnP-Self-HSIDeCNN algorithm in the aforementioned experiments represent expected performance in practical use rather than optimal performance. To explore the model's optimal performance，Gaussian fixed noise（ $σ = 25$ ）is utilized during model training，and two noise estimation strategies are employed during model testing：fixed estimation of the noise level（fixed at 100，50，30，and 5，respectively），and dynamic estimation of the noise level（gradually decreasing as iterations progress）. The experimental results are presented in Table 6 and Table 7.

Table 6. Influence of noise estimation level $σ$ on reconstruction results

Table 6. Influence of noise estimation level $σ$ on reconstruction results

	$σ = 100$		$σ = 50$		$σ = 30$		$σ = 5$		$σ ϵ [0, 50]$
	PSNR/dB	SSIM	PSNR/dB	SSIM	PSNR/dB	SSIM	PSNR/dB	SSIM	PSNR/dB	SSIM
Scene1	27.57	0.750	30.21	0.852	31.00	0.888	31.28	0.897	32.01	0.899
Scene2	33.56	0.921	38.16	0.941	39.38	0.955	40.18	0.973	40.32	0.977
Scene3	32.00	0.888	34.69	0.888	35.77	0.911	36.20	0.928	36.83	0.932
Scene4	31.87	0.861	34.59	0.912	35.13	0.927	35.50	0.929	36.14	0.941
Scene5	21.58	0.749	22.97	0.803	23.75	0.832	24.29	0.850	25.88	0.859
Mean	29.32	0.834	32.12	0.879	33.01	0.903	33.49	0.915	34.23	0.921

Table 7. Influence of noise estimation level $σ$ on reconstruction time

Table 7. Influence of noise estimation level $σ$ on reconstruction time

	$σ = 100$	$σ = 50$	$σ = 30$	$σ = 5$	$σ ϵ [0, 50]$
Scene1	53.10s	96.57s	99.09s	105.39s	117.74s
Scene2	51.21s	91.74s	95.67s	101.17s	110.92s
Scene3	48.69s	84.51s	89.87s	94.28s	105.03s
Scene4	51.84s	91.35s	96.27s	103.77s	116.19s
Scene5	55.62s	104.61s	112.84s	119.07s	145.56s
Mean	52.09s	93.76s	98.75s	104.74s	119.09s

When $σ$ = 100，it significantly exceeds the noise level present in the training data for the denoising model. At this point，the results converge quickly but with poor reconstruction quality. Smaller σ values yield better reconstruction results but require longer runtimes. Compared to fixed noise estimation levels，allowing the noise estimation level to gradually decrease with iterations achieves optimal reconstruction results，albeit at the expense of longer runtimes（since σ decreases as iterations progress）. In summary，the value of σ should be selected based on specific requirements. Larger σ values can be used to expedite reconstruction when strict reconstruction quality is not necessary.

3.5.3　Number of warm start iterations

Deep plug-and-play frameworks typically require a hot-start to expedite convergence. In the aforementioned experiments，the GAP-TV denoiser is employed for 80 iterations before transitioning to a self-supervised learning model to achieve improved reconstruction results. However，the GAP-TV noise reducer doesn't necessarily need 80 iterations to reach its denoising performance limit. Thus，reducing the number of hot-start iterations could accelerate reconstruction. However，ending hot-start iterations prematurely may result in suboptimal final image reconstruction. To investigate this，we select a scene with a fixed noise estimation level of 30 and vary the number of hot-start iterations to study their effects on reconstruction results and speed. The experimental results are presented in Table 8.

Table 8. Influence of the number of warm start iterations