Journal of Infrared and Millimeter Waves, Volume. 43, Issue 6, 847(2024)

Deep plug-and-play self-supervised neural networks for spectral snapshot compressive imaging

Xing-Yu ZHANG1,3, Shou-Zheng ZHU1,3, Tian-Shu ZHOU1,3, Hong-Xing QI1,3, Jian-Yu WANG1,2,3, Chun-Lai LI1,2,3、**, and Shi-Jie LIU1,3、*
Author Affiliations
  • 1School of Physics and Optoeletronic Engineering,Hangzhou Institute for Advanced Study,University of Chinese Academy of Sciences,Hangzhou 310024,China
  • 2Key Laboratory of Space Active Opto-Electronics Technology,Shanghai Institute of Technical Physics,Chinese Academy of Sciences,Shanghai 200083,China
  • 3University of Chinese Academy of Sciences,Beijing 100049,China
  • show less

    The encoding aperture snapshot spectral imaging system, based on compressive sensing theory, can be regarded as an encoder, which can efficiently obtain compressed two-dimensional spectral data and then decode it into three-dimensional spectral data through deep neural networks. However, training the deep neural networks requires a large amount of clean data that is difficult to obtain. To address the problem of insufficient training data for deep neural networks, a self-supervised hyperspectral denoising neural network based on neighborhood sampling is proposed. This network is integrated into a deep plug-and-play framework to achieve self -supervised spectral reconstruction. The study also examines the impact of different noise degradation models on the final reconstruction quality. Experimental results demonstrate that self-supervised learning method enhances the average peak signal-to-noise ratio by 1.18 dB and improves the structural similarity by 0.009 compared with the supervised learning method. Additionally, it achieves better visual reconstruction results.

    Keywords

    Introduction

    Spectral imaging technology enables the simultaneous acquisition of both spectral and image information1. It stands as a crucial component of modern remote sensing,serving as a significant tool for both earth observation2 and space exploration3. The three-dimensional spectral data cube contains a wealth of information,but its acquisition comes with higher costs,often necessitating multiple scans. In recent years,many advanced advances have been made in the field of computational imaging4. The coded aperture snapshot spectral imager(CASSI)system offers a solution by compressing hyperspectral images using compressed sensing theory 5. This system modulates the spectral data cube,compresses it into two-dimensional measurements,and then utilizes algorithms for reconstruction. The physical structure of CASSI has two types,dual-disperser6 and single-disperser7. Single-disperser structure has achieved applications in fields such as medicine8 due to its simpler physical design. The objective of compressed sensing image reconstruction is to restore the original high-dimensional signal xRN from a reduced set of linear measurements yRM9. Current compressed sensing reconstruction methods fall into two categories:model driven algorithms based on regularization prior and data-driven algorithms based on deep networks. Classical model driven algorithms include orthogonal matching pursuit10,iterative hard thresholding11-12 and block compressed sening transforms13. While model driven algorithms offer strong interpretability and generalization,it struggles with processing extensive spectral data,resulting in slower reconstruction speeds and poorer quality. In recent years,deep learning has evolved rapidly,with some technologies being applied to compressed sensing image reconstruction. Deep learning-based reconstruction algorithms have shown superior performance in both simulations and real-world scenarios 14.

    Xiong et al. pioneered the use of deep learning methods for hyperspectral compressive sensing reconstruction 15. They improved reconstruction using convolutional neural networks and residual connections. Choi et al. developed a convolutional autoencoder to obtain a nonlinear spectral representation of hyperspectral images,combining the learned autoencoder prior and total variation(TV)prior as a composite regular term,and solving the problem using the alternating direction multiplier method 16. Wang et al. designed HyperReconNet to reconstruct hyperspectral image by cascading spatial and spectral networks 17. Miao et al. proposed a λ-net for compressive sensing reconstruction of hyperspectral images and videos through two phases 18. They used Generative Adversarial Networks in the first phase of reconstruction and U-net in the second phase to enhance reconstruction. Meng et al. proposed TSA-net,incorporating a spatial-spectral self-attention module into U-net and integrating scatter-grain noise in the training process,significantly improving reconstruction of real data 19. Zheng et al. suggested a flexible plug-and-play(PnP)framework for hyperspectral image reconstruction,enhancing both quality and speed 20. Wang et al. pioneered the application of the Transformer architecture to compressive sensing spectral imaging with the GAP-CSCoT network,maintaining high reconstruction quality while reducing runtime 21. Chen et al. proposed a Proximal Gradient Descent Unfolding Dense-spatial Spectral-attention Transformer(PGDUF)method that can accelerate the training of models based on the Transformer architecture without affecting the reconstruction results22. Chen et al. utilised low-rank subspace representations of hyperspectral images in combination with deep neural networks to achieve better reconstruction results and stronger interpretability23. Luo et al. proposed a Transformer-based HSI reconstruction method called dual-window multiscale Transformer(DWMT),which is a coarse-to-fine process,reconstructing the global properties of HSI with the long-range dependencies,and maintaining better reconstruction quality24.

    By applying deep learning to compressive sensing reconstruction,high quality and fast reconstruction can be achieved through the powerful deep feature representation capability. However,most existing research on deep learning-based reconstruction algorithms is limited to supervised learning approaches,restricting their applicability in real-world scenarios. This is because supervised learning approaches require a large number of clean images to be used as labels,and acquiring clean images in the hyperspectral domain is expensive. In the realm of deep learning denoising,self-supervised learning has shown promise. In cases where the noise is zero-mean,the Noise2Noise method demonstrates that for a clean scene x and two independently noise-containing images y and z observed,a denoising network trained with (y,z) pairings is equivalent to a network trained with (y,x) pairings. Neighbor2Neighbor extends Noise2Noise by downsampling a single noise-containing image and training the two noise-containing sub-images obtained from downsampling as model inputs and labels. However,these self-supervised learning methods have only been tested for their denoising efficacy on RGB images or greyscale images and have not explored their performance on hyperspectral images.

    In this paper,we propose a self-supervised compressive sensing hyperspectral image reconstruction algorithm based on PnP method and Neighbor2Neighbor strategy. Initially,we train a self-supervised hyperspectral denoising network using Neighbor2Neighbor,incorporating a channel attention mechanism to capture inter-spectral correlations in hyperspectral images. Subsequently,the denoising network is embedded into a deep plug-and-play framework based on the alternating multiplier method to achieve compressive sensing image reconstruction with a denoising model. We evaluate the algorithm’s effectiveness in terms of both data metrics and visual effects,and compare the effects of different noise degradation models on the final reconstruction results.

    The main contributions of this paper are as follows:

    (1)Based on Neighbor2Neighbor and SENet,we propose a self-supervised hyperspectral image denoising network model Self-HSIDeCNN for subsequent compressed sensing image reconstruction.

    (2)On the basis of Self-HSIDeCNN and PnP method,we propose PnP-Self-HSIDeCNN method to implement self-supervised hyperspectral compressed sensing image reconstruction.

    (3)Through ablation experiments,we verified the effects of multiple hyperparameters in the PnP-Self-HSIDeCNN method on the reconstruction speed and reconstruction results,laying a foundation for its practical application.

    The rest of this paper is organized as follows:section 1 describes the mathematical model of CASSI;section 2 introduces the mathematical principles of PnP method and Neighbor2Neighbor methond,and introduces self-supervised denoising network Self-HSIDeCNN;section 3 demonstrates the effectiveness of our proposed self-HSIDeCNN method through extensive experiments;Section 4 concludes the paper.

    1 Mathematically model of CASSI

    Let XNx×Ny×Nλ be the spectral data cube,wherex and y are the spatial dimensions and λ is the spectral dimension. Let M* RNx×Ny be the CASSI system physical mask,which can be regarded as a matrix of size Nx×Ny. Each element of this matrix obeys a 0-1 distribution with probability p and is used to modulate the 3D spectral data cube signal. Let X'RNx×Ny×Nλ be the spectral data after passing through the mask plate,and for the nλ-th band,we have

    X':,:,nλ=X:,:,nλM* 

    where represents the element-wise multiplication. After passing through the dispersion prism,the data cube X' is shifted in the y-axis. Let the offset signal be X''RNx×Ny+Nλ-1×Nλ,and λc be the reference wavelength,we have

    X''u,v,nλ=X'x,y+dλn-λc,nλ

    where (u,v) are the pixel coordinates in the plane of the detector,λn is the wavelength of the nλ-th band,λc is the centre wavelength,dλn-λc denotes the spatial offset of the  nλ-th band. This gives the measured value y(u,v) at the position of the detector plane (u,v) is

    yu,v=λminλmaxx''u,v,nλdλ .

    The detector receives signals from all bands and finally obtains a two-dimensional measurement YRNx×Ny+Nλ-1. Considering the noise during the measurement,we have

    Y=nλ=1NλX'':,:,nλ+G .

    Let the offset physical mask plate matrix(this matrix can be fixed or variable25)be MRNx×Ny+Nλ-1×Nλ,and the offset data cube is F˜RNx×Ny+Nλ-1×Nλ,we have

    Mu,v,nλ=M*x,y+dλn-λc
    F˜u,v,nλ=Fx,y+dλn-λc,nλ .

    The final obtained measurement Y can be expressed as19

    Y=nλ=1NλF˜:,:,nλM:,:,nλ+G .

    The complete process is shown in Figure 1. The data obtained by CASSI is a two-dimensional measurement similar to Y. This data is characterised by a large amount of information and a small storage capacity. By means of appropriate algorithms,we were able to recover it as the original three-dimensional data cube.

    CASSI forward model

    Figure 1.CASSI forward model

    2 Method

    2.1 PnP method

    The PnP method can decompose the original problem,which is difficult to solve,into subproblems that are easy to solve with good results26. Hyperspectral compressive sensing image reconstruction can be conceptualized as the task of addressing the subsequent optimization problem:

    argminx 12y-Ax22+λRx

    where y is the measured value,x is the original signal and λRx is the regularity term. The basic idea of PnP method for inverse problems is to use a pretrained denoiser for the desired signal as a prior. The method decomposes the whole problem into easier subproblems and solves the subproblems alternately in an iterative manner. The denoising network can be used as a flexible plug-in(i.e.,it can be easily changed)in the process. Specifically,problem(8)can be decomposed into the following subproblems using the alternating multiplier method27

    xk+1=arg minx12Ax-y22+ρ2x-zk-uk22,
    zk+1=arg minzλRz+ρ2z-xk+1+uk22,
    uk+1=uk+xk+1-zk+1

    where z is the auxiliary variable,u is the multiplier,ρ is the penalty factor,and k is the number of iterations. Let the auxiliary function proxg (v)=arg minxg(x)+ 12x-v22,then Eqs.(9) -(11)can be written in the following form:

    xk+1=proxfρ zk-uk
    zk+1=proxλRρ xk+1+uk
    uk+1=uk+xk+1-zk+1

    where f(x)=12Ax-y22. Eq.(12) has a solution in closed form,and Eq.(13) can be viewed as a denoising prior. The final PnP-ADMM solution for the compressive sensing hyperspectral reconstruction can be written as28

    xk+1=zk-uk+Ay-Azk-ukDiag AA+ρ
    zk+1=𝒟σ^kxk+1+uk
    uk+1=uk+xk+1-zk+1

    where σ2=λ/ρ is the estimated noise bias and 𝒟σ^k is the denoiser. It should be noted that the performance of the noise reducer 𝒟σ^k directly affects the final reconstruction results. The noise reducer 𝒟σ^k used here in this paper is a self-supervised hyperspectral denoising network for the purpose of final self-supervised image reconstruction. During the practical application,the initial inputs consist of 2D measurements acquired from the detector and the mask matrix. Following several iterations of the pre-trained denoising network,the final reconstructed data cube is obtained. This process is shown in Fig. 2.

    PnP image reconstruction framework

    Figure 2.PnP image reconstruction framework

    2.2 Neighbor2Neighbor method

    The most critical part of the PnP-ADMM solution for compressive sensing hyperspectral reconstruction is the part of Eq.(16),which can be approximated using a deep learning model. Specifically it can be solved using models in deep learning image denoising,but these models are mostly trained by supervised learning. There are two primary unsupervised deep learning denoising methods,one relies on deep image prior denoising method,which doesn’t require training but has longer reconstruction times. The other is a denoising method based on Noise2Noise concept,which still needs training but can significantly diminish the reconstruction time.

    The core idea of Noise2Noise is that for an unobserved clean scene x and two observed independent noise-containing images y and z,a noise reduction network trained with (y,z) pairings is equivalent to a network trained with (y,x) pairings,provided the noise is obeying a zero mean29. The optimisation objective of Noise2Noise is

    arg minθEx,y,zfθy-z22

    where f· is the noise reduction network. Noise2Noise requires at least 2 separate noise-containing images for each scene,which is difficult to satisfy in real scenes. To increase the practical value of Noise2Noise,the theory of Noise2Noise is extended. For a single noisy image,one of the possible ways to construct two similar but not identical images is downsampling30.

    The Neighbor2Neighbor downsampling idea is shown in Fig. 3. For a greyscale image y of size H×W,divide it into H2×W2 blocks of pixels of size 2×2. Then two separate pixels are randomly sampled from each pixel block,which are finally combined to form two sub-sampled image g1y and g2y of size H2×W2. At this point equation (18)becomes

    arg minθEx,yfθg1y-g2y2 .

    Image downsampling method

    Figure 3.Image downsampling method

    Literature[30]demonstrated that using Eq.(19) directly as the optimisation objective would end up with denoising results that are too smooth. Therefore,we consider adding a penalty term to Eq.(19) to get the final optimisation objective as

    minθEx,yfθg1y-g2y22+γEx,yfθg1y-g2y-g1fθy+g2fθy22 .

    2.3 Hyperspectral denoising network

    Deep denoising models for greyscale and RGB images have gained longevity in recent years. We propose deep plug-and-play self-supervised hyperspectral image denoising network(Self-HSIDeCNN). FFDNet is chosen as the base model framework for our model,which has the advantages of being flexible and fast,and has already shown excellent performance in denoising RGB images and greyscale images31. We use FFDNet as a backbone network for two main reasons. Firstly,compared to UNet,a commonly used backbone network in image processing,FFDNet is designed for image denoising,and the noise level map can be varied during training to make the model flexible for various levels of noise. Second,compared with the transformer architecture,which has stronger feature extraction capability,FFDNet's training and inference are faster,and its downsampling module can effectively reduce the model's requirement on computational resources. The structure of our neural network is shown in Figure 4.

    Self-HSIDeCNN network architecture

    Figure 4.Self-HSIDeCNN network architecture

    In the process of performing frame-level noise reduction,for the nλ-th band of the spectral data cube XNx×Ny×Nλ,in addition to inputting this band of size Nx×Ny into the neural network,the images of its neighbouring K bands will also be inputted(in this paper,we take K = 6). For K+1 band images at this point,following the sampling strategy in 2.2,two subgraphs,both of size K+1×Nx2×Ny2,are obtained. One subgraph y1 is used as model input and one subgraph y2 is used as labels to compute the loss function. For subgraph y1,in order to speed up model training,another downsampling is performed after input to the model. Together with the noise estimation level map,the final model has an input size of 4K+5×Nx4×Ny4.

    In order to better capture the correlation between different spectral bands of hyperspectral images,an inter-channel attention structure SENet is added to form a SEBlock after every two convolutional layers in the neural network. This structure is equivalent to assigning a separate weight to each channel feature map,which increases the number of parameters in the model,but enhances the non-local feature extraction capability of the model. The SENet structure is shown in Fig. 5. A total of seven SEBlocks are used in this paper,each using a convolutional kernel of size 3×3 and a ReLU activation function(the output of the last SEBlock does not use an activation function). The feature map size is reduced to K+1×Nx2×Ny2 by up-sampling after 7 SEBlocks,and then the loss function is calculated with the sub-sampled image y2. The noise level σ in the noise level estimation map is randomly changed during the training process,allowing the model to flexibly adapt to different noises.

    SENet architecture

    Figure 5.SENet architecture

    After obtaining the hyperspectral denoising model,we can train the model based on the Neighbor2Neighbor self-supervised learning method. According to the optimization objective shown in Eq.(20),the loss function is designed as

    L=rec +γreg =fθg1y-g2y22+γfθg1y-g2y-g1fθy-g2fθy22

    where g1,  2· is a stochastic downsampling function and fθ· is a noise reduction neural network. γ is a penalty term(fixed at 5 in this paper)used to balance the level of detail preserved in the denoising results. To keep the gradient stable,the gradients of g1fθ(y) and g2fθ(y) are not propagated during training. This training process is shown in Fig. 6. In the inference stage,we only need to input the noisy image into the model to obtain the denoised image directly.

    Self-supervised training process

    Figure 6.Self-supervised training process

    For denoising models with supervised learning methods,the models cannot be trained in the absence of clean images as labels. For the compressed perceptual image reconstruction model with supervised learning method,the training data containing noise will make the model worse(we will discuss this content in the subsequent section). As can be seen from Figure 6,this self-supervised learning method only needs noise-containing images to complete the training of the model,without the need of clean images as labels. The self-supervised learning approach is significantly superior in the hyperspectral domain where noise-free data is very expensive to obtain.

    3 Experiments and results

    Our proposed method can be divided into two steps. First,we train the hyperspectral denoising network based on the self-supervised approach. Then,this network is directly embedded into the PnP framework for self-supervised image reconstruction. In this section,we first describe the dataset and implementation details. Then,to evaluate the effectiveness,the proposed method is compared with models trained based on supervised learning approach. Furthermore,ablation studies are conducted to analyze the effect of hyperparameters on the results.

    3.1 Training details

    Model training was performed using the CAVE hyperspectral dataset,consisting of 32 scenes,each with a resolution of 512 × 512 and containing a total of 31 bands from 400 nm to 800 nm. Five of the scenarios were selected as the test set and the remaining scenarios as the training set. The RGB images of the five test scenes are shown in Fig. 7. To increase the training data for subsequent training,this dataset was randomly cropped and data augmented(including flipping,rotating,mirroring,and combinations of these operations). A total of 33,480 data of size 256 × 256 × 7 were finally obtained for training. For the five test scenarios,the space was downsampled to a 256 × 256 × 31 data cube. In this paper,the code was implemented using the PyTorch framework with 500 epochs using the Adam optimiser. The initial learning rate was set to 0.000 1 and multiplied by 0.5 after every 100 epochs. Training of the entire network took approximately 8 hours,using a machine equipped with an Intel Xeon CPU,360 GB of memory,and four Nvidia RTX 4090 Ti GPUs with 24 GB RAM.

    RGB image of test scene

    Figure 7.RGB image of test scene

    3.2 Denoising results

    Before comparing the reconstruction outcomes,it is essential to access both the supervised and the self-supervised denoising results to investigate the impact of the Neighbor2Neighbor self-supervised learning strategy on the denoising outcomes. The identical model,parameters,and training data are employed here,differing solely in the loss function and back-propagation gradient. For the supervised learning approach,the loss function is

    L=fθy-x22 .

    For the self-supervised learning approach,the loss function is shown in Eq.(21).

    With the maximum pixel value of 255,let the Gaussian noise variance be σ and the Poisson noise intensity be λ. In this paper,we compare four different noise degradation models,which are fixed Gaussian noise(σ=25),range Gaussian noise(σ ϵ 5, 50),fixed Poisson noise(λ=30),and range Poisson noise(λ ϵ 5, 50). Using peak signal to noise ratio(PSNR)and Structural Similarity(SSIM)as evaluation metrics. The five test scenario metrics were averaged and the final results are shown in Table 1.

    • Table 1. Comparison of denoising results

      Table 1. Comparison of denoising results

      Noise typeSupervised ModelSelf-supervised Model
      σ=2538.12 dB,0.95139.33 dB,0.978
      σϵ5, 5037.63 dB,0.92439.16 dB,0.969
      λ=3038.28 dB,0.96440.25 dB,0.983
      λϵ5, 5038.25 dB,0.96639.86 dB,0.976

    The self-supervised learning model,utilizing the Neighbor2Neighbor strategy,demonstrates superior performance compared to the supervised learning model across various noise degradation models. This superiority arises from the slight disparity ε between the labels of the self-supervised learning model and the desired denoising results of the model inputs. Additionally,with the incorporation of the penalty term in the loss function,the model achieves enhanced generalization and performs better on the test set.

    The hyperparameter γ is used to avoid overly smooth denoising results. When γ=0,it means that the penalty term in the loss function does not exist. In such a case,fθg1y and fθg2y are not exactly the same,and the model tends to output the average value of fθg1y and fθg2y because the loss function achieves its minimum value at this point. In order to verify the influence of the superparameter γ on the denoising results,the noise in the training data is unchanged(this paper uses the Gaussian noise with variance of 25 and the Poisson noise with intensity of 30),and only the value of the superparameter is changed to verify the denoising effect of different models on the test set. The final results are shown in Table 2.

    • Table 2. The influence of hyperparameter γ on denoising results

      Table 2. The influence of hyperparameter γ on denoising results

      γ

      PSNR / dB

      σ=25

      SSIM

      σ=25

      PSNR / dB

      λ=30

      SSIM

      λ=30

      039.270.97040.260.983
      139.230.96240.230.981
      239.260.96540.340.986
      539.330.97840.250.983
      2039.300.97739.890.979

    For Gaussian noise,the denoising effect is first analysed when γ=0 and γ=1. When γ=0,the penalty term does not exist,the denoising result will be too smooth,and the performance of Self-HSIDeCNN is not optimal. However,because of the presence of downsampling and upsampling modules in Self-HSIDeCNN,the denoising outcomes exhibit excessive smoothness in the downsampled subgraphs. However,despite this,the high-frequency information in the final output image of the model remains well-preserved after upsampling the subgraphs. This explains why the denoising effect becomes worse at γ=1(the presence of the penalty term makes the denoising inadequate). At γ=5,although a small amount of noise is not removed due to the increase in γ,the original information of the image is better recovered(compared toγ=0). At γ=20,more noise is retained along with the high-frequency detail information of the image,so the denoising effect becomes worse again.

    For Poisson noise,the trend is similar to that of Gaussian noise,but its optimum is achieved at γ=2. This is because the model has a better denoising effect on Poisson noise. At this point,only a smaller γ is needed to alleviate the problem of over-smoothing caused by the self-supervised learning method.

    In this paper,we fix γ=5 for experiments,and the value of γ should be determined according to the scene characteristics and experimental results in practical applications.

    3.3 Reconstruction results

    The self-supervised learning model Self-HSIDeCNN based on Neighbor2Neighbor in 3.2 can be directly embedded in deep plug-and-play architectures for compressive sensing image reconstruction. We name it PnP-Self-HSIDeCNN. Deep plug-and-play frameworks often require a warm start to speed up convergence. Here,the GAP-TV denoiser is used for 90 iterations first,and then the denoiser is switched to the self-supervised learning model for better reconstruction results. In addition,for the estimated noise level σ at the time of reconstruction,it was set to 30 regardless of the noise degradation model(normalised to 0 to 1). As the iteration progresses,this parameter can be gradually decreased to enhance the reconstruction quality. We employ another reconstruction algorithm named PnP-HSI,which utilizes a denoising model trained through supervised learning with the PnP framework.

    To compare with end-to-end neural networks,we selected both U-net and TSA-net models. U-net consists of two main components:an encoder and a decoder. Each coding block contains two 3×3 convolutional layers and a 2×2 maximum pooling layer using the ReLU activation function. TSA-net,directly uses the structure from the literature[19]without further changes. For the dataset,the CAVE hyperspectral dataset is also used and its size is randomly cropped to 256 × 256 in the spatial dimension for data augmentation,resulting in 5000 training data of size 256×256×31. The CASSI system forward model simulation was performed on the data cube before inputting the model,and the final model input was a 2D measurement of 256×316. It is important to emphasize that our primary aim is to investigate the utilization of self-supervised learning in compressive sensing image reconstruction and to analyze the influence of various noise degradation models on the reconstruction outcomes. Consequently,we refrain from comparing our approach with current state-of-the-art models in terms of performance,as these models are trained using a supervised approach.

    The final comparison results are shown in Table 3. In terms of reconstruction quality,the self-supervised reconstruction model proposed in this paper outperforms both U-net and TSA-net in terms of PSNR and SSIM of the reconstruction results. It can also be seen that the optimal results were achieved by the self-supervised learning model trained based on fixed Gaussian noise. This is because the mathematical form of 𝒟σ^k in Eq.(13) corresponds to a fixed Gaussian noise denoiser. Unless otherwise stated,the PnP-Self-HSIDeCNN used in the subsequent comparison experiments was trained based on Gaussian fixed noise(σ=25).

    • Table 3. Comparison of reconstruction results

      Table 3. Comparison of reconstruction results

      U-netTSA-netPnP-HSI

      Ours

      σ=25

      Ours

      σϵ5, 50

      Ours

      λ=30

      Ours

      λϵ5, 50

      Scene126.29dB,0.84326.47dB,0.85529.56dB,0.87531.12dB,0.89230.44dB,0.84328.34dB,0.78828.51dB,0.798
      Scene237.20dB,0.94136.98dB,0.93537.59dB,0.95939.62dB,0.95937.01dB,0.92137.55dB,0.96338.39dB,0.966
      Scene333.87dB,0.91835.07dB,0.91736.27dB,0.92435.89dB,0.91634.31dB,0.88834.14dB,0.90834.25dB,0.909
      Scene432.99dB,0.91033.16dB,0.92934.88dB,0.93335.22dB,0.92834.13dB,0.87032.96dB,0.91033.02dB,0.903
      Scene520.11dB,0.78821.14dB,0.79421.55dB,0.79723.89dB,0.83823.94dB,0.80823.06dB,0.81022.70dB,0.802
      Mean30.09dB,0.88030.56dB,0.88631.97dB,0.89833.15dB,0.90731.97dB,0.86631.21dB,0.87631.37dB,0.876

    Fig. 8 depicts the visual representation of a test scenario. The image on the left exhibits the RGB image of the scene alongside the 2D measurements following simulation,while the image on the right showcases the reconstruction results obtained from various algorithms. In Fig. 9,an enlarged view of the last band image on the right side of Fig. 8 is presented to facilitate a visual comparison of its reconstruction quality.

    Comparison of the visual effect of the reconstruction results of different algorithms:(a)RGB image;(b)2D measurements;(c)ground truth;(d)U-net;(e)TSA-net;(f)PnP-HIS;(g)PnP-Self-HSIDeCNN

    Figure 8.Comparison of the visual effect of the reconstruction results of different algorithms:(a)RGB image;(b)2D measurements;(c)ground truth;(d)U-net;(e)TSA-net;(f)PnP-HIS;(g)PnP-Self-HSIDeCNN

    Comparison of visual reconstruction effects in the final band:(a)ground truth;(b)U-net;(c)TSA-net;(d)PnP-HIS;(e)PnP-Self-HSIDeCNN

    Figure 9.Comparison of visual reconstruction effects in the final band:(a)ground truth;(b)U-net;(c)TSA-net;(d)PnP-HIS;(e)PnP-Self-HSIDeCNN

    In the evaluation of hyperspectral images,we not only consider spatial metrics but also emphasize spectral metrics. Fig. 10 illustrates the reconstruction of the spectral profile of the central position for this scene. For the spectral curves obtained from the reconstruction of different algorithms in Fig. 10(b),we use Spectral Angle Mapper(SAM)to evaluate their similarity with the real spectral curves,and this result is shown in the legend of Fig. 10(b). From Fig. 10,it can be seen that the reconstructed spectral curve based on PnP-Self-HSIDeCNN is most similar to the real spectral curve.

    Comparison of spectral curve reconstruction results:(a)RGB image;(b)spectral curve comparison

    Figure 10.Comparison of spectral curve reconstruction results:(a)RGB image;(b)spectral curve comparison

    Based on the aforementioned metrics,it is evident that the self-supervised reconstruction algorithm proposed in this paper,based on the deep plug-and-play framework,yields superior results across various metrics. However,in terms of runtime,the end-to-end neural networks U-net and TSA-net hold a significant advantage,completing reconstruction in less than 1 second after training,whereas algorithms utilizing deep plug-and-play frameworks require several minutes due to the iterative process. Nevertheless,runtime durations at the minute level are generally acceptable in practical applications. Moreover,the method circumvents the necessity for clean data as labels during training,thereby substantially mitigating the issue of inadequate training data for compressive sensing deep learning reconstruction models in real-world scenarios.

    3.4 Comparison of generalisability

    In real-world application scenarios,data often contains noise,and ideally,clean data with minimal noise levels is preferred. In our proposed self-supervised reconstruction model,spectral data cubes containing noise can be obtained from real scenarios and then utilized for training to ensure generalization. However,for supervised deep learning models,training becomes challenging due to the absence of clean images as labels. One approach is to directly train with noisy images as labels,but this typically leads to a degradation in model performance. To evaluate the generalizability of the models,Gaussian noise was introduced to the CAVE dataset to simulate real-world scenarios. U-net and TSA-net were retrained and tested using this noisy data. Conversely,self-supervised networks,which inherently incorporate noise during the training process,can be directly tested with noisy data. The resulting performance is summarized in Table 4.

    • Table 4. Comparison of generalisability

      Table 4. Comparison of generalisability

      U-netTSA-netPnP-HSI

      Ours

      σ=25

      Scene125.89dB,0.82425.88dB,0.83829.52dB,0.87331.13dB,0.890
      Scene236.41dB,0.93135.49dB,0.92737.54dB,0.95739.59dB,0.957
      Scene331.78dB,0.88733.80dB,0.90436.23dB,0.91935.84dB,0.912
      Scene431.03dB,0.90132.03dB,0.91734.80dB,0.93435.21dB,0.929
      Scene520.14dB,0.77320.50dB,0.78721.56dB,0.79823.84dB,0.834
      Mean29.05dB,0.86329.54dB,0.87431.93dB,0.89933.12dB,0.904
      Difference-1.04dB,-0.017-1.02dB,-0.012-0.04dB,-0.001-0.03dB,-0.003
      Percentage-3.46%,-1.93%-3.34%,-1.35%-0.13%,-1.11%-0.09%,-0.33%

    From Table 4,it's evident that the performance of the algorithms declines when tested on noisy data. However,the degradation in performance for the two algorithms based on the deep plug-and-play framework is considerably less than that observed for the two end-to-end neural networks. This indicates that the self-supervised learning model exhibits stronger robustness and generalization. Building upon the above findings,the self-supervised reconstruction algorithm proposed in this article,based on the deep plug-and-play framework,demonstrates superior performance across multiple indicators. Importantly,this method doesn't necessitate clean data as labels for training,thereby alleviating the challenge of insufficient training data for compressive sensing deep learning reconstruction models in practical applications.

    3.5 Ablation experiment

    3.5.1 Hyperparameterisation γ

    In 3.2,we verified the effect of the hyperparameter γ on the model’s denoising results. Here,we explore its impact on the hyperspectral image reconstruction. Using the denoising model from Section 3.2 directly and maintaining other parameters in the PnP-Self-HSIDeCNN algorithm unchanged,we present the reconstruction results in Table 5.

    • Table 5. Influence of hyperparameter γ on reconstruction results

      Table 5. Influence of hyperparameter γ on reconstruction results

      γ=0γ=1γ=2γ=5γ=20
      PSNR/dBSSIMPSNR/dBSSIMPSNR/dBSSIMPSNR/dBSSIMPSNR/dBSSIM
      Scene130.920.86230.420.86230.920.87829.700.78529.890.855
      Scene237.190.92138.540.94040.190.95135.500.92137.670.944
      Scene333.930.88836.280.90836.360.91332.740.88834.280.892
      Scene433.980.88335.070.89935.470.89433.190.86135.380.916
      Scene522.970.76923.860.83323.800.84323.290.79123.180.803
      Mean31.790.86532.830.88833.350.89630.880.84932.080.882

    For scenes 1 to 4,an appropriate penalty term can enhance the models’ reconstruction quality. However,excessively large values of γ can degrade the model’s denoising effect(even worse than without the penalty term),consequently affecting the final hyperspectral compressed perceptual image reconstruction. However,for scene 5,the model performance is better when γ is not zero. This is attributed to scene 5’s rich colour information and more complex spatial-spectral curves,consequently affecting the final hyperspectral compressed perceptual image reconstruction. The comparison of Tables 2 and 5 reveals that the model’s performance on the denoising task and on the image reconstruction task are not entirely correlated. On the denoising task,the model achieves optimal performance with γ=5,but on the image reconstruction task the model achieves optimal performance with γ=  2. In summary,the value of γ should also be determined according to the specific usage scenario.

    3.5.2 Noise estimation level σ

    In the aforementioned comparison,the estimated noise level σ is set to 30,irrespective of the noise degradation model. This approach is chosen to facilitate direct model usage for reconstruction without the need for tedious parameter adjustment processes. Consequently,the reconstruction results of the PnP-Self-HSIDeCNN algorithm in the aforementioned experiments represent expected performance in practical use rather than optimal performance. To explore the model's optimal performance,Gaussian fixed noise(σ=25)is utilized during model training,and two noise estimation strategies are employed during model testing:fixed estimation of the noise level(fixed at 100,50,30,and 5,respectively),and dynamic estimation of the noise level(gradually decreasing as iterations progress). The experimental results are presented in Table 6 and Table 7.

    • Table 6. Influence of noise estimation level σ on reconstruction results

      Table 6. Influence of noise estimation level σ on reconstruction results

      σ=100σ=50σ=30σ=5σϵ0, 50
      PSNR/dBSSIMPSNR/dBSSIMPSNR/dBSSIMPSNR/dBSSIMPSNR/dBSSIM
      Scene127.570.75030.210.85231.000.88831.280.89732.010.899
      Scene233.560.92138.160.94139.380.95540.180.97340.320.977
      Scene332.000.88834.690.88835.770.91136.200.92836.830.932
      Scene431.870.86134.590.91235.130.92735.500.92936.140.941
      Scene521.580.74922.970.80323.750.83224.290.85025.880.859
      Mean29.320.83432.120.87933.010.90333.490.91534.230.921
    • Table 7. Influence of noise estimation level σ on reconstruction time

      Table 7. Influence of noise estimation level σ on reconstruction time

      σ=100σ=50σ=30σ=5σϵ0, 50
      Scene153.10s96.57s99.09s105.39s117.74s
      Scene251.21s91.74s95.67s101.17s110.92s
      Scene348.69s84.51s89.87s94.28s105.03s
      Scene451.84s91.35s96.27s103.77s116.19s
      Scene555.62s104.61s112.84s119.07s145.56s
      Mean52.09s93.76s98.75s104.74s119.09s

    When σ = 100,it significantly exceeds the noise level present in the training data for the denoising model. At this point,the results converge quickly but with poor reconstruction quality. Smaller σ values yield better reconstruction results but require longer runtimes. Compared to fixed noise estimation levels,allowing the noise estimation level to gradually decrease with iterations achieves optimal reconstruction results,albeit at the expense of longer runtimes(since σ decreases as iterations progress). In summary,the value of σ should be selected based on specific requirements. Larger σ values can be used to expedite reconstruction when strict reconstruction quality is not necessary.

    3.5.3 Number of warm start iterations

    Deep plug-and-play frameworks typically require a hot-start to expedite convergence. In the aforementioned experiments,the GAP-TV denoiser is employed for 80 iterations before transitioning to a self-supervised learning model to achieve improved reconstruction results. However,the GAP-TV noise reducer doesn't necessarily need 80 iterations to reach its denoising performance limit. Thus,reducing the number of hot-start iterations could accelerate reconstruction. However,ending hot-start iterations prematurely may result in suboptimal final image reconstruction. To investigate this,we select a scene with a fixed noise estimation level of 30 and vary the number of hot-start iterations to study their effects on reconstruction results and speed. The experimental results are presented in Table 8.

    • Table 8. Influence of the number of warm start iterations

      Table 8. Influence of the number of warm start iterations

      Number of warm start iterationsWarm start time / sSelf-supervised reconstruction time / sTotal reconstruction time/ sPSNR / dBSSIM
      0039.0639.0633.790.828
      105.2440.9546.1934.330.867
      2010.0439.6949.7334.430.867
      3015.7640.9556.7134.440.871
      4020.6840.3261.0034.420.869
      5026.4140.3266.7334.250.869
      6031.2339.6970.9234.210.867
      7036.4439.6976.1334.170.864
      8041.6740.3281.9934.150.869
      9046.8840.9587.8334.120.869
      10052.2739.6991.9634.120.867

    As can be seen from Table 8,regardless of the number of hot-start iterations,the running time of the image reconstruction phase of the PnP-Self-HSIDeCNN using the deep model is basically unchanged. Hence,the total running time of the model is primarily limited by the hot-start time. The total image reconstruction time is shortest when no hot-start is performed,but this compromises the final reconstruction results. As the number of hot-start iterations increases,the image reconstruction quality decreases slightly and the total reconstruction time increases. Therefore,in practice,the number of hot-start iterations can be set to 10~20 to accelerate reconstruction while ensuring satisfactory reconstruction quality.

    4 Conclusions

    In this paper,we propose a self-supervised learning method,PnP-Self-HSIDeCNN,based on a deep plug-and-play framework and neighborhood sampling strategy. Unlike traditional methods,it only requires noisy images for training and fully exploits the inter-spectral correlation of hyperspectral images through the SENet structure. Comparing the results across visual evaluation,PSNR,SSIM,and previous end-to-end neural network reconstructions,the self-supervised learning method proposed herein achieves commendable reconstruction results with robust generalization within acceptable runtimes. We conduct a detailed and comprehensive comparison test on three hyperparameters of the loss function — penalty factor γ,noise level estimation parameter σ,and number of hot-start iterations — to fully explore their impacts on final reconstruction results in terms of speed and quality,thereby laying a relevant foundation for the algorithm's future practical applications. Our findings demonstrate that self-supervised learning can yield satisfactory performance even with limited or poor-quality data,providing a feasible approach for the future application of compressed sensing and CASSI systems in real-world scenes. However,compared to end-to-end neural networks,our approach lacks in real-time performance. Also,real-world noise is far more complex than Gaussian and Poisson noise. For future work,we would like to reduce the iteration time of our method and extend it to real-world data.

    [1] Jian-Yu Wang, Rong Shu, Yin-nian Liu et al. Science Press.

    [2] P Z Wu. Characteristics and applications of satellite-borne hyperspectral imaging spectrometer. Remote Sensing of Land Resources, 10(1999).

    [3] Z Y Ouyang. Ouyang Ziyuan: Scientific objectives of China's lunar exploration project. Proceedings of the Chinese Academy of Sciences.

    [4] X Dun, Q Fu, H T Li et al. Advances in the frontiers of computational imaging. Chinese Journal of Image Graphics, 27, 37(2022).

    [22] Z Chen, J Cheng. Proximal gradient descent unfolding dense-spatial spectral-attention transformer for compressive spectral imaging. arXiv preprint(2023).

    [29] J Lehtinen, J Munkberg, J Hasselgren et al. Noise2Noise: learning image restoration without clean data. arXiv e-prints, 2018(1803).

    Tools

    Get Citation

    Copy Citation Text

    Xing-Yu ZHANG, Shou-Zheng ZHU, Tian-Shu ZHOU, Hong-Xing QI, Jian-Yu WANG, Chun-Lai LI, Shi-Jie LIU. Deep plug-and-play self-supervised neural networks for spectral snapshot compressive imaging[J]. Journal of Infrared and Millimeter Waves, 2024, 43(6): 847

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Interdisciplinary Research on Infrared Science

    Received: Feb. 29, 2024

    Accepted: --

    Published Online: Dec. 13, 2024

    The Author Email: LI Chun-Lai (lichunlai@mail.sitp.ac.cn), LIU Shi-Jie (liushijie@ucas.ac.cn)

    DOI:10.11972/j.issn.1001-9014.2024.06.016

    Topics