Advanced Photonics Nexus, Volume. 4, Issue 5, 056002(2025)

Snapshot multispectral imaging through defocusing and a Fourier imager network

Xilin Yang1,2,3, Michael John Fanous1, Hanlong Chen1,2,3, Ryan Lee1, Paloma Casteleiro Costa1,2,3, Yuhang Li1,2,3, Luzhe Huang1,2,3, Yijie Zhang1,2,3, and Aydogan Ozcan1,2,3、*
Author Affiliations
  • 1University of California, Los Angeles, Electrical and Computer Engineering Department, Los Angeles, California, United States
  • 2University of California, Los Angeles, Bioengineering Department, Los Angeles, California, United States
  • 3University of California, Los Angeles, California NanoSystems Institute (CNSI), Los Angeles, California, United States
  • show less

    Multispectral imaging, which simultaneously captures the spatial and spectral information of a scene, is widely used across diverse fields, including remote sensing, biomedical imaging, and agricultural monitoring. We introduce a snapshot multispectral imaging approach employing a standard monochrome image sensor with no additional spectral filters or customized components. Our system leverages the inherent chromatic aberration of wavelength-dependent defocusing as a natural source of physical encoding of multispectral information; this encoded image information is rapidly decoded via a deep learning-based multispectral Fourier imager network (mFIN). We experimentally tested our method with six illumination bands and demonstrated an overall accuracy of 98.25% for predicting the illumination channels at the input and achieved a robust multispectral image reconstruction on various test objects. This deep learning-powered framework achieves high-quality multispectral image reconstruction using snapshot image acquisition with a monochrome image sensor and could be useful for applications in biomedicine, industrial quality control, and agriculture, among others.

    Keywords

    1 Introduction

    Multispectral imaging, with the capacity of concurrently capturing the spatial and spectral information of objects, has become an indispensable tool in various fields, including remote sensing,13 biomedical imaging,47 and agricultural and food monitoring,812 among others.13,14 However, unlike standard color image sensors that utilize a Bayer filter array to capture the red, green, and blue (RGB) channels, multispectral imaging typically requires bulkier optical setups to separately capture spatial data within each wavelength channel while minimizing spectral crosstalk among the channels of interest. One of the approaches for multispectral imaging involves scanning systems with single-pixel sensors or one-dimensional sensor arrays coupled with wavelength-selective components like color filters15,16 or prisms.17 However, these scanning systems acquire spatial and spectral data sequentially, which compromises frame rate, increases exposure time, and degrades signal-to-noise ratio, thereby limiting their applicability in real-time imaging scenarios. Nonscanning systems, commonly referred to as snapshot imagers,18 aim to capture spatial and spectral information in a single exposure. Examples of snapshot imagers include color-coded apertures,19 Fabry-Pérot filters,20,21 spectral filter arrays,22,23 metasurfaces,24,25 and diffractive optical elements.2628 Despite their ability to perform real-time imaging, most snapshot multispectral imaging systems require additional components that are relatively complex and less accessible compared with commercially available image sensors used in everyday applications.

    Here, we present a single-shot multispectral imaging system that utilizes a simple monochrome image sensor without any spectral filters or customized components. Our method leverages propagation-induced defocusing as a physical encoding mechanism, significantly reducing the complexity and cost typically associated with traditional multispectral imaging systems. For the multispectral image decoding, we employ a trained neural network termed multispectral Fourier imager network (mFIN), which is optimized to rapidly reconstruct the multispectral image information embedded in the defocused image captured using a monochrome image sensor. We experimentally validated our approach by acquiring 964 multispectral images, where each object is acquired by illuminating a pattern with a distinct combination of light-emitting diodes (LEDs). The entire dataset was divided into 888 multispectral objects used to train the mFIN model, and the remaining 76 multispectral objects, unseen during training, were employed to blindly evaluate the model’s ability to accurately reconstruct images and their corresponding color channels. Our results demonstrate that this multispectral imaging system achieved an overall accuracy of 98.25% in predicting the correct illumination channels of the test objects using a monochrome image sensor array. In addition, the reconstructed multispectral images of test objects exhibit a decent structural quality, as evidenced by various quantitative image metrics including the structural similarity index measure (SSIM) averaging 0.61±0.09, peak signal-to-noise ratio (PSNR) averaging 11.48  dB±1.79  dB, root mean square error (RMSE) averaging 0.27±0.05, and normalized mean squared error (NMSE) averaging 0.077±0.029. These experimental results confirm the effectiveness of our snapshot computational multispectral imaging method using a monochrome image sensor. Our study presents an efficient multispectral imaging solution that integrates natural image defocusing as a physical encoding scheme with a deep learning-based spectral decoding framework. This integration not only simplifies multispectral imaging hardware but also maintains high-quality image reconstructions through snapshot image acquisition, potentially enabling broader applications and enhancing the accessibility of multispectral imaging technologies.

    2 Results

    Figure 1 illustrates our system setup, depicting both the schematic layout [Fig. 1(a)] and the physical components of the system [Fig. 1(b)]. To achieve multispectral imaging, we implemented a custom-designed ring-shaped LED array comprising six different types of LEDs that span the visible spectrum. The multispectral illumination is directed through a collimator, diffuser, and pinhole, ultimately reaching a digital micromirror device (DMD). The DMD projects the spatial information of the image, referred to as the image pattern, thereby forming the final multispectral image. A 3× demagnifying lens pair then focuses this multispectral image onto a specific region of a monochrome complementary metal-oxide-semiconductor (CMOS) image sensor. Due to the inherent chromatic aberration of the optical elements and dispersion, different illumination wavelengths result in varying focal planes. Consequently, our system captures each spectral channel with a distinct level of image defocus and blur, resulting in physical defocus-based multispectral image encoding. This defocused and blurred pattern that is captured by a monochrome image sensor is subsequently decoded by a deep learning-based neural network model, enabling the reconstruction of the multispectral image of an unknown pattern. This multispectral image reconstruction is performed using an mFIN, which features residual connections with channel repetition from the monochrome image input to the multispectral image output, ensuring the alignment of the channel dimensions (see Sec. 4 for details).

    (a) Schematic layout: overview of the snapshot multispectral imaging system, illustrating the arrangement of the circular LED array, the digital micromirror device (DMD), optical components (collimator, diffuser, and lenses), and the monochrome imaging camera. (b) Experimental setup.

    Figure 1.(a) Schematic layout: overview of the snapshot multispectral imaging system, illustrating the arrangement of the circular LED array, the digital micromirror device (DMD), optical components (collimator, diffuser, and lenses), and the monochrome imaging camera. (b) Experimental setup.

    A total of 964 objects from 120 unique image patterns of test objects (images of human lung tissue samples; see Sec. 4) were imaged, each captured under a distinct illumination condition using various combinations of six LEDs. Of these, 888 multispectral objects from 114 patterns were utilized in the training process of the mFIN model. The training involved optimizing the mFIN network by minimizing the discrepancies between the ground truth image patterns (projected by the DMD) with channels illuminated by their corresponding LEDs that were turned on and the multispectral images reconstructed by the mFIN network. Further details regarding the dataset and training methodology are provided in Sec. 4.

    Following the network training, we performed a blind evaluation on an independent test set composed of 76 multispectral objects generated from six image patterns that were not used during training. We emphasize that the train–test split was enforced at the pattern level: no image pattern used for training appears in the test set, ensuring a rigorous assessment of the model’s generalization capability. We first evaluated our system’s ability to accurately reconstruct images of test objects and their corresponding color channels under a single-LED illumination, with the results presented in Fig. 2. The left column displays the snapshot grayscale images (resulting from the monochrome image sensor), which exhibit varying degrees of defocus, aberrations, and intensity fluctuations across different illumination wavelengths. For each test object, mFIN accurately predicted both the corresponding illumination wavelength and the underlying image pattern, where each illumination channel was denoted by the center wavelength of its corresponding LED; see Fig. 2.

    (a) and (b) Visualizations of image reconstruction results of two test image patterns illuminated under different wavelengths. Each column refers to an image with the monochrome defocused sensor input, network output, and multispectral target (ground truth). Pseudo colors represent the corresponding illumination LED color (center wavelength).

    Figure 2.(a) and (b) Visualizations of image reconstruction results of two test image patterns illuminated under different wavelengths. Each column refers to an image with the monochrome defocused sensor input, network output, and multispectral target (ground truth). Pseudo colors represent the corresponding illumination LED color (center wavelength).

    In the subsequent phase of the blind testing, we increased the complexity of the multispectral image reconstruction task by testing new image patterns under simultaneous illumination by multiple LEDs, covering various combinations of wavelengths. Figure 3 showcases the test image patterns illuminated by different LED configurations. The input monochrome images captured by the CMOS imager (top row) exhibit significant spatial variations and aberrations due to the superposition of multiple illumination wavelengths. mFIN reconstructed output images (left column) are juxtaposed with the target images (right column), revealing a high degree of concordance between the two. These experimental results further confirm that mFIN can blindly generate accurate multispectral image reconstructions from snapshot monochrome images, demonstrating its robustness.

    Visualization of image reconstruction results under various illumination configurations. Each column displays the predicted image alongside the corresponding target (ground truth) image for different combinations of activated LEDs that were on.

    Figure 3.Visualization of image reconstruction results under various illumination configurations. Each column displays the predicted image alongside the corresponding target (ground truth) image for different combinations of activated LEDs that were on.

    To quantitatively evaluate the performance of our model, we first assessed its spectral channel accuracy using a classification-based approach, followed by an analysis of the structural image metrics for the spatial quality of the reconstructed multispectral image features. For the spectral channel accuracy classification, each illumination LED channel was treated as a binary state: the LED was either on during the test object illumination, contributing to the snapshot monochrome image, or off, leaving the spectral channel of this object blank. Consequently, spectral prediction errors were categorized into two types: false positives (termed “leakages”), where the model incorrectly predicted an image in a blank/off spectral channel, and false negatives (termed “losses”), where the model failed to generate a spectral image or produced one with low intensity in a spectral channel that should contain an image. The accuracy of the spectral predictions was determined by applying thresholds on an energy difference term defined as Idif=IpredItargetItarget where Ipred represents the predicted image intensity of a given output channel, calculated as the sum of all the pixel values in the predicted image, and Itarget denotes the sum of the pixel intensity values of the ground truth image pattern [see Fig. 4(a) and Sec. 4 for details]. For statistical analysis, each sample pattern under a given illumination configuration was treated as six independent entries—one per channel—resulting in a total of 456 data points derived from 76 test objects never seen during training. Our multispectral image reconstruction model reached an overall spectral accuracy of 98.25% across all data points in the test dataset. The confusion matrix in Fig. 4(a) (see Sec. 4 for additional evaluation protocol) offers a detailed view of the spectral classification performance. Remarkably, the network produced zero false-negative detections and only eight false-positive instances out of 456 channel predictions—corresponding to an error rate of 1.8%—highlighting the model’s reliability in spectral-channel prediction.

    Image reconstruction performance analysis. (a) Confusion matrix of spectral channel accuracy. (b) Classification metrics against the number of concurrent LED illuminations: Plots of sensitivity, specificity, and F1 score against the number of concurrent LED illuminations. (c) Energy difference values as a function of the number of concurrent LED illuminations; boxplots depict the distribution of energy differences for various numbers of concurrent illuminations, indicating the shift in the prediction accuracy and variance. (d) Energy difference values as a function of the illumination wavelengths. We used light green to signify “losses” and light blue for “leakages.”

    Figure 4.Image reconstruction performance analysis. (a) Confusion matrix of spectral channel accuracy. (b) Classification metrics against the number of concurrent LED illuminations: Plots of sensitivity, specificity, and F1 score against the number of concurrent LED illuminations. (c) Energy difference values as a function of the number of concurrent LED illuminations; boxplots depict the distribution of energy differences for various numbers of concurrent illuminations, indicating the shift in the prediction accuracy and variance. (d) Energy difference values as a function of the illumination wavelengths. We used light green to signify “losses” and light blue for “leakages.”

    Next, we analyzed the model’s performance relative to the number of concurrent illuminations by calculating sensitivity, specificity, and F1 score values (see Sec. 4) across varying illumination configurations. As illustrated in Fig. 4(b), the network achieves very good performance under a single-LED illumination, reaching 100% sensitivity, specificity, and F1 score, i.e., no classification errors. When the illumination becomes more complex—with two to five LEDs active simultaneously—occasional misclassifications emerge; nevertheless, all three performance metrics remain consistently high (90%) and the specificity stays at 100%. These results confirm the model’s ability to maintain robust spectral predictions even in the presence of significant spectral overlap and multiplexed illumination.

    Also, classifying the spectral predictions as correct or incorrect [Figs. 4(a) and 4(b)] provides a binary assessment of our performance; it does not fully capture the nuances of the reconstructed image quality. To address this, Fig. 4(c) presents the energy difference levels for all test data points using boxplots, categorized by the number of concurrent illuminations. For enhanced visibility, a zoomed-in version of the y-axis is also shown on the right. The mean energy difference (Idif) remains centered around zero as the number of simultaneously active LEDs increases; however, the distribution gradually broadens, signaling a larger standard deviation. This widening spread reflects the added difficulty of disentangling more complex spectral superpositions recorded by the monochrome sensor. Importantly, the vast majority of data points still reside within ±0.05 of the ground-truth energy, indicating that reconstruction errors remain well-bounded even under multiplexed illumination. Collectively, these results confirm that the proposed framework retains a high degree of reliability and effectiveness when reconstructing multispectral images in increasingly challenging illumination scenarios.

    Next, we analyzed how the wavelength of the illumination affects the multispectral image reconstruction performance. Similar to the earlier analyses, each test image from a single spectral channel was treated as an independent data point, resulting in six data points per test pattern—one for each spectral channel. We then grouped these data points by their corresponding illumination wavelengths and plotted the energy difference levels in Fig. 4(d). This wavelength-resolved analysis reveals only minor variability in the reconstruction accuracy across the spectrum. At 397 nm, the distribution of the energy-difference metric Idif is shifted slightly above zero, indicating a modest increase in false positives (“leakages”) where the network hallucinates content in an otherwise blank channel. Conversely, at 623 nm, the distribution is biased below zero, leading to a higher incidence of false negatives (“losses”) in which the expected content is missed. The remaining channels cluster tightly around Idif=0 with very low error rates and fall in the ±0.05 range. Despite these minor differences, the overall reconstruction fidelity remains high, confirming the method’s robust, wavelength–agnostic performance. The residual discrepancies can be attributed to chromatic–aberration–induced focal–plane shifts. We also evaluated the quality of the reconstructed images using various structural image metrics, including RMSE, SSIM, PSNR, and NMSE (refer to Sec. 4 for details). When assessing these metrics, we focused exclusively on true positive predictions, as leakages or losses cannot be used for quantitative evaluation of image quality. In its blind testing, our model achieved an average SSIM of 0.61±0.09 and a PSNR of 11.48  dB±1.79  dB. In addition, the RMSE averaged 0.27±0.05, and the NMSE metric averaged 0.077±0.029 for the test objects. These results, characterized by low standard deviations, demonstrate a repeatable performance in our model’s multispectral image reconstructions. In Figs. 5(a)5(d), we further report the distributions of these four quality metrics across different wavelengths. The four image–quality metrics exhibit a consistent trend: reconstructions at 460 and 623 nm outperform the other bands, mirroring the spectral–accuracy results reported earlier. By contrast, 397 and 590 nm channels show a modest—but still acceptable—decrease in performance. This wavelength–dependent behavior stems from the chromatic–aberration–induced defocus. Inspection of the raw monochrome snapshots in Fig. 1 corroborates this: 397 and 590 nm channels exhibit markedly greater blur compared to the 460 and 623 nm channels. Consequently, the reconstruction fidelity is higher for the minimally defocused bands and slightly reduced for those with more pronounced optical blur. Crucially, this property also confers application-specific flexibility. By selecting the focal plane to favor a particular band, one can prioritize the image quality of the most mission-critical wavelength.

    Distributions of (a) PSNR, (b) RMSE, (c) SSIM, and (d) NMSE metrics for the reconstructed images at different illumination wavelengths.

    Figure 5.Distributions of (a) PSNR, (b) RMSE, (c) SSIM, and (d) NMSE metrics for the reconstructed images at different illumination wavelengths.

    Finally, we evaluated the reconstructed multispectral image quality and the resulting confusion matrices in relation to the number of concurrent LED illuminations, as depicted in Fig. 6. The performance remained relatively stable across varying levels of multiplexed illumination. Despite minor discrepancies observed, the presented snapshot multispectral imaging system performs well across all the tested wavelengths, as indicated by the quantitative structural image quality metrics, confirming the robustness of the reconstruction process across the visible spectrum.

    (a) Distributions of structural quality metrics for the reconstructed multispectral images with respect to the number of concurrent LED illuminations. (b) Confusion matrices with different numbers of concurrent LED illuminations.

    Figure 6.(a) Distributions of structural quality metrics for the reconstructed multispectral images with respect to the number of concurrent LED illuminations. (b) Confusion matrices with different numbers of concurrent LED illuminations.

    To further evaluate the limits of our approach, we designed a testing scenario involving dual-channel illumination, where two distinct LEDs simultaneously illuminate different image patterns. The training dataset was constructed by randomly selecting two multispectral samples, each captured under a single LED channel, along with their corresponding monochrome inputs. The inputs were then (1) summed and normalized to numerically simulate a dual-illumination snapshot with two distinct image patterns, and (2) their spectral targets were combined. To ensure diversity and prevent redundancy, pairs with overlapping spectral channels were resampled. During blind testing, the model was evaluated on entirely unseen combinations of image patterns, ensuring strict separation between the training and test sets. As shown in Fig. 7, our model accurately predicted the active spectral channels and successfully disentangled and reconstructed each component with high fidelity. We emphasize that this setup represents a highly ill-posed inverse problem, making it particularly challenging. A few representative failure cases—such as spectral image pattern cross-talk, missed channels/losses, and incorrect channel predictions—are also presented in Supplementary Fig. S1. Nonetheless, the reconstructions confirm the framework’s robustness even under these more challenging conditions. Additional performance gains can be achieved through tailored physical-encoding designs, as detailed in Sec. 3.

    Reconstruction results on synthetic test data with dual-LED spectral illumination on different patterns. Each test input simulates a monochrome snapshot formed by the superposition of two independent single-channel illuminations on two distinct image patterns. The mFIN-generated spectral reconstructions (left) are compared against the combined ground truth targets (right, in each column).

    Figure 7.Reconstruction results on synthetic test data with dual-LED spectral illumination on different patterns. Each test input simulates a monochrome snapshot formed by the superposition of two independent single-channel illuminations on two distinct image patterns. The mFIN-generated spectral reconstructions (left) are compared against the combined ground truth targets (right, in each column).

    3 Discussion

    The presented method leverages chromatic aberration-induced defocusing as a physical encoding mechanism, significantly reducing the complexity and cost typically associated with traditional multispectral imaging systems. By eliminating the need for additional optical elements such as spectral filters, prisms, or diffractive optical components, our approach not only simplifies the imaging setup but also enhances accessibility and practical applicability across diverse imaging scenarios. One additional advantage of our framework is that, being data-driven, the encoding scheme does not require rigorous analytical formulation or precise knowledge of the optical properties and aberrations of the imaging system. Instead, our deep learning model learns the inverse mapping implicitly, functioning as a decoder for the optically encoded measurements, even in the presence of unknown sources of aberrations or misalignments. This learning-based decoding is realized through our mFIN model, which enables efficient and robust spectral reconstruction—delivering rapid, accurate, and high-quality multispectral images suitable for potential applications in biomedical imaging, agricultural monitoring, and industrial quality control, among others.

    In some recent works, spectrometers have been utilized to enhance29 or quantitatively evaluate30 the performance of multispectral imaging systems. For instance, ground-truth spectral information from unknown objects was acquired by scanning the scene with a spectrometer to validate the accuracy of a multispectral imager.30 Different from these earlier works, our demonstrated approach operates with discrete, predetermined spectral channels whose ground truth originates from precisely controlled illumination by a programmable LED array. This choice of LED-based ground truth spectral information does not require the use of a spectrometer because the illumination spectrum of each LED in the programmable array is known a priori; this does not limit our method’s validity or applicability in real-world scenarios. To increase the spectral resolution of our multispectral imaging system, one can use programmable spectral filters during the training phase to replace the LED array with an increased number of nonoverlapping spectral channels, forming our ground truth. In a practical deployment, following the training, only a monochrome image sensor and the native chromatic aberration of an off-the-shelf lens would be sufficient—without the use of dedicated dispersive optics or spectral filter wheels. For a specific application of interest, one can also pre-select relevant spectral channels to be used in the training phase based on the application-specific requirements, such as the spectral importance or contrast correlation, allowing our method to adapt effectively to diverse multispectral imaging or sensing scenarios.

    Despite the success of our experimental results, the current implementation is also subject to hardware constraints due to the DMD that is used for training and testing our system. Therefore, extending the framework to alternative imaging hardware and objects remains an important avenue that we leave for future work. In addition, as the inverse problem becomes increasingly more complex, i.e., when addressing multispectral images with finer structural details and higher bit-depth patterns for each spectral channel, image-defocusing-based encoding of spectral information may not sufficiently capture all the information needed for high-fidelity multispectral image reconstructions. To provide better multispectral image encoding strategies, there is growing research interest in utilizing diffractive optical elements to engineer wavelength- and depth-dependent point-spread functions (PSFs), potentially enhancing how information is projected onto image sensors. When combined with deep neural network models, these engineered PSFs have the potential to significantly improve the quality of multispectral image reconstructions. For example, recent advancements have demonstrated that diffractive deep neural networks can effectively design spatially and spectrally varying PSFs for incoherent light sources31,32 with multiple diffractive layers interconnected through free-space light propagation. By incorporating such cascaded structured diffractive layers, one can replace the defocusing-based multispectral image encoding strategy with a fully programmable diffractive approach, providing greater degrees of freedom compared with traditional single-layer diffractive designs.31,33

    To further enhance the robustness and real-world applicability of our approach, future research can leverage joint optimization strategies integrating front-end optical encoding with back-end computational decoding.3436 Specifically, combining multilayer diffractive optical networks as physical encoders with dedicated multispectral image reconstruction networks as digital decoders could offer significant potential to extend the boundaries of what is achievable in multispectral and hyperspectral imaging systems. Such integrated optimization strategies of front-end optical encoder hardware and back-end digital decoder algorithms would allow for the simultaneous refinement of both the physical imaging setup and the computational back-end system, ensuring that the entire imaging pipeline is finely tuned for multispectral image reconstruction fidelity and performance.

    4 Appendix: Materials and Methods

    4.1 Network Architecture, Training, and Performance Evaluation Metrics

    mFIN architecture is composed of multiple dynamic spatial Fourier transform (dSPAF) modules with dense connections,37 as shown in Fig. 8(a). The dense connections, also referred to as dense links, aggregate the output of all previous dSPAF modules to the next one [Fig. 8(b)] so that the feature maps are progressively accumulated, allowing the network to retain information flow while effectively capturing intricate spatial representations. We conducted an ablation study by removing the dense connections, which led to a significant drop in performance with an accuracy of 73.25% and a PSNR lower than 10 dB. The results of this experiment are presented in Supplementary Fig. S2. The dSPAF module is shown in Fig. 8(c), utilizing a shallow U-Net38 to dynamically generate spatial frequency filters, enabling adaptive processing and introducing extra degrees of freedom. Input images are first converted to the spatial frequency domain using a two-dimensional fast Fourier transform (FFT), which are then modulated by the dynamically generated weights from the shallow U-Net. Subsequently, an inverse fast Fourier transform (iFFT) is performed to reconstruct the image in the spatial domain. The resulting spatial tensors undergo a parametric rectified linear unit (PReLU)39 activation to introduce nonlinearity. A residual connection was incorporated to provide better convergence by establishing a direct pathway from input to output; channel repetition was also employed for the monochrome input to match the desired number of output channels.

    Network architecture. (a) mFIN. (b) Dense links: each output tensor of the dSPAF group is appended and fed to the subsequent one. (c) Detailed schematic of dSPAF modules. See Sec. 4 for details.

    Figure 8.Network architecture. (a) mFIN. (b) Dense links: each output tensor of the dSPAF group is appended and fed to the subsequent one. (c) Detailed schematic of dSPAF modules. See Sec. 4 for details.

    Our training loss function is a weighted sum of the mean absolute error (MAE) term and the Fourier domain MAE (FDMAE) loss: LossmFIN=ωMAELMAE+ωFDMAELFDMAE,where ωMAE and ωFDMAE are the weights for each domain and are empirically set to be 1 and 0.15, respectively.

    Each loss term is defined as LMAE=i=1nabs(yiy^i)n,LFDMAE=i=1nabs[F(y)iF(y^)i]n,where F defines the 2D discrete Fourier transform, y represents the ground truth, y^ represents the output of the network, and n is the total number of pixels. We conducted an ablation study on the FDMAE loss component, with the results presented in Supplementary Fig. S3. This analysis revealed that the training process is highly sensitive to the choice of FDMAE loss weight. To systematically examine this effect, we trained four separate models using different weight configurations and compared their spectral classification accuracy and structural reconstruction quality, as summarized in Supplementary Fig. S4. Models trained with either very small (ωFDMAE=0.001) or zero FDMAE loss weight exhibited significantly degraded performance. By contrast, assigning a larger weight (ωFDMAE=0.15) resulted in the highest spectral channel accuracy, whereas a more moderate weight (ωFDMAE=0.015) yielded slightly superior image quality while sacrificing some channel prediction accuracy. This trade-off highlights the flexibility of our approach, allowing users to prioritize either accuracy or visual fidelity depending on the specific requirements of their downstream applications.

    We defined the energy difference for each channel as Idif=IpredItargetItarget where Ipred=iy^i and Itarget=iYi, where i is the index of all pixels within the image. Note that we calculated Itarget using the binary image pattern (Y), irrespective of whether the channel was illuminated (positive ground truth) or blank (negative ground truth). This approach ensures that Idif can always be normalized without division by zero, even if the channel’s ground truth appears blank (i.e., the corresponding LED is not turned on). We applied two thresholds on the energy difference values to determine whether a sample is considered correctly predicted: 0.2 for false negatives and 0.1 for false positives. More specifically, for a sample where the ground truth is positive, any output that has Idif0.2 is considered a false negative or loss. Similarly, for a sample where the ground truth is blank, any output with Idif0.1 is considered a false positive or leakage. The thresholds were empirically selected to balance the trade-off between spectral prediction accuracy and overall image quality. In general, applying stricter thresholds (i.e., smaller absolute values) tends to reduce the reconstruction accuracy but enhances structural similarity by imposing tighter constraints on energy difference tolerances. To validate this design choice, we conducted additional experiments using four different threshold settings on the model trained with the ωFDMAE=0.15. Here, we denote the false negative threshold with TFN where IdifTFN is considered a false negative, whereas for false positives, we use IdifTFP as the criterion. The experimental results, summarized in Supplementary Fig. S5, confirmed the expected trade-off behavior between the accuracy and the structural fidelity.

    For performance evaluations, we used sensitivity, specificity, and F1 score for overall channel accuracy metrics Sensitivity=TPFN+TP,Specificity=TNFP+TN,F1score=2×[TP/(TP+FP)]×[TP/(TP+FN)][TP/(TP+FP)]+[TP/(TP+FN)]where TP, TN, FP, and FN represent the counts of true positives, true negatives, false positives, and false negatives, respectively.

    For reconstructed image quality evaluations, we used four structural metrics: RMSE, SSIM, PSNR, and NMSE. Unless otherwise stated, we calculated all metrics on the raw predicted images without scaling, normalization, or binarization operations. RMSE is defined as RMSE(y,y^)=i=1n(yiy^i)2n,where yi represents the ground truth image pixel, y^i represents the output image pixel from the model, and i is the index for all pixels within the image. The NMSE metric focuses on the image contrast and uses a normalization factor σ to rule out the impact of absolute image intensity level mismatches, i.e., NMSE(y,y^)=RMSE(σy,y^),where σ=iy^iiyi.

    SSIM evaluates the structural similarity between two images, defined by SSIM(y,y^)=(2μyμy^+c1)(2σyy^+c2)(μy2+μy^2+c1)(σy2+σy^2+c2),where μy and μy^ represent the mean pixel values of the ground truth and the output images, whereas σy2 and σy^2 are the variances and σyy^ is the covariance of the ground truth and the output images. c1 and c2 are constants, which were calculated based on the range of the images to be 6.5 and 58.5, respectively.

    The PSNR is defined as PSNR=20log10[1RMSE(y,y^)].

    The training and testing are done on a standard workstation with a GeForce RTX 3090 Ti graphics processing unit (GPU) in workstations with 256 GB of random-access memory (RAM) and an Intel Core i9 central processing unit (CPU). mFIN was implemented using Python version 3.12.0 and PyTorch40 version 2.1.0 with CUDA toolkit version 12.1. Typical inference time is 100  ms per image (512  pixel×512  pixel), which can be significantly improved using a state-of-the-art GPU.

    4.2 Experimental Setup, Data Acquisition, and Preprocessing

    A custom-designed multispectral LED array was built using six distinct LEDs (Chanzon® 5 mm): red (620 to 625 nm), green (520 to 525 nm), purple (395 to 400 nm), yellow (588 to 592 nm), orange (600 to 610 nm), and blue (455 to 465 nm). The LEDs were arranged in a circular configuration on a 1×1 prototype board, as shown in Fig. 1. The array was modulated using an Arduino UNO R3 microcontroller board, programmed with the Arduino IDE 1.8.19 software41 to manage both the cycling and brightness of the LEDs. The cycling pattern involved several phases to maximize the capture of spectral information. First, each LED was illuminated individually for 5 s. Following this, combinations of two LEDs were activated, cycling through various pairs. After the two-LED combinations, the system progressed to groups of three LEDs, illuminating them simultaneously for further spectral complexity. Further combinations of four and five LEDs were also included, providing a rich dataset for encoding by mixing multiple wavelengths. Each phase of the cycle (ranging from one to five LEDs for each combination) was executed for a designated duration, with brightness levels adjusted to account for varying LED photon energy and ensure roughly uniform brightness throughout our measurements. The Arduino controlled both digital and pulse width modulation pins, allowing for precise control of the intensity of each LED.

    The optical system was designed to capture encoded multispectral information using a DMD and a custom camera setup. A Texas Instruments DLP9500 DMD was used, featuring a 10.8-μm pitch size, and was driven by the EasyProj (ViALUX) software to project binary images of lung tissue samples. To counteract the DMD’s 45-deg tilt, the image was displayed in a diamond shape against a black background. The multispectral illumination passed through a collimator and diffuser before striking the DMD and was subsequently focused by two lenses: one with a focal length of 200 mm (Thorlabs, LA1708 - N-BK7 Plano-Convex Lens) and the other 75 mm (Thorlabs, LA1608 - N-BK7 Plano-Convex Lens), positioned 110 mm apart. The imaging was captured by a ScopeTek DCM500BW camera housed in a custom 3D-printed casing made using a Stratasys Objet30 Prime™ printer. The camera, equipped with a 5.0-megapixel monochrome image sensor (2592×1944 resolution, 2.2  μm pixel size), recorded out-of-focus images, which were processed and stored via the ScopePhoto 3.1 software. The imaging speed was 300  ms per capture, with automatic exposure adjustment.

    Formalin-fixed, paraffin-embedded (FFPE) tissue slides stained with hematoxylin and eosin (H&E) from anonymized lung specimens were used for testing. Each slide originated from previously collected anonymized materials that did not contain identifiable patient information. The slides were initially imaged as holographic data using a custom-built lens-free in-line setup.42,43 The reconstructed holographic images of tissue specimens were then binarized using Otsu’s thresholding method,44 which adaptively selects the optimal threshold to separate foreground and background regions. The resulting images were rotated into a diamond shape and positioned against a black background to create an upright frame of suitable size on the CMOS image sensor. The optical configuration resulted in the captured images being 3 times smaller than the projected images, due to the lens system and component distances. The image defocus varied between different illumination wavelengths due to chromatic aberration. Blue light focuses closer to the lens, whereas red light focuses farther away, leading to differences in the amount of defocus for different colors.

    For the training and testing datasets, we collected two sets of image patterns. The first set consisted of 100 image patterns, each with six exposures corresponding to one unique LED illumination at a time. The second set consisted of another 20 image patterns. In addition to the six illumination settings in the first set, each projected image pattern in the second set is illuminated with another 14 illumination settings that have more than one LED on at a time. In total, we captured 964 multispectral images using these two image pattern sets. Note that the total expected dataset size is 1000 images, but 36 were discarded due to illumination synchronization failure in the experiments. The captured raw monochrome images, encompassing the full field-of-view on the image sensor, underwent the following preprocessing steps: we first cropped the center diamond regions and rotated the images 45 deg to roughly align the captured data to be in the same orientation as the target images projected to the DMD. A 5  pixel×5  pixel binning was applied to the input images, which were then resized to 512  pixel×512  pixel to ensure compatibility with the network model architecture. This resolution was fixed for both the training and evaluation phases to maintain consistency. All resizing operations used bilinear kernels.

    Acknowledgments

    Acknowledgment. Ozcan Lab at the UCLA acknowledges valuable discussions with Dr. Douglas M. Baney and the funding of Keysight Technologies Inc.

    Biographies of the authors are not available.

    [1] Spectral imaging for remote sensing. Lincoln Lab. J., 14, 3-28(2003).

    [40] et alPytorch: an imperative style, high-performance deep learning library(2019).

    [44] A threshold selection method from gray-level histograms. Automatica, 11, 23-27(1975).

    Tools

    Get Citation

    Copy Citation Text

    Xilin Yang, Michael John Fanous, Hanlong Chen, Ryan Lee, Paloma Casteleiro Costa, Yuhang Li, Luzhe Huang, Yijie Zhang, Aydogan Ozcan, "Snapshot multispectral imaging through defocusing and a Fourier imager network," Adv. Photon. Nexus 4, 056002 (2025)

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Research Articles

    Received: Jan. 25, 2025

    Accepted: Jul. 21, 2025

    Published Online: Aug. 5, 2025

    The Author Email: Aydogan Ozcan (ozcan@ucla.edu)

    DOI:10.1117/1.APN.4.5.056002

    CSTR:32397.14.1.APN.4.5.056002

    Topics