Snapshot multispectral imaging through defocusing and a Fourier imager network

Xilin Yang; Michael John Fanous; Hanlong Chen; Ryan Lee; Paloma Casteleiro Costa; Yuhang Li; Luzhe Huang; Yijie Zhang; Aydogan Ozcan

doi:10.1117/1.APN.4.5.056002

1 Introduction

Multispectral imaging, with the capacity of concurrently capturing the spatial and spectral information of objects, has become an indispensable tool in various fields, including remote sensing,1^–3 biomedical imaging,4^–7 and agricultural and food monitoring,8^–12 among others.13^,14 However, unlike standard color image sensors that utilize a Bayer filter array to capture the red, green, and blue (RGB) channels, multispectral imaging typically requires bulkier optical setups to separately capture spatial data within each wavelength channel while minimizing spectral crosstalk among the channels of interest. One of the approaches for multispectral imaging involves scanning systems with single-pixel sensors or one-dimensional sensor arrays coupled with wavelength-selective components like color filters15^,16 or prisms.17 However, these scanning systems acquire spatial and spectral data sequentially, which compromises frame rate, increases exposure time, and degrades signal-to-noise ratio, thereby limiting their applicability in real-time imaging scenarios. Nonscanning systems, commonly referred to as snapshot imagers,18 aim to capture spatial and spectral information in a single exposure. Examples of snapshot imagers include color-coded apertures,19 Fabry-Pérot filters,20^,21 spectral filter arrays,22^,23 metasurfaces,24^,25 and diffractive optical elements.26^–28 Despite their ability to perform real-time imaging, most snapshot multispectral imaging systems require additional components that are relatively complex and less accessible compared with commercially available image sensors used in everyday applications.

Here, we present a single-shot multispectral imaging system that utilizes a simple monochrome image sensor without any spectral filters or customized components. Our method leverages propagation-induced defocusing as a physical encoding mechanism, significantly reducing the complexity and cost typically associated with traditional multispectral imaging systems. For the multispectral image decoding, we employ a trained neural network termed multispectral Fourier imager network (mFIN), which is optimized to rapidly reconstruct the multispectral image information embedded in the defocused image captured using a monochrome image sensor. We experimentally validated our approach by acquiring 964 multispectral images, where each object is acquired by illuminating a pattern with a distinct combination of light-emitting diodes (LEDs). The entire dataset was divided into 888 multispectral objects used to train the mFIN model, and the remaining 76 multispectral objects, unseen during training, were employed to blindly evaluate the model’s ability to accurately reconstruct images and their corresponding color channels. Our results demonstrate that this multispectral imaging system achieved an overall accuracy of 98.25% in predicting the correct illumination channels of the test objects using a monochrome image sensor array. In addition, the reconstructed multispectral images of test objects exhibit a decent structural quality, as evidenced by various quantitative image metrics including the structural similarity index measure (SSIM) averaging $0.61 \pm 0.09$ , peak signal-to-noise ratio (PSNR) averaging $11.48 dB \pm 1.79 dB$ , root mean square error (RMSE) averaging $0.27 \pm 0.05$ , and normalized mean squared error (NMSE) averaging $0.077 \pm 0.029$ . These experimental results confirm the effectiveness of our snapshot computational multispectral imaging method using a monochrome image sensor. Our study presents an efficient multispectral imaging solution that integrates natural image defocusing as a physical encoding scheme with a deep learning-based spectral decoding framework. This integration not only simplifies multispectral imaging hardware but also maintains high-quality image reconstructions through snapshot image acquisition, potentially enabling broader applications and enhancing the accessibility of multispectral imaging technologies.

2 Results

Figure 1 illustrates our system setup, depicting both the schematic layout [Fig. 1(a)] and the physical components of the system [Fig. 1(b)]. To achieve multispectral imaging, we implemented a custom-designed ring-shaped LED array comprising six different types of LEDs that span the visible spectrum. The multispectral illumination is directed through a collimator, diffuser, and pinhole, ultimately reaching a digital micromirror device (DMD). The DMD projects the spatial information of the image, referred to as the image pattern, thereby forming the final multispectral image. A 3× demagnifying lens pair then focuses this multispectral image onto a specific region of a monochrome complementary metal-oxide-semiconductor (CMOS) image sensor. Due to the inherent chromatic aberration of the optical elements and dispersion, different illumination wavelengths result in varying focal planes. Consequently, our system captures each spectral channel with a distinct level of image defocus and blur, resulting in physical defocus-based multispectral image encoding. This defocused and blurred pattern that is captured by a monochrome image sensor is subsequently decoded by a deep learning-based neural network model, enabling the reconstruction of the multispectral image of an unknown pattern. This multispectral image reconstruction is performed using an mFIN, which features residual connections with channel repetition from the monochrome image input to the multispectral image output, ensuring the alignment of the channel dimensions (see Sec. 4 for details).

Figure 1.(a) Schematic layout: overview of the snapshot multispectral imaging system, illustrating the arrangement of the circular LED array, the digital micromirror device (DMD), optical components (collimator, diffuser, and lenses), and the monochrome imaging camera. (b) Experimental setup.

Download full size

View all figures

A total of 964 objects from 120 unique image patterns of test objects (images of human lung tissue samples; see Sec. 4) were imaged, each captured under a distinct illumination condition using various combinations of six LEDs. Of these, 888 multispectral objects from 114 patterns were utilized in the training process of the mFIN model. The training involved optimizing the mFIN network by minimizing the discrepancies between the ground truth image patterns (projected by the DMD) with channels illuminated by their corresponding LEDs that were turned on and the multispectral images reconstructed by the mFIN network. Further details regarding the dataset and training methodology are provided in Sec. 4.

Following the network training, we performed a blind evaluation on an independent test set composed of 76 multispectral objects generated from six image patterns that were not used during training. We emphasize that the train–test split was enforced at the pattern level: no image pattern used for training appears in the test set, ensuring a rigorous assessment of the model’s generalization capability. We first evaluated our system’s ability to accurately reconstruct images of test objects and their corresponding color channels under a single-LED illumination, with the results presented in Fig. 2. The left column displays the snapshot grayscale images (resulting from the monochrome image sensor), which exhibit varying degrees of defocus, aberrations, and intensity fluctuations across different illumination wavelengths. For each test object, mFIN accurately predicted both the corresponding illumination wavelength and the underlying image pattern, where each illumination channel was denoted by the center wavelength of its corresponding LED; see Fig. 2.

Figure 2.(a) and (b) Visualizations of image reconstruction results of two test image patterns illuminated under different wavelengths. Each column refers to an image with the monochrome defocused sensor input, network output, and multispectral target (ground truth). Pseudo colors represent the corresponding illumination LED color (center wavelength).

Download full size

View all figures

In the subsequent phase of the blind testing, we increased the complexity of the multispectral image reconstruction task by testing new image patterns under simultaneous illumination by multiple LEDs, covering various combinations of wavelengths. Figure 3 showcases the test image patterns illuminated by different LED configurations. The input monochrome images captured by the CMOS imager (top row) exhibit significant spatial variations and aberrations due to the superposition of multiple illumination wavelengths. mFIN reconstructed output images (left column) are juxtaposed with the target images (right column), revealing a high degree of concordance between the two. These experimental results further confirm that mFIN can blindly generate accurate multispectral image reconstructions from snapshot monochrome images, demonstrating its robustness.

Figure 3.Visualization of image reconstruction results under various illumination configurations. Each column displays the predicted image alongside the corresponding target (ground truth) image for different combinations of activated LEDs that were on.

Download full size

View all figures

To quantitatively evaluate the performance of our model, we first assessed its spectral channel accuracy using a classification-based approach, followed by an analysis of the structural image metrics for the spatial quality of the reconstructed multispectral image features. For the spectral channel accuracy classification, each illumination LED channel was treated as a binary state: the LED was either on during the test object illumination, contributing to the snapshot monochrome image, or off, leaving the spectral channel of this object blank. Consequently, spectral prediction errors were categorized into two types: false positives (termed “leakages”), where the model incorrectly predicted an image in a blank/off spectral channel, and false negatives (termed “losses”), where the model failed to generate a spectral image or produced one with low intensity in a spectral channel that should contain an image. The accuracy of the spectral predictions was determined by applying thresholds on an energy difference term defined as $I_{dif} = \frac{I_{pred} - I_{target}}{I_{target}}$ where $I_{pred}$ represents the predicted image intensity of a given output channel, calculated as the sum of all the pixel values in the predicted image, and $I_{target}$ denotes the sum of the pixel intensity values of the ground truth image pattern [see Fig. 4(a) and Sec. 4 for details]. For statistical analysis, each sample pattern under a given illumination configuration was treated as six independent entries—one per channel—resulting in a total of 456 data points derived from 76 test objects never seen during training. Our multispectral image reconstruction model reached an overall spectral accuracy of 98.25% across all data points in the test dataset. The confusion matrix in Fig. 4(a) (see Sec. 4 for additional evaluation protocol) offers a detailed view of the spectral classification performance. Remarkably, the network produced zero false-negative detections and only eight false-positive instances out of 456 channel predictions—corresponding to an error rate of $\sim 1.8 %$ —highlighting the model’s reliability in spectral-channel prediction.

Figure 4.Image reconstruction performance analysis. (a) Confusion matrix of spectral channel accuracy. (b) Classification metrics against the number of concurrent LED illuminations: Plots of sensitivity, specificity, and $F 1$ score against the number of concurrent LED illuminations. (c) Energy difference values as a function of the number of concurrent LED illuminations; boxplots depict the distribution of energy differences for various numbers of concurrent illuminations, indicating the shift in the prediction accuracy and variance. (d) Energy difference values as a function of the illumination wavelengths. We used light green to signify “losses” and light blue for “leakages.”

Download full size

View all figures

Next, we analyzed the model’s performance relative to the number of concurrent illuminations by calculating sensitivity, specificity, and $F 1$ score values (see Sec. 4) across varying illumination configurations. As illustrated in Fig. 4(b), the network achieves very good performance under a single-LED illumination, reaching 100% sensitivity, specificity, and $F 1$ score, i.e., no classification errors. When the illumination becomes more complex—with two to five LEDs active simultaneously—occasional misclassifications emerge; nevertheless, all three performance metrics remain consistently high ( $\geq 90 %$ ) and the specificity stays at 100%. These results confirm the model’s ability to maintain robust spectral predictions even in the presence of significant spectral overlap and multiplexed illumination.

Also, classifying the spectral predictions as correct or incorrect [Figs. 4(a) and 4(b)] provides a binary assessment of our performance; it does not fully capture the nuances of the reconstructed image quality. To address this, Fig. 4(c) presents the energy difference levels for all test data points using boxplots, categorized by the number of concurrent illuminations. For enhanced visibility, a zoomed-in version of the $y$ -axis is also shown on the right. The mean energy difference ( $I_{dif}$ ) remains centered around zero as the number of simultaneously active LEDs increases; however, the distribution gradually broadens, signaling a larger standard deviation. This widening spread reflects the added difficulty of disentangling more complex spectral superpositions recorded by the monochrome sensor. Importantly, the vast majority of data points still reside within $\pm 0.05$ of the ground-truth energy, indicating that reconstruction errors remain well-bounded even under multiplexed illumination. Collectively, these results confirm that the proposed framework retains a high degree of reliability and effectiveness when reconstructing multispectral images in increasingly challenging illumination scenarios.

Next, we analyzed how the wavelength of the illumination affects the multispectral image reconstruction performance. Similar to the earlier analyses, each test image from a single spectral channel was treated as an independent data point, resulting in six data points per test pattern—one for each spectral channel. We then grouped these data points by their corresponding illumination wavelengths and plotted the energy difference levels in Fig. 4(d). This wavelength-resolved analysis reveals only minor variability in the reconstruction accuracy across the spectrum. At 397 nm, the distribution of the energy-difference metric $I_{dif}$ is shifted slightly above zero, indicating a modest increase in false positives (“leakages”) where the network hallucinates content in an otherwise blank channel. Conversely, at 623 nm, the distribution is biased below zero, leading to a higher incidence of false negatives (“losses”) in which the expected content is missed. The remaining channels cluster tightly around $I_{dif} = 0$ with very low error rates and fall in the $\pm 0.05$ range. Despite these minor differences, the overall reconstruction fidelity remains high, confirming the method’s robust, wavelength–agnostic performance. The residual discrepancies can be attributed to chromatic–aberration–induced focal–plane shifts. We also evaluated the quality of the reconstructed images using various structural image metrics, including RMSE, SSIM, PSNR, and NMSE (refer to Sec. 4 for details). When assessing these metrics, we focused exclusively on true positive predictions, as leakages or losses cannot be used for quantitative evaluation of image quality. In its blind testing, our model achieved an average SSIM of $0.61 \pm 0.09$ and a PSNR of $11.48 dB \pm 1.79 dB$ . In addition, the RMSE averaged $0.27 \pm 0.05$ , and the NMSE metric averaged $0.077 \pm 0.029$ for the test objects. These results, characterized by low standard deviations, demonstrate a repeatable performance in our model’s multispectral image reconstructions. In Figs. 5(a)–5(d), we further report the distributions of these four quality metrics across different wavelengths. The four image–quality metrics exhibit a consistent trend: reconstructions at 460 and 623 nm outperform the other bands, mirroring the spectral–accuracy results reported earlier. By contrast, 397 and 590 nm channels show a modest—but still acceptable—decrease in performance. This wavelength–dependent behavior stems from the chromatic–aberration–induced defocus. Inspection of the raw monochrome snapshots in Fig. 1 corroborates this: 397 and 590 nm channels exhibit markedly greater blur compared to the 460 and 623 nm channels. Consequently, the reconstruction fidelity is higher for the minimally defocused bands and slightly reduced for those with more pronounced optical blur. Crucially, this property also confers application-specific flexibility. By selecting the focal plane to favor a particular band, one can prioritize the image quality of the most mission-critical wavelength.

Figure 5.Distributions of (a) PSNR, (b) RMSE, (c) SSIM, and (d) NMSE metrics for the reconstructed images at different illumination wavelengths.

Download full size

View all figures

Finally, we evaluated the reconstructed multispectral image quality and the resulting confusion matrices in relation to the number of concurrent LED illuminations, as depicted in Fig. 6. The performance remained relatively stable across varying levels of multiplexed illumination. Despite minor discrepancies observed, the presented snapshot multispectral imaging system performs well across all the tested wavelengths, as indicated by the quantitative structural image quality metrics, confirming the robustness of the reconstruction process across the visible spectrum.

Figure 6.(a) Distributions of structural quality metrics for the reconstructed multispectral images with respect to the number of concurrent LED illuminations. (b) Confusion matrices with different numbers of concurrent LED illuminations.

Download full size

View all figures

To further evaluate the limits of our approach, we designed a testing scenario involving dual-channel illumination, where two distinct LEDs simultaneously illuminate different image patterns. The training dataset was constructed by randomly selecting two multispectral samples, each captured under a single LED channel, along with their corresponding monochrome inputs. The inputs were then (1) summed and normalized to numerically simulate a dual-illumination snapshot with two distinct image patterns, and (2) their spectral targets were combined. To ensure diversity and prevent redundancy, pairs with overlapping spectral channels were resampled. During blind testing, the model was evaluated on entirely unseen combinations of image patterns, ensuring strict separation between the training and test sets. As shown in Fig. 7, our model accurately predicted the active spectral channels and successfully disentangled and reconstructed each component with high fidelity. We emphasize that this setup represents a highly ill-posed inverse problem, making it particularly challenging. A few representative failure cases—such as spectral image pattern cross-talk, missed channels/losses, and incorrect channel predictions—are also presented in Supplementary Fig. S1. Nonetheless, the reconstructions confirm the framework’s robustness even under these more challenging conditions. Additional performance gains can be achieved through tailored physical-encoding designs, as detailed in Sec. 3.

Figure 7.Reconstruction results on synthetic test data with dual-LED spectral illumination on different patterns. Each test input simulates a monochrome snapshot formed by the superposition of two independent single-channel illuminations on two distinct image patterns. The mFIN-generated spectral reconstructions (left) are compared against the combined ground truth targets (right, in each column).

Download full size

View all figures

3 Discussion

The presented method leverages chromatic aberration-induced defocusing as a physical encoding mechanism, significantly reducing the complexity and cost typically associated with traditional multispectral imaging systems. By eliminating the need for additional optical elements such as spectral filters, prisms, or diffractive optical components, our approach not only simplifies the imaging setup but also enhances accessibility and practical applicability across diverse imaging scenarios. One additional advantage of our framework is that, being data-driven, the encoding scheme does not require rigorous analytical formulation or precise knowledge of the optical properties and aberrations of the imaging system. Instead, our deep learning model learns the inverse mapping implicitly, functioning as a decoder for the optically encoded measurements, even in the presence of unknown sources of aberrations or misalignments. This learning-based decoding is realized through our mFIN model, which enables efficient and robust spectral reconstruction—delivering rapid, accurate, and high-quality multispectral images suitable for potential applications in biomedical imaging, agricultural monitoring, and industrial quality control, among others.

In some recent works, spectrometers have been utilized to enhance29 or quantitatively evaluate30 the performance of multispectral imaging systems. For instance, ground-truth spectral information from unknown objects was acquired by scanning the scene with a spectrometer to validate the accuracy of a multispectral imager.30 Different from these earlier works, our demonstrated approach operates with discrete, predetermined spectral channels whose ground truth originates from precisely controlled illumination by a programmable LED array. This choice of LED-based ground truth spectral information does not require the use of a spectrometer because the illumination spectrum of each LED in the programmable array is known a priori; this does not limit our method’s validity or applicability in real-world scenarios. To increase the spectral resolution of our multispectral imaging system, one can use programmable spectral filters during the training phase to replace the LED array with an increased number of nonoverlapping spectral channels, forming our ground truth. In a practical deployment, following the training, only a monochrome image sensor and the native chromatic aberration of an off-the-shelf lens would be sufficient—without the use of dedicated dispersive optics or spectral filter wheels. For a specific application of interest, one can also pre-select relevant spectral channels to be used in the training phase based on the application-specific requirements, such as the spectral importance or contrast correlation, allowing our method to adapt effectively to diverse multispectral imaging or sensing scenarios.

Despite the success of our experimental results, the current implementation is also subject to hardware constraints due to the DMD that is used for training and testing our system. Therefore, extending the framework to alternative imaging hardware and objects remains an important avenue that we leave for future work. In addition, as the inverse problem becomes increasingly more complex, i.e., when addressing multispectral images with finer structural details and higher bit-depth patterns for each spectral channel, image-defocusing-based encoding of spectral information may not sufficiently capture all the information needed for high-fidelity multispectral image reconstructions. To provide better multispectral image encoding strategies, there is growing research interest in utilizing diffractive optical elements to engineer wavelength- and depth-dependent point-spread functions (PSFs), potentially enhancing how information is projected onto image sensors. When combined with deep neural network models, these engineered PSFs have the potential to significantly improve the quality of multispectral image reconstructions. For example, recent advancements have demonstrated that diffractive deep neural networks can effectively design spatially and spectrally varying PSFs for incoherent light sources31^,32 with multiple diffractive layers interconnected through free-space light propagation. By incorporating such cascaded structured diffractive layers, one can replace the defocusing-based multispectral image encoding strategy with a fully programmable diffractive approach, providing greater degrees of freedom compared with traditional single-layer diffractive designs.31^,33

To further enhance the robustness and real-world applicability of our approach, future research can leverage joint optimization strategies integrating front-end optical encoding with back-end computational decoding.34^–36 Specifically, combining multilayer diffractive optical networks as physical encoders with dedicated multispectral image reconstruction networks as digital decoders could offer significant potential to extend the boundaries of what is achievable in multispectral and hyperspectral imaging systems. Such integrated optimization strategies of front-end optical encoder hardware and back-end digital decoder algorithms would allow for the simultaneous refinement of both the physical imaging setup and the computational back-end system, ensuring that the entire imaging pipeline is finely tuned for multispectral image reconstruction fidelity and performance.

4 Appendix: Materials and Methods

4.1 Network Architecture, Training, and Performance Evaluation Metrics

mFIN architecture is composed of multiple dynamic spatial Fourier transform (dSPAF) modules with dense connections,37 as shown in Fig. 8(a). The dense connections, also referred to as dense links, aggregate the output of all previous dSPAF modules to the next one [Fig. 8(b)] so that the feature maps are progressively accumulated, allowing the network to retain information flow while effectively capturing intricate spatial representations. We conducted an ablation study by removing the dense connections, which led to a significant drop in performance with an accuracy of 73.25% and a PSNR lower than 10 dB. The results of this experiment are presented in Supplementary Fig. S2. The dSPAF module is shown in Fig. 8(c), utilizing a shallow U-Net38 to dynamically generate spatial frequency filters, enabling adaptive processing and introducing extra degrees of freedom. Input images are first converted to the spatial frequency domain using a two-dimensional fast Fourier transform (FFT), which are then modulated by the dynamically generated weights from the shallow U-Net. Subsequently, an inverse fast Fourier transform (iFFT) is performed to reconstruct the image in the spatial domain. The resulting spatial tensors undergo a parametric rectified linear unit (PReLU)39 activation to introduce nonlinearity. A residual connection was incorporated to provide better convergence by establishing a direct pathway from input to output; channel repetition was also employed for the monochrome input to match the desired number of output channels.

Figure 8.Network architecture. (a) mFIN. (b) Dense links: each output tensor of the dSPAF group is appended and fed to the subsequent one. (c) Detailed schematic of dSPAF modules. See Sec. 4 for details.

Download full size

View all figures

Our training loss function is a weighted sum of the mean absolute error (MAE) term and the Fourier domain MAE (FDMAE) loss: $L o s s_{m F I N} = ω_{M A E} L_{M A E} + ω_{F D M A E} L_{F D M A E},$ where $ω_{M A E}$ and $ω_{F D M A E}$ are the weights for each domain and are empirically set to be 1 and 0.15, respectively.

Each loss term is defined as $L_{M A E} = \frac{\sum_{i = 1}^{n} abs (y_{i} - {\hat{y}}_{i})}{n},$ $L_{F D M A E} = \frac{\sum_{i = 1}^{n} abs [F (y)_{i} - F (\hat{y})_{i}]}{n},$ where $F$ defines the 2D discrete Fourier transform, $y$ represents the ground truth, $\hat{y}$ represents the output of the network, and $n$ is the total number of pixels. We conducted an ablation study on the FDMAE loss component, with the results presented in Supplementary Fig. S3. This analysis revealed that the training process is highly sensitive to the choice of FDMAE loss weight. To systematically examine this effect, we trained four separate models using different weight configurations and compared their spectral classification accuracy and structural reconstruction quality, as summarized in Supplementary Fig. S4. Models trained with either very small ( $ω_{F D M A E} = 0.001$ ) or zero FDMAE loss weight exhibited significantly degraded performance. By contrast, assigning a larger weight ( $ω_{F D M A E} = 0.15$ ) resulted in the highest spectral channel accuracy, whereas a more moderate weight ( $ω_{F D M A E} = 0.015$ ) yielded slightly superior image quality while sacrificing some channel prediction accuracy. This trade-off highlights the flexibility of our approach, allowing users to prioritize either accuracy or visual fidelity depending on the specific requirements of their downstream applications.

We defined the energy difference for each channel as $I_{dif} = \frac{I_{pred} - I_{target}}{I_{target}}$ where $I_{pred} = \sum_{i} {\hat{y}}_{i}$ and $I_{target} = \sum_{i} Y_{i}$ , where $i$ is the index of all pixels within the image. Note that we calculated $I_{target}$ using the binary image pattern $(Y)$ , irrespective of whether the channel was illuminated (positive ground truth) or blank (negative ground truth). This approach ensures that $I_{dif}$ can always be normalized without division by zero, even if the channel’s ground truth appears blank (i.e., the corresponding LED is not turned on). We applied two thresholds on the energy difference values to determine whether a sample is considered correctly predicted: 0.2 for false negatives and 0.1 for false positives. More specifically, for a sample where the ground truth is positive, any output that has $I_{dif} \leq - 0.2$ is considered a false negative or loss. Similarly, for a sample where the ground truth is blank, any output with $I_{dif} \geq 0.1$ is considered a false positive or leakage. The thresholds were empirically selected to balance the trade-off between spectral prediction accuracy and overall image quality. In general, applying stricter thresholds (i.e., smaller absolute values) tends to reduce the reconstruction accuracy but enhances structural similarity by imposing tighter constraints on energy difference tolerances. To validate this design choice, we conducted additional experiments using four different threshold settings on the model trained with the $ω_{F D M A E} = 0.15$ . Here, we denote the false negative threshold with $T_{FN}$ where $I_{dif} \leq - T_{FN}$ is considered a false negative, whereas for false positives, we use $I_{dif} \geq T_{FP}$ as the criterion. The experimental results, summarized in Supplementary Fig. S5, confirmed the expected trade-off behavior between the accuracy and the structural fidelity.

For performance evaluations, we used sensitivity, specificity, and $F 1$ score for overall channel accuracy metrics $Sensitivity = \frac{TP}{FN + TP},$ $Specificity = \frac{TN}{FP + TN},$ $F 1 score = 2 \times \frac{[TP / (TP + FP)] \times [TP / (TP + FN)]}{[TP / (TP + FP)] + [TP / (TP + FN)]}$ where TP, TN, FP, and FN represent the counts of true positives, true negatives, false positives, and false negatives, respectively.

For reconstructed image quality evaluations, we used four structural metrics: RMSE, SSIM, PSNR, and NMSE. Unless otherwise stated, we calculated all metrics on the raw predicted images without scaling, normalization, or binarization operations. RMSE is defined as $RMSE (y, \hat{y}) = \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{n},}$ where $y_{i}$ represents the ground truth image pixel, ${\hat{y}}_{i}$ represents the output image pixel from the model, and $i$ is the index for all pixels within the image. The NMSE metric focuses on the image contrast and uses a normalization factor $σ$ to rule out the impact of absolute image intensity level mismatches, i.e., $NMSE (y, \hat{y}) = RMSE (σ y, \hat{y}),$ where $σ = \frac{\sum_{i} {\hat{y}}_{i}}{\sum_{i} y_{i}}$ .

SSIM evaluates the structural similarity between two images, defined by $SSIM (y, \hat{y}) = \frac{(2 μ_{y} μ_{\hat{y}} + c_{1}) (2 σ_{y \hat{y}} + c_{2})}{(μ_{y}^{2} + μ_{\hat{y}}^{2} + c_{1}) (σ_{y}^{2} + σ_{\hat{y}}^{2} + c_{2})},$ where $μ_{y}$ and $μ_{\hat{y}}$ represent the mean pixel values of the ground truth and the output images, whereas $σ_{y}^{2}$ and $σ_{\hat{y}}^{2}$ are the variances and $σ_{y \hat{y}}$ is the covariance of the ground truth and the output images. $c_{1}$ and $c_{2}$ are constants, which were calculated based on the range of the images to be $\sim 6.5$ and $\sim 58.5$ , respectively.

The PSNR is defined as $PSNR = 20 \log_{10} [\frac{1}{RMSE (y, \hat{y})}] .$

The training and testing are done on a standard workstation with a GeForce RTX 3090 Ti graphics processing unit (GPU) in workstations with 256 GB of random-access memory (RAM) and an Intel Core i9 central processing unit (CPU). mFIN was implemented using Python version 3.12.0 and PyTorch40 version 2.1.0 with CUDA toolkit version 12.1. Typical inference time is $\sim 100 ms$ per image ( $512 pixel \times 512 pixel$ ), which can be significantly improved using a state-of-the-art GPU.

4.2 Experimental Setup, Data Acquisition, and Preprocessing

A custom-designed multispectral LED array was built using six distinct LEDs (Chanzon^® 5 mm): red (620 to 625 nm), green (520 to 525 nm), purple (395 to 400 nm), yellow (588 to 592 nm), orange (600 to 610 nm), and blue (455 to 465 nm). The LEDs were arranged in a circular configuration on a $1^{″} \times 1^{″}$ prototype board, as shown in Fig. 1. The array was modulated using an Arduino UNO R3 microcontroller board, programmed with the Arduino IDE 1.8.19 software41 to manage both the cycling and brightness of the LEDs. The cycling pattern involved several phases to maximize the capture of spectral information. First, each LED was illuminated individually for 5 s. Following this, combinations of two LEDs were activated, cycling through various pairs. After the two-LED combinations, the system progressed to groups of three LEDs, illuminating them simultaneously for further spectral complexity. Further combinations of four and five LEDs were also included, providing a rich dataset for encoding by mixing multiple wavelengths. Each phase of the cycle (ranging from one to five LEDs for each combination) was executed for a designated duration, with brightness levels adjusted to account for varying LED photon energy and ensure roughly uniform brightness throughout our measurements. The Arduino controlled both digital and pulse width modulation pins, allowing for precise control of the intensity of each LED.

The optical system was designed to capture encoded multispectral information using a DMD and a custom camera setup. A Texas Instruments DLP9500 DMD was used, featuring a $10.8 - μ m$ pitch size, and was driven by the EasyProj (ViALUX) software to project binary images of lung tissue samples. To counteract the DMD’s 45-deg tilt, the image was displayed in a diamond shape against a black background. The multispectral illumination passed through a collimator and diffuser before striking the DMD and was subsequently focused by two lenses: one with a focal length of 200 mm (Thorlabs, LA1708 - N-BK7 Plano-Convex Lens) and the other 75 mm (Thorlabs, LA1608 - N-BK7 Plano-Convex Lens), positioned 110 mm apart. The imaging was captured by a ScopeTek DCM500BW camera housed in a custom 3D-printed casing made using a Stratasys Objet30 Prime™ printer. The camera, equipped with a 5.0-megapixel monochrome image sensor ( $2592 \times 1944$ resolution, $2.2 μ m$ pixel size), recorded out-of-focus images, which were processed and stored via the ScopePhoto 3.1 software. The imaging speed was $\sim 300 ms$ per capture, with automatic exposure adjustment.

Formalin-fixed, paraffin-embedded (FFPE) tissue slides stained with hematoxylin and eosin (H&E) from anonymized lung specimens were used for testing. Each slide originated from previously collected anonymized materials that did not contain identifiable patient information. The slides were initially imaged as holographic data using a custom-built lens-free in-line setup.42^,43 The reconstructed holographic images of tissue specimens were then binarized using Otsu’s thresholding method,44 which adaptively selects the optimal threshold to separate foreground and background regions. The resulting images were rotated into a diamond shape and positioned against a black background to create an upright frame of suitable size on the CMOS image sensor. The optical configuration resulted in the captured images being $\sim 3$ times smaller than the projected images, due to the lens system and component distances. The image defocus varied between different illumination wavelengths due to chromatic aberration. Blue light focuses closer to the lens, whereas red light focuses farther away, leading to differences in the amount of defocus for different colors.

For the training and testing datasets, we collected two sets of image patterns. The first set consisted of 100 image patterns, each with six exposures corresponding to one unique LED illumination at a time. The second set consisted of another 20 image patterns. In addition to the six illumination settings in the first set, each projected image pattern in the second set is illuminated with another 14 illumination settings that have more than one LED on at a time. In total, we captured 964 multispectral images using these two image pattern sets. Note that the total expected dataset size is 1000 images, but 36 were discarded due to illumination synchronization failure in the experiments. The captured raw monochrome images, encompassing the full field-of-view on the image sensor, underwent the following preprocessing steps: we first cropped the center diamond regions and rotated the images 45 deg to roughly align the captured data to be in the same orientation as the target images projected to the DMD. A $5 pixel \times 5 pixel$ binning was applied to the input images, which were then resized to $512 pixel \times 512 pixel$ to ensure compatibility with the network model architecture. This resolution was fixed for both the training and evaluation phases to maintain consistency. All resizing operations used bilinear kernels.

Acknowledgments

Acknowledgment. Ozcan Lab at the UCLA acknowledges valuable discussions with Dr. Douglas M. Baney and the funding of Keysight Technologies Inc.

Biographies of the authors are not available.

Category: Research Articles

Received: Jan. 25, 2025

Accepted: Jul. 21, 2025

Published Online: Aug. 5, 2025

The Author Email: Aydogan Ozcan (ozcan@ucla.edu)

DOI:10.1117/1.APN.4.5.056002

CSTR:32397.14.1.APN.4.5.056002