Cross-modality transformations in biological microscopy enabled by deep learning

Dana Hassan; Jesús Domínguez; Benjamin Midtvedt; Henrik Klein Moberg; Jesús Pineda; Christoph Langhammer; Giovanni Volpe; Antoni Homs Corbera; Caroline B. Adiels

doi:10.1117/1.AP.6.6.064001

1 Introduction

Modern optical microscopy methods provide researchers with a window into the microscopic world with visual clarity not possible using traditional bright-field microscopy. While bright-field microscopy relies on light absorption by the sample to generate visual contrast, biological specimens often lack sufficient light absorption for clear, analyzable images.1 To overcome this challenge, scientists have traditionally employed various staining techniques and specialized microscopy methods tailored to derive contrast from diverse properties of the sample across different scales, portrayed in Fig. 1. For tissues, researchers use chemical dyes to stain the sample and create contrast.2 Similarly, fluorescent dyes are employed to highlight specific cellular structures.3^,4 At the molecular level, fluorophores are commonly utilized to bind to target molecules, enabling researchers to track and observe individual molecules using fluorescence microscopy.5^,6

$Applications of cross-modality transformations across biological scales. At the largest scales, virtual staining is used to enhance imaging contrast. At intermediate scales, virtual staining is used in conjunction with noise reduction techniques. At the smallest scales, superresolution is used to study systems far beyond the optical diffraction limit. Image created with the assistance of BioRender.$

Figure 1.Applications of cross-modality transformations across biological scales. At the largest scales, virtual staining is used to enhance imaging contrast. At intermediate scales, virtual staining is used in conjunction with noise reduction techniques. At the smallest scales, superresolution is used to study systems far beyond the optical diffraction limit. Image created with the assistance of BioRender.

Download full size

View all figures

However, modern microscopy techniques also present challenges. Staining tissue samples is a laborious, invasive, and often irreversible process, resulting in varying staining outcomes for different tissues and limiting their reuse for alternative purposes.7 Similarly, imaging cellular and subcellular structures poses challenges, such as costly and time-consuming staining procedures that often limit sample utility.8 Furthermore, at the molecular scale, light microscopy encounters limitations in image resolution. Researchers must use specialized objectives, sophisticated setups, and complex fluorophore mixes and buffers to observe sufficiently small structures, resulting in expensive and intricate optical arrangements, e.g., interferometric scattering microscopy9 and direct stochastic optical reconstruction microscopy (STORM).10^,11

Recently, deep learning (DL) has emerged as a potential solution to overcome the aforementioned challenges in microscopy.12 By employing neural networks to perform numerical transformations of images between different optical modalities,13 researchers can capture images using a low-cost modality, such as bright-field microscopy, and convert them to preferred modality, such as fluorescence microscopy, for simplified analysis.14 This process, known as cross-modality transformation, eliminates the need for costly and invasive staining procedures,15 allowing for multiple staining techniques to be produced from the same sample with minimal additional expense. Moreover, since the entire process is numerical, results can be easily replicated by independent teams, ensuring reproducible and reliable outcomes.

In this review, we demonstrate the utilization of cross-modality transformations across biological scales. We outline common strategies for training neural networks for cross-modality transformations, while addressing the specific challenges and possibilities associated with each biological scale. Finally, we summarize the most successful techniques as rules-of-thumb and provide guidelines for the development and utilization of cross-modality transformations.

2 Introduction to DL for Multimodal Transformations

DL is a subset of machine learning that uses artificial neural networks to perform specific tasks. Neural networks are complex computational models processing input data to generate output results. The performance of a neural network is determined by its parameters, commonly referred to as weights, which can range from tens of thousands to hundreds of millions, depending on the application. Primarily, the objective of DL is to optimize these weights by a process called training, enabling the neural network to yield desired outcomes for a given input space.

In cross-modality transformations, the neural networks used are frequently trained using supervised learning.16 During this process, the network is presented with an image captured from one modality (e.g., bright-field) and trained to generate the corresponding image in another modality (e.g., fluorescence). Typically, these training data are obtained using either a dual-modality microscope,17 where the sample is imaged using both modalities, or alternatively, the sample can be imaged twice—before and after a specific treatment (e.g., staining)—with subsequent image processing to align the two views.18 In certain cases, it may be feasible to derive the transformation analytically, necessitating the imaging of the sample using only one modality to train the network to reconstruct the original image. Even if the physical staining or alternative imaging process is conducted at least once to establish a training pool, subsequent experiments benefit from their simplification.

Cross-modality transformation involves translating images from one modality to another, typically employing encoder–decoder-style fully convolutional neural networks (CNNs). A CNN is a regularized network, meaning it adds information unidirectionally and learns features by itself via kernel optimization. These networks utilize convolutional operations to process input images and are commonly exemplified by architectures, such as U-Net, ResNet, and InceptionNet.

An essential aspect in training neural networks for cross-modality transformation is choice of the loss function, a metric minimized during training. Traditionally, minimizing the mean absolute error ( $L_{1}$ ) or mean squared error ( $L_{2}$ ) distance between the predicted and ground truth images is common.19 However, this approach often yields low-resolution and nonphysical results. To address this, an auxiliary adversarial loss function is frequently incorporated.20 This involves training a discriminator neural network alongside the main generator network, where the discriminator distinguishes between generated and ground truth images. The generator is trained to deceive the discriminator by producing physically reasonable results. Such networks are referred to as generative adversarial networks (GANs).

Alternatively, diffusion models represent a recent approach generative modeling. These models utilize probabilistic generative techniques in a two-step process: forward diffusion, where noise is iteratively added to images until they become pure Gaussian noise; and reverse diffusion, where images are iteratively denoised using a neural network. By conditioning the reverse diffusion process, diffusion models effectively handle image-to-image transformation tasks.21 While diffusion models can produce higher-quality images compared to GANs, they come with significantly higher computational costs.

3 Tissue Imaging/Histology

Histological staining is a cornerstone of clinical pathology and research, playing a pivotal role in unraveling tissue details at the microscopic level. It enables the visualization of structures crucial for medical diagnosis, scientific study, autopsy, and forensic investigation.22 In recent years, DL advances have revolutionized tissue imaging and histology analysis, offering innovative solutions to overcome the limitations of traditional physical staining methods.23^,24 In the following sections, we will explore the transformative impact of DL techniques in substituting conventional staining approaches, with the aim of improving the analysis of histological samples.

3.1 Limitations of Chemical Staining

One of the most impactful applications of DL for cross-modality transformation is in histology, where visual tissue analysis often faces challenges associated with traditional physical staining. Tissues, as the largest biological structures routinely observed through optical microscopy, require staining protocols to create visual contrast between features. However, these protocols often rely on chemical dyes that can be hazardous and may adversely affect the samples, especially during critical steps when sample structures are vulnerable.25^–27 In addition, manual labor and dexterity are necessary, and especially for fluorescence dyes, not all staining procedures are compatible in the same sample, limiting the information obtainable.

The histological process usually comprises several steps, including fixation, embedding, sectioning, staining, and mounting, although the specific steps may vary according to the staining technique and the target tissue. The first step is fixation, where the full tissue sample is preserved using chemicals, such as formaldehyde or glutaraldehyde. Fixation prevents decay and maintains the structural integrity by cross-linking the proteins in the sample. However, the tissue’s original chemistry is altered. An alternative approach is to freeze the sample, often using liquid nitrogen, which can preserve the natural state of proteins and lipids without chemicals. The next step is to dehydrate the tissue sample through a series of diluted alcohol solutions and to clear it, using different clearing agents that dissolve remnant lipids and simultaneously homogenize the refractive index. This process renders the tissue transparent as a consequence of even light scattering across the sample,28 and prepares it for infiltration, but can cause tissue shrinkage and morphological alterations. The sample can now be completely encased in an embedding medium, such as paraffin wax, and left to solidify. Once hardened, the sample is cut using a microtome into very thin slices, around 4 to $10 μ m$ thick, in a step known as sectioning. This process requires high skill and precision, as incorrect microtome alignment or use can lead to tearing or crushing of the tissue and obscure important details. Finally, the slices are placed on microscopic glass slides for observation. Handling must avoid stretching or folding, which can distort the samples and hinder analysis. Once mounted, the samples are ready for the next step: the staining. During staining, various stains are applied based on the cellular structures of interest. Among histological stains, hematoxylin and eosin (H&E) are among the most widely used, with hematoxylin staining cell nuclei purple and eosin coloring the cytoplasm and extracellular matrix pink. Other stains target different structures, such as Picrosirius Red (PSR) for collagen fibers or Alcian Blue for acidic mucins and cartilage. An important technique within this process is immunohistochemistry (IHC), which detects specific proteins in the sample using antibodies. This method requires antigen retrieval to unmask target proteins after fixation, alongside a blocking step to reduce nonspecific antibody binding. The staining process is sensitive to factors such as concentration and timing, and if applied unevenly, it can obscure details. Other limitations of IHC regard unspecific binding or cross-reactions of the added antibodies, which can negatively affect the outcome. Finally, the sample is prepared for detailed observation, and the histological features can be identified via microscopy imaging.

Traditionally a qualitative method, chemical staining can also yield quantitative data through image analysis,29 such as measuring staining intensity, to estimate the presence of biomolecules. However, identifying biological features often requires additional context and expert analysis, and variability in staining intensity, reagent quality, and human interpretation can affect results. Standardized protocols and software tools can mitigate some of this variability, though issues persist compared to virtual staining and inherent contrast techniques.

As an alternative to chemical staining, inherent contrast techniques—such as phase contrast, differential interference contrast (DIC),30^,31 and quantitative phase imaging (QPI)—provide an alternative to chemical staining by exploiting refractive index variations in biological tissues to enhance contrast.32 These methods require minimal additional equipment but lack the specificity of chemical dyes and are prone to optical artifacts, such as halo effects.33 To improve specificity, they are often combined with other techniques. The choice between virtual staining and inherent contrast methods depends on factors, such as cost-effectiveness and imaging quality.

3.2 DL for Tissue Imaging

DL models can be trained to virtually stain samples, whether they are unstained34 or have been stained using a different method.35 By bypassing the physical staining process, multiple readouts equivalent to different dyes can be obtained from the same image. This not only maximizes information output36 for analysis and diagnostics37 but also simplifies the experimental setup requirements, as shown in Fig. 2.

Figure 2.Contrast between physical and virtual approaches to obtain a stained image. In the physical approach, the sample undergoes a series of complex procedures, including preparation, staining, and imaging. Tissue preparation may involve fixing, embedding, and sectioning, among other steps. Similarly, histological staining of an unstained sample requires permeabilization, chemical dye application, washing, counterstaining, and protocol optimization before imaging. In contrast, virtual staining offers a simplified alternative to these protocols, eliminating the need for physical processing³⁷ or staining of the sample.³⁸ In the virtual approach, an unaltered or unstained sample is processed through a virtual staining network to generate a stained image, with results equivalent to physical staining. Physically stained images serve as training data, or input, for the model, especially when transforming between different stains is the objective. Created with the assistance of BioRender. (Tissue image adapted from Berkshire Community College Bioscience Image Library.)

Download full size

View all figures

In histology, DL models are trained to virtually stain tissue using collections of stain/unstained image pairs as, a reference,36^,37^,39^–47 as shown in Fig. 3(a). This method constitutes a pure virtual approach to staining samples, although the starting point is not always an unstained sample. Some models have the capability to transform one type of staining into another.36 Hong et al., as shown in Fig. 3(b),48 washed and restained the samples to obtain equivalent pairs for a stain-to-stain translation model. Still, there are cases of cross-modality transformation of tissues based on unpaired samples.49 These cases are notably more intricate as they typically do not rely on training pools consisting of paired samples,50 but rather on disparate samples obtained through different techniques or dyes. In alternative scenarios, a multistain model,51 such as the one shown in Fig. 3(c), is developed, where the model is trained to virtually apply various dyes to an unstained sample, enabling it to transform a stained image into each of the other dye types.36 An illustrative example is provided by Rivenson et al.,52 who used a CNN to transform wide-field autofluorescence images of unlabeled tissue sections53 into images equivalent to the bright-field images of histologically stained versions of identical samples. Their study demonstrates the feasibility of this approach to generate multiple types of stains on different tissue types through the autofluorescence signal.

Figure 3.Representative applications of cross-modality transformations for tissue imaging using DL. (a) Virtual staining of an unlabeled sample image to obtain the equivalent H&E stained image. Adapted from Rana et al.³⁹ (b) Stain-to-stain translation where the input and output are images from two different staining procedures, in this case H&E to IHC staining for cytokeratin (CK). Adapted from Hong et al.⁴⁸ (c) Multi-stain model that is able to transform unlabeled tissue images into different staining options simultaneously: H&E, orcein, and PSR. Adapted from Li et al.⁴⁰ (d) Cross-modality transform to apply a segmentation method, or potentially a stain, in a previously incompatible modality. In this case, an AI segmentation for MRI images is transcribed to CT images. Adapted from Dou et al.⁴⁹ (e) Biopsy-free cross modality transformation, where not only the staining procedure but also the sample preparation is avoided. Using CRM as a noninvasive technique for in vivo measurements, the resulting images incorporate features comparable to H&E, despite being incompatible with traditional staining techniques in such conditions. Adapted from Li et al.³⁷

Download full size

View all figures

Therefore, virtual and histological stainings are not mutually exclusive and can be complementary. While virtual staining offers convenience and reproducibility, it still relies on histological staining to provide the ground truth for generating large training data sets. The main trends are summed up in Table 1 in Sec. 6 guidelines. Some models benefit from existing databases of images of stained tissues for their training, reducing the manual effort required to obtain a sufficient training set.34^,61^,62 Virtual staining also offers advantages over traditional histology, including the potential for real-time staining of tissue samples63 and three-dimensional (3D) reconstructions of full tissues.63 In the latter case, Wang et al.63 virtually stained light-field microscopy—not to be confused with bright-field microscopy—images of volumetric samples, merging two typically incompatible techniques. Furthermore, certain virtual staining models extend their capabilities by integrating segmentation of the highlighted region of interest.48^,64^–66 Segmentation, and potentially staining protocols, can be applied in imaging techniques where usually contrast-enhancing staining is not available due to cross-modality transformations, as shown in Fig. 3(d). In this example from Dou et al.,49 segmentation, obtained through artificial intelligence in magnetic resonance imaging (MRI) measurements, is transferred to CT images. This is particularly relevant for the utilization of machine-learning models for diagnostic purposes, facilitating the differentiation of healthy tissue samples from those associated with conditions, such as infections or tumors.61^,62^,67^,68 Moreover, in certain cases,37 not only diagnosis is accomplished through virtual staining, but also the direct acquisition of suitable input images from patients via noninvasive methods, as represented in Fig. 3(e), underscoring the potential of this powerful technique. While most cross-modality transformations in tissues are based on the same imaging modality with different nonfluorescent dyes, some examples involve transformation into fluorescent dyes35 or other imaging techniques. Nevertheless, fluorescence is of greater relevance at the cellular level, where emphasis is placed on specific structures and organelles.

Virtual staining models must undergo rigorous validation against chemically stained samples to ensure accurate representation of biological features, as variations in color, texture, and detail may occur. A comprehensive comparison between traditional and virtual staining methods is therefore crucial for assessing reliability,69 which requires either manual or automated ground-truth data annotation. Objective metrics, such as index measure (SSIM)51^,70^–72 and peak signal-to-noise ratio (PSNR),73^–75 are used to measure accuracy, while expert visual evaluations are essential to confirm that virtual stains accurately reflect key histological features for research and diagnostic purposes.

In general, a single dye is insufficient to provide comprehensive information about a particular tissue sample. Instead, various dyes can be applied on different samples of the same original tissue, such as different slices of a single specimen. In the study by Li et al.,40 three distinct dyes—H&E, PSR, and orcein—the last used to demonstrate elastic fibers—were utilized on carotid tissue. Each dye targets different components of artery tissue, aiding in the identification of coronary artery disease and vascular injuries. First, an independent model was trained for each dye to virtually stain an unstained sample. Then, a complete model was trained to simultaneously produce all three modalities from the original sample. This was accomplished by applying one of the three corresponding staining protocols to the originally unstained samples, generating pairs of stained and unstained images for each type. A total of 60 (whole slide images) of each stain, along with their unstained equivalents, were utilized, yielding 1500 to 1800 divided images for training and 150 to 200 images for validation for each staining protocol. Following standard practice, a conditional generative adversarial network (cGAN) was implemented to learn the generation of stained images from the acquired data set. The generator architecture was based on U-Net, while the discriminator comprised a PatchGAN architecture. To accomplish virtual staining according to three distinct protocols, the StarGAN76 architecture was implemented. This framework enables image-to-image translations across multiple domains using only a single model, offering practically unlimited potential for utilizing unstained samples, as they can potentially be transformed into any other protocol with an appropriately trained network.36^,77

These studies suggest that DL holds significant potential for histological staining, yet its widespread adoption remains limited. Though DL has emerged as a leading choice for analyzing and interpreting histology images with the potential to enhance medical diagnostics,78 very few algorithms have transitioned to clinical implementation.79 Several challenges persist, notably the need for accurate labeling and addressing variations in slide colors,80^–82 as addressed in Sec. 6. Looking ahead, the availability of high-throughput experimental devices is crucial for optimizing the performance of DL-based digital histopathology methods. Fanous et al.83 showed the potential of DL to speed up data acquisition in scanning microscopes, using a GAN-based image restoration approach to reconstruct motion-blurred scanned images.84 This suggests that DL-based approaches could significantly improve experimentation efficiency and speed, thereby improving algorithm performance. In addition, integrating multiple modalities, such as molecular profiling information,85^,86 could further enhance the accuracy of disease classification and prognosis accuracy in digital pathology. Furthermore, the development of transfer-learning techniques that profit pretrained models on large data sets could mitigate the challenge of limited training data in certain biological applications, expanding DL’s utility in digital pathology. Promising advancements are expected in the use of diffusion models for data augmentations and synthetic image generation, as demonstrated by Moghadam et al.,87 where the generated histopathology images were indistinguishable by experienced pathologists.

Diffusion models have emerged as a powerful tool in histology for virtual staining, addressing some of the limitations seen in traditional models, such as GANs and encoder–decoder networks. For instance, StainDiff is a diffusion probabilistic model designed to improve the stain-to-stain transformations by overcoming issues, such as mode collapse, where the generator of a GAN produces images based on a limited range of the training samples and posterior mismatching found in other networks.88 These models are also applied to generate virtual IHC images from H&E stained slides, as seen in PST-Diff, which ensures structural and pathological consistency through mechanisms such as asymmetric attention and latent transfer.89 Despite their potential, diffusion models generally require large data sets, making them less feasible for histological applications with limited data. To address this, multitask architectures such as StainDiffuser have been developed to simultaneously generate cell-specific stains and segment cells, optimizing the performance even with constrained data sets.90 In addition, advanced methods, such as virtual IHC multiplex staining, utilize large vision-language diffusion models to generate multiple IHC stains from a single H&E image, addressing tissue preservation challenges often faced in biopsies.91 However, challenges remain. Diffusion models have shown limitations in unpaired image translation tasks, such as slide-free microscopy virtual staining, where the sample preparation process is bypassed along with the staining process. In such cases, they underperform compared to models such as CycleGAN without additional regularization, highlighting the need for further refinement in certain applications.92 Despite these challenges, diffusion models show great promise in virtual staining, with ongoing research focused on enhancing their reliability and applicability in histology.

4 Cellular and Subcellular Structure Imaging

Biologists and clinical laboratories routinely employ optical microscopy to examine cell cultures, enabling the study of cellular and subcellular morphologies and physiology. This examination helps in understanding intercellular communication networks, dynamic cell behaviors, and pathophysiological mechanisms.93 For instance, changes in the morphological characteristics of cellular structures serve as effective indicators of a cell culture’s physiological status and its response under drug exposure.94^,95 In the subsequent sections of this review, we delve into the limitations of fluorescence staining techniques, shedding light on the challenges associated with both fixed and live staining methods. In addition, we explore how DL approaches are revolutionizing cellular imaging analysis, offering innovative solutions to overcome these limitations and ushering in a new era of advanced and automated cell culture investigations.96

4.1 Limitations of Fluorescence Staining

Standard cell imaging workflows typically rely on fluorescence microscopy, employing either fixed or live fluorescent staining techniques to highlight specific cell structures. Despite their widespread use, both fixed and live fluorescent staining methods have limitations. These procedures can be invasive and toxic, potentially impacting cell health and behavior.97 In fixed staining, as for tissue, the fixation process itself can introduce artifacts by altering the native state of cellular components. In addition, the use of permeabilization methods compromises cell membrane integrity.98 Furthermore, fixed staining provides only a static view of cellular processes, limiting the ability to study dynamic processes. Conversely, live staining, while theoretically preserving the native state of cells, often alters their biological activity and can be toxic.99 The availability of specific and effective live-staining dyes can also be limiting, restricting the visualization of certain cellular components. In addition, real-time observation of cellular processes in live staining may be challenging due to phototoxicity and photobleaching over extended imaging periods.100 Lastly, the use of multiple fluorophores can lead to spectral cross talk between fluorescence channels, potentially resulting in misleading results and complicating image analysis. These challenges hinder the acquisition of reliable longitudinal data, which is often crucial for studying the effects of drug exposure over time.101

4.2 DL for Cellular and Subcellular Imaging

Recently, research has proposed the use of DL as an alternative to conventional physical staining methods to mitigate inherent problems. These works suggest replacing physical staining and fluorescence microscopy with a neural network that generates virtual fluorescence-stained images from unlabeled samples.

Virtual cell staining has been achieved from various imaging modalities, including phase contrast,55^,102 QPI,103 and holographic microscopy.104 Moreover, recent studies have shown that bright-field images, despite their limited detail, contain sufficient information for a CNN to reproduce different types of staining.

For example, Ounkomol et al.105 introduced a CNN-based framework to map the relationship between paired 3D bright-field and fluorescence live-cell images for various key subcellular structures (e.g., DNA, cell membrane, nuclear envelope, and mitochondria). Each cellular component is modeled separately, with a U-Net trained independently for each one. The training process minimizes the mean-squared error between the ground-truth fluorescence image and the predicted image. Once trained, these models can be combined, allowing a single 3D bright-field input to generate multichannel, integrated fluorescence images across multiple subcellular structures. Particularly advantageous, the training data require relatively few paired examples (solely 30 pairs per structure), lowering the machine-learning entry barriers.

The work by Helgadottir et al.54 offers yet another compelling example of the potential of virtual staining of cellular structures from bright-field images. Similar to other approaches, this method relies on a modified version of the U-Net to learn the cross-modality mapping. However, it enhances the reconstruction accuracy by incorporating GAN-based training.

GANs have become a widely adopted framework in virtual cell staining due to their capacity to generate high-quality, realistic images.54^,55^,102^–104 Particularly, Helgadottir et al. employed a conditional GAN to virtually stain lipid droplets, cytoplasm, and nuclei from bright-field images of human stem-cell-derived adipocytes. The generator network, a U-Net, processes a stack of bright-field images captured at multiple $z$ positions and generates virtually stained fluorescence images. The architecture features three independent decoding paths, each dedicated to one cellular component, to effectively decorrelate the features associated with the predicted fluorescence images. The discriminator, a CNN, is trained to distinguish between the synthetically generated virtual stains and actual fluorescently stained samples (conditioned on the input bright-field image). These two neural networks are trained simultaneously until the generator can fool the discriminator by producing images that closely mimic real fluorescence [Fig. 4(a)]. An interesting feature of Helgadottir’s method is its robustness and rapid convergence; the neural network requires relatively few epochs to quantitatively reproduce the corresponding cell structures [Fig. 4(b)].

Figure 4.Virtual cell staining using DL. (a) Helgadottir et al. introduced a cGAN to virtually stain lipid droplets, cytoplasm, and nuclei using bright-field images of human stem-cell-derived fat cells (adipocytes). The U-Net-based generator processes bright-field image stacks captured at various $z$ positions to generate virtually stained fluorescence images. A CNN-based discriminator is trained to differentiate between the virtually generated stains and real fluorescently stained samples, conditioned on the input bright-field image. (b) The virtual staining of lipid droplets (green channel) and cytoplasm (red channel) exhibits a high degree of fidelity, as evidenced in the fine details of the lipid droplet internal structure and the enhanced contrast among distinct cytoplasmic components (highlighted by the arrows). Panels (a) and (b) adapted from Helgadottir et al.⁵⁴ (c) Unsupervised cross-modality image transformation using UTOM. Two GANs, G and F, are trained concurrently to learn bidirectional mappings between image modalities. The model incorporates a cycle-consistency loss ( $L_{cycle}$ ) to ensure the invertibility of transformations, while a saliency constraint ( $L_{sc}$ ) preserves key image features and content in the generated outputs. (d) UTOM achieves performance comparable to a supervised CNN trained on paired samples, without requiring paired training data. Panels (c) and (d) adapted from Li et al.⁵⁵ (e) Co-registered label-free reflectance and fluorescence images acquired using a multimodal LED array reflectance microscope. (f) Multiplexed prediction displaying DNA (blue), endosomes (red), the Golgi apparatus (yellow), and actin (green). Zoomed-in views, with white circles, highlight representative cell morphology across different phases of the cell cycle. Adapted from Cheng et al.¹⁰⁶

Download full size

View all figures

While GANs have proven effective in enhancing the performance of virtual staining networks for various applications, they rely on co-registered input and ground-truth images. Nevertheless, obtaining perfectly co-registered training pairs is often challenging due to the rapid dynamics of biological processes or the incompatibility of different imaging modalities. To address this limitation, Li et al.55 introduced unsupervised content-preserving transformation for optical microscopy (UTOM). This approach utilizes a CycleGAN to transform images between domains without requiring paired data. Unlike traditional GAN models, CycleGANs employ two generator-discriminator pairs, one for each domain, to learn bidirectional mappings between imaging modalities [Fig. 4(c)]. UTOM has been applied, among other examples, to the virtual staining of phase-contrast images of differentiated human motor neurons, notably delivering competitive performance compared to a CNN architecture trained on paired samples under supervision despite the lack of paired training data [Fig. 4(d)].

Importantly, although the architecture and training of the neural network play a decisive role in the performance of virtual staining models, the input imaging modality must capture sufficient contrast of the different cell structures, providing the network with enough information to learn the transformation to the desired high-contrast, high-specificity fluorescently stained samples.

Recent research has centered on the development of optical systems that capture the rich structural details of cells and embed inductive bias within the network to enhance its performance. For instance, Cheng et al.106 profited from the rich structural information and high sensitivity in reflectance microscopy to boost the performance of virtual staining models. Specifically, the authors employed an LED array reflectance microscope to acquire co-registered label-free reflectance and fluorescence images.107 This platform collects four dark-field reflectance images using half-annulus LED patterns oriented in different directions (top, bottom, left, and right). These measurements derive two dark-field reflectance differential phase contrast (drDPC) images computed along orthogonal orientations [Fig. 4(e)]. Interestingly, the oblique illumination dark-field and drDPC images provide complementary contrast information. While raw dark-field images highlight subcellular structures, such as nuclei, nucleoli, and hyperreflective areas near the nuclear periphery, the drDPC images emphasize cell membranes with clearly defined boundaries. These images serve as multichannel input for the virtual staining model, which, boosted by the enhanced resolution and sensitivity in the backscattering data, provide a reliable prediction of subcellular features [Fig. 4(f)].

In a similar vein, Cooke et al.108 proposed incorporating a physical model of the experimental microscope into the virtual staining model. This approach utilizes a CNN which incorporates a “physical layer” representing the microscope’s illumination model. Consequently, during training, the network learns task-specific LED patterns that significantly enhance its ability to infer fluorescence image information from label-free transmission microscopy images. This work, in particular, further underscores the importance of rich input data and highlights the potential combination of programmable optical elements and physics-informed DL to open new possibilities for exploring the structure and function of cells.

5 Molecular Imaging

One of the most significant advancements in molecular imaging, which involves the optical imaging of single biological molecules at micro- and nanoscales, has been the introduction of fluorescence microscopy techniques.3 However, unlike in tissue and cellular imaging, the lack of viable techniques for studying molecules without fluorescence currently prevents virtual staining in molecular imaging. Rather, the focus of cross-modality transformations in molecular imaging typically revolves around superresolution microscopy aiming to surpass diffraction-imposed limits in imaging molecules.16 Traditionally, achieving such high resolutions requires either expensive microscopy setups with specialized objectives, complex numerical estimations of the imaging process, or specific fluorophores.109 Nevertheless, recent research has indicated that DL-based approaches using generative learning (Sec. 2) can enhance the resolution of images captured with ordinary objectives, comparable to those obtained with costly specialized objectives. Further, DL-driven cross-modality transformations have demonstrated the ability to achieve superresolution across various microscope modalities.16^,110^–112 In this section, we first survey the physics underlying the spatial resolution limitations and the subsequent methods for achieving superresolution, aiming to overcome these constraints. Thereafter, we present an overview of the relatively recent work in applying DL techniques to transform images into their superresolved counterparts, particularly focused on, but not limited to, applications in molecular imaging.

5.1 Physics of Superresolution Microscopy

When light from a point-like light source (an object with diameter $D_{1}$ ) traverses a lens, it undergoes diffraction, producing a characteristic pattern known as the Airy disk. This pattern comprises a bright central region surrounded by concentric rings of diminishing intensity ( $D_{2}$ ). The Airy disk represents the smallest focal point achievable by a light beam. Below the object and Airy disk representations, corresponding intensity plots illustrating the point spread functions (PSFs) are displayed. As Airy patterns reach a point of significant interference, causing a reduction in contrast, they merge, becoming indistinguishable and limiting the spatial resolution. Spatial resolution, the shortest physical distance between two points within an image, stands out as the single most important feature in optical microscopy.113 The primary constraints affecting the achievable spatial resolution stem from an intrinsic phenomenon of diffraction physics. Regardless of lens quality or optical component alignment, a microscope’s resolution ultimately correlates with the wavelength of the detected scattered light and inversely with the numerical aperture (NA) of its objective. This relationship is shown in Fig. 5(a), where light from the sample traverses an objective to the image plane, generating a fundamentally limited diffraction pattern known as a PSF. The PSF inherently limits the minimal distance between two discernible points in the sample, shown in Fig. 5(b). The full width at half-maximum of a PSF in the lateral directions can be approximated as $FWHM = 0.61 \frac{λ}{NA}$ , where $λ$ represents the light’s wavelength, and NA denotes the numerical aperture of the objective. Thus, for a typical oil immersion objective with $NA = 1.4$ , the resulting PSF has a lateral size of 200 nm and an axial size of 500 nm, effectively restricting the resolution to this range for visible-light studies.114 Comparing these scales with those depicted in Fig. 1, it becomes evident that the diffraction limit rarely poses a challenge in most imaging at organ, tissue, or even cellular levels. However, in cellular exploration, where subcellular and molecular structures are of interest, issues regarding diffraction limits become prominent. These issues are exacerbated by the typically dense distribution of molecules and subcellular structures, causing their PSFs to overlap, thus blurring many intricate details together. Hence, the development of superresolution techniques that surpass the diffraction limit becomes imperative for further exploration of these structures using noninvasive optical light. Various microscope techniques have been developed to overcome this limitation, including single-molecule localization microscopy (SMLM) methods, such as STORM,115 photo-activated localization microscopy (PALM),116 and fluorescence photoactivation localization microscopy.117 Other methods of transcending the standard resolutions of microscopes exist, including complex numerical estimations of point spread (transfer) functions seeking to estimate the diffraction behavior, illumination pattern engineering methods reducing the PSF size,118 as well as specialized fluorophores.109 However, these approaches pose their own challenges, including complex and multivariate dependencies on imaging conditions, making solving diffraction integrals of PSFs exceedingly difficult for practically relevant systems,119 as well as increased costs associated with the aforementioned fluorophores.120

$Superresolution physical principles. (a) Illustration of the PSF resulting from imaging object of diameter D1 below the diffraction limit of an optical system, leading to an image of diameter D2. (b) Low-resolution image of simulated emitters alongside ground-truth emitter positions. Image reproduced with permission from Ref. 56.$

Figure 5.Superresolution physical principles. (a) Illustration of the PSF resulting from imaging object of diameter $D_{1}$ below the diffraction limit of an optical system, leading to an image of diameter $D_{2}$ . (b) Low-resolution image of simulated emitters alongside ground-truth emitter positions. Image reproduced with permission from Ref. 56.

Download full size

View all figures

In recent years, another promising avenue for achieving super-resolution has emerged as a consequence of the astounding growth and success of DL-based computer vision algorithms. Analogous to the cross-modality transforms mentioned above, the DL approach to superresolution involves training neural networks to transform one imaging modality (regular-resolution images) to another (superresolved images). Some of these approaches utilize generative learning, effectively learning the complex interpolation function between regular- and superresolved images, or through direct supervised learning, for example, by estimating the positions of underlying diffraction-limited emitters. The specific techniques for training these networks vary considerably across applications, as elaborated upon further below.

5.2 DL for Superresolution Microscopy

In general, DL for superresolution can be categorized into two approaches, each with two learning paradigms. The first approach aims to enhance resolution by training end-to-end, directly transforming low-resolution images into high-resolution ones. This can be achieved through supervised learning, using pairs of simulated or experimentally measured images from the same sample at different resolutions to train neural networks, or through unsupervised learning, where only low-resolution or high-resolution images are obtained, either experimentally or through simulations. The other approach seeks resolution enhancement by training a network to output the position of each individual molecule (or equivalent scattering object) within an image, and then reconstructing the high-resolution image from these positions, thus transcending the diffraction limits. This approach can also be trained either in a supervised or unsupervised fashion. A summary of the different models and their characteristics can be found in Table 1 in Sec. 6 guidelines. Once such a network is trained, it can swiftly generate high-resolution images without the need for parameter adjustment, yielding an efficient algorithm for improving image resolution within a specific modality.121^–126

5.2.1 End-to-end superresolution mapping

One common approach for supervised end-to-end low- to high-resolution mapping involves pre-upsampling the low-resolution image using a traditional upsampling interpolating algorithm, followed by training a CNN to refine the upsampled image until accurate superresolution is achieved. This approach, initially implemented for single-image superresolution,127 has been used in various biological applications, such as enhancing the resolution of magnetic resonance (MR) images128^–132 and X-ray computed tomography (CT) images.133

Another approach is to apply superresolution to the image after it undergoes computationally intensive CNN layers. This reduces the overall computational burden, as most of the computations are performed on low-resolution images. This approach, known for its efficiency, was first introduced by Dong et al.134 and has also been applied in various biological contexts, including superresolution of X-ray images,135 endoscopy images,136^,137 cardiac images,138 and MR images.131

A different strategy for end-to-end low- to high-resolution mapping involves iteratively up- and downsampling the image using downsampling convolutional layers and upsampling transposed convolutional layers. This technique utilized in the “back projection” networks presented by Harris et al.,139 incorporates an error feedback mechanism for projection errors at each iteration stage. Each up and downsampling stage is mutually connected through concatenation, reflecting the mutual dependence of low- and high-resolution image pairs, for which the authors demonstrated yield superior results across multiple data sets, as outlined in Table 1 in Sec. 6 guidelines. This approach has also been applied in various biological applications, such as transformation of CT scan brain images into higher-resolution MRI images140^,141 for the detection of multiple sclerosis142 and Alzheimer’s disease,143 as well as cardiac MRI scans,144 and 3D scans.145

Yet another approach involves sequentially upsampling low-resolution images in several separate steps using separate models, as introduced by Lai et al.146 This approach offers two main benefits. First, it allows the user to choose desired resolutions for their high-resolution images without retraining models. Second, it simplifies the learning task for individual networks, since their task is simpler than performing full superresolution in a single feedforward pass. This may potentially improve the performance of the final models.

5.2.2 Specific architectures and methods

Although the generic approaches mentioned above can result in a practically infinite variety of specific architectures, many significant superresolution studies in molecular microscopy have been achieved using a small number of named architectures. Structured feature superresolution microscopy model architecture, shown in panels of Fig. 6 as an example, allows for precise live-cell imaging with high spatial and temporal resolution to continuously monitor subcellular dynamics over extended periods. Among these, ANNA-PALM,60 a U-Net based cGAN trained solely with experimental data, stands out; Deep-STORM,56 based on a CNN encoder–decoder network trained with simulated data; and smNet,57 which directly outputs molecule location, dipole orientation, and wavefront distortion from complex and subtle features of the PSF. In single-molecule superresolution microscopy, there is generally a trade-off between throughput and resolution. To construct a high-quality superresolution image, a large number of molecules need to be localized with high precision, requiring sufficient localizations before sampling a structure of interest.

Figure 6.Superresolution applied architecture. The superresolution network enhances image resolution by training on pairs of simulated low-resolution (LR) and high-resolution ground-truth images or on wide-field (WF) and STORM images from a STORM microscope. First, the LR/WF image undergoes preprocessing through a subpixel edge detector to generate an edge map, both of which serve as inputs to the network. Training is guided by a multi-component loss function that incorporates the combination of multiscale structure similarity index measure and mean absolute error loss (MS-SSIM L1) to capture pixel-level accuracy between the superresolution (SR) and ground-truth/STORM images through multiscale similarity and mean absolute error, perceptual loss to assess feature map differences via the visual geometry group network, adversarial loss using a U-Net discriminator to differentiate ground-truth/STORM images from SR images, and frequency loss to compare differences in the frequency spectrum between SR and ground-truth/STORM images within a specific frequency range using the fast Fourier transform function. This comprehensive loss function helps the superresolution network model achieve precise and perceptually accurate superresolution imaging. Image adapted from Chen et al.¹⁴⁷

Download full size

View all figures

ANNA-PALM accomplishes this by training on a set of blinking single molecules from which a high-quality superresolution image can be experimentally acquired. A subset of these frames is used to generate a (low-quality) “sparse” superresolution image, which, alongside the diffraction-limited image and information about the imaged structure, serves as inputs into ANNA-PALM. The output, or label, is the full superresolution image reconstructed using all frames. Once trained, ANNA-PALM demonstrated the ability to provide high-quality results in imaging mitochondria, the nuclear core complex, and microtubules60 at significantly higher speeds than conventional methods.

ANNA-PALM has proven to be a valuable method for accelerating the acquisition of high-density superresolution images by several orders of magnitude. However, there are many other DL models more appropriate for direct single-molecule localization. An important early model for this is Deep-STORM, developed for the acquisition of superresolution images of microtubules with single or multiple overlapping PSFs. While non-DL algorithms exist for this purpose,148 they typically suffer from high computational costs and require sample-specific parameter tuning. Deep-STORM, an encoder–decoder CNN trained on simulated images, consists of simulated PSFs in various positions in an image on top of experimentally relevant background levels. Said simulated images are thereafter upsampled by a constant factor, constituting a superresolution image. These two versions of the same image are fed as input and output, respectively, to Deep-STORM during training. Such a model has been used to superresolve images of microtubules and quantum dots,56 and inspired works on localizing high-density ultrasound scatterers149 and Crispr-CAS-protein-DNA binding events.150

Another impactful model is the aforementioned smNet, which similarly employs a (ResNet-inspired) CNN trained on simulated images for superresolution. The key distinctions lie in the image recreation of 3D images and the network’s outputs of the 3D coordinates and orientations of PSF-convoluted emitters from which the superresolution image can be reconstructed. smNet has been demonstrated to localize highly astigmatic single-molecule PSFs in experimental images of significantly higher quality compared to conventional Gaussian fitting methods.57 This approach, reminiscent of DeepLoco,151 was developed around the same time and trains NNs to reconstruct simulated emitters in 3D through well-defined mathematical models of astigmatic PSFs. Other related examples of 3D superresolution are that of Zhou et al., who used a dual-GAN framework to directly superresolve images of mouse brains and bodies taken with fluorescence microscopy,152 or Zhang et al.153 and Zelger et al.,154 who used a U-Net-based and CNN-based approach, respectively, for superresolution in SMLM.

For images with higher PSF density, Speiser et al. introduced the method known as DECODE.58 This architecture consists of a stack of two U-Nets, where the first U-Net processes a feature representation of a single frame, and the second U-Net processes feature representations of consecutive frames. The output of this method consists of several channels, each containing information in each pixel of the input image regarding (1) the probability of containing an emitter, (2) its brightness, (3) its 3D coordinates, (4) its background intensity, and (5) epistemic uncertainty of its localization and brightness. In the work of Speiser et al.,58 this architecture is trained on simulated PSFs with a loss function connected to all five aforementioned types of pixel-level information and has been successfully applied to microtubules in conditions of low light exposure and ultrahigh sample densities.

5.2.3 Superresolution by emitter localization

Further, there are DL methods revolving around improving the precision of localizing underlying emitters in images. This is particularly relevant in SMLM imaging, where spatial resolution of the microscope is in practice directly correlated with the localization precision of single molecules.155 Since this localization is normally enabled by conventional heuristic-based fitting algorithms, using DL methods may enhance its performance. BGNet59 is one such architecture, designed to accurately identify the centroid of a PSF. It achieves this by training on (simulated) corrupted PSF images and outputting the background of the image. A trained BGNet can then be used to correct the background of a given image at inference time by subtracting its predicted background. Thus, one obtains background-corrected PSF images, which can be fed into conventional maximum likelihood estimation-fitting algorithms for superresolution, thereby enhancing the overall final output without the need to replace the entire analysis pipeline with an end-to-end DL-based system.

5.2.4 Extracting additional information from PSFs

Further, DL methods aim to extract more information from PSFs themselves.10^,147^,156^–160 PSFs contain often unexploited information, such as emission wavelengths, as well as lateral and axial location and identity of emitters. Many deep architectures, including smNET, have been developed to exploit this. smNet outputs not only 3D localization of emitters but also their orientation and wavefront distortions. Another notable architecture is DeepSTORM3D,10 following the original Deep-STORM, which can identify the location of emitters with multiple overlapping PSFs in highly dense conditions over a large axial range by combining information from multiple 3D PSFs. DeepSTORM3D consists of two components: an image formation model and a decoder CNN. The former takes simulated 3D emitter positions as input and outputs their corresponding low-resolution CCD image, while the latter tries to recover the simulated emitter positions given the low-resolution CCD image. The difference between simulated and predicted positions is then used to optimize the phase mask in the Fourier plane and recovery CNN parameters in tandem. This architecture was used by Nehme et al.10 to superresolve mitochondria and enable the volumetric imaging of telomeres within cells. Further, there are architectures157^,158 to classify the color channels of individual PSFs. Hershko et al.158 exploited the chromatic dependence of the PSF to train a CNN architecture to determine the color of an emitter from a gray-scale camera image from a standard fluorescence microscope.

The progress in DL for superresolution has been astounding in the past half-decade, driven mainly by different forms of CNNs trained in GANs through supervised learning. More recently, there are highly promising developments in few-shot,161 single-shot,162 zero-shot learning163 and even untrained neural networks for image superresolution.164 Diffusion models have also shown significant promise in improving the fidelity and robustness of image superresolution methods.165^–168 Diffusion models represent a recent approach in generative modeling. These models utilize probabilistic generative techniques in a two-step process: forward diffusion, where noise is iteratively added to images until they become pure Gaussian noise, and reverse diffusion, where images are iteratively denoised using a neural network. By conditioning the reverse diffusion process, diffusion models effectively handle image-to-image transformation tasks.21 While these models can produce higher-quality images compared to GANs, they come with significantly higher computational costs.

Recently, they have been used to generate superresolution images of microtubules,169 reconstruct authentic images with unseen low-axial resolutions into high-axial resolution of 3D microscopic data,170 and outperform state-of-the-art in high-fidelity continuous image superresolution.171 Thus, these advancements suggest that the field will continue to progress significantly in the near future.

6 Guidelines

This section provides detailed recommendations for developing cross-modality transformation models in microscopy, with an emphasis on data quality, model architecture selection, and evaluation metrics. Researchers can use this as a framework to navigate the key decisions and challenges associated with their tasks.

6.1 Data Quality, Augmentations, and Data Normalization

The quality of data plays a critical role in determining the performance of DL models. Two major issues commonly affect model quality: insufficient data to capture the variability within the data set or training data that fail to represent the conditions under which the model will be applied.

To detect the issue of insufficient data, a standard approach is to set aside a validation set that the model never sees during training. If the model’s performance on this validation set is significantly worse than on the training set, it likely indicates a lack of sufficient training data to generalize effectively. A common practice is to allocate ∼20% to 30% of the data set as a validation set. However, it is important to ensure that the validation set is maximally decorrelated from the training set to avoid misleading results. For example, it is ideal to sample from different locations in the sample or even from entirely different experimental videos. A poor sampling strategy, such as selecting every fifth frame from the same video, would introduce a high correlation between the training and validation sets, resulting in overly optimistic performance estimates.

If detected, data scarcity can be mitigated by data augmentation techniques to synthetically increase the diversity of training data. Techniques such as geometric transformations (rotation, scaling), noise injection, and intensity variation can simulate a broader range of conditions. However, care must be taken to ensure that these transformations can be meaningfully applied across modalities. For instance, intensity variations in quantitative phase contrast imaging hold physical significance and altering them synthetically could distort biologically relevant information. Geometric translations, in most cases, provide limited benefit, as convolutional models are inherently translation-equivariant. However, if the chosen model breaks translation equivariance (such as vision transformers), they may be useful.

Regularization techniques also play an essential role in improving model robustness, especially when data are scarce or noisy. Methods, such as dropout, weight decay, or $L_{2}$ regularization, are commonly used to prevent overfitting by penalizing overly complex models that may fit noise in the data rather than in underlying patterns. In scenarios where the model could easily memorize the training data, these techniques ensure that the model learns generalizable features rather than artifacts specific to the training set. Advanced regularization techniques, such as Bayesian regularization, can further improve robustness by incorporating uncertainty into the model’s predictions, making it especially useful in tasks where noisy or variable data are expected.

Transfer learning offers another potential solution to address limited data availability. Pretrained models, especially those trained on large data sets from similar domains, can be fine-tuned to perform specific tasks in microscopy. By leveraging features learned from related tasks, transfer learning reduces the need for extensive training data while still allowing the model to generalize effectively. This approach not only speeds up training but also improves the model’s performance on smaller, domain-specific data sets. In some cases, transfer learning from pretrained models in related fields, such as medical imaging, can be more effective than starting from scratch, especially in scenarios where biological structures share visual characteristics across different imaging modalities.

When the training data are nonrepresentative, this issue can be identified by observing a drop in model performance under real-world conditions, even though the performance on the validation set remains strong. This discrepancy often arises due to variations in optical systems, sample preparation protocols, or environmental factors that differ from those present in the training data. For instance, subtle differences in microscope settings, sample staining techniques, or even temperature can cause a shift in the data distribution, leading to poor generalization when the model is applied in different scenarios.

The primary strategy to address issues of representativeness is through effective data normalization. Normalization techniques aim to reduce variability in the data by standardizing features across data sets, such as intensity scaling, contrast adjustment, or color normalization. This can help minimize discrepancies between data sets generated under different conditions. However, caution must be taken in modalities where quantitative relationships between intensity values are critical. In such cases, aggressive normalization may disrupt important mappings between intensity and biological features, potentially degrading the model’s ability to learn meaningful cross-modality transformations. Furthermore, domain adaptation techniques can be employed to align the distributions of training and application data, improving the robustness of the model across diverse conditions.

6.2 Model Selection

The choice of model architecture can have a significant impact on the performance of the model and depends on several key factors, such as data availability, target task, and specific requirements. Here, we give a general guideline for choosing an appropriate model.

If your data are not aligned, the CycleGAN is recommended. Aligned data refer to cases where each image in one modality has a direct counterpart in the other modality, meaning that both images capture the same sample part of the sample under the same conditions, making it possible to map pixel-to-pixel relationships between the two. When such paired data are unavailable, CycleGAN is suitable because it learns to map between modalities without requiring this strict correspondence. However, the less restrained training procedure also is likely to result in the model learning transformations that are less precise or biologically relevant, especially when precise quantitative relationships between modalities are required. Careful evaluation and additional constraints may be necessary to ensure that the model’s outputs are meaningful and accurate.

Assuming your data are paired and aligned, we recommend starting with a conditional GAN architecture, specifically using a U-Net-like generator and a spatial discriminator. A well-established configuration for this setup is the pix2pix model. Conditional GANs are optimized to generate quantitative, physically meaningful images by leveraging paired data to learn a direct mapping between input and output modalities. The U-Net generator was originally developed for biomedical images and is one of the most proven and widely adopted architectures for tasks involving fine-scale structural details. Spatial discriminators, in turn, evaluate the realism of local regions of the image rather than assessing it as a whole, often resulting in more detailed and accurate outputs.

However, depending on the specific requirements of your task, other architectures may be better suited. For example, if the goal is simply to enhance the contrast of specific substructures without requiring physical realism in the produced images, it may be more practical to forego generative models entirely. In such cases, direct supervised training of a U-Net can offer a simpler and stabler solution. The drawback of using a purely supervised U-Net is that it may lack the ability to generate the nuanced, high-fidelity details that generative models, particularly GANs, are capable of producing. However, for applications where interpretability and stability are more important than photorealism, this trade-off can be worthwhile.

On the other hand, if maximal photorealism is required, diffusion models are worth considering. These models have consistently been shown to produce highly realistic images, often outperforming GANs in terms of image quality and stability. Diffusion models work by iteratively denoising random noise to generate an image, which allows them to better capture fine-grained details and complex textures. However, diffusion models are typically much more computationally expensive, both to train and to evaluate, compared to GANs. Moreover, one should be careful not to conflate photorealism with better quantitative performance on downstream tasks.

Another important consideration is the spatial distribution of information in the image. The U-Net generator is highly effective for analyzing local, position-invariant features, making it ideal for tasks where the meaning of a structure does not depend on its specific location within the image. However, for data where the spatial context is crucial, such as brain scans, attention-based models may be more suitable. Attention mechanisms allow the model to focus on specific regions of the image while considering their global relationships, enabling more context-aware analysis. This makes attention-based architectures a better choice for tasks that require understanding both local features and their larger spatial context.

Finally, for more specialized applications, more complex, hybrid models may be necessary. For tasks where interpretability is a priority, incorporating latent-space constraints can improve both stability and clarity in the results. For example, using a Wasserstein GAN (WGAN) with a carefully designed loss function can provide more control over the training process and generate smoother, more interpretable transformations. In addition, hybrid models that combine multiple architectures, such as variational autoencoders (VAEs) with GANs, can provide both generative flexibility and the ability to impose structural constraints, improving the model’s capacity to generate accurate, interpretable results for complex tasks. In superresolution tasks, specialized models such as Deep-STORM or DECODE utilize domain knowledge to far outperform what standard cGANs can achieve.

Table 1. Overview of the key parameters for common approaches of DL in microscopy across scales.

View table

View all Tables

Table 1. Overview of the key parameters for common approaches of DL in microscopy across scales.


Method	Architecture	Data sets	Learning type	Significant aspect
	Discriminator	Generator
Tissue
Conditional GAN (cGAN)³⁹	CNN	U-Net	Paired labeled/annotated images	Supervised	Conditioning mechanism based on the additional input information
CycleGAN³⁵	CNN	U-Net	Unpaired histology images	Unsupervised	Cycle-consistency loss enforces consistency and unsupervised translation
StarGAN⁴⁰	PatchGAN	U-Net/ResNet	Unpaired images of tissue structures from multiple domains	Unsupervised	Unified architecture for a single model across multiple domains
Cellular and subcellular structures
Conditional GAN (cGAN)⁵⁴	CNN	U-Net	Paired images from fluorescence, confocal electron microscopy	Supervised	Conditioning mechanism based on the additional input information
CycleGAN⁵⁵	CNN	U-Net	Unpaired images of bright-field, phase-contrast, fluorescence, and DIC microscopy	Unsupervised	Cycle-consistency loss enforces consistency and unsupervised translation
Molecular structures
Deep-STORM⁵⁶	Encoder–decoder CNN	Fluorescent images from techniques like STORM, PALM, or dSTORM	Supervised, labeled	Trained on simulated data to enhance resolution in SMLM, enabling superresolution imaging of molecular structures with improved accuracy
smNet⁵⁷	ResNet-inspired CNN	Fluorescent images from techniques such as STORM, PALM, or SIM	Supervised, labeled	Simulated PSFs and ground-truth 3D position labels training, accurately localized astigmatic single-molecule PSFs
DECODE⁵⁸	Stack of two U-Nets	Images captured via optical diffraction tomography	Supervised, labeled	Stacked U-Net to process single and consecutive frames, improved accuracy, and resolution under low-light conditions
DeepSTORM3D¹⁰	Encoder–decoder CNN	Simulated images of fluorescent emitters noise, and optical properties of the microscope with known positions, including PSFs, background	Supervised, labeled	Image formation model and decoder CNN to pinpoint 3D emitter coordinates from simulated PSFs, enabling high-resolution volumetric molecular imaging
BGNet⁵⁹	CNN	Fluorescent images from fluorescence microscopy or SMLM	Supervised, labeled	Identifying PSF centroids for background correction, improving single-molecule localization and overall imaging resolution
ANNA-PALM⁶⁰	U-Net-based cGAN	Images of photoactivated single molecules captured by PALM or STORM	Supervised, labeled	Trained on experimental data to rapidly acquire high-density superresolution images, especially for mitochondria, nuclear core complexes, and microtubules

6.3 Evaluation Metrics

Evaluating the performance of a cross-modality transformation model can be challenging. Typical strategies involve measuring the visual fidelity of the images, but these measures may not fully correlate with the retention of biologically relevant information. Some examples of evaluation metrics include the following.

Another approach is to evaluate the biological relevance of the generated images by performing downstream analyses, such as cell counting or feature segmentation, and comparing the results to known quantities or those obtained from real experimental data. This method more directly assesses the retention of biologically meaningful information, but it introduces additional uncertainties. For example, inaccuracies in the downstream task, such as errors in the cell-counting algorithm, can confound the evaluation of the model’s performance, making it difficult to disentangle the model’s contributions from errors in postprocessing or analysis pipelines.

6.4 Ethical Considerations

Ethical considerations are essential for ensuring responsible and fair use of AI in the image analysis of biological samples. A key concern is protecting the privacy of patients and donors, as these samples often contain sensitive personal information. Handling biological samples must comply with data protection laws, which require informed consent from all parties involved and transparency about how the samples will be used. These laws vary by region, with the General Data Protection Regulation (GDPR) in the European Union,172 the Health Insurance Portability and Accountability Act (HIPAA) in the United States,173 the Data Protection Act 2018 in the United Kingdom,174 and the Personal Information Protection Law (PIPL) in China.175 International efforts, such as those led by the World Health Organization,176^,177 along with national initiatives, such as India’s data protection frameworks,178 continue to evolve these regulations to keep pace with the rapid growth of AI technologies.179^–181 In addition, a growing body of literature addresses these developments across various countries,182 stages of database handling,183 and specific fields.184

These regulations address issues such as privacy breaches due to improper data use and cybersecurity threats.185 Furthermore, the responsibility for AI models used in diagnostics is a major concern, particularly in relation to bias mitigation. Biased data sets can result in inaccurate or discriminatory outcomes, especially in healthcare applications. Rigorous validation of AI models is critical to ensure accuracy, reproducibility, and the prevention of errors that may lead to misdiagnosis or flawed scientific conclusions. Transparency is also crucial, requiring clear documentation of model training, data sources, usage frameworks, and decision-making processes.186 Guidelines should promote not only sharing data sets but also the trained model weights, enabling researchers to independently validate and replicate findings. Lastly, accountability frameworks are necessary to ensure that researchers and developers are held responsible for the ethical use of AI, with proper oversight to enforce compliance with established guidelines.

7 Perspectives

Cross-modality transformations in biological microscopy present advanced techniques with important implications for biology, medicine, and materials science. Although these advances suggest exciting opportunities where AI is set to increase diagnostic accuracy and improve workflow efficiency, it still faces ongoing challenges that require innovative solutions for them to be implicated and come to societal use. Figure 7 summarizes the ongoing developments and potential outcomes of combining AI with novel imaging modalities.

Figure 7.Potential application perspectives of AI on biological samples imaging. Current developments found in the literature are contained in green boxes, while speculative prospects for the future are contained in yellow boxes. Starting from the top left, AI is extensively used in diagnostics such as virtual staining and other cross-modality transforms (image in the green panel adapted from Li et al.¹⁸⁷). (a) In the future, this could lead to in vivo analysis of virtual biopsies instead of performing tissue extraction and preparation. (b) Cross-modality transform could progressively transition the limits of the different scales outlined in this review. (c) As a consequence of broader data availability and in vivo imaging, AI models could also transform and predict sample evolution over time. (d) Further, cross-modality transformations could also become more accessible and be routinely integrated in the sampling process, obtaining simultaneously different modalities with one single measurement. In the bottom left panel, these ideas could be integrated in treatment prediction and design for biological samples, leading to personally tailored therapy. The potential of superresolution to better characterize molecules structures is nowadays used for protein-folding determination (image in the green panel adapted from Kumar et al.¹⁸⁸), but implementing other information input from different modalities could enhance the reach of our knowledge (centered, bottom panel). The application of AI extracting cellular information from 3D structures hosted in increasingly complex in vitro systems that better replicate the dynamic conditions of in vivo systems (bottom right panel), is highlighted. AI is already implemented in assisted-surgery settings and equipment (right green panel adapted from Zhang et al.¹⁸⁹), which could be greatly improved by including real-time contrast enhancement and segmentation from cross-modality transformations. Images in the yellow boxes were created with the assistance of Designer, using DALL·E 3 technology, and BioRender. They have demonstrative purposes and do not hold real scientific meaning beyond the visualization of the ideas expressed.

Download full size

View all figures

For example, the synthesis of high-resolution images from less invasive imaging methods such as MRI and CT scans provides tissue insights without the need for biopsies,37^,49 illustrated in Fig. 7(a). This approach provides another important advantage: access to living tissue data that can have a significant impact on clinical studies. Similar strategies may also advance cellular in vitro culture studies in preclinical settings. Such strategies may advance preclinical in vitro cell culture studies in creating physically relevant environments with 3D settings, such as promoting cell growth into spheroids or inoculating them on a microphysiology platform.190 Recent advances of this technology have enabled co-culture of single or multiple cell types that mimic closer in vivo conditions by implementing 3D architectures, fluid dynamics, and the gradient of materials contained in living tissues. In these environments, extracting probe-free information on cellular behavior, metabolic states, or migration is currently not feasible, but may soon be achievable through AI and various imaging modalities, as presented in the bottom right panel.

It is also likely that new imaging technologies and implementations with AI will be able to integrate data across scales to provide unprecedented perspective on diseases at the cellular and molecular levels, portrayed in Fig. 7(b). This approach would produce realistic models that combine visual, molecular, and genomic information. By implementing data from a variety of modalities, AI will enable a more comprehensive analysis of biological models including the practical information required for the diagnosis of complex diseases where tissue morphology and function are crucial, exemplified in Fig. 7(d). In addition, longitudinal disease monitoring could be more sensitive, allowing clinicians to track tissue changes over time, shown in Fig. 7(c), and tailor treatment responses accordingly. This possibility extends to the field of transplantation biology, where the cellular integration, biocompatibility, and biodegradation of transplanted tissues, synthetic materials, or prostheses can be monitored over time. In summary, AI-powered tools will enable faster and more accurate diagnostics with reduced bias, thus minimizing the time required for human review. Over time, these advancements have the potential to make high-quality histological analysis more accessible, particularly in areas with limited pathology expertise, while also standardizing diagnostic protocols across institutions.

Beyond diagnostics, AI is already being used to guide physicians during advanced surgeries, directing the surgeon’s movements189 with accuracy and precision. Future imaging cross-modality transformations may facilitate real-time tissue mapping during surgery, giving surgeons immediate insights from various imaging techniques, represented in the top right panels. AI-powered methods are also being used in the research for drug discovery to identify new drug candidates and their potential folding structures,188 as demonstrated in the bottom central panels. Subsequent preclinical studies to predict how tissues will respond to novel treatments could be envisioned using cross-modality transformations, as seen in the bottom left panel. In the future, AI may simulate the effects of drugs on a patient’s tissue, aiding in the development of personalized therapies. However, challenges remain, including the need for high-quality multimodal data sets to train AI systems and the development of interpretable AI models that biologists and clinicians can trust. In addition, integrating AI into clinical workflows requires careful consideration to ensure these new technologies are used effectively and ethically by healthcare professionals. Despite these hurdles, the future of AI in cross-modality transformations in biology is promising, with profound implications for both biomedical research and clinical diagnostics.

8 Conclusions

The incorporation of DL techniques in biological microscopy represents a significant advancement, with the potential to enhance our understanding of histology, cellular structures, and molecular imaging. While these technologies offer promise, it is essential to acknowledge that the field is still evolving. The current state of these methods often involves grappling with their black-box nature, necessitating further refinement and investigation. Researchers continue to address challenges related to interpretability and the need for extensive developments to unlock the full transformative potential of DL in biological microscopy. Beyond technological advancement, these methods offer a paradigm shift by enabling imaging without the reliance on chemical stains and fluorescence, both simplifying experimental processes and preserving sample integrity. This marks a pivotal shift in microscopy, offering a noninvasive and label-free alternative that preserves the integrity of the specimens under investigation. Cross-modality transformations have a significant impact not only in laboratory settings but also in clinical diagnostics and fundamental biological research, opening new avenues for discoveries and breakthroughs. Furthermore, these techniques are becoming more accessible and affordable, democratizing access to microscopic exploration and empowering researchers across disciplines.

Jesús Manuel Antúnez Domínguez is a biophysicist with expertise in microscopic approaches to bacterial collective behaviour. Holding an industrial PhD in biophysics, he has experience in both academia and industry, notably at the Innovation Unit of Elvesys in Paris and the Biophysics Lab in the Department of Physics at the University of Gothenburg. His research interests span microfluidics, active matter, and advanced image analysis.

Giovanni Volpe is a professor of physics at the University of Gothenburg, with expertise in artificial intelligence, complex systems, and active matter. He leads interdisciplinary research exploring AI-driven solutions to understand emergent behaviors in biological and synthetic systems. He is currently co-authoring the book Deep Learning Crash Course for No Starch Press. His work integrates experimental, theoretical, and computational approaches, contributing widely to scientific literature and innovation.

Caroline B. Adiels is an associate professor of biophysics at the University of Gothenburg, with a focus on microfluidics, biology, and artificial intelligence applications. She leads an interdisciplinary research group dedicated to advancing single-cell analysis and communication studies using optics and microfluidics, which extends to organ-on-a-chip technology. Her work integrates AI-based image analysis software tailored for life sciences, creating a research portfolio that bridges the fields of physics and biology.

Biographies of the other authors are not available.

Category: Reviews

Received: Jun. 18, 2024

Accepted: Oct. 28, 2024

Posted: Oct. 28, 2024

Published Online: Nov. 29, 2024

The Author Email: Adiels Caroline B. (caroline.adiels@physics.gu.se)

DOI:10.1117/1.AP.6.6.064001

CSTR:32187.14.1.AP.6.6.064001

Table 1. Overview of the key parameters for common approaches of DL in microscopy across scales.

Table 1. Overview of the key parameters for common approaches of DL in microscopy across scales.

微信扫一扫：分享