Most optical information processors deal with text or image data, and it is very difficult to deal experimentally with acoustic data. Therefore, optical advances that deal with acoustic data are highly desirable in this area. In particular, the development of a voice or acoustic-signal authentication technique using optical correlation can open a new line of research in the field of optical security and could also provide a tool for other applications where comparison of acoustic signals is required. Here, we report holographic acoustic-signal authentication by integrating the holographic microphone recording with optical correlation to meet some of the above requirements. The reported method avails the flexibility of 3D visualization of acoustic signals at sensitive locations and parallelism offered by an optical correlator/processor. We demonstrate text-dependent optical voice correlation that can determine the authenticity of acoustic signal by discarding or accepting it in accordance with the reference signal. The developed method has applications in security screening and industrial quality control.
【AIGC One Sentence Reading】:A holographic acoustic-signal authenticator using optical correlation for voice authentication is reported, offering flexibility in 3D visualization and parallelism. It can verify acoustic signals' authenticity, with applications in security and quality control.
【AIGC Short Abstract】:This study introduces a holographic approach for acoustic-signal authentication, integrating holographic microphone recording with optical correlation. The method offers flexible 3D visualization and parallel processing, enabling text-dependent voice authentication. It has potential applications in security and industrial quality control, providing a new tool for acoustic signal comparison and authentication.
Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.
1. INTRODUCTION
Due to the inherent parallelism and speed offered by optics for solving problems with intensive computational needs, systems utilizing optical technology have been considered a natural way for implementing several applications such as optical security, optical correlators, pattern recognition, and metrology [1–9]. In these applications, holography and optical processors have been playing an important role for decades due to their multi-dimensional and multiple degrees of freedom capabilities [10–22]. Holography including its optical and digital versions has contributed enormously to diverse fields such as optical security, display, data storage, and multiplexing [10–14]. Recently, the optical security field has been accelerated by the inclusion of meta-surface and orbital angular momentum (OAM) holography for providing better security and that also provides an alternative [15–19]. In these reported methods, OAM has also been used to provide a new degree of freedom to the security level [15]. In the meta-surface-based security schemes, the tunability element has been added to the meta-surface allowing multiple segments of encrypted information to be used for security reasons [16]. Furthermore, the benefits of OAM and meta-surface have been simultaneously exploited by integrating them [17]. In terms of authentication capability, it has been realized that the optical diffractive neural network using meta-surfaces in the visible spectral range has the ability to recognize objects and vortex patterns with high accuracy [18,19]. Despite the availability of enough reported work starting from traditional to modern schemes on optical security, the suitability of these techniques was limited to image or text data.
In recent years, some work has been done on the optical security of voice data, but no experimental implementation of voice security has been reported [20,21]. The encryption of voice has been proposed where voice is recorded by an optical microphone and the encryption has been demonstrated by computer simulation [20]. Due to the use of optical microphone recording, properties such as recording voice information in images, recording without disturbing the field of observation at sensitive places, and visualization at specific locations can be availed for security and other applications. On the other hand, these schemes are not beneficial in terms of computational cost and require a large storage space as it is necessary to encrypt a sequence of recorded holograms. Furthermore, a scheme has been proposed to use voice as a security tool for anti-counterfeiting where a voice watermark is generated from recorded voice [21]. However, the watermarking scheme was demonstrated by simulation rather than experiment. Therefore, the optical security approaches that could be realized experimentally for voice data are very challenging and also demanding to prepare a solid background for optical voice security.
Among the various variants of optical image security, authentication verification or identification of images has been of great interest not only in security but also in applications where comparison of objects is required [7,8,18,19,22,23]. In this direction, the field of pattern recognition has played a greater role in applications such as sorting products on industrial assembly lines, recognizing fingerprints, and recognizing enemy tanks and aircrafts, which are all of great importance. Machine learning techniques such as neural networks have also been used for pattern recognition [23,24]. Further, for low light conditions, a quantum-enabled pattern recognition system has also been reported on the ghost imaging platform [25]. Entangled probe states and photon counting can be used to achieve quantum advantages in pattern recognition [26]. On the other hand, some other approaches for defect detection and security screening have recently been reported [27,28]. The work reported above suggests that alternative methods suitable for measuring ultra-fine defects in quality control and for security screening applications are highly desirable.
Sign up for Photonics Research TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
The optical correlator has also played an important role in implementing pattern recognition, encryption, authentication, and verification [29–32]. The optical correlation technique basically uses the Fourier transform property of the lens to implement them successfully [29–32]. However, the patterns to be recognized are usually present in an image format. Thus, these systems have been applied for image or text data and have not been suitable for other formats of data such as voice or acoustic signals. Therefore, any new approach to make the optical correlation technique suitable for acoustic signals is much needed and could open a new direction of research. As we know, the voice is considered as a unique identifier for authentication. The technique of voice biometrics is usually referred to in several versions including voice verification, speaker recognition, speaker identification, and speaker verification [33]. The purpose of speech recognition is to identify the words of the speaker. By utilizing the analogy from the above strategy, it has also been possible to identify defects in products or materials [9,34,35]. However, these approaches use the traditional way of microphone recording and identification through digital algorithms. Thus, the development of sound/acoustic signal identification through an optical processor could be advantageous due to its inherent parallelism and speed and it could provide a new tool.
Here, we propose and develop a holographic voice or acoustic-signal authenticator by integrating a holographic microphone recorder [36–39] with an optical correlation technique [29–32]. The developed system is able to verify text-dependent voice data in the optical domain, taking advantage of optical technologies. First, the voice information of different texts is recorded using the optical setup of parallel phase-shifting digital holography (PPSDH) [39]. Then, the voice information is appropriately converted into a two-dimensional (2D) image from the recorded holograms and then 2D voice image data of different texts can be verified/authenticated by employing an optical processor based on a fringe-adjusted joint transform correlator (JTC) [30,31]. The cross-correlation and autocorrelation peaks obtained by the optical processor can determine the authenticity of a voice of particular interest. The method can test acoustic signals by visualizing specific three-dimensional (3D) locations of sensitive places with parallel processing as it uses holographic recording and an optical correlator. In support of our proposed scheme, we present experimental results of voice data.
2. PROPOSED METHOD
A. Illustration of the Holographic Acoustic-Signal Authenticator
A schematic diagram to illustrate the proposed holographic acoustic-signal authenticator is shown Fig. 1(a). When target and reference acoustic signals are recorded and correlated using the system, it can discriminate between them as it can be seen from the output signal of the above diagram. When the target acoustic signal is the same as the reference acoustic signal, the output will be notified with the peak signal. On the other hand, there will be no peak signal when the input acoustic signals are different. The authentication system uses the holographic recording and optical correlation, and its brief description is given in Fig. 1(b). The first step involves the use of holographic recording and visualization for the target and reference acoustic signals separately. After visualization of the 2D format of both the signals, optical correlation that uses two successive Fourier transforms (FTs) is performed that gives correlation signals. From the correlation signals, authenticity of the target acoustic signal can be verified.
Figure 1.Illustrations of (a) holographic acoustic-signal authenticator and (b) holographic recording, visualization, and optical correlation. The acoustic target signal to be compared with corresponding reference signal is given as input to the optical machine containing the holography and correlator that can discriminate between acoustic signals. In holographic recording and visualization, a sequence of holograms is recorded for an acoustic signal and then the corresponding phase sequence is retrieved. Individual pixel values from the phase sequence can be appropriately arranged in two-dimensional (2D) format. Finally, optical correlation, which can be implemented by Fourier transform (FT) and inverse FT, provides correlation output.
B. Optical Implementation of the Holographic Acoustic-Signal Authenticator
The developed optical setup of the holographic acoustic-signal authenticator is shown in Fig. 2 where laser light is used as the illumination source and is segregated into 0 deg and 90 deg (- and -) polarized light beams using a half-wave plate and a polarizing beam splitter. The beam corresponding to 0 deg serves as the reference beam and it is further divided by a beam splitter to be used as a light source for implementing the optical JTC and another one kept as the reference beam. The beam corresponding to 90 deg serves as the object beam where the target and reference acoustic signals can be spoken/placed. Polarizers are placed in both beams that can correct the direction of polarization in respective beams by adjusting the appropriate angles of them. After the combination of both beams through a beam splitter, a quarter-wave plate is used to convert orthogonal linear polarizations, 0-deg and 90-deg polarized light, into circular polarizations. Here, the reference wave and the object wave produce the interference fringe pattern in the respective linear polarization direction of the used image sensor. The holograms are captured by a high-speed polarization image sensor, which can capture four interference fringe patterns of different phase shift amounts for each polarization direction in a single hologram [21,38–40].
Figure 2.Proposed optical setup for implementing a holographic acoustic-signal authenticator. For voice recording, parallel phase-shifting digital holography is employed while optical correlation of voice signals is obtained by a joint transform correlator. NDF, neutral density filter; SF, spatial filter; CL, collimating lens; PBS, polarizing beam splitter; M, mirror; BS, beam splitter; SLM, spatial light modulator; L, lens; CCD, charge-coupled device camera; P, polarizer; QWP, quarter-wave plate; and HPC, high-speed polarization camera.
The reconstructed voice can be registered to the system where a computer is attached. A schematic diagram showing the detailed procedure of phase image reconstruction and conversion of voice phase data into a 2D image is shown in Figs. 3(a) and 3(b). The recorded hologram uses a pixel configuration to simultaneously record the intensity of four linear polarization components of the incident laser beam corresponding to phase shifts of 0, , , and , as shown in Fig. 3(a). The voice as target acoustic signal to be identified is spoken at the designated place of the system. For recording and reconstruction of the target voice signal, the same procedure is followed. The principle of acoustic-signal recording is to measure the phase change caused by the sound pressure of a particular text voice over time [21,36–38]. Actually, the change in phase where light propagates occurs due to a change in the refractive index of the medium that is directly caused by changing the sound pressure in that medium. Both the signals can be correlated by the JTC which is implemented by using a liquid crystal spatial light modulator (SLM) and a lens (2-f system), as shown in the right side of Fig. 2. Here, the target and reference signals do not need to be visualized as it does in most sound field imaging. However, appropriate conversion is required that can be used directly for authentication purposes. The detailed recording and reconstruction procedures are given in Sections 2.C and 2.D.
Figure 3.Schematic diagrams of (a) phase image reconstruction from recorded spatially multiplexed hologram and (b) generating 2D voice image from reconstructed phase images. PPSDH, parallel phase-shifting digital holography. The 2D conversion follows a column-wise arrangement.
First, a reference voice signal to be recorded is delivered near an object beam at a specific location, as shown in Fig. 2. The system is based on the technique of PPSDH. For recording acoustic-signal or text voice data, the principle of sound field imaging by optical means is used where the phase change occurs that is caused by sound pressure due to a particular text voice over time. The change in phase where light propagates occurs due to a change in the refractive index of the medium that is directly caused by changes in the sound pressure in that medium [36–39]. From the reported literature on acoustic field imaging [36–39], the refractive index change can be derived as
In Eq. (1), the refractive index is represented by symbol at atmospheric pressure . Other symbols and are the increased pressure due to the sound wave and the specific heat ratio, respectively.
The recording principle by holographic setup can be understood by considering the Mach–Zehnder interferometer which is formed by the object and interference beams [36–39]. When the sound source is placed near the object beam of the setup, the index of refraction of the medium changes in the region where the acoustic signal propagates. It is well known that the index is higher in a region where the refractive index is higher and vice versa. When the light wave passes via the area where acoustic phenomena are happening, the light wave phase will be changed. Mathematically, the complex amplitude carried by the object beam passing through the region where acoustic signal propagates can be written as follows: where is the interaction width of the object wave and the acoustic-signal field. In Eq. (2), the symbols and are the wavelength and the imaginary unit, respectively. It can be seen from Eqs. (1) and (2) that the acoustic pressure can be measured by estimating the refractive index distribution of the medium from the phase change of the object beam. For measuring above phase change, a holographic technique called PPSDH can be used successfully for acoustic-signal or voice recording [39,40].
In the PPSDH recording scheme, the image sensor detects pixel by pixel the amount of phase shift caused by the phase shift of the reference wave. Each pixel layout of the polarization camera’s image sensor can simultaneously record the intensity of four linear polarization components of the incident laser beam. During reconstruction, image sensor pixels with the same amount of phase shift are extracted from the recorded hologram and these are relocated to the same addresses in another image. This is done for all phase-shifted pixels at each address where the original pixels were located, as shown in Fig. 3. Further, the values of the neighboring pixels are used to interpolate all the values of the blanked pixels [39,40]. These steps can be seen in Fig. 3(a). At the end of these procedures, phase-shifted holograms can be obtained in accordance with each phase shift. In our case, we used four-step PPSDH and therefore four phase-shifted holograms, namely, , , , and that are constructed from single-frame recording by space-division multiplexing.
From the obtained four phase-shifted hologram, twin and zeroth-order free hologram/complex amplitude distribution can be obtained as
The complex amplitude distribution at the object plane is derived by the Fresnel transformation:
The phase can be obtained by taking the angle/argument of the above complex amplitude distribution. The above procedure is explained for one temporal hologram only. However, for any temporal phenomena, or in this case, the sequence of holograms is recorded. The same procedure is applied to all the recorded holograms to obtain the sequence of phase images.
From the reconstructed phase images, single-pixel phase data are extracted from the same location of phase images to construct a 1D array of phase values or temporal phase profile of the acoustic data. The constructed phase array values are now arranged in a 2D matrix/image in a controlled way (in vertical orders). For details, Fig. 3(b) can be followed. The arrangement starts from the first vertical column to end at the last column. It can be done horizontally too. However, the procedure should be the same for both reference and target acoustic signals. When the phase array does not meet the required square size, non-zero values are allocated to the remaining pixels. If the same voice data are arranged in the same manner, this 2D image can be processed using an optical processor by keeping voice property.
D. Acoustic-Signal Authentication Using an Optical Correlator
The reconstructed acoustic signal after converting into a 2D format is now appropriate for processing by an optical processor. In our case, any available 2D optical image correlator can be used to implement an acoustic-signal authenticator. In general, there are two main processing architectures used for optical correlators, such as the VanderLugt type correlator and the JTC [7,8,29,30]. The JTC has some advantages such as being robust to misalignment and implementable on a 2-f system using a single SLM [29,30]. There are several variants of JTC available in the literature but in this study, we have utilized fringe-adjusted optical correlation on JTC architecture.
In principle, JTC is a two-step procedure. In the first step, the joint power spectrum (JPS) is obtained by performing an FT on the joint input. The joint input is constructed by placing the target and reference signals side by side in the same plane, as shown in Fig. 4. Considering as the target image and as the reference image, its JPS can be defined as
Figure 4.Schematic diagram of fringe adjusted joint transform correlation. FT, Fourier transform; IFT, inverse FT; and JPS, joint power spectrum.
In the second step, an inverse FT is performed to get the correlation peak:
By adjusting the fringes of the JPS, an improved discrimination through the correlation signal can be achieved. This technique is called fringe-adjusted JTC, and it is found to yield significantly better correlation output than the classical counterparts [31,32]. In fringe-adjusted JTC, the JPS is multiplied with an intensity filter known as a fringe-adjusted filter. This fringe-adjusted filter is mathematically expressed as [32] where and are either constants or functions. Thus, the JPS after adjusting its fringes, is expressed as
When and , the fringe-adjusted filter approaches a perfect real-valued inverse filter and the fringe-adjusted JPS can be written as
The inverse FT gives the correlation output which contains two delta functions located at along with a zeroth-order term. Here, getting the JPS can be implemented experimentally as well as numerically. Both ways give similar results. Use of numerically calculated JPS avoids the use of extra components. In our case, the numerically calculated JPS is displayed on the SLM, and its FT is implemented by an additional lens attached to the main experimental setup.
The JTC has some advantages over other correlators such as being robust against misalignment and implementable on a 2-f system together with a single SLM [29–32]. In our study, computationally obtained JPS of the joint target and reference voice signals is displayed on the SLM and its FT which is performed by the lens gives the correlation signal. Correlation signals are recorded by an image sensor attached to it, as shown in right side of Fig. 2. From the recorded correlation signal, authenticity of a particular target acoustic signal can be checked by verifying cross-correlation and autocorrelation. The present technologies used in this proposal can work for authentication of audible as well as ultrasound acoustic signals because the PPSDH can record acoustic signals in these ranges and correlators do not have limitations as they use converted acoustic signals in image format. Moreover, the proposed method uses the recording of acoustic signals by digital holography so the specific 3D locations of two signals can be compared unlike the point detector or digital microphone detector.
3. EXPERIMENTAL RESULTS
An experiment has been carried out to check the validity of the proposed voice verification scheme. Text voice samples have been recorded using the proposed setup by following the reconstruction algorithm based on the PPSDH technique and registered to the system. The correlation of voice data has been performed by the joint transform correlation approach. A laser ( laser, JUNO of KYOCERA SOC) with a wavelength of 532 nm is used as a light source. For recording voice data, a high-speed polarization-imaging camera (FASTCAM-SA5-P of Photoron) is used while for performing optical correlation, a normal CCD image sensor is used. Holograms of acoustic signal of size pixels are recorded with a speed of 7 kfps (fps, frames per second). Here, the samples of voice texts “Thank you,” “Arigato,” and “Hello” were recorded to test the authentication capability of the system. The normal image sensor with a pixel pitch of 5.2 μm and a resolution of pixels is used to record the cross-correlation peaks. An SLM of Pluto Holoeye is used to display the JPS obtained from joint 2D voice images. With a resolution of pixels and a pixel pitch of 8.0 μm, this SLM can modulate the light beam to produce the desired images. The -axis or horizontal direction concerning the laboratory is the direction in which the SLM operates. Usually, the SLMs can be controlled by 8-bit digital signals transmitted from computer via a digital visual interface. So, the SLM device can be used easily by driving it as a secondary monitor of the computer. The 8-bit digital signals from the computer are converted to analog signals by 12-bit digital-to-analog converters in the controller. Nonlinear response can be compensated by using a look-up table in the controller of the SLM. In this way, the 2D image can be displayed using the phase modulation of the device and it can be linear up to 8-bit input signal level. For the reconstruction of the temporal phase profile and converting it to 2D format from the sequence of recorded holograms, MATLAB 2023 is used. For visualizing voice data from the temporal phase profile, an “audio write” command is used.
At first, three text voices are recorded, and their temporal phase profiles are obtained using the recording and reconstruction procedures explained in Figs. 2 and 3. The phase profiles of the recorded three acoustic texts “Thank you,” “Hello,” and “Arigato” are shown in Figs. 5(a)–5(c), respectively. These data are obtained by picking single pixel value from all reconstructed phase images with same pixel address, as can be seen in Fig. 3. The obtained values represent temporal behavior of the acoustic signal in terms of pressure change and carry the information of the acoustic signal. The acoustic signals can be heard after the use of audio write in MATLAB software. Although the visualization of the acoustic signal is not required for authentication, the visualized sounds of “Thank you,” “Hello,” and “Arigato” are provided in Visualization 1, Visualization 2, and Visualization 3, respectively, to confirm the appropriate recording. From these phase arrays, 2D images are obtained by following steps shown in Fig. 3(b). The constructed 2D images corresponding to three voice texts are shown in Figs. 5(d)–5(f), respectively. From these results, it can be seen that voice data of different texts can be converted into 2D images which are suitable to be processed with the optical processor.
Figure 5.Results of optical voice authentication. (a)–(c) are temporal phase profiles of experimentally recorded three voices corresponding to texts “Thank you,” “Hello,” and “Arigato,” respectively. (d)–(f) are corresponding 2D voice images.
Further, we used the fringe-adjusted JTC to obtain correlation signals. At first, the fringe-adjusted JPS of acoustic data is obtained computationally, which are shown in Fig. 6. Figures 6(a)–6(c) show the JPS corresponding to the text voices of “Thank you,” “Hello,” and “Arigato,” respectively when same voices are used with itself as a joint input. On the other hand, the JPSs when different text voices are used as a joint input are obtained. The JPS when text voices “Thank you” and “Hello” are considered as joint input is shown in Fig. 6(d). The JPSs when text voices “Thank you” and “Arigato,” and “Hello” and “Arigato” are considered as joint inputs are shown in Figs. 6(e) and 6(f), respectively.
Figure 6.Results of JPS. (a)–(c) The obtained JPS corresponding to the text voices of “Thank you,” “Hello,” and “Arigato,” respectively, when the same voices are used with itself as a joint input; (d) the obtained JPS when the text voices “Thank you” and “Hello” are joint input; (e) the obtained JPS when the text voices “Thank you” and “Arigato” are joint input; and (f) the obtained JPS when the text voices “Hello” and “Arigato” are joint input.
This computationally obtained JPS is now displayed on the SLM and after performing the FT, the correlation signals can be recorded. The proposed scheme uses parallelism offered by optical processors. The present technique is opto-digital, i.e., it can be implemented partially in the digital as well as partially in the optical domain. Here, we suggest computing the JPS and then implementing it experimentally. That reduces approximately half of the processing time. However, both steps can be implemented experimentally. The recorded autocorrelation signals shown in Figs. 7(a)–7(c) correspond to the text voices of “Thank you,” “Hello,” and “Arigato,” respectively. The signal terms can be seen on the upper and lower sides of the image as it does happen in the case of the JTC. The recorded cross-correlation signals corresponding to “Thank you” and “Hello,” “Thank you” and “Arigato,” and “Hello” and “Arigato” are shown in Figs. 7(d)–7(f), respectively. Here, no signal terms appear due to the discriminative power of the correlator used. Furthermore, the 3D correlation plots are also obtained computationally from the above 2D correlation signals obtained experimentally. The 3D autocorrelation plots corresponding to “Thank you,” “Hello,” and “Arigato” are shown in Figs. 7(g)–7(l), respectively. Here, signal terms in the form of peaks are clearly visible, confirming the match of the acoustic signals. On the other hand, the 3D plots of the cross-correlation corresponding to “Thank you” and “Hello,” “Thank you” and “Arigato,” and “Hello” and “Arigato” are shown in Figs. 7(g)–7(l), respectively. In this case, the peaks disappear, confirming the discriminability of the acoustic signals. From the results presented, it can be claimed that the proposed method can successfully authenticate the acoustic signals.
Figure 7.Results of optical voice authentication obtained by fringe-adjusted JTC. (a)–(c) 2D autocorrelation signals, (d)–(f) 2D cross-correlation signals, (g)–(i) 3D plots corresponding to (a)–(c), and (j)–(l) 3D plots corresponding to (d)–(f). In the case of autocorrelation, when the target and reference acoustic signals are the same, the peak signal can be seen, whereas in the case of cross-correlation, when they are different, the peak signal does not appear.
Further, we present results when different kinds of attacks such as Gaussian noise, speckle noise, and occlusion attacks are applied to the system to check the discrimination capability of the proposed method. The results of Gaussian noise and speckle noise attacks are shown in Figs. 8(a)–8(h). Figures 8(a) and 8(b) show the 2D autocorrelation signals when Gaussian noises of 0.01 variance with mean of 0.2 and 0 are added to the voice image, and corresponding simulated correlation 3D plots are shown in Figs. 8(c) and 8(d), respectively. Figures 8(e) and 8(f) show the 2D autocorrelation signals when speckle noises of variance 0.25 and 0 are added to the voice image, and corresponding simulated correlation 3D plots are shown in Figs. 8(g) and 8(h), respectively. From the correlation results, it can be confirmed that the method can tolerate Gaussian noise up to a variance of 0.01 and mean 0 and speckle noise attack up to a variance of 0.
Figure 8.Results of Gaussian and speckle noise attack. (a) 2D autocorrelation signals when Gaussian noise of 0.01 variance with mean of 0.2 is added to the voice image, (b) 2D autocorrelation signals when Gaussian noise of 0.01 variance with mean of 0 is added to the voice image, (c) simulated 3D plot corresponding to (a), (d) simulated 3D plot corresponding to (b), (e) 2D autocorrelation signals when speckle noise of variance 0.25 is added to the voice image, (f) 2D autocorrelation signals when speckle noise of variance 0 is added to the voice image, (g) simulated 3D plot corresponding to (e), and (h) simulated 3D plot corresponding to (f).
The results of the occlusion attack are shown in Fig. 9. Figure 9(a) shows the 2D autocorrelation signal when 5% of voice image is occluded, and the corresponding simulated 3D plot is shown in Fig. 9(b). Further occlusion of the JPS is also checked. Figures 9(c) and 9(d) show the 2D autocorrelation signals when 25% and 40% of the JPS are occluded, and the corresponding correlation peaks are shown in Figs. 9(e) and 9(f), respectively. From correlation results, it can be confirmed that the method can tolerate occlusion attacks up to 5% when the voice image is occluded, while it can tolerate up to 40% when the JPS is occluded.
Figure 9.Results of occlusion attack. (a) 2D autocorrelation signals when 5% of voice image is occluded, (b) simulated 3D plot corresponding to (a), (c) 2D autocorrelation signals when 25% of JPS is occluded, (d) 2D autocorrelation signals when 40% of JPS is occluded, (e) simulated 3D plot corresponding to (c), and (f) simulated 3D plot corresponding to (d).
The present work reports on a text-dependent acoustic-signal authenticator which has applications in security screening applications. The voice recognition systems are usually text-dependent and text-independent. In the text-independent system, the program needs a unique voiceprint of the speaker for comparison. It does not require any pre-defined passphrases and the subject of the analysis is conversational speech. In the text-dependent system, the program needs a dictionary of words to identify the speaker’s words and the system is trained to recognize predetermined voice passphrases from the speaker. In this case, the text is often referred to as a “voice passphrase,” which consists of a few words that are about a second long [33]. In active enrollment, the person must knowingly perform the enrollment and speak the required word. Therefore, the reported method is well suited for text-dependent voice authentication. In the future, tone and texture-based approaches may be combined with this system to authenticate a specific person’s voice. Here, encryption is not used along with the proposed acoustic-signal correlation. But the security level of authentication can be enhanced as data is suitable to be processed by the optical processor.
The text-dependent acoustic-signal authenticator has several other applications where comparison of acoustic signals is required, such as in industrial quality control to measure micro-defects [9,34,35]. Most of the optical image correlators reported in the literature could be suitable for voice data after following this proposed approach. The proposed study will invite a detailed study on the discrimination capability of voice or acoustic signals under different architectures of optical correlators that will promote practical applications of voice authentication and for comparisons of acoustic signals in industrial quality control such as micro-defect measurements. Recently, reported digital holography-based methods can also be equally applied to avail their advantages [41,42].
In this study, the reference and target voice/acoustic signals are considered as spoken/delivered with almost the same speed. However, the method can also correlate when the target text is spoken at a different speed than the reference acoustic signal. In that case, the 2D image corresponding to the target text acoustic signal can be resized (compressed or expanded) as per the size of the reference text voice signal before optical correlation. In the case of security screening, a slow-to-fast or vice versa conversion is required, while for other applications in industrial quality control, the conversion is not required. The detailed analysis will be considered in future works.
4. CONCLUSIONS
In this work, we have proposed and successfully implemented the optical acoustic-signal authenticator that is based on the PPSDH and JTC. Using the proposed system, text-dependent acoustic signal verification has been demonstrated. For validation, the voice information of different texts is recorded by using the recording of PPSDH. Subsequently, the voice/acoustic information is reconstructed, and the obtained temporal profile of different texts is converted into 2D images which can retain the full data. After that, the correlation has been performed by using optical joint transform correlation for authentication of the desired acoustic signals. The obtained cross-correlation and autocorrelation peaks can identify the authenticity of the acoustic signal after processing the JPS obtained through the optical processor. In support of the proposed scheme, we present optical experimental results of holographic voice verification of different text-dependent voice samples. Results of different noise and occlusion attacks have been presented, which confirm the discrimination capability of authentication. Due to the use of digital holography, the specific 3D locations of two signals can be compared unlike the point detector or digital microphone detector. To the best of our knowledge, this is the first time we explore the holographic voice authentication direction using an optical processor like the JTC. We hope that the developed system of optical voice correlation through the optical image correlation can open new research direction in the optical security, and it will also provide a tool to be utilized in other applications where comparison of voice/acoustic signals is needed.
Acknowledgment
Acknowledgment. This study was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant-in-Aid for Challenging Exploratory Research 22K18809, Research Activity Start-up 23K1912, and Grant-in Aid for Transformative Research Areas, 20H05887 and 20H05886. N. K. Nishchal wishes to acknowledge the funding from the Science and Engineering Research Board, Govt. of India through grant No. CRG/2021/001763. A. Shikder acknowledges the financial support through the Prime Minister Research Fellowship, Govt. of India (No. 2701807).
Authors’ Contribution.Y. A. and S. K. R conceived the idea. S. K. R., A. S., and R. T. constructed the optical setup and carried out the experiments. S. K. R., N. K. N., O. M., and Y. A. discussed its implementation and results. Y. A. and N. K. N jointly supervised the project. All authors contributed to the preparation of the paper.