Image-free cross-species pose estimation via an ultra-low sampling rate single-pixel camera

Xin Wu; Cheng Zhou; Binyu Li; Jipeng Huang; Yanli Meng; Lijun Song; Shensheng Han

doi:10.3788/COL202523.091101

1. Introduction

Recent advances in intelligent perceptual models, powered by optical imaging, deep neural networks^[1,2], and enhanced computing capabilities^[3], have significantly advanced cross-species pose estimation techniques^[4–6]. These developments have not only accelerated computational neurobehavior research^[7] but also enabled diverse applications including remote monitoring^[8], behavioral analysis^[9], precision farm management^[10,11], wildlife conservation^[12], and cinematic special effects^[13]. However, current mobile artificial intelligence (AI) implementations face significant challenges in meeting the computational and energy demands required for real-time pose estimation. Consequently, most practical applications still depend on cloud-based processing^[14–16]. A mobile-based solution capable of fast pose estimation with minimal data requirements would overcome bandwidth and communication rate limitations, enabling truly autonomous operation.

The field of cross-species pose estimation has witnessed remarkable progress as image-based techniques have advanced from 2D to 3D and real-time applications^[17–21]. Modern approaches leverage pre-trained models trained on large image datasets^[22] and employ transfer learning^[2] to enable pose analysis across diverse species, thereby facilitating studies of neural mechanisms and behavioral patterns. As illustrated in Fig. 1(a), standard image-based models typically comprise a feature extractor (encoder) and pose parser (decoder)^[23]. Alternative image-free approaches utilizing motion sensors, acoustic sensors, infrared^[24], LIDAR^[25], radio frequency (RF) systems^[26], and WIFI^[27] have emerged as viable solutions. However, conventional array camera systems remain constrained by high bandwidth consumption, extensive storage requirements, and computationally intensive data processing, resulting in excessive power demands and operational costs. Recent breakthroughs in single-pixel imaging^[28–39] have opened new possibilities for image-free intelligent sensing, even in complex environments.

Figure 1.Overview of plane-array-camera-based and SPC-based pose estimation models. (a) The architecture of plane-array-camera-based (image-based) pose estimation model. (b) The architecture of SPC-based (image-free) pose estimation model that performs reconstruction before pose estimation, the two-stage architecture of image-free model with implicit reconstruction, and the architecture of our proposed image-free and (implicit) reconstruction-free model.

Download full size

View all figures

In contrast to conventional imaging, single-pixel camera (SPC) employs a single-pixel detector to capture spatially encoded signals, enabling object feature extraction through computational methods including deep learning and compressed sensing^[40]. This innovative approach exhibits significant advantages in biomedical imaging, remote sensing, security surveillance, and privacy protection^[41–43], while simultaneously addressing the bandwidth and storage limitations of traditional imaging. These unique characteristics make SPC particularly suitable for cross-species pose estimation and behavioral analysis applications. Emerging research on SPC-based pose estimation has developed along two distinct pathways [Fig. 1(b)]. The first approach first reconstructs images from SPC measurements before performing pose estimation^[34], while the second integrates image reconstruction and pose estimation within a single deep neural network^[36,44]. Although these methods successfully mitigate bandwidth, storage, and privacy concern inherent to conventional approaches, they remain computationally intensive and maintain significant processing complexity.

We present SPCPose, an image-free cross-species pose estimation framework that leverages an SPC operating at ultra-low sampling rates [Fig. 1(b), third row]. Our novel computational architecture revolutionizes traditional approaches by directly processing minimal single-pixel detection values to achieve pose estimation, completely eliminating the need for any form of image reconstruction. The system operates through three innovative stages: first, the SPC efficiently compresses and encodes object features (including shape and texture) into compact detection values. Subsequently, SPCPose employs a vision transformer to extract features directly from these compressed measurements. Finally, an orthogonal decomposition feature map decoder precisely analyzes the object pose. This architecture represents a paradigm shift from conventional pose estimation technologies by directly deriving pose characteristics from sparse single-pixel data without intermediate reconstruction layers. Experimental validation confirms the method’s exceptional accuracy in pose decoding and demonstrates remarkable robustness across diverse species and challenging real-world scenarios. Our work makes three fundamental contributions. 1)High accuracy pose estimation with ultra-low sampling rate: This method is the first time to realize image-free without explicit or implicit reconstruction cross-species pose estimation without reducing pose estimation accuracy, and the amount of detection data is four orders of magnitude lower than that of traditional images (sampling rate is $6.260 \times 10^{- 4}$ ).2)Cross-species adaptability: SPCPose demonstrates strong cross-species adaptability, significantly expanding its application scope. It is suitable for human and animal pose estimation, making it a versatile tool for diverse research and practical applications.3)Robustness in real-world scenarios: In real-world tests, SPCPose exhibits remarkable robustness. It accurately decodes object poses under challenging conditions, such as outdoor and complex backgrounds, underscoring its reliability and effectiveness in practical environments.

2. Methods and Materials

The realization of ultra-low sampling rate cross-species pose estimation based on an SPC is mainly divided into two parts: One is the realization of object information encoding and measurement storage of encoded information by an SPC; The other is the deep learning process of obtaining pose information directly based on the extremely low measurement value of the SPC in the process of no image display or implicit image reconstruction.

2.1. Single-pixel camera

Figure 2(a) illustrates our designed ultra-low sampling rate single-pixel detection system, whose core components include: camera lens, used to collect object scene light signals; digital micromirror device (DMD), used to modulate a series of preset Hadamard base map light fields; single-pixel detector, used to measure the signal modulated by DMD containing object information; computer, used to calculate the optical signal measured by the single-pixel detector and depth learning processing. Briefly, the camera lens displays the image information of the object (Human) on the DMD optical modulation surface. Then, the mirror reflects its $+ 12 °$ outgoing signal light and collects it by the single-pixel detector through the convergence lens.

Figure 2.Overview of the SPCPose. (a) SPC structure schematic. (b) The architecture of our proposed image-free and (implicit) reconstruction-free cross-species pose estimation model.

Download full size

View all figures

The SPC’s schematic diagram is shown in Fig. 2(a), whose core components include: camera lens, used to collect object scene light signals; DMD, used to modulate a series of preset Hadamard base map light fields; single-pixel detector, used to measure the signal modulated by DMD containing object information; computer, used to calculate the optical signal measured by the single-pixel detector and depth learning processing. The camera lens projects the object scene information onto the optical modulation surface of the DMDs. The DMD modulates the incident optical information according to the preset light field coding pattern, then reflects it out at a specific angle ( $+ 12 °$ ), and then collects and records it by the single-pixel detector through the collecting lens. The object information encoding process of the system can be expressed as $H^{(m)} (x, y) \cdot I (x, y)$ , where $I (x, y)$ represents the object, $H^{(m)} (x, y)$ represents the light field modulation code preset on the DMD memory, $m$ represents the number of light field modulation codes, and $m = 1,2, \dots, M$ . The signal collected by a single-pixel detector can be expressed as $B^{(m)} = \iint H^{m} (x, y) \cdot I (x, y) d x d y .$ (1)

The image information can be obtained by calculation through Eq. (2): $G (x, y) = ⟨ H^{(m)} (x, y) B^{(m)} ⟩ .$ (2)

For more detailed principles and algorithms, please refer to Refs. [40,45]. In fact, there is no need for image reconstruction in the pose estimation in this paper, as only the probe value needs to be obtained and input into the pose estimation network.

Figure 2(a) illustrates the imaging principle of the SPC in this paper, which is mathematically equivalent to Hadamard-transform-based image coding. Specifically, given a $λ$ -channel instance image $I \in R^{H \times W \times λ}$ , first reshape image $I$ into $I_{2^{N}} \in R^{2^{N} \times 2^{N} \times λ}$ , then restructure $I_{2^{N}}$ into a five-dimensional tensor $I_{2^{N} / O} \in R^{\frac{2^{N}}{O} \times \frac{2^{N}}{O} \times O \times O \times λ}$ , and partition $I_{2^{N}}$ according to the light field pattern size $O \in {{2}^{x}, x \in N^{*}}$ . Immediately after, sum $I_{2^{N} / O}$ along the zeroth and first dimensions to obtain $I_{sum} \in R^{O \times O \times λ}$ . Take Hadamard matrix $H \in R^{O^{2} \times O^{2} \times λ}$ , vectorize the zeroth and first dimensions of $I_{sum}$ to $I_{O^{2}} \in R^{O^{2} \times λ}$ , and then apply the Hadamard transform to get the full-sampling single-pixel detection result with a sampling rate $μ = \frac{O \times O}{H \times W}$ , which is $B = H I_{O^{2}} \in R^{O^{2} \times λ} .$ (3)

For full sampling results, three methods are used to extract the sampled values including natural order ( $M 1$ ), reverse order ( $M 2$ ), and random order ( $M 3$ ). In most cases, samples are drawn from each channel in the amount of $M \in {{2}^{x}, x \in N^{*}}$ data volume to reshape into an image with $\sqrt{M} \times \sqrt{M}$ resolution, with a sampling rate $\frac{M}{H \times W}$ .

For example, for the sequential sampling result of $B$ , assuming the original image resolution is $480 \times 640$ , when $M = 256$ , the sampling rate is $μ = 8.333 \times 10^{- 4}$ . The detailed code configuration can be found on the project homepage.

2.2. SPCPose architecture

Vision-transformer-based perception models have demonstrated superior performance in many tasks compared to convolutional-neural-network-based models^[46]. This advantage is owed to components like the multi-head attention mechanism that endow these models with powerful feature extraction capabilities. In Fig. 2(b), SPCPose primarily consists of a single-pixel detection vision-transformer-based^[20] feature extraction module and a feature map orthogonal decomposition decoder. It does not incorporate structures that include implicit reconstruction, maintaining a consistent architecture with image-based intelligent perception models.

Specifically, given a single-pixel detection instance input $B_{M} \in R^{M \times 3}$ , reshaping it into $B_{H \times W} \in R^{H \times W \times 3}$ , it might contain $N_{k}$ keypoints. Notably, this input is not an image but detection values reshaped into an image form to ease the training process. Initially, a positional encoder encodes the input to aid the model in feature extraction, as denoted by $F^{0} = PatchEmbed (B_{H \times W}) .$ (4)

Then, it is progressively fed into $L$ layers of the vision transformer encoder blocks to gradually extract hidden features from the single-pixel detection values, as described by $F^{i + 1} = ViTEncoderBlock (F^{i}),$ (5)where $F^{i} \in R^{\frac{H}{d} \times \frac{W}{d} \times C}$ , the output from the last $L$ layer is $F_{out} = F^{L} \in R^{\frac{H}{d} \times \frac{W}{d} \times C}$ , serving as an implicit representation of the information embedded in the single-pixel detection values, $d$ (16 by default), is the down-sampling rate of the patch embedding layer, and $C$ is the channel dimension. Subsequently, to interpret this implicit representation, we first enhance the resolution of the feature map via deconvolution, as expressed by $K_{Deconv} = {Conv}_{1 \times 1} [{DeconvBlock}^{N} (F_{out})],$ (6)where $K_{Deconv} \in R^{\frac{H}{2^{4 - N}} \times \frac{W}{2^{4 - N}}}$ and $N$ is the DeconvBlock repeating times (one by default), controlling the feature map size. To adapt to cross-species and complex scenarios, the feature map undergoes orthogonal decomposition, as expressed by $K_{Flatten} = Flatten (K_{Deconv}),$ (7) $K_{H} = Linear (K_{Flatten}),$ (8) $K_{V} = Linear (K_{Flatten}),$ (9)where $K_{H} \in R^{(γ W) \times N_{k}}$ and $K_{V} \in R^{(γ H) \times N_{k}}$ , that is, the probability distribution of keypoints in both the horizontal and vertical directions, and $γ$ (two by default) can be used to control the dimensionality of the output feature vectors so as to accommodate feature maps at different scales. Finally, the coordinates of the keypoints are estimated using KL divergence. For the detailed code configuration, please refer to Ref. [47].

2.3. Datasets for SPCPose

Tri-Mouse. The original mouse dataset was provided by the V. N. Murthy laboratory at the Harvard University^[6,17]. It documents the behavior of three mice within an experimental setup. The raw images were captured at a resolution of $640 pixel \times 480 pixel$ , with each mouse annotated using 12 keypoints. A total of 112 samples were ultimately used in our analysis.

Horse10. The horse dataset originates from the Mathis laboratory at the École Polytechnique Fédérale de Lausanne (EPFL)^[6,48]. The raw images have a resolution of $288 pixel \times 162 pixel$ , and each horse was annotated with 22 keypoints. Our final dataset consists of 8,144 samples.

Human. We utilized two OAK-D-Pro-W cameras to capture the motion of two subjects from different perspectives. The images were saved at a resolution of $852 \times 480 pixel$ , and each human subject was annotated with 17 keypoints. From the recorded videos, we extracted a total of 1,110 samples for our experimental study.

Real-World Human using SPC. The principle and device information are shown in Fig. 3, where the camera lens (focal length: 420–800 mm, aperture size: F3.8-16) collects and images the object’s human information onto the optical modulation surface of the DMD (Vialux DLP650LNIR). The DMD modulates the object based on a Hadamard pattern ( $128 pixel \times 128 pixel$ ), the $6 μm \times 6 μm$ of the DMD are combined into a single pixel, each pixel is $64.8 μm \times 64.8 μm$ , and the total size of the pattern is $8294.4 μm \times 8294.4 μm$ pre-loaded into memory. Since DMD has a working mechanism of $\pm 12 °$ , we added a mirror to the DMD along the propagation direction of $\pm 12 °$ , changed its transmission direction, and then converged to the single-pixel detector (THORLABS, PDA100A2) through the lens. The signal acquired by the single-pixel detector is stored on the computer by the data acquisition card (NI USB6251).

Figure 3.Visualization of SPCPose on the public dataset. (a) Visualization of SPCPose on Tri-Mouse. (b) Visualization of SPCPose on Horse10. M1 means natural order, M2 means reverse order, and M3 means random order. The numbers on the picture mean the sampling rate.

Download full size

View all figures

2.4. SPCPose training setup

The subsequent sections outline the training configurations for the datasets used in this study. All models are trained using the ViT-MAE^[49,50] pre-trained model on a single NVIDIA A800 GPU, with a consistent batch size of 64 and a maximum iteration limit of 10,000. The AdamW optimizer is utilized, initialized at a learning rate of 0.0005, with adjustable hyperparameters to fine-tune performance. Supervised learning of the output probability maps employs the KLDiscretLoss as the loss function. The Tri-Mouse dataset, significantly smaller with only 112 samples, serves exclusively for evaluating few-shot learning scenarios without further subdivision into training, validation, or test sets. The Horse10 dataset encompasses 8,114 samples; here, 60% (4,868 samples) is designated for training, with 20% (1,623 samples) each dedicated to evaluation during training and testing. Lastly, for the Human dataset, which consists of 1,110 samples, 80% of the data (888 samples) is allocated for training purposes, while 10% (111 samples) each is reserved for evaluation during training and final testing. Further details regarding the datasets and experimental setup can be found in Ref. [47].

We used average precision (AP), percentage of correct keypoints (PCK), area under the receiver operating characteristic (ROC) curve (AUC), and end-point error (EPE) as the main evaluation metrics for the majority of the datasets. With the exception of EPE, all other evaluation metrics are the larger, the better.

3. Results

We first designed the generation method of the simulation training dataset, then selected two public datasets^[6,17,51] (Tri-Mouse and Horse10) and constructed one dataset (Human) for the numerical simulation test. Finally, we further verified the pose estimation using a small number of detection values measured outdoors by the SPC we made. In this section, we will highlight the outstanding performance of the single-pixel detection system in combination with SPCPose for cross-species object pose estimation under various imaging modes.

3.1. Simulation results on the public dataset for SPCPose

This section shows simulation results on two publicly available datasets (Tri-Mouse and Horse10) with the coding method described in Sec. 2. All objects were detected, generating 16,384 single-pixel detection values, referred to as full sampling. We then extracted 4,096, 1,024, and 256 points with natural order ( $M 1$ ), reverse order ( $M 2$ ), and random order ( $M 3$ ) from these values to simulate varying degrees of extraction conditions. Figure 3 displays the pose estimation results of SPCPose, where the first row of each subplot shows the original image (Tri-Mouse original image resolution is $640 \times 480$ , Horse10 original image resolution is $288 \times 162$ ) and its corresponding under-sampling detection value. The second row presents the keypoint locations extracted by SPCPose (the keypoints in the original image were manually annotated) as the sampling rate decreases. The third row maps the prediction results back onto the original image. The visualization results showed that pose estimation could be achieved for both animals at three lower sampling rates. Even when the sampling rate is as low as $8.333 \times 10^{- 4}$ , SPCPose can still effectively parse the pose information of the object, which proves the superiority of the proposed method in terms of effectiveness, cross-species, and information compression.

Tables 1 and 2 provide a detailed quantitative analysis comparing the results of pose estimation in $M 1$ , $M 2$ , and $M 3$ , alongside methods based on original images such as heatmap-based approaches^[52] and regression-based RLE methods^[49,53], which used HRNet^[49] and ResNet^[54] as benchmark models. The evaluation criteria include AP@50-95, PCK@0.05, AUC, EPE, parameter count (to evaluate computational space complexity), and GFLOPs (to evaluate actual computational complexity). We chose these models for comparison due to their similar computational complexities to SPCPose and their optimal detection accuracy, facilitating an argument for whether SPCPose can achieve comparable performance.Table 1.

Performance Comparison of Image-Based and Image-Free Pose Estimation Regarding AP, PCK, AUC, EPE, Params, and GFLOPs on Tri-Mouse^[6,51]^a

Method	Sample rate	Backbone	AP@50-95	PCK@0.05	AUC	EPE	Params (M)	GFLOPs
Heatmap^[52]	1.000	HRNet-W32	99.9	99.8	91.4	1.91	28.5	10.2
Heatmap^[52]	1.000	HRNet-W48	100.0	99.8	92.2	1.67	63.6	21.0
RLE^[48]	1.000	ResNet50	98.4	95.8	86.4	3.43	23.7	5.4
	1.000	ResNet101	99.3	97.4	87.7	3.01	42.7	10.2
	1.000	ResNet152	98.9	96.4	87.5	3.08	58.3	15.1
SPCPose-M1	5.333 × 10⁻²	ViT-S	99.9	95.9	91.4	1.90	24.3	5.9
	1.333 × 10⁻²	ViT-S	96.5	99.9	88.5	3.56	24.3	5.9
	3.333 × 10⁻³	ViT-S	53.0	54.8	54.5	25.39	24.3	5.9
	8.333 × 10⁻⁴	ViT-S	78.4	79.7	77.6	14.59	24.3	5.9
SPCPose-M2	1.333 × 10⁻²	ViT-S	100.0	100.0	93.9	1.13	24.3	5.9
	3.333 × 10⁻³	ViT-S	100.0	100.0	92.8	1.49	24.3	5.9
	8.333 × 10⁻⁴	ViT-S	100.0	100.0	93.4	1.30	24.3	5.9
SPCPose-M3	1.333 × 10⁻²	ViT-S	100.0	99.7	91.5	1.87	24.3	5.9
	3.333 × 10⁻³	ViT-S	100.0	99.8	91.3	1.93	24.3	5.9
	8.333 × 10⁻⁴	ViT-S	100.0	100.0	93.4	1.28	24.3	5.9

Heatmap^[52] and RLE^[48] are common image-based pose estimation methods that use HRNet^[49] or ResNet^[54] with superior performance as feature extractors.

Table 2.

Performance Comparison of Image-Based and Image-Free Pose Estimation Regarding AP, PCK, AUC, EPE, Params, and GFLOPs on Horse10^[51]^a

Method	Sample rate	Backbone	AP@50-95	PCK@0.05	AUC	EPE	Params (M)	GFLOPs
Heatmap^[52]	1.000	HRNet-W32	98.3	99.9	93.6	1.14	28.5	10.2
Heatmap^[52]	1.000	HRNet-W48	98.3	99.9	93.7	1.10	63.6	21.0
RLE^[48]	1.000	ResNet50	95.1	99.6	93.8	1.44	23.7	5.4
	1.000	ResNet101	96.1	99.7	93.9	1.36	42.7	10.2
	1.000	ResNet152	96.5	99.7	94.0	1.30	58.3	15.1
SPCPose-M1	3.512 × 10⁻¹	ViT-S	87.6	94.0	88.1	2.96	24.3	5.9
	8.779 × 10⁻²	ViT-S	76.0	86.6	83.0	4.85	24.3	5.9
	2.195 × 10⁻²	ViT-S	67.4	81.5	80.2	5.83	24.3	5.9
	5.487 × 10⁻³	ViT-S	72.6	83.2	81.0	5.65	24.3	5.9
SPCPose-M2	8.779 × 10⁻²	ViT-S	92.7	97.7	91.3	1.87	24.3	5.9
	2.195 × 10⁻²	ViT-S	89.4	94.9	88.5	2.83	24.3	5.9
	5.487 × 10⁻³	ViT-S	83.3	90.3	85.0	4.10	24.3	5.9
SPCPose-M3	8.779 × 10⁻²	ViT-S	94.7	98.1	93.4	1.52	24.3	5.9
	2.195 × 10⁻²	ViT-S	91.1	96.5	90.3	2.20	24.3	5.9
	5.487 × 10⁻³	ViT-S	89.6	95.8	89.6	2.45	24.3	5.9

Heatmap^[52] and RLE^[48] are common image-based pose estimation methods that use HRNet^[49] or ResNet^[54] with superior performance as feature extractors.

In particular, on the Tri-Mouse dataset, SPCPose-M2 and SPCPose-M3 achieved comparable results to the image-based method as the sampling rate was reduced to a minimum of $8.333 \times 10^{- 4}$ (1250-fold reduction in data volume). On the Horse10 dataset, SPCPose-M2 and SPCPose-M3 achieved results close to those of image-based methods. For example, compared to the heatmap-based with the HRNet-W48 method, with a sampling rate of $8.779 \times 10^{- 2}$ (about 11 times lower data volume), SPCPose-M3 reduces the number of parameters by a factor of 2.6, the computation volume by a factor of 3.6, and the AP@50-95 by a factor of only 3.7%. At a sampling rate of $5.487 \times 10^{- 3}$ (about 181 times less data), the AP@50-95 is reduced by only 8.9%.

3.2. Simulation results on the indoor Human dataset for SPCPose

This section demonstrates the effect of SPCPose in human pose estimation for indoor scenes. The Human dataset is used. Original images with a resolution of $852 \times 480$ were detected, generating 16,384 single-pixel detection values as full sampling results. Subsequently, we obtained 4,096, 1,024, and 256 points by extraction corresponding to sampling rates of $1.002 \times 10^{- 2}$ , $2.504 \times 10^{- 3}$ , and $6.260 \times 10^{- 4}$ , respectively.

Figure 4.Visualization of SPCPose on the indoor Human dataset. The numbers on the picture mean the sampling rate.

Download full size

View all figures

3.3. Real-world human experimental results for SPCPose

We conducted tests across multiple outdoor scenes to comprehensively evaluate the robustness of our constructed single-pixel detection system and SPCPose in real-world conditions. The custom-built single-pixel detection system captured human targets in various environments and image resolution of $377 \times 377$ . Using 256 detection values for pose estimation, the sampling rate was approximately 0.0018. Section 2 provides detailed system settings and experimental methods.

Figure 5 presents the key results from our real-world tests. We placed the produced single-pixel imaging system in the nearby building to take pictures of people in three different scenes near the track and field in Fig. 5(a). In Fig. 5(b), images taken by a conventional camera illustrate the actual scenes, while showcasing the imaging capabilities of the single-pixel system and its visualization of detected keypoints. The first column of Fig. 5(c) shows the SPC imaging results. The second column shows the pose estimation results using the image-based approach, and the third column maps the pose estimation results onto a reconstructed image for easy presentation. Figures 5(d), 5(e), and 5(f) show the results of applying SPCPose to three different methods of extracting single-pixel detection values. The first column provides a graphical representation of the single-pixel detection values, an essential step for understanding how raw data is captured. The second column displays the processed results after applying SPCPose, converting single-pixel detection values into pose estimation information. The third column maps the keypoint locations onto the reconstructed image, visually displaying the pose estimation outcomes.

Figure 5.Real-world experimental results. (a) Schematic of our customized SPC and captured scenes. (b) Results of the same object captured with an RGB camera perform different actions in three real-world scenes. (c) Reconstructed scenes using our single-pixel detection system and pose estimation results using an image-based approach. (d) Results of the natural-order single-pixel detection value map representation and the action of SPCPose. (e) Inverse-order single-pixel detection value map representation and results from the action of SPCPose. (f) Random-order single-pixel probe-value map representation and results of the action of SPCPose.

Download full size

View all figures

In these experiments, we had targets perform actions at distances of approximately 100 m in three scenes, capturing target features through the single-pixel detection system and parsing the poses using SPCPose. The visualization results show that even under conditions of low sampling rates, long distances, and complex backgrounds, SPCPose can effectively parse target poses. It demonstrates the powerful potential of combining a single-pixel detection system with SPCPose, not only for high-throughput post-processing applications but also for advancing image-free intelligent sensing technologies, heralding new possibilities for innovative applications in the future smart sensing domain.

Notably, we have implemented the proposed model on two leading-edge computing platforms: the NVIDIA Jetson AGX Orin and the Atlas 200I DK A2. The Jetson AGX Orin, featuring a high-performance NVIDIA Ampere architecture GPU and multi-core CPU, provides abundant computing resources. It allows the model to conduct complex neural network computations with high throughput and low latency, validating its scalability for large-scale edge AI data processing. In contrast, the Atlas 200I DK A2, powered by Huawei’s Ascend 310 AI processor, is tailored for low-power, high-efficiency edge AI. Our successful deployment on it shows the model’s adaptability to resource-constrained setups. Despite having fewer resources than the Jetson AGX Orin, the Atlas 200I DK A2 can execute the model effectively, suggesting its potential in cost and power-conscious edge computing applications.

In summary, through extensive testing across various imaging modes, species, sampling rates, and real-world scenarios, we have demonstrated the effectiveness and robustness of the single-pixel detection system combined with SPCPose. This combination holds theoretical and technological implications, providing a solid foundation for practical applications and promising to play a crucial role in the future of innovative sensing technology.

4. Discussion

Firstly, we demonstrate that SPCPose can accurately parse the pose information of targets under different single-pixel detection configurations. It highlights the strong potential of single-stage models to extract implicit features without requiring explicit reconstruction processes. Secondly, we test the pose estimation performance of SPCPose at extremely low sampling rates, verifying its capability to efficiently and accurately perform cross-species pose estimation. It underscores SPCPose’s proficiency in handling deeply undersampled data and offers the possibility of reducing data transmission bandwidth, storage requirements, and computational complexity. Finally, using a customized single-pixel detection system combined with SPCPose, we conducted real-world pose estimation experiments, fully showcasing the robustness and practicality of the model. Despite demonstrating the substantial potential of a single-stage model without implicit reconstruction, several factors can influence SPCPose’s ability to achieve optimal results. This section delves into these factors in detail as a guide for obtaining the best pose estimation outcomes.

The Hadamard matrix optimization sequence will directly affect the result of image-free pose estimation. Besides the three sequences selected in the experiment, there are also sequences such as Russian dolls and cake-cutting. The main reason why we adopted natural order, reverse order, and random Hadamard matrices in the experiment instead of Russian nesting dolls or cake-cutting sequences is that in this part of the work, we want to discuss and analyze the Hadamard matrices corresponding to any segment of detection values that simultaneously contain high- and low-resolution (high- and low-order) matrices. It is hoped that this will ensure that pose estimation is effective in both high-resolution and low-resolution detections. The natural order and random order Hadamard matrices precisely conform to this feature. The results of the natural order and reverse natural order Hadamard experiments precisely verify our idea. The results show that a natural order Hadamard matrix can be used to detect the target scene normally, and then selecting a small number of any segment of detection values (either the previous segment or the subsequent segment) can achieve pose estimation. Rather than being constrained and not allowed to select the front of the sequence or a specific data segment, this increases the flexibility of pose estimation detection. The Hadamard matrices of Russian dolls and cake-cutting sequences may produce different effects. There should also be Hadamard sequences that are more suitable for image pose estimation. This part of the research is very important and will produce very important and meaningful research results.

4.1. Cross-species adaptability

This section aims to evaluate the adaptability of SPCPose for multi-species pose estimation. The visualization results from Figs. 3 and 4 demonstrate that SPCPose exhibits excellent cross-species adaptability, accurately estimating pose features across different imaging modes and sampling rates. It performs comparably to image-based pose estimation models even in multi-target scenes and complex background environments. It indicates that SPCPose functions well under simple experimental conditions and provides reliable results in more complex real-world applications.

The results show that when applied to simple-background, multi-object scenarios like those in the Tri-Mouse dataset, SPCPose achieves performance consistent with or superior to image-based models. At sampling rates ranging from $5.333 \times 10^{- 2}$ to $8.333 \times 10^{- 4}$ , it reduces parameter counts by 14.7% to 61.8% and lowers computational complexity by 42.2% to 71.9%. Notably, SPCPose outperforms regression-based RLE methods in AUC and has lower EPE, possibly due to the single-pixel detection system focusing more on the object, concentrating the features injected into SPCPose, thereby reducing EPE.

For complex background scenes similar to those in the Horse10 dataset, SPCPose maintains performance close to that of image-based models. However, there is an inevitable decline in AP and PCK metrics. In summary, SPCPose shows significant advantages in multi-species pose estimation, mainly providing high-quality results with low parameter counts and computational complexity.

4.2. Lower limit of sample rate

This section explores the significance of single-pixel imaging (SPI) at extremely low sampling rates for high-throughput post-processing applications, aiming to determine the potential lower limit of sampling rates. For example, Fig. 4 illustrates the visual effects of SPCPose on human pose estimation at different sampling rates. Despite some errors relative to ground-truth annotations, these discrepancies do not significantly impact the recognition accuracy required for backend applications.

Tables 1 to 3 show SPCPose’s performance at very low sample rates. In particular, the indoor experiments in Table 3 show that SPCPose-M3 still exceeds 50% AP@50-95 at a very low sampling rate of $6.260 \times 10^{- 4}$ . While it cannot be definitively stated that $6.260 \times 10^{- 4}$ is the absolute minimum sampling rate SPCPose can recognize, this lower limit is remarkably low and likely sufficient for most high-throughput applications. From a hardware perspective, the lower limit could be further reduced if the single-pixel detection system features a higher dynamic range, lower noise levels, and higher quantum efficiency. Algorithmically, SPCPose directly extracts implicit features from sequential data without relying on explicit reconstruction methods to adapt to environmental changes.

In conclusion, current limitations are not insurmountable; SPCPose will exhibit broader applicability and superior performance as technology and methodologies advance. Technological progress and algorithmic optimization will improve existing capabilities and expand into new domains, offering more efficient and cost-effective solutions for high-throughput post-processing applications.

4.3. Effect of extraction methods on pose estimation

Deep-learning-based single-pixel imaging methods can leverage various extraction strategies, such as $M 1$ , $M 2$ , and $M 3$ of single-pixel detection values for image reconstruction. To investigate whether SPCPose exhibits similar characteristics and applicability, we conducted a detailed analysis of the Human dataset. All experiments used a light field pattern of $128 \times 128$ resolution to detect objects, obtaining 16,384 single-pixel detection values at complete sampling. Subsequently, we extracted 256 detection values to evaluate the impact of different extraction methods on pose decoding.

Figure 6(a) showcases the subjective visual effects of SPCPose under different extraction methods. For unobstructed objects, the performance differences among various extraction methods are minimal. Although all methods can detect the object when dealing with side-facing objects, $M 3$ yields decoding results closer to ground-truth annotations. From an objective evaluation perspective, Fig. 6(b) provides quantified results. For SPCPose, M3 excels in subjective and objective evaluations, especially for parsing complex poses or side-facing objects. M2 is the second-best option, providing reliable results when aliasing imaging is not feasible. M3 is generally not preferred due to its potential to cause hallucinations. However, it can still be considered in specific scenarios if initial “outliers” can be effectively handled. Additionally, random sampling increases diversity, helping reduce model bias, particularly when other methods struggle to differentiate between cases.

Figure 6.Effect of extraction methods on SPCPose’s performance in parsing object poses at 256 sample points. (a) Visualization of SPCPose processing results with different extraction methods. The numbers on the picture mean the sampling rate. (b) Objective evaluation of SPCPose processing performance with different extraction methods.

Download full size

View all figures

In conclusion, while SPCPose supports multiple extraction methods, each with varying degrees of applicability in pose estimation, achieving optimal pose estimation results recommends prioritizing $M 3$ . Considering the diverse requirements of practical applications, flexibly selecting the most suitable extraction strategy for one’s specific scenario is crucial. This approach maintains high pose estimation accuracy under low sampling rates and adapts to varying background complexities and object pose changes.

4.4. Effect of complex scenarios on pose estimation

Despite demonstrating its potential across various scenarios, achieving comprehensive coverage of all known contexts remains challenging for SPCPose. Addressing this issue by exploring typical complex scenarios is essential for future research.

Firstly, in complex backgrounds where the contrast between the object and the environment diminishes, it imposes higher demands on the model’s ability to recognize and accurately locate feature points. We have adopted a vision-transformer-based architecture, which is more adept at extracting sequential features than traditional convolutional networks. The future architecture might offer more significant advantages as it can better overcome “hallucination” phenomena than standard transformers.

Secondly, humans’ or other organisms’ poses exhibit high dynamism and diversity, including standing, walking, jumping, and turning sideways. These variations can lead to key points being occluded or morphologically distorted, affecting pose estimation accuracy. Currently, we are optimizing this problem using orthogonal decomposition feature maps; however, exploring higher-dimensional multiscale feature decomposition appears particularly important, as a multiscale feature decomposition method may provide a more effective solution to the challenges faced in pose estimation.

Furthermore, low sampling rates present an issue that cannot be overlooked when handling complex scenes, as they can result in the loss of critical details, impacting the effectiveness of pose estimation. In practical applications, it is necessary to appropriately select the sampling rates and optimize the light field modulation mode in the passive illumination mode according to the specific application requirements to ensure that no vital visual information is omitted.

5. Conclusion

This study introduces SPCPose, a cross-species pose estimation method using an SPC at ultra-low sampling rates ( $6.260 \times 10^{- 4}$ ). Unlike conventional approaches, SPCPose bypasses image reconstruction entirely, drastically reducing data requirements while maintaining accuracy comparable to traditional image-based methods. The framework combines a vision transformer for feature extraction with an orthogonal decomposition decoder for pose estimation, demonstrating robust performance across both human and animal subjects. Validated in challenging real-world scenarios including outdoor environments and long-distance conditions, SPCPose offers an efficient, low-latency solution ideal for mobile applications where bandwidth and computational resources are constrained.

Category: Imaging Systems and Image Processing

Received: Apr. 1, 2025

Accepted: May. 9, 2025

Published Online: Aug. 22, 2025

The Author Email: Cheng Zhou (zhoucheng91210@163.com), Jipeng Huang (huangjp848@nenu.edu.cn), Lijun Song (ccdxslj@126.com)

DOI:10.3788/COL202523.091101

CSTR:32184.14.COL202523.091101