Behavior recognition based on time-dependent attention

	Slow	Fast	Outsize T×S²
global average pool，concate，fc			# classes
Input	—	—	64×224²
Data	stride 16，1²	stride 2，1²	slow：4×224² fast：32×224²
Conv₁	1×7²，64 stride 1，2²	5×7²，8 stride 1，2²	slow：4×112² fast：32×112²
Pool₁	1×3²，64 stride 1，2²	1×3²，8 stride 1，2²	slow：4×56² fast：32×56²
Res_TCAM1	1×1²，64 1×3²，64×3 TCAM，256	3×1²，8 1×3²，8×3 TCAM，32	slow：4×56² fast：32×56²
Res_TCAM2	1×1²，128 1×3²，128×4 TCAM，512	3×1²，16 1×3²，16×4 TCAM，64	slow：4×28² fast：32×28²
Res_TCAM3	3×1²，256 1×3²，256×6 TCAM，1 024	3×1²，32 1×3²，32×6 TCAM，128	slow：4×14² fast：32×14²
Res_TCAM4	3×1²，512 1×3²，512×4 TCAM，2 048	3×1²，64 1×3²，64×4 TCAM，256	slow：4×7² fast：32×7²

Table 2. Effect of γ on the model
View table
View in Article
Table 2. Effect of γ on the model
Model γ Top-1/% GFlops
SlowFast 64 92.68 27.99
96 93.12 41.99
128 94.55 55.98
SlowFast+TCAM 64 95.15 28.74
96 96.03 43.08
128 96.11 57.49

Table 3. Effects of different frame sampling rates on the model
View table
View in Article
Table 3. Effects of different frame sampling rates on the model
T×τ top-1/% GFlops
2×32 92.89 14.86
4×16 95.15 28.53
8×8 96.73 56.79

Table 4. Effects of time-dependent attention and lateral connectivity on experimental results
View table
View in Article
Table 4. Effects of time-dependent attention and lateral connectivity on experimental results
model TCAM Lateral Lateral（Ours） Top-1/%
SlowFast（b⁺） — √ — 92.68
SlowFast — — √ 93.79
Slow Only √ — √ 94.70
Fast Only（ours） √ — √ 95.83
SlowFast √ — √ 95.15

Table 5. Comparison of recognition accuracy of TCAM with other methods

View table

View in Article

Table 5. Comparison of recognition accuracy of TCAM with other methods

Class	Method	Pretraining	Spatial resolution	Backbone	UCF101/%	HMDB51/%
Baseline	IDT+FV^［1］	—	—	—	85.90	57.20
	IDT+BOVW^［1］	—	—	—	87.90	61.10
	Two-stream^［7］	ImageNet	224×224	ConvNets	88.00	59.40
	C3D ^［8］	—	128×171	C3D	85.2	—
No-attention	Two-stream-I3D^［9］	—	224×224	Inception V1	93.40	66.40
	Two-stream-I3D^［9］	ImageNet+Kinetics	224×224	Inception V1	98.00	80.70
	Hidden Two-stream^［26］	UCF101	224×224	I3D	97.10	78.70
	R（2+1）D^［27］	Kinetics	112×112	ResNet34	94.97	—
Attention	AR3D_V1^［22］	—	112×112	R3D	88.39	51.53
	AR3D_V2^［22］	—	112×112	R3D	89.28	52.51
	STA-LSTM^［10］	ImageNet	224×224	GoogLeNet	90.80	59.70
	STA-LSTM^［10］	Kinetics	224×224	I3D	98.66	81.45
	Chen^［28］	Kinetics	224×224	Inception-v3	96.13	74.45
	Chen^［28］	Kinetics	224×224	BN-Inception	96.18	75.17
	NST^［29］	ImageNet+ Kinetics	224×224	I3D	96.00	76.10
	文献［11］	ImageNet+ Kinetics	224×224	ResNet50	96.3	77.7
	文献［12］	ImageNet	128×128	InceptionV3	94.9	—
	文献［13］	ImageNet	299×299	InceptionV3	94.9	70.8
	TCAM（Ours）	—	224×224	ResNet50	95.83	76.10
	TCAM（Ours）	Kinetics	224×224	ResNet50	98.16	82.30

Tools

Get Citation

Copy Citation Text

Kuan LIU, Wei WANG, Hong-ting SHEN, Hong-tao HOU, Min-zhen GUO, Zi-jiang LUO. Behavior recognition based on time-dependent attention[J]. Chinese Journal of Liquid Crystals and Displays, 2023, 38(8): 1095

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites