Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

Yuhang Wen1, Zixuan Tang1, Yunsheng Pang2, Beichen Ding1, Mengyuan Liu3,
1Sun Yat-sen University  2Tencent Technology (Shenzhen) Co., Ltd.  3Shenzhen Graduate School, Peking University

General Interactive Actions Addressing the Diversity of Interacting Entities.
(Person-to-person, Hand-to-hand & Hand-to-object)

Abstract

Recognizing interactive action plays an important role in human-robot interaction and collaboration. Previous methods use late fusion and co-attention mechanism to capture interactive relations, which have limited learning capability or inefficiency to adapt to more interacting entities. With assumption that priors of each entity are already known, they also lack evaluations on a more general setting addressing the diversity of subjects.

To address these problems, we propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations. Specifically, our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities. By extending the entity dimension, ISTs provide better interactive representations. To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations. When modeling correlations, a strict entity ordering is usually irrelevant for recognizing interactive actions. To this end, Entity Rearrangement is proposed to eliminate the orderliness in ISTs for interchangeable entities.

Extensive experiments on four datasets verify the effectiveness of ISTA-Net by outperforming state-of-the-art methods.

What is A General Interactive Action



What ISTA-Net Features

  • Interactive Spatiotemporal Tokenization
  • A general solution to represent motion of multiple skeletons including diverse subjects, without the assumption that priors of each interacting entity are already known.
  • Entity Rearrangement
  • A simple yet effective way to ensure inherent permutation invariance for unordered interacting entities.
  • Token Self-Attention Blocks
  • Our architecture incorporates a multi-head self-attention mechanism to model the spatial, temporal, and interactive rela- tionships simultaneously.

Benchmark Difficulty



Visualizations

BibTeX

@INPROCEEDINGS{wen2023interactive,
      author={Wen, Yuhang and Tang, Zixuan and Pang, Yunsheng and Ding, Beichen and Liu, Mengyuan},
      booktitle={2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
      title={Interactive Spatiotemporal Token Attention Network for Skeleton-Based General Interactive Action Recognition}, 
      year={2023},
      volume={},
      number={},
      pages={7886-7892},
      doi={10.1109/IROS55552.2023.10342472}
}