News

  • 2025 Papers accepted to NeurIPS 2025, CVPR 2025, ACL 2025, and IJCV 2025.
  • 2024 Papers accepted to ICLR 2024, Pattern Recognition 2024, and CVPR Workshop 2024.
  • 2023 Papers accepted to ICLR 2023 and IEEE TIP 2023.
  • 2022 Paper accepted to IEEE TBIOM 2022.
  • 2021 Paper accepted to ICCV 2021.

Experience

Research Intern, TikTok Video-Audio Joint Generation
2025 - Now
Research Intern, Baidu Inc Multimodal Video Understanding
2022 - 2025

Research Interest

Research: Video-centric Computer Vision (Generation, Understanding, Recognition).

Publications

jova
JoVA: Unified Multimodal Learning for Joint Video-Audio Generation
Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han
arXiv preprint (2512.13677), 2025
A unified framework for generating synchronized video and audio through multimodal learning.
3drs
MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han
NeurIPS 2025
Proposing 3D-aware representation supervision to enhance scene understanding capabilities in MLLMs.
PruneVid
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang, Hao Zhou, Kai Han
ACL 2025
An efficient visual token pruning method to accelerate Video Large Language Models without sacrificing performance.
change3d
Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective
Duowang Zhu, Xiaohu Huang, Haiyan Huang, Hao Zhou, Zhenfeng Shao
CVPR 2025 Highlight
Revisiting change detection and captioning tasks through the lens of video modeling techniques.
skim
Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting
Zhengqi Zhao*, Xiaohu Huang*, Hao Zhou, Kun Yao, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng
IJCV 2025
A coarse-to-fine framework integrating contextual and fine-grained views for accurate repetitive action counting.
froster
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition
Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han
ICLR 2024
Leveraging frozen CLIP models as effective teachers to improve open-vocabulary action recognition.
scd
What's in a Name? Beyond Class Indices for Image Recognition
Kai Han*, Xiaohu Huang*, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia
CVPR 2024 Workshop Spotlight
Investigating the semantic impact of using class names versus indices in image recognition tasks.
skeletongcl
Graph Contrastive Learning for Skeleton-based Action Recognition
Xiaohu Huang, Hao Zhou, Bin Feng, Xinggang Wang, Wenyu Liu, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang
ICLR 2023
A graph contrastive learning framework designed to learn robust representations for skeleton-based action recognition.
cag
Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition
Xiaohu Huang, Xinggang Wang, Zhidianqiu Jin, Bo Yang, Botao He, Bin Feng, Wenyu Liu
IEEE TIP 2023
A condition-adaptive graph convolution network for robust gait recognition under complex environmental variations.
cstl
Context-sensitive temporal feature learning for gait recognition
Xiaohu Huang, Duowang Zhu, Hao Wang, Xinggang Wang, Bo Yang, Botao He, Wenyu Liu, Bin Feng
ICCV 2021
Learning context-sensitive temporal features to significantly improve the robustness of gait recognition.