The current state of domain adaptation in person re-ID

7 min readSep 13, 2021

CVPR 21 Highlights

Introduction

Person re-identification (PreID) is the task of associating images or video frames of the same person taken from different cameras or from the same camera at different times. PreID has a number of applications, including automated contact tracing, missing child detection, and real-time athlete statistics summaries at sporting events. PreID can be categorized along a number of dimensions, including: level of supervision, source, and modality. In this article, we consider unsupervised or weakly supervised, multi-source, unimodal PreID specifically (and drop the adjectives at this point).

Sample image motivating missing child detection through PreID. Photo by Chris Barbalis on Unsplash

Between October 2016 and January 2018, the state of the art in PreID leapt from ~0.5 mAP to ~0.9mAP1, exceeding human-level performance [1]. At present, top-ranking models achieve a mAP near 1. These results, however, are for a specific benchmark dataset, Market-1501, and do not reflect real-world PreID capability. To achieve human-level performance in the real world, PreID needs to overcome the unsupervised domain adaptation gap.

The UDA Gap

Unsupervised domain adaptation (UDA) is the task of training a statistical model on labeled data from one or more source domains to achieve better performance on data from one or more target domains, with access to only unlabeled data in the target domain. The PreID UDA gap is a failure to bridge lessons learned from the source domain(s) to target domain(s). This gap manifests as a trained model’s failed detection of a person in the target domain, despite this person’s presence in the source domain. Common causes of the UDA gap include a change in: person’s clothing, viewing angle, camera source, lighting conditions.

In theory, this gap can be bridged through deeper abstraction (better generalization) of features during training. Four common approaches to address the PreID UDA gap are: 1) using GANs to style transfer labeled data from source to target domain, 2) pseudo-labeling to iteratively fine-tune the object detection model through rounds of clustering, 3) generalization through batch-instance normalization models, and 4) training on multiple source datasets. None of these approaches has resolved the UDA gap to date, which explains the large difference in mAP results when training and testing on, e.g. Market-1501, compared to training and testing on different datasets, or training and testing on more challenging benchmarks, e.g. cloth-changing benchmarks.

A Complicating Factor

A close-set problem is one in which no unknown classes arise during testing. In the case of close-set object classification, one or more known classes are in an image or not. PreID is an open-set problem, as unknown classes (specific persons) will be present during test time. Since inter-class differences are far more subtle for PreID than for broad object detection, directly applying close-set object detection solutions to PreID may “pull different identities close in the feature space” [2]. Thus, addressing the UDA gap is complicated by this property of PreID, and requires models that are specifically designed for an open-set problem.

A Few CVPR21 PreID Stand-Outs

IEEE’s CVPR conference is one of the most popular among the computer vision community. 26 PreID papers were accepted to CVPR21 this year, showcasing advances over the past year. Below is a summary of 3 papers that advance different aspects of PreID.

Bai et al. [2] attempt to advance PreID, given multiple source datasets. To do so, the authors extend mutual mean-teaching (MMT) [3], a pseudo-labeling approach, through two modules: rectification domain-specific batch normalization (RDSBN), which reduces domain-specific characteristics and increases the distinctiveness of person features; and multi-domain information fusion (MDIF), which minimizes domain distances by fusing features of different domains (See Figure 1). MMT+Ours (authors’ model) is evaluated using Market-1501, DukeMTMC, CUHK03, and MSMT benchmark datasets. Bai et al. conduct an ablation study to determine gains made by increasing the number of source datasets (Duke, Duke + CUHK, Duke+CUHK+MSMT) in relation to a different target dataset (Market-1501) for four models (Direct Transfer of pre-trained model during testing, MMT(DBSCAN), MMT-with-source, MMT+Ours). Results indicate that every model improves as the number of source datasets increases, but MMT+Ours outperforms the 3 comparison models by a large margin. The authors do not evaluate MMT+Ours where source and training data are the same, making it difficult to compare against models that do. The best result from the paper is mAP=87.3 (source: CUHK, target: Market-1501), approximately 10% lower than current state-of-the-art models that trained and tested on Market-1501, indicating the UDA gap is still present.

Figure 1 — Overall architecture of Bai et al., which adds RDBSN and MDIF modules to mutual mean-teaching [2].

Hong et al. [3] investigate PreID amid changes in clothing, a challenging aspect of the UDA gap since a change in clothing simultaneously results in large intra-class variation and small inter-class variation. The researchers propose a Fine-grained Shape-Appearance Mutual learning framework (FSAM), a two-stream model that learns body shape features in a shaped stream and transfers it to an appearance stream to complement the cloth-specific knowledge of the appearance features (See Figure 2). Specifically, in the shape stream, FSAM generates masks to extract body shape features that are passed to a pose-specific (front, rear, and side views) multi-branch network. FSAM evaluation results against the standard cloth-changing benchmark datasets are: PRCC (R-1=38.5, mAP=16.2), LTCC (R-1=54.5, mAP not reported), and VC-Clothes (R-1=78.6, mAP=78.9). FSAM was also evaluated against non-cloth-changing benchmarks: Market-1501 (R-1=94.6, mAP=85.6), DukeMTCC (results withheld as the dataset has been retracted). Despite state-of-the-art results on the cloth-changing benchmarks, there is a large gap in relation to the non-cloth-changing benchmarks. The significant R-1 and mAP gap between the two sets of benchmarks implies that future research in this PreID sub-task is warranted.

Figure 2 — Architecture of Hong et al., which has appearance and shape streams that are fused during training [3].

Li et al. [4] investigate PreID amid partial occlusion of view, which is challenging given the myriad types and degrees of partial occlusion. The authors propose the Part-Aware Transformer (PAT) model, a weakly supervised encoder-decoder transformer architecture that learns part prototypes, a part classifier that determines if pixels belong to a learned part. The model learns part-aware masks by optimizing the similarity between all pixels in the feature map and part prototypes. PAT was evaluated against previous state-of-the-art methods in occluded PreID. Key results include: Occluded-REID (R-1=81.6, mAP=72.1), Partial iLIDS (R-1=76.5, mAP=88.2), and Market-1501 (R-1=95.4, mAP=88.0). While PAT outperforms previous approaches, the model does not achieve state-of-the-art PreID on the standard Market-1501 benchmark.

Figure 3 — Architecture of Li et al., which comprises a pixel content-based transformer encoder and a part prototype-based transformer decoder [4].

Conclusions

PreID is prone to the UDA gap, which is further complicated by PreID being an open-set problem. A change of appearance, posture, or camera angle can result in poor PreID performance, which is exacerbated over time by diluted re-id accuracy as new classes (persons) are continually observed, thanks to the open-set nature of PreID. Given this challenge, PreID consistently draws time and attention from the computer vision community. CVPR21 highlighted a number of advances in PreID this year that addressed aspects of the unsupervised domain adaptation gap. We reviewed three papers that addressed: utilizing multiple source datasets, person change of clothing, and person occlusion. The papers demonstrate sound progress in PreID, but also indicate that ongoing research is needed to exceed human-level performance in the real world.

References

[1] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei Jiang, Chi Zhang, Jian Sun; AlignedReID: Surpassing Human-Level Performance in Person Re-Identification (arXiv), 2018.

[2] Zechen Bai, Zhigang Wang, Jian Wang, Di Hu, Errui Ding; Unsupervised Multi-Source Domain Adaptation for Person Re-Identification; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12914–12923

[3] Peixian Hong, Tao Wu, Ancong Wu, Xintong Han, Wei-Shi Zheng; Fine-Grained Shape-Appearance Mutual Learning for Cloth-Changing Person Re-Identification; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10513–10522

[4] Yulin Li, Jianfeng He, Tianzhu Zhang, Xiang Liu, Yongdong Zhang, Feng Wu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2898–2907

About the Author

Christopher Farah is a Senior Data Scientist at Anno.Ai. Chris has over 14 years of experience conducting spatial data mining research in the healthcare and national security sectors. Chris has a Bachelors in Chemical Engineering from The Cooper Union, a MA in Mathematics from St. Louis University, and a PhD in Spatial Information Science and Engineering from the University of Maine.

Be sure to follow us on Twitter and LinkedIn!