Re-id: data transforms that work and those that don’t
This blog describes experiments where data transformations (called transforms in the ML community) are evaluated for their ability to improve re-identification, called re-id. So what is a transform and what is re-id? To answer the former, consider Anno mascot Potato, centered in the transform wheel in Figure 1. The center image undergoes 6 transforms; manipulations that modify one or more aspects of the image. (See caption for details.)
As for re-id, it is the computer vision task of associating images or video frames of the same entity taken from different cameras or from the same camera at different times. Applications include missing persons detection and automated contact tracing (See my blog on the state of re-id in 2021 for some more detail.) The better the re-id model, the more likely the success of its applications. So why would transforms impact a re-id model’s performance? A transform is selected in the hope of improving model generalization, e.g. if we augment our image training data with rotations of the original images, we do so because we hope for rotation-invariance, i.e. the model provides the same output for a given object, even after that object has been rotated.
Interestingly, a transform can also help us understand what is part of the natural distribution of the data used for training and evaluation. We pose two questions relative to the re-id task:
- Are there transforms that improve a re-id model’s performance?
- How do the transforms inform the distribution of the evaluation data?
To address these questions, we will train the popular Omni-Scale Net Instance Batch Normalization (OSNet IBN) model using 8 different data transforms, along with not applying any transforms at all (aside from the usual resizing and normalization). Evaluation will be conducted by measuring mean average precision (mAP) and mean inverse negative penalty (mINP). See the Evaluation Metrics section for detail.
- Input: Pre-trained OSNet IBN, training and evaluation data, 8 transforms to transfer learn on the last 2 layers.
- Output: Model ranking in terms of mAP and mINP, per transform and re-id context.
As described above, a transform is an image transformation. Most commonly, transforms are isometries (distance is preserved) or affine transformations (collinearity is preserved). Examples of isometries include translations and reflections (reflection Potato’s at 3 and 9 o’clock in Figure 1); examples of affine transformations that are not isometries include dilations and shears (perspective Potato at 7 o’clock in Figure 1, the same as a shear). However, there are transforms that are neither of the above (color jitter Potato at 1 o’clock in Figure 1; see here for more examples.) The transforms we consider are: horizontal flip, vertical flip, crop, patch, color jitter, color invert, affine, and perspective. For details, see the Torchreid transforms module and the torchvision transforms. Examples of each are presented in Figure 2, using images from Market-1501 and AICity21 to illustrate.
Rather than discuss the sneak preview of results that we’ve provided with the transform examples, we’ll defer speculation to the Discussion.
We evaluate each re-id model through metrics that measure the quality of gallery images returned for each query image (See Figure 3 for an example.) To do so, we use a commonly reported metric, mAP, and a newer metric that is increasing in popularity, mINP.
Mean Average Precision (mAP)
Average precision (AP) is the weighted mean of precisions (P) achieved at each threshold, with the change in recall (R) between current and previous threshold used as the weight. In this way, it acts as a summary metric of the precision-recall curve. AP ranging over N threshold values for query i from the set of query images Q. The mean average precision is the mean of the average precision scores ranging over N threshold values for all queries in Q. The relationship between mAP, AP, and ΔR is:
One shortcoming of mAP is the potential overemphasis on easy gallery matches in relation to the query. This brings us to mINP.
Mean Inverse Negative Penalty (mINP)
The mean inverse negative penalty is based on negative penalty (NP). To compute NP, for a query, subtract the number of correct gallery (G) matches from the position of the hardest/last match (H), and divide this by the position of the hardest match. mINP can be calculated by taking the mean of one minus the NP score (This is done to provide the same range as mAP.) over all queries in Q. The relationship between mINP and NP is:
(Note: I use H for the hardest match, as opposed to the mINP creator’s use of R, to prevent confusion with recall.) By design, mINP determines how good a model is at returning all gallery matches, including the difficult cases, and is not prone to overemphasizing easy gallery matches. mINP does not discriminate model performance as well for large galleries, and this is a good reason to keep mAP.
Experiments and Results
All experiments were performed on an Ubuntu 20.04 machine with GeForce RTX 3080 GPU using the Torchreid repository, with some additions to load the AICity21 data, incorporate mINP, expand the transforms used, and format the results. Models were trained for 10 epochs using a batch size of 256 during training and 64 during evaluation.
Feature extraction results shown in Figure 4 summarize model performance in terms of mAP and mINP for OSNet IBN, trained using AICity21 or Market-1501 for each of 8 transforms, along with no transform, annotated None, as the baseline. (For a dynamic view, please follow this link.)
Discussion of Results
Referring to Figures 2 and 3, we discuss model performance in relation to the transforms and training datasets. First, color jitter, horizontal flip, and perspective improve the performance of OSNet IBN when evaluated for person re-id and for vehicle re-id. This makes sense from a data distribution perspective, as slight variations in color, horizontal flips, and synthetic changes in perspective result in images that fall in line with the natural data distribution, both for the AICity21 and Market-1501. By contrast, color invert, vertical flip, and affine create such severe changes in the original image, that it pushes the transformed image out of the respective data distribution.
Figures 2 and 3 also highlight a difference. Whereas crop benefitted the AICity21-trained model, it did not benefit the Market-1501-trained model. Meanwhile, patch had the opposite effect, yielding the best model for person re-id and a poorer performing model for vehicle re-id. In reviewing the datasets, Market-1501 has more instances of partial occlusions (pedestrians, bicycles, backpacks) than AICity21, where occlusions do occur (vehicle passing under a traffic light) but are far less common. Thus, simulating occlusions benefits person re-id but not vehicle re-id. Even after reviewing the data, it is not clear why the AICity21-trained model that used the crop transform had good performance.
Caveats and Miscellany
As these were same-domain experiments (experiments in which source and target data have the same(ish) distribution), we can’t give recommendations on transforms to use in the cross-domain setting. However, color jitter and horizontal flip (the Torchreid defaults for cross-domain model training), along with perspective, and potentially crop or patch, are worth investigating.
In addition to applying transforms to improve model performance, generative-adversarial networks (GANs) can be used to generate life-like synthetic examples, should the training data have gaps in examples that can’t be supplanted through transforms alone.
It is possible that, in addition to the nature of the data, OSNet IBN’s architecture lends itself to the best performing transforms and that other transforms would perform better for a different re-id model. Hence, explore the use of data transforms for model training with care, and a bit of mirth.
About the Author
Christopher Farah is a Senior Data Scientist at Anno.Ai. Chris has over 14 years of experience conducting computer science research in the healthcare and national security sectors. Chris has a Bachelors in Chemical Engineering from The Cooper Union, a MA in Mathematics from St. Louis University, and a PhD in Spatial Information Science and Engineering from the University of Maine.
About the Data
While Market-1501 is publicly available, AICity data is only available with a data use agreement and for non-commercial use only. For details, see here.