Novel Approaches to Suppress Tracking Drift Caused by Similar-looking Distractors

Student thesis: Doctoral ThesisDoctor of Philosophy


Visual tracking is a fundamental task in computer vision with numerous applications in surveillance, self-driving vehicles and UAV-based monitoring. It is the task of locating the same moving object in each frame of a video sequence, given only the initial appearance of the target object. Most modern trackers treat visual tracking as a classification problem. By learning an appearance model of the target from the initial frame, the trackers distinguish the target from the background and other objects by a cross-correlation operation to predict its location in the following frames. Although they ignore other potentially useful sources of information, such as previous motion trajectory, these Tracking-by-Detection methods achieve impressive performance. However, they can fail when the appearance model misidentifies a similar-looking object (a “distractor”) as the target. This thesis focuses on this tracking drift problem caused by distractors, and adopts three novel solutions:


This thesis demonstrates that increasing the shape bias of a CNN can improve robustness of template matching and object tracking. A data augmentation technique which replaces the texture in the original image with the style of a randomly selected painting through neural style-transfer for image classification was first used to train four CNN models with a same network structure but differing in shape sensitivity. These models were used as a feature-extractor to obtain deep features for template matching. The results show that training a CNN to learn about texture cues while biasing it to be more sensitive to shape cues, can improve template matching performance. Considering most of current state-of-the-art trackers use a template matching technique (the cross-correlation operation) to locate the target, these results suggest that the same data augmentation method could potentially also improve the matching performance in these trackers. This assumption was verified by training state-of-the-art trackers using the same dataset augmentation method. The results show that training trackers to learn about texture cues while biasing them to be more sensitive to shape cues, can improve their performance.


This thesis proposes a novel online predictor using the appearance of distractors
detected in previous frames. These are represented as additional appearance models with the same size as the target appearance model. The predicted location of the target takes into consideration not only the tracked object, but also the distractors detected in previous frames. As a consequence matches between the target appearance model and the surrounding background are suppressed, and the identification of the target is more reliable.


This thesis proposes a novel motion predictor using a probabilistic map that the
object is present at a certain location in the search region. The probabilistic map is generated based on the state estimated by a Kalman filter. Different from previous approaches, the movement of the background between two joint frames is estimated and the filter takes the global movement as input every frame to compensate for the motion of the camera, and this improves robustness of the Kalman filter. The resulting probabilistic map narrows-down the range of locations where the target might appear in the current frame, and hence, improves tracking performance.

These methods can be used individually or in combination. When using one of them with existing trackers including Super_DiMP, ARSuper_DiMP, SiamFC and SiamFC++, in all cases the performance of the underlying tracker is improved by the addition of the proposed method. This indicates that the three methods have good transferability and are potentially general approaches that could be used to improve the performance of most current visual trackers using Tracking-by-Detection approach. Using them in combination is shown to improve the performance of the current state-of-the-art tracker ARsuper_DiMP. Specifically, when using all three methods with ARsuper_DiMP the resulting tracker achieves new state-of-the-art performance on three benchmark datasets: OTB-100, NFS and LaSOT.

Date of Award1 Sept 2022
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorMichael Spratling (Supervisor) & Kathleen Steinhofel (Supervisor)

Cite this