Few-shot Semantic Segmentation in Images

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

Computer vision is a field of computer science that focuses on identifying and under-standing images. In the realm of computer vision, semantic segmentation is an important pixel-level classification task. However, achieving precise pixel-level results continues to depend on the availability of extensive datasets. This thesis focuses on the task of few-shot segmentation (FSS), which aims to segment unseen classes with only a few annotated samples. The main challenge of FSS is that the limited annotated data are often insufficient to provide the necessary information for the accurate segmentation of novel classes. Multiple methods are proposed that leverage various characteristics of FSS to address this challenge.

Most current FSS methods focus on using object information extracted, with the aid of human annotators, from support images to identify the same objects in new query images. However, background information can also be useful to distinguish objects from their surroundings. A method named CobNet is proposed that utilizes information about the background extracted from the query images without annotations. A cross-attention mechanism is designed to enable the model to learn the relationship between foreground and background, which can supplement the limited information available for novel classes.

Experimental results demonstrate that this method achieves a mean Intersection-over-Union (mIoU) score of 61.4% and 37.8% for 1-shot segmentation on PASCAL-5i and COCO-20i, respectively. In addition, for weakly supervised few-shot segmentation which is a situation where weak or no annotations are provided for the novel class images, CobNet achieves a mIoU of 53.7% for 1-shot on PASCAL-5i without any support annotations. Furthermore, leveraging bounding boxes as a form of weak supervision leads to an outstanding result of 55.5% with 1-shot on PASCAL-5i.

Due to a limited number of annotated images for novel classes, current FSS methods often misclassify query background pixels, that are similar to the support object, leading to a high rate of false positives. To address this issue, a method named QSem is proposed to extract more semantics from the query image itself. As a result the proposed method is better able to discriminate between foreground and background features in the query image. This method modifies the training process to associate prototypes with class labels, including known classes from the training data and latent classes representing unknown background objects. The class information is then used to extract a background prototype from the query image, which helps to improve the discrimination between foreground and background features. Experiments achieve mIoU results of 62.7% and 66.7% for 1-shot and 5-shot segmentation on PASCAL-5i, and 37.4% and 44.1% on COCO-20i. Moreover, this approach operates only during training, and results are produced with no extra computational complexity during testing. In addition, for weakly supervised few-shot segmentation, QSem achieves a mIoU of 54.8% for 1-shot on PASCAL-5i without any support annotations. When QSem leverages bounding boxes as a form of weak supervision this leads to an outstanding result of 58.5% with 1-shot on PASCAL-5i.

No-masks supervised few-shot segmentation is a task where no masks are available for all images, further reducing reliance on annotation requirements. A real-world example of no-masks supervised FSS is industrial anomaly detection: a task that inherently lacks annotated masks. To address this challenge, ComNet is proposed to use an unsupervised comparative training approach to avoid the necessity of masks in FSS and weakly supervised FSS methods. Specifically, for category-agnostic representation learning, the comparison ability is treated as a proxy task which is then used to identify pixels by com-paring the query features to support ones. A training set of images from various categories is used to achieve this comparison ability. In addition, a Wasserstein distance-based method is introduced, enabling adaptive selection of optimal augmentations for individual categories. This enhances the capacity to accommodate the distinct distribution characteristics observed across categories. In this exceptionally challenging setting, ComNet achieves the AUC results of 90.2% and 67.3% when using only 2 novel class images on the MVTec and MPDD benchmarks, respectively.
Date of Award1 Feb 2024
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorMichael Spratling (Supervisor) & Frederik Mallmann-Trenn (Supervisor)

Cite this

'