TY - CHAP
T1 - SKiT
T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
AU - Liu, Yang
AU - Huo, Jiayu
AU - Peng, Jingjing
AU - Sparks, Rachel
AU - Dasgupta, Prokar
AU - Granados, Alejandro
AU - Ourselin, Sebastien
N1 - Funding Information:
This work was supported by the Engineering & Physical Sciences Research Council Doctoral Training Partnership ( EPSRC DTP) grant EP/T517963/1; the Academy of Medical Sciences Springboard Award [SBF005\1131]; King’s funded CDT in Surgical & Interventional Engineering and King’s-China Scholarship Council PhD Scholarship programme (K-CSC).
Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - This paper introduces SKiT, a fast Key information Transformer for phase recognition of videos. Unlike previous methods that rely on complex models to capture long-term temporal information, SKiT accurately recognizes high-level stages of videos using an efficient key pooling operation. This operation records important key information by retaining the maximum value recorded from the beginning up to the current video frame, with a time complexity of O(1). Experimental results on Cholec80 and AutoLaparo surgical datasets demonstrate the ability of our model to recognize phases in an online manner. SKiT achieves higher performance than state-of-the-art methods with an accuracy of 92.5% and 82.9% on Cholec80 and AutoLaparo, respectively, while running the temporal model eight times faster (7ms v.s. 55ms) than LoViT, which uses ProbSparse to capture global information. We highlight that the inference time of SKiT is constant, and independent from the input length, making it a stable choice for keeping a record of important global information, that appears on long surgical videos, essential for phase recognition. To sum up, we propose an effective and efficient model for surgical phase recognition that leverages key global information. This has an intrinsic value when performing this task in an online manner on long surgical videos for stable real-time surgical recognition systems.
AB - This paper introduces SKiT, a fast Key information Transformer for phase recognition of videos. Unlike previous methods that rely on complex models to capture long-term temporal information, SKiT accurately recognizes high-level stages of videos using an efficient key pooling operation. This operation records important key information by retaining the maximum value recorded from the beginning up to the current video frame, with a time complexity of O(1). Experimental results on Cholec80 and AutoLaparo surgical datasets demonstrate the ability of our model to recognize phases in an online manner. SKiT achieves higher performance than state-of-the-art methods with an accuracy of 92.5% and 82.9% on Cholec80 and AutoLaparo, respectively, while running the temporal model eight times faster (7ms v.s. 55ms) than LoViT, which uses ProbSparse to capture global information. We highlight that the inference time of SKiT is constant, and independent from the input length, making it a stable choice for keeping a record of important global information, that appears on long surgical videos, essential for phase recognition. To sum up, we propose an effective and efficient model for surgical phase recognition that leverages key global information. This has an intrinsic value when performing this task in an online manner on long surgical videos for stable real-time surgical recognition systems.
UR - http://www.scopus.com/inward/record.url?scp=85185867869&partnerID=8YFLogxK
U2 - 10.1109/ICCV51070.2023.01927
DO - 10.1109/ICCV51070.2023.01927
M3 - Conference paper
AN - SCOPUS:85185867869
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 21017
EP - 21027
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 October 2023 through 6 October 2023
ER -