Spatial-temporal attention for video-based assessment of intraoperative surgical skill

Paper: https://doi.org/10.1038/s41598-024-77176-1

Overall architecture of the spatial-temporal attention network for surgical skill assessment.

Highlights

Novel spatial-temporal attention mechanism tailored for surgical video analysis.
Automated objective assessment of surgical skill from intraoperative videos.
Attention visualization reveals important surgical actions and anatomical regions correlated with skill level.
State-of-the-art performance on multiple surgical skill assessment benchmarks.

Abstract

Objective assessment of surgical skill is crucial for surgical training, credentialing, and quality improvement. Traditional methods rely on manual expert evaluation, which is subjective, time-consuming, and resource-intensive. We propose an automated surgical skill assessment framework based on spatial-temporal attention mechanisms applied to intraoperative videos. Our method learns to identify and focus on critical surgical actions and anatomical regions that are indicative of skill level. The spatial attention module identifies important regions in each video frame, such as surgical instruments and key anatomical structures. The temporal attention module captures the dynamics of surgical workflow and the temporal patterns that distinguish expert from novice performance. By combining these complementary attention mechanisms, our model achieves objective, consistent, and interpretable surgical skill assessment. Experimental results on multiple surgical datasets demonstrate that our approach achieves superior performance compared to existing methods and provides insights into the visual cues associated with surgical expertise.

Method

Illustration of the spatial attention mechanism identifying critical regions in surgical videos.

Our approach consists of two main components: spatial attention and temporal attention.

The spatial attention module processes each video frame to identify regions that are most relevant for skill assessment. Rather than treating all regions equally, the spatial attention mechanism learns to focus on surgical instruments, target anatomy, and areas where critical actions occur. This is implemented through a learnable attention map that weighs different spatial regions based on their importance for skill classification.

Temporal attention weights across video frames showing important surgical phases.

The temporal attention module analyzes the sequence of frames to capture surgical workflow dynamics and temporal patterns. Expert surgeons exhibit smoother, more efficient movements and better adherence to optimal surgical sequences. The temporal attention mechanism learns to identify these temporal signatures of expertise by attending to key phases of the procedure and transitions between surgical actions.

The spatial and temporal features are integrated through a fusion layer, and the combined representation is used for skill level prediction. This joint spatial-temporal modeling enables comprehensive understanding of surgical performance.

Results

Our framework achieves state-of-the-art performance on standard surgical skill assessment benchmarks. The spatial-temporal attention mechanism significantly outperforms methods using only spatial or only temporal features, demonstrating the importance of their combination.

Attention visualizations showing regions and time points the model focuses on for skill assessment.

The attention visualizations provide interpretable insights into what the model considers important for skill assessment. Spatial attention maps highlight surgical instruments and critical anatomical structures. Temporal attention weights reveal that the model learns to focus on challenging phases of the procedure where skill differences are most pronounced.

Comparison of skill assessment performance across different methods and datasets.

Conclusion

This article is only meant for a brief introduction.

We present a spatial-temporal attention framework for automated surgical skill assessment from intraoperative videos. The spatial attention module identifies critical regions in each frame, while the temporal attention module captures the dynamics of surgical workflow. By combining these complementary attention mechanisms, our model achieves accurate, objective, and interpretable surgical skill assessment. The attention visualizations provide insights into the visual and temporal cues associated with surgical expertise, which could inform surgical training curricula. Our approach demonstrates the potential of deep learning to provide scalable, consistent surgical skill evaluation, supporting surgical education and quality improvement initiatives.

Published Jun 1, 2023

Fourth year PhD student majored in computer science at Johns Hopkins University.Bohua Wan on Twitter