Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Seenivasan, Lalithkumar; Islam, Mobarakol; Krishna, Adithya K; Ren, Hongliang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.11053 (cs)

[Submitted on 22 Jun 2022 (v1), last revised 26 Jun 2022 (this version, v2)]

Title:Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Authors:Lalithkumar Seenivasan, Mobarakol Islam, Adithya K Krishna, Hongliang Ren

View PDF

Abstract:Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable 'second opinion' could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.

Comments:	Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
Cite as:	arXiv:2206.11053 [cs.CV]
	(or arXiv:2206.11053v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.11053

Submission history

From: Lalithkumar Seenivasan [view email]
[v1] Wed, 22 Jun 2022 13:21:31 UTC (426 KB)
[v2] Sun, 26 Jun 2022 13:26:20 UTC (425 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators