Abstract: Expert surgeons often have heavy workloads and cannot promptly respond to queries from medical students and junior doctors about surgical procedures. Thus, research on Visual Question ...
Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel ...