List of Papers By topics Author List
| Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews | 
Authors
Muhammad Abdullah Jamal, Omid Mohareri
Abstract
Data-driven approaches to assist operating room (OR) workflow analysis depend on large curated datasets that are time consuming and expensive to collect. On the other hand, we see a recent paradigm shift from supervised learning to self-supervised and/or unsupervised learning approaches that can learn representations from unlabeled datasets. In this paper, we leverage the unlabeled data captured in robotic surgery ORs and propose a novel way to fuse  the multi-modal data for a single video frame or image. Instead of producing different augmentations (or “views”) of the same image or video frame which is a common practice in self-supervised learning, we treat the multi-modal data as  different views to train the model in an unsupervised manner via clustering. We compared our method with other state of the art methods and results show the superior performance of our approach on surgical video activity recognition and semantic segmentation. 
Link to paper
DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_43
SharedIt: https://rdcu.be/cVRXi
Link to the code repository
N/A
Link to the dataset(s)
N/A
Reviews
Review #1
- Please describe the contribution of the paper
    The authors describe a method for pretraining a CNN to aid in workflow analysis tasks (Activity recognition and semantic segmentation) from the room camera in an operating room. The method is evaluated on two public datasets and compared against 2 other methods from the state of the art. 
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    - Relevant problem
- Novel approach
- Good dataset for evaluation
- Comparison against different methods
 
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    - Certain details/terms not explained in the paper
- No real discussion of results
- Some grammatical errors, the paper would definitely benefit from another read-through
 
- Please rate the clarity and organization of this paper
    Good 
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    The authors present the method in such a manner that it should be reproducible. All the details provided on the checklist were truthful. 
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    One of my main concerns is that a lot of details were missing/unexplained in the paper: - First, it only becomes apparent relatively late in the paper that the authors are talking about workflow analysis from room cameras, not e.g. from the daVinci endoscope
- Some abbreviations are never introduced, e.g. SwAV, Bi-GRU
- The term “prototype” is also never defined, nor is any intuition behind the idea explained.
 There are also some other open questions/concerns from my side: - The results of the different methods for segmentation appear close together, can you comment on the statistical significance of the difference in results?
- The way I understand sec. 3.1., you are pretraining the network to produce similar features for the depth and the RGB images, here it would be interesting to see if the final network actually needs the multimodal data in the end/how the network would fare with only one modality.
- How are the modalities merged? Is the input two channels? If yes, how exactly does it work for pretraining? Is one channel just set to 0?
- Did you also perform the analysis for K for the segmentation problem? On what data was K selected, did you use a validation set?? Seeing the results in table 3, wouldn’t it have sense to consider larger Ks?
 
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
    5 
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    In my opinion, the paper tackles an in interesting problem, but it still has a lot of missing information and unclarities that need to be addressed. 
- Number of papers in your stack
    2 
- What is the ranking of this paper in your review stack?
    1 
- Reviewer confidence
    Very confident 
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
    5 
- [Post rebuttal] Please justify your decision
    Most of my questions from the original review are still open, so I have not adjusted my opinion 
Review #2
- Please describe the contribution of the paper
    This work proposes to use unsupervised clustering method, i.e., SwAV to pretrain the image encoder with multi-modal images. It takes two modalities as two different views of the same image, intensity and depth images from TOF cameras and train the encoder to predict the other modality’s pseudo code. Compared to the offline clustering method, the SwAV fits the large-scale setting and achieves a superior result compared to the other self-supervised methods, especially on few-shot settings. 
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    -Multi-modality usage: this work proposes to fuse different modalities to pretrain the model encoder. It forces semantically similar group together for different modalities. -Training process: the current methods fall into the directly end2end training process while rare works focus on the pretraining + finetuning process. This paper shows a potential way to pretrain the image encoder with multi-modal data. 
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    -Online V.S. Offline: This work opts for the online manner to do the pretraining. However, as far as I know, the SwAV is not comparable to the other offline methods, such as MoCO. It is not clear why the online method is so necessary. -Fair comparison: This paper is about multi-modal unsupervised pretraining, however, the compared methods, i.e., pace prediction, clip order prediction are based on the uni-modal video frames, as far as I am concerned. This is not a fair comparison, since I will be confused whether the multi-modal data or the pretraining process proposed helps to improve the performance. 
- Please rate the clarity and organization of this paper
    Excellent 
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    According to the materials, the paper is reproducable. 
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    -Except for the weakness, I’d like to see more details about the training process. Because the pretraining process in this paper is for a better initialization of the backbone encoder. How’s that backbone utilized for different task, i.e., activity recognition, semantic segmentation, what the specific settings need to be applied, a few lines of explanation will make it better to understand. -Ablation study about the multi-modal data and the pretraining process -Multi-modal pretraining baseline to be added. 
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
    5 
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    The use of multi-modal data in pretraining. 
- Number of papers in your stack
    5 
- What is the ranking of this paper in your review stack?
    3 
- Reviewer confidence
    Confident but not absolutely certain 
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
    5 
- [Post rebuttal] Please justify your decision
    The response letter address my concerns about offline and online differences. And I think this should be included in the discussion section. 
Review #3
- Please describe the contribution of the paper
    This paper relies on multimodal data to pretrain models. The model enforces generating similar prototypes from different modalities. The model has been evaluated on two task, where the proposed model achieved superior performance. 
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    - experiment parameters are provided
- leveraging multimodal data to pretrain models
- experimental results on two tasks
 
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    - the proposed model relies on an architecture similar to [4]. It seems that the technical contribution is limited.
- [4] relied on single modality and used different augmentation to achieve the same goal, I believe the author should have used compared their results with [4] as another baseline.
- while the proposed model used both intensity and depth images, the baseline model only took intensity images as input. I think it would be interesting to establish another baseline that relies on the same data as the proposed model as depth data was used during pretraining.
- [16, 30] have studied the effect of using bot RGB and depth images on improving performance. The author could have used the both modalities when the models are trained for the task.
- there were multiview data. Were the cameras calibrated? As multiview data was available, it would be interesting to look at this aspect as well.
 
- Please rate the clarity and organization of this paper
    Very Good 
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    Model parameters are provided in the supplementary document. It is however important to ensure it will be published along with the paper. The used dataset are not publicly available. 
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    I think it would interesting to use [4] as a baseline, as the proposed model share many concept with [4] and the model in [4] can be applied into more data because it only relies on a single modality. [20] used DeepLab V3+ as baseline, it would have been more interesting to use the same baseline. Similarly to workflow experiment, it would be interesting to see the performance of the semantic segmentation model when 100% of labeled data was used. 
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
    4 
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    The technical contributions are limited. The proposed model borrowed many concepts with [4] hence I was expecting to use [4] as a baseline. 
- Number of papers in your stack
    5 
- What is the ranking of this paper in your review stack?
    3 
- Reviewer confidence
    Very confident 
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
    5 
- [Post rebuttal] Please justify your decision
    Not Answered 
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
    This paper presents a framework for workflow analysis tasks in the OR including surgical activity recognition and semantic segmentation, using multi-modal unsupervised pre-training. The work proposes a clustering based unsupervised learning approach, for fusing intensity and depth map of a single video frame/image (from time-of-flight sensors) to extract discriminative representations. The paper is well-motivated, the approach (using multi-modal data in pre-training) is novel, and the evaluation experiments are thorough. The main criticisms of the work are regarding technical contribution related to the architecture, concerns around baseline models used and lack of discussion of the results, and missing details regarding experimental setup and model parameters. The following points should be addressed in the rebuttal: - Clarifications regarding differences in architecture/technical contribution compared to [4], and justification for results not having been compared to this baseline.
- Better discussion and justification of the results and the baseline models used.
- Justification for online vs. offline pre-training
- Missing details regarding the experimental setup and training process
 
- What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
    4 
Author Feedback
We thank the reviewers for their constructive comments and suggestions. In this work we’re presenting a general pre-training approach that can be applied to any multi-modal datasets available in MICCAI community for emerging applications such as surgical workflow analysis. We first address the major concerns of the reviewers.
Q1: Offline vs Online Training Previous clustering-based approaches are offline in nature where they alternate between cluster-assignment in which the whole dataset is clustered and the prediction of the cluster-assignment (“codes”). Similar to SwAV, we want our approach to scale to any dataset size which is not feasible with the offline clustering approaches. Therefore, we limit ourselves to online learning. We also want to point out that SwAV did compare with MoCO and other approaches (cf. Figure2, Table 3 in original SwAV paper). For fair comparison, we did compare our approach to other clustering- based approaches in Table 4. Q2: Diff between ours and [4] We have described in section 3 differences of our approach from SwAV [4]. While SwAV relies on producing difference augmentations (views) of the same image, we propose a multi-modal fusion approach in which different modalities (intensity and depth in our case) are treated as two different views of the same video frame. Moreover, unlike SwAV [4], we show the effectiveness of our approach on video domain. As suggested by R3, we train the SwAV model as a baseline for semantic segmentation task and results show that our approach still outperforms the SwAV baseline. For SwAV baseline, we get mIoU of 0.484±0.006, 0.502±0.006, 0.522±0.004 for 2%, 5% and 10% labeled data which is still lower compared to ours (cf. Table 4 in manuscript). Q3: Training process? The goal of our pre-training approach is to provide better initialization for downstream tasks such as surgical activity recognition, semantic segmentation etc. At the same time, we wanted to show that our approach is applicable to both video domain and image domain. The first stage of the training is similar where we pre-train I3D and Resnet-50 backbones using our approach. Next, for activity recognition, we follow [28] which employs two-stage training process. For semantic segmentation task, we initialize the backbone of Deeplab-v2 with our pre-training, and then fine-tune the whole network on low-data regime. Q4: Multimodal baseline We train a multi-modal approach called CoCLR [1], and it achieves mAP of 64.87 and 83.74 on 10% and 20% labeled data respectively which is still lower than our approach (cf. Table 1). [1] Self-supervised Co-Training for Video Representation Learning.
Minor concerns. R1-Q2: As mentioned in SwAV [4], and supplementary details, we apply various training improvements to original DeepCluster which might boost the performance. Moreover, when the labeled data is 2%, an improved DeepCluster outperforms SeLA. As we increase the labeled data, the performance gap shrinks. R3- Q1: We are proposing a general pre-training approach that can be applied to any multi-modal datasets. Our validation is based on an OR workflow dataset captured using ToF sensors with both intensity and depth data streams. While we did not have RGB data available for this work, the approach can be applied to RGB-D or RGB + flow data as an example. R3-Q2: Multi-view architectures have been out of scope for this work, and we treat each view as an individual input for our model and strictly separate the recordings for each procedure. Our sensors are time-synchronized but multi-camera calibration wasn’t needed for this work. R3-Q3: Our goal is to propose a method that can be applied to any type of model architecture and backbone. While we did our validations on I3D network (with Inception modules) and Deeplab-v2 with Resnet-50 backbone, we believe that the method is not limited to certain backbones or models. The final version will include our new baseline results and revisions to address all minor concerns.
Post-rebuttal Meta-Reviews
Meta-review # 1 (Primary)
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores,  indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
    The paper is well-motivated, and the topic is of interest to the CAI community. The approach is novel and validated thoroughly on two different datasets. The rebuttal addresses some of the comments from reviewers (offline vs online training, difference in contribution compared [4], training process), these and other comments should be incorporated in the final manuscript. 
- After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.
    Accept 
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
    3 
Meta-review #2
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores,  indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
    The authors responded adequately to the reviewers’ comments 
- After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.
    Accept 
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
    4 
Meta-review #3
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores,  indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
    After carefully reviewing the paper, reviewers, MRs and rebuttal, I recommend an accept for this paper. The authors should incorporate reviewers’ feedback in the camera ready 
- After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.
    Accept 
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
    8 
