Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Heming Yao, Adam Pely, Zhichao Wu, Simon S. Gao, Robyn H. Guymer, Hao Chen, Mohsen Hejrati, Miao Zhang

Abstract

The optical coherence tomography (OCT) signs of nascent geographic atrophy (nGA) are highly associated with GA onset. Automatically localizing nGA lesions can assist patient screening and endpoint evaluation in clinical trials. This task can be achieved with supervised object detection models, but they require laborious bounding box annotations. This study thus evaluated whether a weakly supervised method could localize nGA lesions based on the saliency map generated from a deep learning nGA classification model. This multi-instance deep learning model is based on 2D ResNet with late fusion and was trained to classify nGA on OCT volumes. The proposed method was cross-validated using a dataset consisting of 1884 volumes from 280 eyes of 140 subjects, which had volume-wise nGA labels and expert-graded slice-wise lesion bounding box annotations. The area under Precision-Recall curve (AUPRC) or correctly localized lesions was 0.72(±0.08), compared to 0.77(±0.07) from a fully supervised method with YOLO V3. No statistically significant difference is observed between the weakly supervised and fully supervised methods (Wilcoxon signed-rank test, p=1.0).

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_46

SharedIt: https://rdcu.be/dnwds

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    The authors demonstrate a weakly supervised multi-instance deep learning method (ResNet) to detect nascent geographic atrophy (nGA) in optical coherence tomography (OCT) slices (b-Scans). Labels are given per 3d volume (presence, absence of nGA). Furthermore they use GradCAM to localize the exact location of nGA in a single bScan.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • using weak labels for an otherwise time-consuming annotation process, and still achieve comparable detection rates with an fully supervised approach.
    • Exact localization of potential lesion by using the GradCam saliency map and providing a confidence score of detected lesion using the classification logit.
    • Proper setup of data-split and proper evaluation by comparing with supervised baseline and
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • relatively small dataset, although 5-fold CV was used to mitigate this a bit.
    • moderate precission
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Method is well described. No code is provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Nice practical solution to a clinically very relevant problem of patient screening in age-related macular degeneration (AMD). Using a grad-cam based detection is a good approach to overcome the limitation of not having exact bounding box annotations. For further work it might be interesting to see if other CAM methods (GradCam++, guided GradCAM) may provide a narrower bounding box.

    Minor:

    • significance level in Wilcoxon signed-rank test cannot be p=1.0, both in abstract and in results. Is it 0.01?
    • Conclusion: I would not say that weakly supervised and supervised methods are “on par”, even though no statistical significance in difference was achieved, as precision differs quite a lot.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well structured paper. Solid and effective method to provide a solution for a clinical highly relevant problem when having only volume labels. Proper evaluation by comparing it with a supervised method.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper uses gradCAM with thresholding and connected component analysis to localize nascent geographic atrophy (nGA) in optical coherence tomography (OCT). The proposed method is weakly supervised as it only requires training on nGA classification task. The proposed method is able to achieve high recall scores and obtain similar performance as fully supervised YOLOv3.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths of this paper:

    1. The authors gave a good background introduction on nGA and AMD. This helps readers understand the significance of nGA and potential impact of this work.
    2. The paper is overall well written and easy to follow.
    3. The paper uses comprehensive metrics to evaluate the performance of the proposed method, including AUPRC, 5 fold cross validation, Precision-recall curve and Wilcoxon signed-rank test.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    My biggest concern is the novelty of this paper. Even though the authors claimed that “While existing literature has demonstrated the ability of the GradCAM in post-hoc model interpretation, it is unknown whether the saliency map can help further localize the class related objects” (page 2, last paragraph in Introduction), I do not fully agree with this claim. Using CAM/GradCAM for object localization is a well-studied research area. Many previous works [1,2,3] have explored using CAM/GradCAM as a weakly supervised approach for object localization. In fact, using GradCAM and adaptive thresholding + largest connected component for object localization is a well established method as described in [4]. As described in [3], CAM/GradCAM-based object localization methods have some inherent problems such as small activation area (global average pooling bias and negative weights) and overlapped region thresholding. These inherent problems may help explain why the authors observed lower precision scores and high recall scores (bounding box size is too large). The other concern is the generalizability of the proposed method. It seems that the authors used the same dataset for classification and localization tasks. I’m wondering whether the proposed model can be trained on one OCT dataset and be applied to a different OCT dataset.

    [1]Yang, Seunghan et al. “Combinational Class Activation Maps for Weakly Supervised Object Localization.” 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) (2019): 2930-2938. [2]Bae, Wonho et al. “Rethinking Class Activation Mapping for Weakly Supervised Object Localization.” European Conference on Computer Vision (2020). [3]Zhang, Xiaolin et al. “Inter-Image Communication for Weakly Supervised Localization.” European Conference on Computer Vision (2020). [4]Dogra, M. (2020, December 31). Weakly Supervised Learning for Object Localization. Medium. https://medium.com/analytics-vidhya/weakly-supervised-learning-for-object-localization-4b73d4f4f4a6

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    While the proposed method can help reduce manual annotation time and have potential impact on AI-assisted diagnosis, there is plenty of room for improvements. First, increasing precision score without sacrificing recall score. As suggested in the weakness section, many previous works studied methods to improve gradCAM/CAM based object localization, these methods can be incorporated into the current work and achieve better performance in precision. Second, evaluating the generalizability of the proposed method. The proposed method still requires some manual labeling (for classification). If the method lacks generalizability to unseen dataset, the amount of human labor it can help reduce will be significantly limited. Third, exploring the effect of architecture. Since currently the proposed method has lower precision score, it might be beneficial to add skip-connections to feature extraction backbone for fine-grained detail preservation. In the reference section, there is probably a typo about gradCAM citation - the original paper is published in 2016 but it is cited as a publication in 2020.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the clinical impact is meaningful, the paper lacks technical innovation. There is huge improvement potential for this work.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    Multiple instance learning on CT is actually novel (i.e. labels on 3D level but using aggregated 2D information for classification), for example, [1] demonstrates the application of such technique, it also shows attention maps which is conceptually similar to what the authors are trying to do in this paper. As other reviewers mentioned, the precision score is quite low and is not very comparable to other supervised methods. While I do believe the proposed method can have some clinical applications, the method can be greatly improved.

    [1] https://www.medrxiv.org/content/10.1101/2020.09.14.20194654v1.full.pdf



Review #4

  • Please describe the contribution of the paper

    The authors introduced a weakly supervised technique, utilizing a 2D ResNet-based classification model to identify nGA lesions through saliency maps. This proposed model was evaluated against the YOLO V3 supervised model, which employed expert annotations. The performance of the suggested model proved to be on par with YOLO V3 during cross-validation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors successfully trained a ResNet-18 classification model to generate saliency maps for patients with nGA, eliminating the need for pixel-label annotations or bounding box information from experts. By quantifying the saliency map and comparing it to a YOLO V3 supervised method that uses bounding boxes, the authors demonstrated that their proposed approach, which only requires image/study level labels, can achieve comparable performance to the YOLO V3 model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The study is limited by its small dataset size and lack of external evaluation, which may affect the generalizability and robustness of the proposed model.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors employed cross-validation on their internal dataset to evaluate the performance of their model. However, no external data was used for validation, which may impact the reproducibility and generalizability of the findings.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Despite the proposed method achieving comparable performance to the supervised learning approach, the ResNet-18 model’s performance is still 5% lower than that of YOLO V3. To improve performance, the authors could consider using pretrained weights from medical datasets, such as RadImageNet, which includes ultrasound weights. This may help enhance the model’s accuracy and effectiveness in identifying nGA lesions.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the authors have presented an interesting weakly supervised approach for localizing nGA lesions, the small dataset size and lack of external evaluation limit the study’s generalizability and robustness. Additionally, the ResNet-18 model’s performance is slightly lower than that of YOLO V3, indicating potential room for improvement. The study would benefit from incorporating pretrained weights from medical datasets and utilizing external validation to enhance its credibility and reproducibility.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    While the authors assert that their dataset is unique, its unavailability to the public restricts other researchers from using it to evaluate their algorithms. This lack of access, combined with the small size of the dataset, could potentially limit its generalizability. Moreover, the authors’ responses to the reviewers’ comments do not appear to adequately address the concerns raised.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a weakly supervised multi-instance deep learning method to detect nascent geographic atrophy in optical coherence tomography slices.

    Key strengths:

    1. propose a weakly supervised method using weak labels for an otherwise time-consuming annotation process, and still achieve comparable detection rates with an fully supervised approach.
    2. good ways to visualize the results.
    3. uses comprehensive metrics to evaluate the performance of the proposed method, including AUPRC, 5 fold cross validation, Precision-recall curve and Wilcoxon signed-rank test.

    Key weaknesses:

    1. relatively small data set
    2. used the same dataset for classification and localization tasks.
    3. there is still big room for performance improvement.

    In the rebuttal, please especially clarify the dataset, novelty and performance issues.




Author Feedback

Dear Area Chairs,

We appreciate the acknowledgement from the reviewers on the clarity and organization of our paper, and the recognition of our proposed method being a “nice practical solution to a clinically very relevant problem” with “comprehensive metrics to evaluate the performance of the proposed method.” MICCAI welcomes both methodological and application studies, and we are grateful that our application study adapting SOTA methods to demonstrate clinical viability in in-human feasibility studies got recognized.

The reviewers’ main concerns are about the small dataset, model generalizability, and technical novelty.

We understand the reviewers’ concerns about single/small dataset and model generalizability, mostly from the perspectives of real world data on well abstracted problems. Here, we would like to raise the different challenges in translational research, oriented for impact on clinical trials in a novel disease area with a paucity of data. Our dataset of 1884 volumes is the only currently available dataset worldwide in the nGA disease indication, and the clinical need is to develop a screening algorithm from this dataset for an ongoing clinical trial. Hence, we used 5 fold cross-validation, with each fold consisting of training, turning and validation test sets (Section 2.3), which as noted by reviewer #2 mitigates the generalizability concerns across patients. In a well controlled clinical trial setting, we are able to specify device and scan procedure in the study protocol so other domain shift problems are not a major concern.

Another major criticism was the lack of technical novelty in using GradCAM to localize lesions. In this rebuttal, we address this criticism from the following perspectives: 1) The reviewers overlook the fact that, different from previous works localizing an object in a 2D image with 2D-wise labels, our work localizes lesions at 2D slice-level utilizing only 3D volume-wise labels. In other words, our method has this additional fold of weak supervision, which is critical in clinical practice considering the challenges to collect abnormality-level annotations for training a supervised model (Abstract and the last two paragraphs of Section 1). 2) It is noteworthy that, within the 49 slices of a volume, only a few (typically 3-5) actually contain nGA lesions. As a result, our data has two levels of sparsity, (a) only ~6% of the volumes are label positive, (b) only ~8% of the slices in a label positive volume contains lesions. We creatively applied the multi-instance learning paradigm with the simple late fusion architecture to get the presented results on this challenging problem (Section 2.1 and Fig. 2). 3) While GradCAM has been previously used to localize 2D objects in 2D images, most of the existing work focuses on localizing the most prominent objects in natural images. It is unknown whether it will work similarly on medical problems where lesions or abnormalities are often sparse anatomically.

There seems to be a misconception on our statistical test comparing performances: 1) Reviewer #4 commented that the performance of the weakly supervised method is 0.05 lower than the fully supervised one by comparing the mean AUPRC. Response: Statistically, it is inappropriate to compare multiple measures of two matched samples (i.e. AUPRC) by the mean value alone, considering the large standard deviation. Instead the Wilcoxon signed-rank test showed no significant difference (Section 3.2). 2) Reviewer #2 commented that the significant p-value may be 0.01 instead of 1.0. Response: P-value of 1, which is possible for discrete test statistics, indicates the null hypothesis that “no performance difference” cannot be rejected.

Finally, we appreciate the reviewers’ suggestions on further improving performance using certain pre-trained weights, etc. But we should note that the performance of our weakly supervised model is in the upper limit that was set by the fully supervised method.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a weakly supervised multi-instance deep learning method to detect nascent geographic atrophy in optical coherence tomography slices.

    Key strengths:

    1. propose a weakly supervised method using weak labels for an otherwise time-consuming annotation process, and still achieve comparable detection rates with an fully supervised approach.
    2. good ways to visualize the results.
    3. uses comprehensive metrics to evaluate the performance of the proposed method, including AUPRC, 5 fold cross validation, Precision-recall curve and Wilcoxon signed-rank test.

    Key weaknesses:

    1. used the same dataset for classification and localization tasks.
    2. there is still big room for performance improvement.

    The rebuttal clarified the data size issue. Indeed, a dataset of 1884 clinical volume is hard to obtain and considers big in the medical imaging community. The rebuttal also adequately addresses the performance issue.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Considering the relatively small size of the single dataset used by the authors and the absence of a validation set, I have concerns about the generalizability of this method, especially for weakly supervised location methods that employ MIL. Furthermore, I believe that the authors’ explanations did not adequately address the reviewer’s concerns regarding the novelty of the method. Therefore, I am inclined to reject this article.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Considering all the comments from reviewers and rebuttal from authors, although the authors addressed the reviewers’ concerns partially, the major concerns regarding small dataset, model generalization and incremental technical novelty still exist. Overall, the weakness is over the merits in current version.



back to top