Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yi Zhong, Mengqiu Xu, Kongming Liang, Kaixin Chen, Ming Wu

Abstract

Segmentation of the infected areas of the lung is essential for quantifying the severity of lung disease like pulmonary infections. Existing medical image segmentation methods are almost uni-modal methods based on image. However, these image-only methods tend to produce inaccurate results unless trained with large amounts of annotated data. To overcome this challenge, we propose a language-driven segmentation method that uses text prompt to improve to the segmentation result. Experiments on the QaTa-COV19 dataset indicate that our method improves the Dice score by 6.09\% at least compared to the uni-modal methods. Besides, our extended study reveals the flexibility of multi-modal methods in terms of the information granularity of text and demonstrates that multi-modal methods have a significant advantage over image-only methods in terms of the size of training data required.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_69

SharedIt: https://rdcu.be/dnwEm

Link to the code repository

https://github.com/Junelin2333/LanGuideMedSeg-MICCAI2023

Link to the dataset(s)

https://github.com/HUANGLIZI/LViT


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a language-driven method for segmenting infected areas in lung x-ray images aimed at improving the diagnosis and monitoring of COVID-19. The authors propose a GuideDecoder to fuse textual and visual features from independent text and image encoders at the decoding stage, promoting consistency between the two modalities. The method is evaluated on the QaTa-COV19 dataset, and the results show the advantage of multi-modal approaches over image-only methods in terms of segmentation performance and the size of required training data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper conducts extensive experiments on the QaTa-COV19 dataset, comparing the proposed multi-modal method with image-only methods and analyzing the impact of information granularity in text prompts.
    • The experimental results demonstrate the superiority of the multi-modal approach over image-only methods in terms of segmentation performance and the size of required training data.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed method is only evaluated on the QaTa-COV19 dataset, which may not be sufficient to generalize the results to other datasets or medical imaging tasks.
    • The text annotations in the dataset are not true medical reports but rather “text prompts,” which may limit the practical applicability of the method in real-world clinical settings.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The authors evaluate their method on the QaTa-COV19 dataset and mention that they have corrected errors in the text annotations. While this is a positive aspect, sharing the updated dataset and any pre-processing steps taken would be crucial for reproducibility.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The auhtors mention that the text annotations in the dataset are not true medical reports but rather “text prompts.” Please discuss the implications of this limitation for the practical applicability of your method in real-world clinical settings and whether the method can be adapted to work with actual medical reports.
    • The authors have reported both Jaccard and Dice scores as evaluation metrics for your segmentation method. It is important to note that these two metrics have a one-to-one relationship, as the Dice score can be calculated from the Jaccard index (and vice versa) .
    • Alongside the reported average performance metrics, include measures of spread, such as standard deviation or interquartile range, in your result tables. This will give readers a better understanding of the variability in the performance of your method and the compared methods.
    • To support the significance of your findings, consider conducting statistical tests to compare the performance of your method with other methods. This will help you quantify the significance of the observed differences in performance and provide a more robust comparison.
    • I noticed a reference to Equation 7, but it appears that this equation is missing from the text.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I find the paper interesting, as it addresses a significant problem in medical image segmentation for COVID-19 diagnosis and monitoring. The proposed language-driven method with the novel GuideDecoder shows potential in improving the performance of segmentation tasks by fusing textual and visual features. The experimental results on the QaTa-COV19 dataset highlight the advantages of the multi-modal approach over image-only methods.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper is one of the pioneering works that focus on using text information to assist medical image analysis, which is believed to be the future of this field.

    The authors propose a new model that incorporates both text and image features, which leads to a significant improvement in performance.

    Moreover, the authors promise to release a new version of text annotations, which is a noteworthy contribution to the medical image analysis field since the study of multimodality (i.e., text image) is currently hampered by a lack of good quality dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Good qualitative analysis: Despite text-image models being black boxes, the authors have smartly designed an experiment (Fig. 3, Table 3) that shows the correlation between text information granularity and segmentation performance. This experiment indirectly proves the validity of their method and highlights the importance of certain types of text in medical data analysis.

    Model design: The model design is straightforward and easy to understand.

    Dataset: The authors provide an amendment to text annotations in [16], which is a valuable contribution to the field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Motivation: The motivation behind the study is not clearly presented. For example, the related work is mentioned in the introduction [11-15], but only one of them [15] is discussed. Additionally, the reason behind the module design is not apparent in the method section. The author claims that LViT is not “flexible enough”, what does it mean? Why the proposed method is “flexible”?

    Applicability: It is not clear in what practical scenarios this method can be applied. It requires both text descriptions and images for training and testing, which may not be readily available in most cases. Furthermore, since text descriptions are often the final step in medical decision-making, it is unclear what benefits segmentation could bring after diagnosis has been made.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    reproducibility is good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It would be helpful if the reason behind the model design is better discussed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It fixes errors in existing dataset. It provides interesting qualitative experiments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors present a multi-modal segmentation method that fuses textual and visual features in the decoding stage and outputs language-driven segmentation results. And the authors have also cleaned the errors contained in the text annotations of QaTa-COV19 dataset, addnig more valuable data to the eco environment.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A multi-task segmentation method using independent text encoder and image encoder and design a GuideDecoder to fuse the features of both modalities at decoding stage.
    2. The proposed method can adaptively propagate sufficient semantic information of the text prompts into pixel-level visual features, promoting consistency between two modalities.
    3. Avlations studies were also included to investigate the effectiveness of various GuideDecoder and text annotation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of description of results with central tendency (e.g. mean) & variation (e.g. error bars).
    2. The second case on the Fig.2 is not a strong case to demonstrate the effectiveness of the proposed method.
    3. Equation 7 is missing for detailed loss description?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible

    1. Training and Evaluation codes available
    2. Model description included
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    For future work, I would recommend

    1. Comparison with other text-based feature extraction methods.
    2. Extension to MRI and other modalities.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A new method for multi-modal segmentation framework is proposed with good results via adaptively propagating sufficient semantic information of the text prompts into pixel-level visual features. Results with central tendency (e.g. mean) & variation (e.g. error bars) would be preferred.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed to fuse textual and visual features in the decoding stage for language-driving medical image segmentation. The idea is interesting. The proposed method makes sense (easy but effective) and the experimental results look great. The paper is also well written and easy to understand. What’s more, the authors have also cleaned the errors contained in the text annotations of QaTa-COV19 dataset, and promised to release the dataset. The only concern I have is the ability to generalize to other datasets considering QaTa-COVID 19 is the only dataset we use.




Author Feedback

Thanks to all the reviewers (R1, R2, R3, Meta-R) for acknowledgement about our methodological contribution, and their constructive comments for further clarification.

Q1: About QaTa-COV19 is the only dataset used in experiments. (R1&Meta-R) A1: To the best of our knowledge, QaTa-COV19 is the only dataset with both medical images and the corresponding reports. However, in clinical scenarios, medical images and reports are always paired. We hope that our work could attract more researchers to invest in the construction of medical multi-modal datasets, which would benefit the development of large models in the field of medical image analysis.

Q2: About the relevance of our method for clinical practice. (R1&R2) A2: First of all, the main benefit of our method for clinical practice is the position description of the infected area in the text prompts. We performed some preliminary experiments (not included in the paper) by replacing and adding some other words but keeping the location descriptions. The results tend to be consistent with the performance of the proposed method. In addition, the actual medical reports also contain many descriptions of the lesion area, and mining this information is one of the motivations for this work. Our ultimate goal is to build a model which is capable of both performing higher quality segmentation and also generating image-related reports. The current work is part of this goal, and it remains significant, at least for multi-modal deep learning.

Q3: About the flexibility of our model design. (R2) A3: Our model adopts a modular design, containing a separate visual encoder and text encoder (pre-trained on MIMIC-CXR dataset), and uses GuideDecoder for fusion of multimodal information. For other types of images (e.g. MRI) and text data in different domains, the encoders in our model can be either loaded with pre-trained weights from different datasets or directly replaced with other types of encoders (e.g. by replacing the visual encoder with a Swin Transformer) to achieve flexible iterations of the whole model. At the same time, only the decoder part of the model needs to be trained, as the encoders have been pre-trained already.

Q4: Equation 7 is missing? (R1&R3) A4: In our submission, Equation 7 denotes the total loss function which is a combination of Cross-entropy loss and Dice loss. Since both losses are well-known and concrete, we removed the detailed formulation to make the paper brief.



back to top