Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Akimichi Ichinose, Taro Hatsutani, Keigo Nakamura, Yoshiro Kitamura, Satoshi Iizuka, Edgar Simo-Serra, Shoji Kido, Noriyuki Tomiyama

Abstract

Building a large-scale training dataset is an essential problem in the development of medical image recognition systems. Visual grounding techniques, which automatically associate objects in images with corresponding descriptions, can facilitate labeling of large number of images. However, visual grounding of radiology reports for CT images remains challenging, because so many kinds of anomalies are detectable via CT imaging, and resulting report descriptions are long and complex. In this paper, we present the first visual grounding framework designed for CT image and report pairs covering various body parts and diverse anomaly types. Our framework combines two components of 1) anatomical segmentation of images, and 2) report structuring. The anatomical segmentation provides multiple organ masks of given CT images, and helps the grounding model recognize detailed anatomies. The report structuring helps to accurately extract information regarding the presence, location, and type of each anomaly described in corresponding reports. Given the two additional image/report features, the grounding model can achieve better localization. In the verification process, we constructed a large-scale dataset with region-description correspondence annotations for 10,410 studies of 7,321 unique patients. We evaluated our framework using grounding accuracy, the percentage of correctly localized anomalies, as a metric and demonstrated that the combination of the anatomical segmentation and the report structuring improves the performance with a large margin over the baseline model (66.0% vs 77.8%). Comparison with the prior techniques also showed higher performance of our method.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_59

SharedIt: https://rdcu.be/dnwH5

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a multi-stage architecture to predict localization maps on 3D CT scans for anomalies mentioned in medical reports. The authors use instance segmentation maps generated by a U-Net model as auxiliary information to generate source image embeddings, and classify phrases and details into per-anomaly groups using a BERT-like model to generate target text embeddings. The final localization maps are obtained through source-target attention of the embeddings. The authors claim that their approach is effective and achieves higher performance compared to prior techniques. They also constructed a large-scale dataset with region-description correspondence annotations to verify their framework.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes a novel approach for visual grounding on 3D CT scans, a modality that has not been explored before.
    2. The paper is well written, with a clear explanation of the problem and proposed solution.
    3. A large-scale dataset is utilized to demonstrate the effectiveness of the proposed method, which is a significant contribution.
    4. The paper utilizes instance segmentation maps as auxiliary information, which is a novel approach to improve image embeddings.
    5. The results demonstrate an increased number of abnormalities in CT modalities, which is shown in Fig. 4.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The reason for using LSTM to obtain the representative embedding in the ‘Anomaly-Wise Feature Aggregator’ section needs to be explained.

    It is unclear whether the ground truth used to calculate IOU for each anomaly is the corresponding binary mask or the bounding box.

    The comparison to prior arts in visual grounding is relatively weak, with only two previous works based on transformer included, and no non-transformer based methods for evaluation.

    The localization results for other IOU values (from 0.1 to 0.5) should be included, at least in the supplementary material, especially results on IOU 0.5.

    It is unclear whether the code and dataset used in the paper will be made publicly available in the future.

    The paper does not indicate whether the model follows a multi-stage approach within the Related Work section.

    The authors could have more clearly stated that CT reports are in Japanese, which would explain the use of “character-embeddings”. Diagrams hint at English text.

    The training-procedure for Synapse3D is unclear, and external datasets may have been used, making comparisons to models without anatomical segmentation unfair as additional supervision is used.

    The authors do not specify how they train/adapt TransVG and MDETR models for the 3D modality or how localization maps are generated.

    The authors do not describe how the Anomaly-Wise Feature Aggregator is run without Report Structuring for the baseline methods.

    Given that the improvements suggested can be generalized to 2D X-Ray reports, comparing this method to the state-of-the-art for this modality would better quantify the improvements.

    Papers [20] and [24] referenced to describe the training procedure for Report Structuring are not publicly available.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code and dataset have not been made publicly available. As stated above, the paper lacks many critical details on how certain subnetworks are trained and how the comparison experiments are conducted.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    See the weakness section.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good quality paper with a novel idea, a clear introduction of the method, a detailed supplementary, and well-organized experiments, though there are some possible improvements in the comparison study and evaluation data.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    After reading the reviews and rebuttal that has addressed my concerns, I would like to keep my rating.



Review #2

  • Please describe the contribution of the paper

    Different from the mainstream 2D medical visual grounding models, this paper presents a visual grounding framework designed for CT image and report pairs covering various body parts and diverse anomaly types.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper provides a novel visual grounding task on 3D medical CT images. The figures and tables provided in this paper are clear.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. This papper propose a new task for 3D visual grounding, but the method used in this paper is ordinary and does not highlight the advantages of solving 3D problems. In addition, the proposed framework heavily relies on existing annotation software, and the difference between 2D and 3D tasks is not clear.
    2. The experiments are not sufficient, with few models participating in the comparative experiments, and there lack of comparison with the baseline in the qualitative analysis. In addition, the provided visual grounding results in Figure 3 illustrate no difference to a 2D task.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper does not release the datasets, code, and links to the image annotation software (Synapse 3D V6.8, FUJIFILM corporation, Japan), which results in poor reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. This papper propose a new task for 3D visual grounding, but the method used in this paper is ordinary and does not highlight the advantages of solving 3D problems. The proposed framework heavily relies on existing annotation software, and the difference between 2D and 3D tasks is not clear.
    2. Section 3 lacks clear explanation for why complex 3D features can be learned. And the Anomaly-Wise Feature Aggregator uses LSTM to aggregate different embeddings, which can be replaced by more effective models in recent years, such as the attention mechanism.
    3. Using an early VGG model as the visual encoder to extract visual features may not be able to effectively capture 3D visual information, which could potentially affect the performance of later modules.
    4. The experiments are not sufficient, with few models participating in the comparative experiments, and there lack of comparison with the baseline in Figure 3. In addition, the provided visual grounding results in Figure 3 illustrate no difference to a 2D task.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper provides a novel visual grounding task on 3D medical CT images. However, the proposed framework and the experimental results could not support the claimed superiority in this 3D task.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a new visual grounding framework for CT images and radiology reports. The framework is composed of three parts: an anatomical segmentation model, a report structuring model and a grounding model. It is claimed as the first visual grounding work for CT images. The proposed method is evaluated on a large in-house dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Both the task of visual grounding on CT images and reports and the proposed method are novel. The experiments verifies the effectiveness of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The description of the method is a little confusing. Please re-organize or re-phrase the method section with more explanation of introduced concepts.

    It would be good if more baseline methods are used for comparison.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The results are reproducible only if the code and the dataset are released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please give more and clearer explanation of the grouping of phrases in section 3.3 and how are the characters related to anomalies in section 3.4. As mentioned above, it would be better to compare with more baseline methods.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the number of baseline methods are limited, the overall task and method is novel, and achieved good performance on a large dataset.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    In this paper, a visual grounding framework designed for CT image and report pairs covering various body parts and diverse anomaly types is presented. The framework combines anatomical segmentation of images and report structuring.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Large diverse CT datasets was used;
    2. An innovative visual grounding framework was proposed;
    3. Solid evaluation methodology applied.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    it would be great if you could release the dataset with the annotations and the code with all the parameters used.

    Would it be possible to apply the trained model to any of existing public data and provide some results as a benchmark?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    it would be great if you could release the dataset with the annotations and the code with all the parameters used. Otherwise, it’s hard to reproduce.

    Would it be possible to apply the trained model to any of existing public data and provide some results as a benchmark?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    it would be great if you could release the dataset with the annotations and the code with all the parameters used. Otherwise, it’s hard to reproduce.

    Would it be possible to apply the trained model to any of existing public data and provide some results as a benchmark?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Large diverse CT datasets was used;
    2. An innovative visual grounding framework was proposed;
    3. Solid evaluation methodology applied.
  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper received four reviews with one reject and three accept. From the detailed comments, this is a paper with mixed opinions. So we would invite authors for rebuttal. Authors are trying to solve an interesting and critical clinical problem which is of strong interests. The proposed method seems too initial or preliminary to solve this problem. The proposed technical pipeline (which is very basic in the multi-modality deep learning) can only achieve some interesting results but has strong limitations on the clinical level of performance.




Author Feedback

We thank the reviewers for their thoughtful feedback and valuable comments. Here, we clarify the concerns and will address them in the final paper.

Originality of the method (R2)

We would like to point out that the focus of the work is on the application side and the novelty of the task, being the first application of visual grounding to the field of 3D medical image. Visual grounding for 3D CT images had not been tackled yet because it includes a large number of anomaly types and a wide variety of expression in the reports. Another of our contributions is the technical approach. We started from defining organs and anomaly types to be recognized, with the grounding model designed to decrease the complexity of the grounding task by introducing anatomical segmentation and report structuring. These were done with the deep understanding of the contents in 3D CT reports. We consider that the whole grounding framework is new.

Comparison with existing SOTA works (R1/R2/R3):

We evaluated several state-of-the-art methods which use attention or transformer architectures. Although these models are known to improve performance in most applications, this comes at a high computation and memory cost, and generally require large amounts of training data. Although we have curated a large training dataset for the CT grounding task, we found that the attention models failed to achieve a good generalization performance. At present, we conclude that incorporating domain knowledge like the organ segmentation and the report structuring is more important than leveraging the latest deep learning architecture as a backend. In the comparison study, we demonstrated the superiority of the proposed method in this task by comparing it to strong baseline models such as TransVG and MDETR.

Gap with the level of clinical applicability (Meta):

We agree that the current approach is likely not at clinical level on average as pointed out. However, our results indicate that the performance has reached a clinical level depending on the body part and the anomaly type. For example, prostate mass or kidney/gallbladder swelling reaches 80% by the metric of volume DICE coefficient. On the other hand, we find that it mainly performs poorly for anomalies that are difficult to detect even by humans or for which the number of cases (disease prevalence) is small, such as cysts in lungs and embolisms in the lower limbs. Improving grounding performance for these targets will be an important future work. This paper is a precursor to such study, and we believe it sets the methodology that can be used as a base for future work and stimulating research in this exciting new topic.

Reproducibility of the paper (R1/R4):

Due to confidentiality agreements with joint research partners and patient confidentiality, we are unable to release the dataset and the code at the current time. We are in the process of confirming with the relevant parties whether or not the data can be released, however, despite our will to make the data and code available, we cannot make any assurances that we can make it all fully available.

Clarity (R2/R3):

A. Grouping of phrases: We used the Mention Pooling architecture shown in 1 to infer whether there is a relationship between the anomaly phrase and other phrases, resulting in the grouping of phrases related to the same anomaly. If multiple anatomical phrases are grouped in the same group, they are split into separate groups on a rule basis (e.g. [‘right S1’, ‘left S6’, ‘nodule’] -> [‘right S1’, ‘nodule’], [‘left S6’, ‘nodule’]). B. Comparison study: To adapt TransVG and MDETR for the 3D modality, the backbone was changed to a VGG-like network with 3D convolution layers, the same as the proposed method. The localization maps were created by filling the detected box regions with confidence scores of each box.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Bu integrating all information, this paper is mostly agrred by reviewers as “the overall task and method is novel, and achieved good performance on a large dataset, evaluation is sufficient.” AC read the paper and agree with the general assessment by reviewers. This is an interesting problem. Although AC may be skeptical on actually how well the proposed method can solve the lesion/finding localization problem (maybe only more apparent/obvious/salient findings, not subtle findings), this work deserves to be followed by peers to continue along this direction of work.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The practical application of the proposed technique is still questionable.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    this paper has a merit, novel idea as agreed by all reviewers, and some of the unclear points were already addressed in the rebuttal, not all but good in shape. I think novelty of the approach is in the clinical side, and application has never been done before for 3D volumetric CT.



back to top