Authors

Cheng Chen, Aoxiao Zhong, Dufan Wu, Jie Luo, Quanzheng Li

Abstract

Self-supervised learning (SSL) of visual representations from paired medical images and text reports has recently shown great promise for various downstream tasks. However, previous work has focused on investigating the effectiveness of two major SSL techniques separately, i.e, contrastive learning and masked autoencoding, without exploring their potential synergies. In this paper, we aim to integrate the strengths of these two techniques by proposing a contrastive masked image-text modeling framework for medical visual representation learning. On one hand, our framework conducts cross-modal contrastive learning between masked medical images and text reports, with a representation decoder being incorporated to recover the misaligned information in the masked images. On the other hand, to further leverage masked autoencoding, a masked image is also required to be able to reconstruct the original image itself and the masked information in the text reports. With pre-training on a large-scale medical image and report dataset, our framework shows complementary benefits of integrating the two SSL techniques on four downstream classification datasets. Extensive evaluations demonstrate consistent improvements of our method over state-of-the-art approaches, especially when very scarce labeled data are available.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_48

SharedIt: https://rdcu.be/dnwHt

Link to the code repository

https://github.com/cchen-cc/CMITM

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a novel investigation of two self-supervised approaches in a combined manner to learn medical visual representation. It takes advantage of cross-modal contrastive learning and masked image-text modeling at the same time. Experimental results on several downstream tasks and comparisons with SOTA approaches show promising performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper explored the novel benefits of two SSL approaches together to present a better medical visual representation. This general approach can be useful for other downstream tasks where data label is challenging and limited.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Two used SSL approaches are already being used in medical image domain individually. Even combined benefits are started to be used in the image domain.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The proposed approach, used hyper-parameters, training approach, and hardware configuration are well defined. Experimental datasets are publicly available. Considering these points, the paper should be reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

a) When approaches are fine-tuned with 1%, 10%, and 100% training labeled data, did the authors make sure the use of the same data across different approaches? Or it was random for each approach? b) Proposed approach’s limitations and future directions are not discussed.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presented a novel combined SSL approach in the clinical domain, performed proper experiments, and ablation study. Several downstream tasks can be benefited from using the representation. Overall this approach brings benefits to the clinical domain by providing better medical visual representations.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper proposes a SSL approach for medical visual representation learning by integrating contrastive learning and masked auto encoding. The proposed method conducts cross-modal contrastive learning between masked medical images and text reports, with a representation decoder being used to recover the misaligned information in the masked images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The paper proposes a new framework of combining contrastive learning and masked combining to jointly learn image-text representations using a staggered learning approach that trains the network with reconstruction loss first followed by contrastive loss

2) Strong evaluation that compares the proposed approach to SOTA methods such as MAE , and MGCA.

3) Very clear description of the pre-training and fine tuning stages and the datasets (splits) involved in each stage.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The broader ideas of combining contrastive learning and masked autoencoding, and the cascaded training methodology for such a combination was proposed in a previous work (Layer Grafted Pret-raining), which the authors cite. However, this work is still an interesting application of that idea to learn cross-modal representations on medical text and images
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors clearly describe their experimental and hardware setup and the parameter configurations of the model. Further, the authors, upon acceptance, will publish the code, and the model weights. With this information, I’m confident that the results in this paper can be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

“Their effectiveness demonstrates the use of medical reports as a free supervision single for learning general image representations. “ It would be great if the authors could clarify this statement and claim.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes an interesting framework that combines contrastive learningand masked autoencoding to learn visual-text representations. The paper clearly articulates the framework design choices and the reasoning behind the training methodology and setup. Finally, the paper has a robust evaluation.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors present a novel framework for medical visual representation learning by integrating the strengths of both cross-modal contrastive learning and masked image-text modeling. And the effectiveness of the proposed method is demonstrated on four downstream classification datasets, consistently improving data efficiency under data-scarce scenarios.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. A cross-modal contrastive learning between masked medical images and text reports, with a representation decoder being incorporated to recover the misaligned information in the masked images.
2. The proposed method may be potential useful for next-generation medical image-text retrieval.
3. Avlations studies were also included to investigate the effectiveness of various modules.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Lack of description of results with central tendency (e.g. mean) & variation (e.g. error bars).
2. It’s bit of unclear for first- and second- stage of pre-training process in Fig. 1 and how the testing process is carried out is missing in the Method part.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Reproducible
1. Training and Evaluation codes available
2. Model description included
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
For future work, I would recommend,
1. Extension to cross-modal retrieval tasks.
2. Extension to MRI and other modalities.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A new method for contrastive masked image-text modeling (CMITM) framework on medical visual representation learning is proposed with good results, results with central tendency (e.g. mean) & variation (e.g. error bars) would be preferred.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

In this paper, a novel approach for medical visual representation learning is introduced, which combines contrastive learning and masked autoencoding within a self-supervised learning framework. The proposed method leverages cross-modal contrastive learning by aligning masked medical images with corresponding text reports. A representation decoder is used to reconstruct the misaligned information present in the masked images. The paper clearly communicates the rationale behind the design choices and the training methodology, providing a clear understanding of the framework. Additionally, the paper presents a thorough evaluation of method. The camera ready version still needs to address a few remaining questions and suggestions. These include clarification around data divisions and also discussion of limitations of the work. Thanks for this high quality piece of research! !

Author Feedback

We thank all the reviewers and meta-reviewer for the valuable time and affirmative comments on the proposed novel multi-modal self-supervised learning approach for medical visual representation learning and the thorough evaluation of method.

We will include the clarification on data divisions (which are the same for all the methods) and training/testing process, and also discuss the limitations of the work in our final version.

back to top

Contrastive Masked Image-Text Modeling for Medical Visual Representation Learning