Authors

Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, Tsung-Hui Chang

Abstract

Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M$^3$AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of abstraction in visual and language. Third, we develop different designs for vision and language decoders (i.e., a Transformer for vision and a multi-layer perceptron for language). To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results demonstrate the effectiveness of our approach, where state-of-the-art results are achieved on all downstream tasks. Besides, we conduct further analysis to better verify the effectiveness of different components of our approach and various settings of pre-training. The source code is available at~\url{https://github.com/zhjohnchan/M3AE}.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_65

SharedIt: https://rdcu.be/cVRzj

Link to the code repository

https://github.com/zhjohnchan/M3AE

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper integrated vision with language together using masked autoencoders for joint pre-training. Several important technical modifications and explorations were made upon the original masked autoencoders, such as masking ratios, reconstruction features, and decoder designs. The pre-trained vision-and-language model yields significant improvement over random initialization and other competitive baseline methods on three representative vision-language tasks, covering five public datasets. Ablation study was presented to demonstrate the efficacy of both vision (MIM) and language (MLM) parts of the model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Open dataset: All results were obtained from publicly available datasets.
- Large improvement: Compared with existing methods, this paper achieves noticeable performance gain in multiple public benchmarks.
- Clear illustration: The description and illustration of the proposed joint pre-training are clear and easy to implement.
- Sufficient comparison: The proposed method is compared with several competitive methods in each benchmark dataset and shows great improvement.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No major weakness detected.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is easy to implement the idea based on the existing method description.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Here are some suggestions for improving the paper:
- Apart from vision-and-language downstream tasks, can this pre-trained model be useful for vision-only or language-only tasks as well. How to use the pre-trained weights for vision and language tasks separately?
- Please elaborate on how to determine masked 15% words in the input text. The word “opacities” is masked, but most words in a sentence do not carry such a critical meaning, such as “chest”, “radiograph”, “shows”, etc. Did the authors apply any specific process for effective masking choice?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a very nice study that integrates information from chest radiographs and reports for pre-training ViT. The proposed joint pre-training strategy is fairly novel and easy-to-implement, and extensive experiments show the pre-trained model is powerful across many vision-language tasks.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper presents a multi-modal masked autoencoder (M^3AE), which is based on the Transformer, for medical vision-and-language pre-training. Given a pair of image and text, this paper introduces a simple training strategy that trains the model by predicting the masked regions of the image and the masked words of text. The experimental results show that the proposed method outperforms the baseline methods on three downstream tasks including medical visual question answering, medical image-text classification, and medical image-caption retrieval.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed approach is simple yet effective.
2. Experimental results on three downstream tasks show that the proposed method is effective.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Some key details are not explained well. In section 2.2, representation selection for reconstruction paragraph, if you use the image features obtained from the k-th layer to reconstruct the input image, how will the last (N_m - k) layers be trained? How will you train the feedforward sub-layer of the N_m-th layer?
2. There are some typos. (1) In section 2.1, vision encoder paragraph, the dimension of $p_n$ should be $p_n\in^\mathbb{R}^{P^2\times C}$. (2) In Fig. 1, for text, it should be “Text Embeb” rather than “Image Embeb”.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility looks good.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

This is a good submission in general. Please consider revising the paper according to the weaknesses.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed approach is simple yet effective. The experimental results on three downstream tasks demonstrate the effectiveness of the proposed method. The ablation study demonstrates the effectiveness of using low-level features for reconstructing the input images. The weakness such as typos and the detailed explanations could be easily modified in the final version.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

This paper presents a multimodal pretraining method in a self-supervised manner for the medical field. Reconstruction takes advantage of different levels of visual and textual features to deal with different levels of visual and language abstraction. The evaluation was performed on multiple multimodal downstream tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. A multimodal pretraining method for the medical field is proposed. That is, the proposed multimode masked autoencoder learns in a self-supervised way.
2. Due to different information densities, the input image and text using different masking ratios. Reconstruction takes advantage of different levels of visual and textual features to deal with different levels of visual and language abstraction.
3. Evaluate three tasks, including Med-VQA, medical image-text classification, and medical image-text retrieval, and achieve great improvement.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. For different masking ratios, there is no experiment to vertify your perspective and lack of experience to explain what ratio is the optimal choice.
2. For the ablation study of ‘Effectiveness of different layers to perform MIM’, The reason for “layer3 is the best, but the accuracy of layer4 to layer6 declines” in the experimental results is not clearly and fully explained and a bit far-fetched.
3. There is a lack of experiments to vertify your designs to make this simple approach work for decoder designs.
4. Initialize the vision encoder with CLIP-Vit-B, which is equivalent to using the CLIP dataset. In experiments, other relevant methods seem not to be used in this way, leading to unfairness.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

reproducible
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The proposed contribution point should have experiments to prove that it is effective. Try to supplement relevant experiments to make the paper more solid.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method for the medical field is novel, and the clarity and organization of this paper are very good. Besides, the method is evaluated in 3 downstream tasks to demonstrate its effectiveness.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers unanimously recommend acceptance of the paper. The paper appears to be timely and introduces novelties in the medical field using the recent developments in machine learning, computer vision, and NLP communities.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Author Feedback

We thank the reviewers for appreciating the contribution of our paper and for constructive comments on potential improvements.

Response to Reviewer #1:

To perform the vision- or language-only tasks, there might be two possible ways to explore in the future work: 1) using only the vision or language encoders, 2) designing some prompts to induce better visual or textual representations using prompt learning.

Thanks for the valuable feedback. MLM and MIM are performed in separate forward processes to avoid the mentioned problem in the training process. We will elaborate more clearly on it in the final version.

Response to Reviewer #3:

Although the last (N_m-k) layers of the vision part are not used for MIM, they still need to produce representations for the language part, and thus they can be trained during the MLM process.

Thanks. We have fixed these typos and checked the paper carefully to fix all the typos.

Response to Reviewer #4:

Thanks for the valuable feedback. We will discuss them with the extra space in the final version.

Response to Meta-Reviews:

Thanks for appreciating the contribution.

We will fully address the concerns raised by the reviewers in the final version.

back to top

Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training