Authors

Chenyu You, Ruihan Zhao, Lawrence H. Staib, James S. Duncan

Abstract

Contrastive learning (CL) aims to learn useful representation without relying on expert annotations in the context of medical image segmentation. Existing approaches mainly contrast a single positive vector (i.e., an augmentation of the same image) against a set of negatives within the entire remainder of the batch by simply mapping all input features into the same constant vector. Despite the impressive empirical performance, those methods have the following shortcomings: (1) it remains a formidable challenge to prevent the collapsing problems to trivial solutions; and (2) we argue that not all voxels within the same image are equally positive since there exist the dissimilar anatomical structures with the same image. In this work, we present a novel Contrastive Voxel-wise Representation Learning (CVRL) method to effectively learn low-level and high-level features by capturing 3D spatial context and rich anatomical information along both the feature and the batch dimensions. Specifically, we first introduce a novel CL strategy to ensure feature diversity promotion among the 3D representation dimensions. We train the framework through bi-level contrastive optimization (i.e., low-level and high-level) on 3D images. Experiments on two benchmark datasets and different labeled settings demonstrate the superiority of our proposed framework. More importantly, we also prove that our method inherits the benefit of hardness-aware property from the standard CL approaches.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_61

SharedIt: https://rdcu.be/cVRwO

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces an approach to learn semi-supervised 3D medical image segmentation networks by leveraging four key objectives - (1) a simple voxel-wise contrastive learning objective against an EMA target network contrasting along the feature dimension, (2) a dimensional contrastive objective that contrasts along the batch-dimension, (3) a consistency loss which encourages to directly match the output of the EMA target, and a (4) supervised objective built around a cross-entropy and dice loss.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

While the novelty of each component is limited, the combined 3D semi-supervised segmentation network learning pipeline is, to the best of my knowledge, novel. In particular however, the main selling point of this work is the final semi-supervised segmentation performance achieved by the proposed method, beating out competing methods by a notable margin especially in the low-supervision regime.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
My primary issue with this paper is the structure and writing, which make it very hard to exactly parse how each of the proposed objectives operate, with confusing or in parts missing notations and definition. In particular,
- No additional information on the supervised loss is given, only that it comprises a cross-entropy and dice loss. Are they weighted? Is it a smoothed dice loss variants? Are any additional hyperparameters introduced here?
- Throughout the work, the EMA network is treated as a separate entity and effectively independent teacher network (see e.g. Fig. 1, and separate student/teacher notations throughout the text).
- In general, it is incredibly hard to understand where and how each objective is applied in the overall quite expansive pipeline setup - the notation of high- and low-level contrastive loss is introduced in the beginning of section 1.1 and 1.2, but with hardly any motivation. If I understand correctly, low- and high-level refer to applied in the latent spaces of the first encoder as well as the last encoder, which however is described as being of “similar architecture”. What then makes the contrastive losses high- and low-level? And since parts of Fig. 1 and Fig. 2 are never specifically referenced in the paper it becomes quite hard to understand where what is applied. Similarly, equations 1-3 would benefit from more indices to much more precisely highlight which components are contrasted against which.
- As it is not made clear throughout the paper - are stop-gradient operations applied anywhere? Or does backpropagation also happen into the EMA teacher network?
In addition, it is not entirely clear what the main novel contribution is - MT[21] introduces output matching to an EMA target, component (de-)correlation along the batch axis has been introduced e.g. in Barlow Twins (although with different formulation) and the components in the supervised objective are commonly used in literature. And while the authors claim their regularization to be “anatomy”-informed, it is not entirely clear how this is reflect in the objective? Is it the fact that the volume cube is broken down into voxels?

There is also no experimental support for the claims that CVRL makes the proposed setup less prone to dimensional collapse - just looking at the baseline Dice/Jaccard performances, while it is overall worse, there is no indication that each pipeline component does what the authors claim it should be doing.

Some smaller issues:
- Fig. 3 is incredibly cherrypicked and makes CVRL stand out disproportionally.
- Table 2 techincally misses L^low + L^con and L^high + L^con references. Why where they not included?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

While the majority of hyperparameters and pipeline settings are listed in the experimental section, the limited clarity of the paper make reproduction harder then it has to be.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

See Section 5 - in general, the results are really convincing. If the method details and paper are made clearer and the main novel contributions are carved out better, the overall quality would significantly increase.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Given the results and the general pipeline making overall sense, I would still opt for acceptance, but strongly urge the authors to expand the method section with more details.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper proposes a contrastive semi-supervised learning scheme for 3D image segmentation. The contrastive objective is taken along the feature and the batch dimension, and the optimization is performed at low-level and high-level. The method is evaluated on a dataset for atrial segmentation, and a dataset for pancreas segmentation. The results for comparison to state-of-the art and for an ablation study are shown.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method description is very detailed and clear. Over all, the paper is easy to follow. The evaluation seems to be complete, and the ablation study provides a good validation of the effectiveness of the different loss functions. Grid search for hyperparameter tuning is also performed. According to the results, the proposed method provides a good solution for 3D semi-supervised segmentation if only very few labeled images are available.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- On page 8, it is unclear what is meant by “efficacy of both inter-instance and intra-instance constraints”. Both terms were mentioned before, but what do they mean? Please describe this.
- In Figure 3, what do the red and blue lines stand for.
- In Figure 2, the color coding (yellow, red, green and blue) is not clear.
- Describe what is meant by the hardness-aware property. While it is stated that the hardness-aware property is inherited, it is unclear what advantages this provides and why. -In table A2 in the appendix, I guess these are the results for tha LA dataset. This must be written in the caption. -In the appendix, I would make two seperate sections: Appendix A for additional results, and Appendix B for the proof of the hardness-aware property.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method and architecture is well described. I trust that the code will be publicly available, as the authors wrote in the abstract.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Please consider all points listed under “weaknesses”.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method improves state-of-the-art approaches in cases where only very few labeled images are available. This is a likely scenario in real-world applications. The evaluation of the method is elaborate, and well described. The ablation study covers all building blocks of the method. Overall, the paper is well organized and easy to follow.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The contributions of this paper is to use a voxel-wise contrastive learning approach that leverage the contrastive loss in both the bottleneck feature space and in the segmentation space.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The strength of this paper is to adapt the contrastive learning idea in 3 dimensional-space and use the unlabeled dataset with dimensional-wise contrastive objective as semi-supervised setting.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness is having similar idea that existed for pixel-level contrastive learning with memory bank [1]. Figure 1. specifies the low-level contrastive loss and high-level contrastive loss. However, it is hard to follow the loss defined. Another weaknesses is lacking of contrastive learning baselines and citations for segmentation task.

[1] Alonso, Inigo, et al. “Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility of the paper is good.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

In the experimental perspective, I would recommend to add more contrastive learning state-of-the-arts and cite more contrastive learning methods for segmentation performance, such as:

[1] Wang, Wenguan, et al. “Exploring cross-image pixel contrast for semantic segmentation.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. [2] Hu, Xinrong, et al. “Semi-supervised contrastive learning for label-efficient medical image segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2021. [3] Lee, Ho Hin, et al. “Semantic-Aware Contrastive Learning for Multi-object Medical Image Segmentation.” arXiv preprint arXiv:2106.01596 (2021).

I also have a concern on the innovation. According to the Figure 1., only unlabeled dataset is only contributed to the high-level contrastive loss, is it possible to also use the labeled dataset to calculate high-level contrastive loss?

For the contrastive loss section, is the positive pair defined voxel-by-voxel? What is the benefits of using your proposed contrastive loss to the pixel-wise contrastive loss? It will be great to add experiments and provide more clarity on your innovations.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The lack of contrastive learning baselines and enough citations for contrastive learning methods to demonstrate the confidence of the proposed idea
Number of papers in your stack

2
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes a contrastive semi-supervised learning scheme for 3D image segmentation. The method is evaluated on a dataset for atrial segmentation, and a dataset for pancreas segmentation.

While the novelty of each component is limited, the combined 3D semi-supervised segmentation network learning pipeline is, to the best of my knowledge, novel. In particular however, the main selling point of this work is the final semi-supervised segmentation performance achieved by the proposed method, beating out competing methods by a notable margin especially in the low-supervision regime.

Disagreement between reviewers on organisation and clarity of the paper. Novelty towards (1) is indicated as limited.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Author Feedback

To R1:Thanks!

missing notations and definition/Figure: We will follow your advice to revise our final version.

supervised loss: We use the equal combination of cross-entropy and dice loss, following the same setting in [14]; and do not introduce additional hyperparameters.

anatomy-informed: The main novelty of this work lies in the use of voxel-wise contrastive and dimensional contrastive loss. Recent works usually create one single latent code for each input, where the contrastive loss is computed across different images. In our case, the volumetric latent code is broken down spatially to provide contrastive learning signal to local regions, making this approach “anatomy informed”. The dimensional contrastive loss inherits recent findings that the “hardness aware” property makes it a good regularization to prevent latent space collapsion. This is the first work to apply this approach to medical segmentation.

high- vs low-level: Yes. The low- and high-level refer to our proposed contrastive learning applied in the output feature space (i.e., after the decoders of the teacher and student models) and latent feature spaces (i.e., after the encoders of the teacher and student models), respectively. We will make sure adding more details of indices to highlight the contrastive components more precisely.

stop-gradient: We do not update EMA teacher network through the backpropagation. We will follow your great advice to add the stop-gradient to the picture.

To R2:Thanks!

red/blue lines: The red line is the output prediction and the blue line is the ground truth.

caption/section: We will follow your advice to revise our final version.

inter- and intra- instance: “Inter-instance” and “intra-instance” describe what constitute negative pairs in our voxel-wise contrastive loss. “Inter-instance” comes from the fact that latent codes corresponding to the same spatial location of different input instances are considered negative pairs. “intra-instance” describes those latent codes from different spatial locations of the same input instance are also considered negative pairs.

hardness-aware: “Hardness-aware” comes from the fact that negative pairs have higher chances of getting misclassified as positive pairs get larger gradient updates. This is a probabilistic justification of InfoNCE loss over other contrastive loss functions. Especially, when used with dimensional contrastive training, the loss function encourages each dimension to encode different information; dimensions that are very similar get a big push against each other. This helps prevent latent space collapse.

To R3:Thanks for your helpful comments! We’ve added the comparison on recent contrastive learning (CL) methods, and hope you re-evaluate our paper.

comparisons: Alonso et al. introduced CL with a large memory bank among feature representations of each class, and used pseudo labeling on the labeled data, which is computationally expensive. In contrast, we use a much simpler strategy with the proposed dimensional CL and do not require external memory. [1] introduced 2D supervised CL in a pixel and region level for semantic segmentation, and [2] used 2D step-wise, label-based CL loss for medical segmentation in global- and local- manner. In contrast, our work leverages both labeled and unlabeled data for 3D segmentation in end-to-end manner. [3] applied 2D standard CL with attention map for more discrimination features which require high computational costs. More specifically, compared to [1-3], our method not only incorporates dimensional CL to avoid trivial solutions, but also focuses more on anatomical features on both pixel- and instance-level. Moreover, compared with [1-3], which focus on 2D segmentation, our work focuses 3D segmentation, and the experimental settings are significantly different. Nevertheless, we will cite and compare with 2D segmentation models [1-3] in revision.

voxel-by-voxel: Yes, we will make it clearer in revision.

back to top

Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation