Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xiao Liu, Spyridon Thermos, Pedro Sanchez, Alison Q. O’Neil, Sotirios A. Tsaftaris

Abstract

Training medical image segmentation models usually requires a large amount of labeled data. By contrast, humans can quickly learn to accurately recognise anatomy of interest from medical (e.g. MRI and CT) images with some limited guidance. Such recognition ability can easily generalise to new images from different clinical centres. This rapid and generalisable learning ability is mostly due to the compositional structure of image patterns in the human brain, which is less incorporated in medical image segmentation. In this paper, we model the compositional components (i.e. patterns) of human anatomy as learnable von-Mises-Fisher (vMF) kernels, which are robust to images collected from different domains (e.g. clinical centres). The image features can be decomposed to (or composed by) the components with the composing operations, i.e. the vMF likelihoods. The vMF likelihoods tell how likely each anatomical part is at each position of the image. Hence, the segmentation mask can be predicted based on the vMF likelihoods. Moreover, with a reconstruction module, unlabeled data can also be used to learn the vMF kernels and likelihoods by recombining them to reconstruct the input image. Extensive experiments show that the proposed vMFNet achieves improved generalisation performance on two benchmarks, especially when annotations are limited. Code is publicly available at: \url{https://github.com/vios-s/vMFNet}.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_67

SharedIt: https://rdcu.be/cVRXF

Link to the code repository

https://github.com/vios-s/vMFNet

Link to the dataset(s)

https://www.ub.edu/mnms/

http://niftyweb.cs.ucl.ac.uk/challenge/index.php


Reviews

Review #4

  • Please describe the contribution of the paper

    In this paper, the authors developed a 2D semi-supervised vMF-kernel based model for domain generalization segmentation. The authors also propose to leverage the unlabeled data using a reconstruction module. The experiments on 2 datasets show ed the superiority of the vMFNet compared to SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of using vMF kernels is novel and interesting, and has not been well explored in the fields of medical image processing. This provides an alternative perspective of decomposing image into style and content as in traditional disentanglement approaches. Besides, this paper conducted a strong evaluation by comparing against the state-of-the-art domain generalization methods in the semi-supervised setting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Experiments on both datasets were all conducted in 2D networks. It would be interesting to understand the effectiveness of vMF modeling in 3D contexts.
    • Tests of statistical significance are missing: for example, in Table 1, in SCGM 20% (Dice), the vMFNet achieved about 1.5% Dice improvement compared to DGNet but with a standard deviation of 8.8.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors agree that the code will be made publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • As mentioned in the paper, ‘the vMF likelihoods contain only spatial information of the image’, it would be interesting to further discuss the relationship between modeling via vMF distribution and the traditional disentanglement. Especially in Figure 2, the visualization of the most informative vMF channels is very similar to that of the content code feature maps in disentanglement.
    • In section 3.1, using ‘D’ as the channels is somewhat confusing. ‘C’ would be a better notation for channels.
    • Typo at section 4.5: ‘subjectc’.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The compositional network with vMF is novel and has not been used to address the domain generalization problems in the literature. The decomposition by vMF kernels can serve as a potential alternative to content-style disentanglement and may be applied to the field of image harmonization and etc. The paper is well-written and make reviewer easy to follow. The experiments are well designed and the visualization of the compositionality provides more insights to the readers about the vMF kernels.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #6

  • Please describe the contribution of the paper

    The paper present a method for semi-supervised and test-time domain generalization based on the composition of von-Mises-Fisher (vMF) kernels. The proposed method models the distribution of features as a mixture of vMF components. This mixture (clustering) prior is used to regularize training with a loss minimizing the negative log likelihood. This loss is combined with an unsupervised reconstruction loss and a supervised segmentation to learn a representation for segmentation which is more robust to domain shifts. The proposed method is tested on two datasets for cardiac segmentation (M&M) and spinal grey matter segmentation (SCGM), and shows improved performance compared to different baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of modeling compositional components using vMF kernels has not been explored for domain generalization. Although inspired by [20], the application setting of the proposed method differs from this previous work.

    • Experiment on M&M and SCGM show clear advantages compared to recent baselines for domain generalization.

    • The extension of the method to test time adaptation adds depth to the paper.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The theoretical motivation of the paper is somewhat unclear. The concept of compositional components is defined in vague terms. In practice, the proposed method seems like a regular clustering prior on features based on a mixture model. The advantage of employing vMF kernels instead of Gaussian components as in GMM is unclear.

    • Some elements of the method could be better explained / motivated. For example, I am not sure that the definition of L_vMF achieves the desired goal of aligning mu_j to z_i (missing minus sign?). Also, I fail to see how the image can be reconstructed from such a low dimensional representation, or how this reconstruction can be useful.

    • The proposed method is similar in essence to have a auto-encoder loss on the features as in [35]. I believe this approach should be included as baseline in the experiments to demonstrate the advantage of the vMF compositional model.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Except for the potential error in the L_vMF loss, the method and experiments should be reproducible. Sharing the source code would help.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • I wish authors can explain in a more formal manner why the proposed model helps learn features that can generalize across domains. What is the difference/advantage of the method with respect to a simple clustering prior on the features?

    • Why use vMF instead of a GMM ?

    • The loss L_vMF seems incorrect. I believe it should be Sum_i min_i (1 - mu_j’*z_i) since the goal is to minimize the cosine distance, not cosine similarity.

    • The reconstruction loss in Eq (2) is somewhat arbitrary. Why would the log-likelihood be used as weight ? How can the image be accurately reconstructed from such low dimensional space ?

    • Shouldn’t there be another weight in front of the reconstruction loss in Eq (3) ?

    See the main weakness section for other comments.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper brings an interesting concept of component compositionality for domain generalization, and shows the advantage of the method compared to recent baselines. However, the motivation of the method and its loss terms needs clarification.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #5

  • Please describe the contribution of the paper
    1. A novel architecture and loss function is proposed for domain generalization in medical image segmentation. The novelties are inspired by recent progress in compositional CNNs.
    2. Experiments on multi-center cardiac and spine MRI datasets reveal improved robustness to domain shifts, as compared to several recent domain generalization methods.
    3. The performance is shown to improve further upon training some blocks of the pipeline at test time, using an unsupervised loss that was also used during the initial training.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This is the first time that I have seen use of concepts from the compositionality literature in medical image segmentation + domain generalization papers. In principle, I agree that the notion of compositionality might play an important role in helping us build more robust algorithms. Kodus to the authors from bringing up this topic - I am confident that this will of high interest to the community.
    2. Strong evaluation for domain generalization. Usage of multiple datasets and anatomies.
    3. Extremely clear writing and overall presentation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. My main concern is that I find a lack of sufficient explanation / intuition as to why the proposed method should help for domain generalization. Despite the additional loss function L_{rec} and L_{vMF}, and despite the additional decomposing-composing pipeline, I do not exactly see why a neural network trained on a source distribution will provide correct predictions on a shifted target distribution. Specifically, the feature extractor trained on the source domains may not necessarily extract similar features from a test image. Thus, the fixed kernels may not necessarily correct highlight different regions in different likelihood channels. Indeed, the fact that test-time-training improves performance shows that there are some errors that TTT can alleviate. Nevertheless, the proposed training seems to provide improved robustness even without TTT. Can the authors provide any pointers that can help me better understand this?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have agreed to make their code publicly available. Also public datasets have been used for experimentation. Sufficient implementation details have been provided in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The feature extractor is almost a ‘full’ U-Net, but without the last upsampling layer. This means that the features Z would be half the input images in terms of spatial dimensionality. The reconstruction and segmentation networks are quite shallow with only 1 upsampling layer. It would be useful to clarify these aspects in figure 1 - for instance, you could add a partial decoder to the orange feature extractor, and reduce the number of upsampling in the blue and green blocks.
    2. It is said that “For data from different domains, the features of the same anatomical part of images from these domains will activate the same kernels.” This is a strong claim. Is this based on empirical results or can the authors support this claim theoretically? In the former case, can the authors provide an intuition why the learned kernels are robust?
    3. In the evaluation, two settings have been mixed - those of semi-supervised learning and domain generalization. While the constraints of both these settings (less labelled data and domain shifts between training and test images) may occur concurrently in practice, the methods that the proposed method has been compared with have all been developed primarily for the domain generalization problem. A fairer comparison with respect to semi-supervised learning would have been to include methods from that setting as well. I admit that this would call for too many comparisons - a leaner way could be to focus on one of the two problems at a time.
    4. Please provide details of which intensity and resolution augmentations are used and with what hyperparameters in the compared method SDNet + Aug.
    5. I am not sure how much can be read into the compositionality visualization in figure 2. I suspect that if one visualizes the different channels of a layer before the last upsampling layer of a normal U-Net trained using a supervised loss, one might also see similar segregation of structures into different channels. Perhaps the presence of the reconstruction loss also preserves background structures, which would not happen for a u-net trained only for segmentation. On the other hand, the background structures (e.g. those on the top-right) are not necessarily separated into different channels.
    6. How sensitive is the method to the choice of the number of kernels? Is it important to keep this number relatively low?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe that the compositionality idea is very interesting from the point of view of domain generalization, and the authors have developed, presented and evaluated the method very nicely. This paper will make a fantastic addition to the conference, in my opinion.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #7

  • Please describe the contribution of the paper

    This paper proposed to use learnable von-Mises-Fisher kernels to model the compositional components of human anatomy and demonstrated improved generalization performance on cardiac and brain image segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of applying compositional networks and vMF likelihood to generalized medical image segmentation is novel, and intuitive.

    The paper demonstrated strong performance of their method from extensive experiments on both cardiac and gray matter datasets. The results are promising and convincing, and the paper is well written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some descriptions of the methods and the experiments are not very clear and need more details and justification. But this could be improved given more space.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper is reproducible and the code will be available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Since the kernel activation depends on the position, how robust this method is to spatial variations such as translation and rotations?

    How does “hard assignment of the feature vectors z_i to the vMF kernels μ_j” work? How often is this done during training?

    For the alignment analysis, the average cross entropy errors might not be sufficient to the alignment of vMF likelihood from different source domains. 0.718 and 0.756 of cross entropy errors are not very different. Standard deviations are needed and it would be helpful to show qualitative results visualizing X, Z, and Z_{vMF} of images from different domains.

    Section 4.4: “false predictions occur when the wrong kernels are activated” - it would be helpful to include example visualization of failed segmentations and the corresponding kernel activations.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper demonstrated an effective semi-supervised domain generalization approach that showed a huge improvement in segmentation performance when labeled data is scarce. I think this can method can be useful for many medical applications where labeled training data is limited and domain generalization is desired.

    The method showed significant improvement after test-time domain generalization without using additional labeled data in the test domain.

    The idea of using compositionally is intuitive and demonstrates explainability.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Interesting idea, sound methodology and convincing experimental results - i recommend acceptance.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    nr




Author Feedback

We thank the reviewers and the meta-reviewer for their valuable comments regarding our novelty, technical contribution, and state-of-the-art performance. Responses to reviewers R4, R5, R6 and R7 main comments and suggestions follow below. Other straightforward comments will be addressed in the camera-ready version:

R4: We appreciate the suggestion of extending vMFNet to 3D contexts. We will explore it in future work. We agree that there is a close relation to content-style disentanglement. We plan to include the corresponding discussion.

R5: In terms of why the method improves generalisation, considering training on multiple source distributions, the vMF kernels are learnt to detect the same anatomy in images from different domains i.e. aligned or domain-invariant. Such kernels can also generalise to the test domain if the distribution shifts between source and test are not significantly large. The alignment of kernels (i.e. the robustness to distribution shift) is achieved by clustering feature vectors along with the constraint from the limited segmentation masks. Specifically, the kernels are the centres of the clusters of similar feature vectors (e.g. the centre of the cluster of all MYO feature vectors from cardiac images of different domains). A vMF kernel corresponding to MYO is constrained to be only activated by the MYO feature vectors with limited segmentation supervision. Regarding the separation of semi-supervised learning and DG and fair comparison, SDNet is one of the previous SOTA semi-supervised segmentation models. With the augmentations (we will include the details of the augmentations), SDNet was adopted to address DG problems. Nevertheless, we will add a relevant discussion of the two aspects of semi-supervised DG. We also plan to include the visualisation of a normal UNet to better demonstrate the compositionality and the discussion of the number of kernels.

R6: Gaussian kernels could be possibly used to replace the vMF kernels. However, the vMF kernels have been better integrated into CNN networks (Eq.10 in [20]), which makes the implementation easier and the training faster. The L_vMF is minimised to maximise the vMF likelihood. Deriving from “-Sum_i max_j mu_j^T’*z_i“ (see Eq.14 and Eq.15 in [20]), we can obtain the reported L_vMF loss in the paper. We will clarify it in the updated version. We show, qualitatively in Fig.2 and quantitatively in the ablation study, that a (good) reconstruction is achieved after training. The reconstruction allows the model to better take advantage of the unlabeled data, which significantly improves segmentation performance. Compared to naive latent feature clustering, we learn the aligned/domain-invariant vMF kernels across multiple domains and design the reconstruction module for semi-supervised DG problems. Lastly, the proposed vMFNet can be trained with only three losses (simpler than the 7 losses used in previous SOTA - DGNet) and training is not sensitive to the weights of the loss terms according to our early experiments, which is one of the advantages of vMFNet. Hence, we set the weights of losses as 1.

R7: We appreciate that the reviewer raises an interesting point in terms of the robustness of spatial variations. The vMF likelihoods preserve the spatial information of the input image. However, the vMF kernels are not position-dependent. In other words, rotation and other translations will not stop the vMF kernels to detect the corresponding anatomy/pattens. Hence, the proposed method is robust to spatial variations. The L_vMF is minimised to maximise the vMF likelihood. Once the vMF likelihoods are maximised (as 1), one feature vector will only activate one kernel hence we can achieve “hard assignment of the feature vectors z_i to the vMF kernels μ_j”. We thank the reviewer’s suggestions on the visualisations. We plan to include these visuals accordingly in the camera-ready version.



back to top