Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Dewen Zeng, Yawen Wu, Xinrong Hu, Xiaowei Xu, Jingtong Hu, Yiyu Shi

Abstract

This paper presents a new way to identify additional positive pairs for BYOL, a state-of-the-art (SOTA) self-supervised learning framework, to improve its representation learning ability. Unlike conventional BYOL which relies on only one positive pair generated by two augmented views of the same image, we argue that information from different images with the same label can bring more diversity and variations to the target features, thus benefiting representation learning. To identify such pairs without any label, we investigate TracIn, an instance-based and computationally efficient influence function, for BYOL training. Specifically, TracIn is a gradient-based method that reveals the impact of a training sample on a test sample in supervised learning. We extend it to the self-supervised learning setting and propose an efficient batch-wise per-sample gradient computation method to estimate the pairwise TracIn to represent the similarity of samples in the mini-batch during training. For each image, we select the most similar sample from other images as the additional positive and pull their features together with BYOL loss. Experimental results on two public medical datasets (i.e., ISIC 2019 and ChestX-ray) demonstrate that the proposed method can improve the classification performance compared to other competitive baselines in both semi-supervised and transfer learning settings.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_12

SharedIt: https://rdcu.be/dnwca

Link to the code repository

N/A

Link to the dataset(s)

https://challenge.isic-archive.com/landing/2019/

https://nihcc.app.box.com/v/ChestXray-NIHCC


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes to utilize additional positive pairs for better self-supervising representation learning, by adopting TracIn in BYOL. To make TracIn more efficient in BYOL training, the authors devise a few strategies, e.g., batch-wise computation and the adoption of pre-trained models for sample variety enhancement. Also, extensive comparative experiments are carried out to verify the effectiveness of their methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Main strengths: (1) An interesting formulation. It seems interesting to introduce additional positive pairs into BOYL for SSL by adopting TracIn to measure image similarity. (2) A well-written paper and clear paper structure. (3) Good design for comparative experiments, as well as extensive experiments for algorithm validation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Main weaknesses: (1) Discussion regarding the limitation of TracIn + BYOL. It would be better to discuss the limitations of their methods. (2) A quantitative measurement of the computational cost of increasing additional positive pairs. It would be of interest to provide a quantitative measurement of the computational cost of using TracIn in BYOL.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good, the authors claim the availability of their source codes if accepted.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall, this paper is well-written and easy to follow. And the problem studied in this paper is valuable and interesting. It would be better for the authors to consider 1) a moderate discussion regarding the limitation of this study; 2) an evaluation of additional computation costs for utilizing TracIn in BYOL.

    Minor comments: 1) In Table 1, there should be a space between “0.xxx” and “(0.xxx)”, and so in Table 2. 2) It is unclear to me the number of additional positive pairs selected by the authors. And would this number influence the model performances?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this paper is well-written and easy to follow. And the problem studied in this paper is valuable and interesting. Moreover, the authors have demonstrated the effectiveness of their methods using extensive comparative experiments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes an extension of BYOL based on TracIn, an influence function, for detecting additional positive samples in an unsupervised way (no manual labelling needed). Authors compare their method with several baseline methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clear writing
    • The idea of combining TracIn and BYOL is interesting and new
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some choices about experimental setting and baseline are not clear
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    There are enough implementation details but providing the code would highly improve reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The idea of using an influence function for selecting additional positive samples is interesting and well-motivated. However, the use of another pre-trained model for the pairwise TracIn computation, as well as for the FS pre-trained model, limits the usefulness of the proposed model. Why not using TracIn or FS only after a certain number of epochs (say 50) ? This would avoid using a second pre-trained model. Furthermore, more details about this second pre-training are needed. Please clarify.

    • I think that authors should better discuss and point out their contribution, which is the use of additional positive samples within BYOL. They propose to do it in three ways: supervised labelling (as in SupCon), or without manual labelling using either FS or TracIn. When looking at the results, we can see that all three methods obtain better results than the standard BYOL. This should be better highlighted.

    • Related to my previous point, authors could also use another simple strategy which would be using more views (augmentations) of the same sample as additional positives. This choice would depend more on the chosen augmentation, but it would also be a very simple one. Please comment on that.

    • Why do authors say that BYOL is more resilient to the choice of data augmentations? In the sentence after, authors say that BYOL has limited feature diversity due to the… data augmentation. Please clarify since data augmentation seems to be an important choice also for your method (different data augmentations give different results).

    • At pag. 4, authors write that “if the TracIn of two samples is large in the current iteration, this means that the training of one sample can benefit the other sample a lot because they share some common features.”. However, since there are no normalisation terms in Eq. 2, the TracIn can have a big value when either one (or both) of the two gradients has a big magnitude or when the two gradients are parallel. Are all these cases related to samples that share common features? Or is it just when the two gradients are parallel? Please clarify.

    • Authors approximate the entire gradient with just the gradient of the last layer. They should probably run a small test where they confirm that this hypothesis seems to be correct.

    • In Fig. 2, authors check whether TracIn selects as additional positives images with the same label as the anchor image. Why not doing it in a quantitative way? Authors could check during training to see whether their method converges towards a good solution, namely whether the chosen additional images are indeed samples with the same label as the anchor.

    • Authors should probably also compare the computational time (or number of FLOPS) of the different methods. I guess that FS would take less computational time.

    • What does the std represent in Table 1? Cross-Validation? Different runs with different initialisations?

    • In Eq. 1 and afterwards, I would probably write $l(w_t ; x_t)$ to highlight the fact that $x_k$ is fixed and what is changing is w_t

    • Authors should probably add BYOL in the title

    • (pag 1) A naïve contrastive learning -> A recent contrastive learning

    • Please clarify in the Introduction that BYOL is not a contrastive method (no negative samples)

    • (pag. 2) not clear what is “background pathology”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see my comments before.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present a new way to add positive pairs to BYOL training, by deriving pairs not from the same image, but from images that exhibit some form of similarity. In this work, the similarity metric is computed from an influence function, that would typically be used to measure how a given sample improves a loss on model weights during training. Extensive formulations are given for how to adopt the influence function for BYOL training.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • a novel method for deriving positive pairs for training contrastive models. Strong reasoning behind the need, as well as very clear explanation for how influence function (IF) is adopted to such purpose.
    • unusual (and that’s good) approach. IF adoption is non trivial. Good arguments are given for how to solve this.
    • method is evaluated on multiple different medical datasets
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • ultimately, TracIn requires a pretrained BYOL to start with. This is not highlighted by the authors until last paragraph of Method section. This is a significant limitation. In effect, this makes it a ‘BYOL finetuning strategy’, if I understood correctly
    • since TracIn is rather difficult to adopt at train time, a few significant limitations are made for finding positive pairs. First, only the last linear layer is used to predict gradients on images, which limits the effect of ‘TracIn’ in comparison to its traditional form. I am curious if a study can be made for how much efficiency is lost here, and whereas the improvement in results is gained just from ‘training for longer’?
    • as gradients are estimated ‘per mini batch’, if i understood correctly, the positive pairs can only be selected from a given minibatch. That means the batch size needs to be rather large, in order to maximize the changes that true ‘similar’ pairs of images are found per mini-batch.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is mentioned that access to code and trained models will be available. I would suggest that anonymous github account with code is published and weights are shared also in anonymous way. These can then be easily pointed to non blind repositories if the manuscript is accepted.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    As per major weaknesses stated above:

    1. Could you clarify to reader much earlier in the manuscript that BYOL with TracIn is effectively used to finetune a pretrained model
    2. If you were to simply keep training a pretrained BYOL without TracIn for the same amount of epochs - would you get improvement in results also? Can you verify this?
    3. Can you verify the effect of minibatch size on the performance of your system? It is an important limitation of your method. Also - what if the minibatch selection is initialized with different seeds, does performance depend on this (as positive pairs are selected per minibatch)?

    Minor:

    It was not immediately clear to me how do you get from equation (1) to equation (2). Can you provide a more extensive step by step derivation? I understood why the higher order term is non significant, but got lost in the other parts where terms are re-arranged.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite certain strong limitations of the proposed method, the method is very novel. Adaptation of influence function for training purposes, which typically would be used on trained models, is certainly thought provoking. All assumptions are well explained and thorough reasoning is given behind the work.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Some of the view contents are actually critical, especially how much additional computation cost is demanded, how many additional positive pairs are required to see performance gain, what limitations the approach has, and the implementation details are missing. The questions that were alluded to are: does “Additional Positive Enables Better Representation Learning” with other self-supervised methods? The evaluation is limited to two target tasks, and the performance, especially ChestX-rays, is significantly lower than the SoTA performance in the literature – Publishing such a result may cause misunderstandings of the state of the art in the field. Furthermore, without statistical analysis, it is unknown if performance gains are significant due to additional positive pairs even at a performance below the SoTA.




Author Feedback

We thank all reviewers and meta-reviewer for their time and recognition of this paper’s contributions. Here, we focus on addressing some concerns and questions of the reviewers.

Reviewer #1:

  1. Need some discussions about the computation cost and limitations. Response: In order to find the additional positive, our method introduces almost twice the computation cost compared to the original BYOL, this is one of the major limitations of this study, and we will add some discussion about the limitations in the revised version.
  2. The number of additional positive pairs selected? Response: In this paper, we only use one additional positive pair for each sample. One can easily extend to more pairs by modifying the selection criteria. However, more pairs will introduce more computation costs and increase the false positive rate which may degrade the performance.

Reviewer #2:

  1. TracIn requires a pre-trained BYOL to start with, this makes it a ‘BYOL finetuning strategy’, is the improvement in results gained just from ‘training for longer’? Response: A pre-trained model can help TracIn to identify additional positives more accurately, thus improving the performance. This is different from finetuning a pre-trained BYOL model cause the BYOL-TracIn is initialized randomly. In fact, we have tested finetuning the pre-trained BYOL for the same epoch as the BYOL-TracIn, and the result is that BYOL-TracIn is still better than the further finetuned BYOL although finetuning can slightly improve the original BYOL. Due to the page limitation, this is not included in the draft. From another view, if we can get a better pre-trained model (through further finetuning), we can use that better pre-trained model to further improve BYOL-TracIn because it is more accurate in identifying positives. After all, the main focus is to improve model accuracy.
  2. The batch size needs to be rather large. Response: The batch size needs to be large enough to make sure there are enough potential positives for each sample (from a probability perspective). For example, bs = 256 for cifar10 is OK because there are 25 positives for each class on average in a mini-batch. Another purpose of computing TracIn in mini-batch is to improve the positive diversity so it shouldn’t be too large. One extreme case is that the batch size equals the training size, then each sample will get a fixed additional positive during training (if using a pre-trained model to guide), even though chances are high that such positives are true positives, the improvement in representation learning could be limited because you are always pushing the same two instances together.

Reviewer #3:

  1. Why not use TracIn or FS only after a certain number of epochs (say 50)? Response: In general, a better pre-trained model can identify positives more accurately. In our preliminary experiments, we tested firstly doing normal BYOL training and then applying TracIn after different epochs (50, 100, 200, etc.). We also tested extracting the pre-trained model from different epochs. The result is that using the better pre-trained model to guide is always the best. One possible explanation could be a pre-trained model can the model learn better during all stages.
  2. Authors could also use another simple strategy which would be using more views (augmentations) of the same sample as additional positives. Computation cost? Response: Thanks for the suggestions, using more views of the same sample could be another good baseline, we will add this in our future work. As for the computation time, both the TracIn and feature similarity need a pre-trained model to do one forward pass on the current mini-batch (the actual TracIn and feature similarity computation are relatively small compared to the forward pass), so the computation cost is almost twice the original BYOL training.



back to top