Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ziyue Xu, Andriy Myronenko, Dong Yang, Holger R. Roth, Can Zhao, Xiaosong Wang, Daguang Xu

Abstract

Acquiring pixel-level annotation has been a major challenge for machine learning methods in medical image analysis. Such difficulty mainly comes from two sources: localization requiring high expertise, and delineation requiring tedious and time-consuming work. Existing methods of easing the annotation effort mostly focus on the latter one, the extreme of which is replacing the delineation with a single label for all cases. We postulate that under a clinical-realistic setting, such methods alone may not always be effective in reducing the annotation requirements from conventional classification/detection algorithms, because the major difficulty can come from localization, which is often neglected but can be critical in medical domain, especially for histopathology images. In this work, we performed a worst-case scenario study to identify the information loss from missing detection. To tackle the challenge, we 1) proposed a different annotation strategy to image data with different levels of disease severity, 2) combined semi- and self-supervised representation learning with probabilistic weakly supervision to make use of the proposed annotations, and 3) illustrated its effectiveness in recovering useful information under the same worst-case scenario. As a shift from previous convention, it can potentially save significant time for experts’ annotation for AI model development.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_8

SharedIt: https://rdcu.be/cVRq6

Link to the code repository

N/A

Link to the dataset(s)

https://camelyon16.grand-challenge.org/Data/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a new semi-supervised segmentation method for histopathological slides. The originality consists in annotating only a single polygon on the major tumor site and give a rough estimation of the tumor/tissue area ratio.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed architecture dealing with both a rough polygon containing positive samples and just an estimation of the tumour/tissue ratio is very original and could be useful in clinical routine

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The method could be compared to more classical semi-supervised approaches The tumour/tissue ratio is not so easy to estimate, a study on the influence of a “bad” estimation could be given

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Codes, data and trained networks are given. Very good reproducibility of the study

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    As already stated, I think that the paper presents a very clear, new and original approach for semi-supervised segmentation of whole slide images. The results seem to prove that this approach provides best results than others. I would have like to read a more complete study on the influence of the parameters of the method and especially the ratio given by the user on tumour/tissue. What if the pathologist underestimate or overestimate this ratio? Maybe a second experiment on another publicly available dataset would also strenghthen the evaluation of the method.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Original approach Good results Interesting topic

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This paper proposed a annotation strategy for histopathology images. A semi-supervised learning method with probabilistic week supervision is designed to verify its effectiveness on Camelyon 16 dataset under “worst-case” setting. The idea seems interesting and refreshing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper proposed new annotation strategy that may be insightful for clinical annotation.
    2. Designed semi-supervised learning method proved its effectiveness on Camelyon 16 under worst-case setting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The assumption behind the designed annotation strategy (“easy” and “hard” figure) is not very theoretically reliable. 2.The strategy is only evaluated on Camelyon 16 with small number of samples and limited tumor types, thus the generality is limited. 3.The method is only compared with limited self-supervised and weakly-supervised methods. It would be better to have comparison with latest semi-supervised methods, such as FlexMatch, SimMatch, etc.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper gives detailed explanation of its method and dataset. However, in order to reproduce this work, further details about experiment setting (such as choice of hyperparameters) and data sampling results should be provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. I wonder the assumption (“hard” and ”easy”) in Figure 1 is theoretically supported by pathologists. Can you provide references or further explanation? 2.The experiments were conducted on a portion of Camelyon16, which limits the generality of the strategy. Would it be possible to experiment on more datasets? 3.The method is only compared with limited self-supervised and weakly-supervised methods. It would be better to have comparison with latest semi-supervised methods, such as FlexMatch, SimMatch, etc. 4.All the tables illustrate FROC, and it would be better if annotation time cost of the methods is also listed in a table for comparison.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method introduced an interesting annotation strategy for histopathology images, which is proved effective on Camelyon16 compared with a few self-supervised and weakly-supervised methods.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    This paper studies an interesting topic in clinical-realistic annotation for histopathology images. It poses several key components that are neglected but could be critical in histopathology image annotation, including localization and boundary delineation. To address them, this paper proposes different annotation strategies for slides with different annotation burdens. Especially, this paper proposes a probabilistic weak-supervision training pipeline, which is novel to me. The overall results are promising. In particular, different baseline methods, e.g., MixMatch or SimCLR can be improved by the probabilistic ratio supervision with a considerable margin.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Well-organized and well-motivated ideas. This paper starts by posing domain-specific limitations and challenges in the pathology image annotation process and derives its motivation and idea along with the text. I can easily follow the main idea and find it solid.

    2. The proposed probabilistic weak-supervision training method is novel, to me. Using the sorted prediction scores to match the expected labels is a clever way to exploit the ration information. Although it does strongly rely on good stage-1 training (in Table 2, “skip” stage-1 fails), the proposed probabilistic framework can improve the upstream model’s performance on the downstream task of interest, which is acceptable to me.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    If the authors would address my concerns below, I would further adjust my score.

    1. Major concern: estimation of ratios and multipliers. 1) How do you obtain the 4-point or 10-point polygons for experiments? Is the proposed method sensitive to the quality of polygons? 2) Estimating the ratio for small/medium focal tumors seems to be straightforward by computing poly_area/otsu_foreground_ared. But how do you estimate the multiplier for polygon area v.s the whole tumor area? An accurate estimation of the ratio, in my opinion, is the key to the success of the proposed method. Using the ground truth label to directly obtain this information can be seen as label leakage, but I’m okay with it if the leakage is not much. Authors should provide the information on the distribution tumor_area v.s. tissue_area, or the distribution of the number of small/medium and large tumors. This helps to inspect the degree of leakage.

    2. Missing details about implementations. 1) MC Dropout. Where do you put the dropout layer, and how many times of MC dropout are performed? How do you estimate the uncertainty U? Is it the standard deviation of multiple passes? How much time is increased for Probabilistic + feedback (use MC Dropout for feedback)? There are way more passes for MC dropout. And the gain seems to be marginal, (0.02 from table 2) 2) Training details in each step. The paper has not presented the training details, e.g. how many epochs SimCLR / MixMatch are pre-trained and how many epochs those models are tuned, either using patch Camelyon or other annotation. 3) Consider specifying what is stored in the sample queue. Are they images? Besides, is Y+ a vector full of ones (eq 3)? 4) Will the code be made available?

    Minor issues:

    1. The tumor region ratio indicates tumor_area/slide_area or tumor_area/tissue area. These two concepts both exist in the text, e.g., “while large lesions can occupy more than 50% of the slide” and “the tumor region occupies 51%, 3%, and 0.01% of the foreground tissue”. Consider clarifying it more.

    2. Comparison between MixMatch-variant [8]. In table 2, the last line, is it equivalent to MixMatch+Polygon+Probabilistic+feedback? Is there any difference in stage-1?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    More details about each training step, e.g., pre-training, fine-tuning, MC dropout configuration, need to be clarified for reproducibility.

    The authors checked almost all boxes in the checklist, in which it says the code would be made available. If that is the case, reproducibility is not an issue.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please refer to the weakness section.

    In addition to that, I suggest doing another ablation that uses the full amount of WSIs in Camelyon16 to pre-train the SimCLR encoder. This also meets the realistic scenario — you don’t have to pre-train only on labeled data.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The well-motivated idea.
    2. The novel probabilistic ratio-based weak supervision training pipeline. However, an explanation for my Major concern (2) is needed for my final rating.
  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I stand with my previous justification.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces a new annotation strategy for histopathology images based on a probabilistic weak-supervision training pipeline, where the annotation comprises a single polygon around the tumour. Experiments are done on the Camelyon 16 dataset under “worst-case” setting. The paper received a mix of positive and negative comments. The positive aspects identified by the reviewers are: 1) paper’s idea is original and potentially useful in clinical routine; 2) results are effective on Camelyon 16 dataset; and 3) paper is well written and motivated. However, the negative points with the paper are: 1) the assumption behind the designed annotation strategy is not theoretically reliable; 2) the limited evaluation on Camelyon 16 should be discussed; 3) it is unclear why comparisons with the latest semi-supervised learning methods (e.g., FlexMatch and SimMatch) have not been made; 4) how are the 4-point or 10-point polygons obtained?; 5) Is the method sensitive to the quality of polygons? 6) how does the paper estimate the multiplier for polygon area vs. the whole tumour area? 7) the paper should provide the information on the distribution tumor_area vs. tissue_area, or the distribution of the number of small/medium and large tumours; 8) the paper should provide more details about the experiments (e.g., dropout layer, training details, sample queue); and 9) will the code be made available? Based on this assessment, I invite the authors to write a rebuttal for the paper.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We sincerely thank the reviewers and area chair for the approval on our work and valuable suggestions to further improve the paper. Below we listed the major questions and our clarifications.

  • Assumption behind the annotation strategy (R2) Fig. 1 depicts a “general view” of how “hard” and “easy” the tasks of detection and annotation can be for a typical case of large/medium/small tumor region. Two dimensions: 1) annotation difficulty can be derived directly from how many vertices are required by a particular scheme, from 0 / few clicks to thousands of clicks (delineation for large tumors); 2) the localization difficulty comes from how “visually distinctive” the tumor region appears from normal tissue. By consulting pathologists, we believe it is generally true that smaller tumors are easier to miss (thus more difficult to locate). Fig. 1 is by no means “quantitative”, the purpose is to provide an idea for this task, especially to add the dimension of localization which is often overlooked in existing works.
  • Details of polygon annotation and tumor ratio estimation (R1/3) We thank and agree that such design can indeed be prone to error, and covered some efforts to properly model and alleviate this concern in the current work:
    1. We are inspired by similar information recorded and used in some clinical practices [19], hence it could be recognized as a “reasonable task” under certain protocols.
    2. We ask annotators to draw a polygon as big as possible, while fully contained inside the major tumor site (btw, the “10-point” one is not a polygon, but just a point set as suggested by [8]). Under this request, for small/medium tumors, the polygon will not have significant variations because of limited size and shape irregularity. Only for large, irregular tumors, the polygon might have inter-observer variability issue.
    3. To estimate the multiplier, we provide a mesh grid overlaying the image, so that the annotator can roughly count the grids to make the estimation. Again, for small/medium ones, it is easy by counting other visible tumor sites. It can be more difficult for large tumors.
    4. To cover such potential error, based on the observation from our repeated estimations, we chose a ratio with a stride of 5% (and experimented with a relaxed stride of 10%). Of course, the range can vary depending on annotator and tumor. Fortunately, as stated above, the error is mainly for large/irregular tumors. While for this category, the large polygon can already cover a significant area, as compared with the 224x224 patch size. Therefore, the contribution from the probability part is not as significant as the smaller ones.
  • Additional experiments on robustness, dataset, and other methods (R1/2)
    We thank the reviewers for the suggestion. We realize that simply relaxing the overall stride to 10% may not be a precise way to model this issue, because the potential error is not constant across different tumors: the larger/more irregular the tumor is, the more difficult it becomes to estimate the multiplier. More sophisticated evaluation is desirable to model it. It’s also helpful to include more datasets. Our major motivation of Camelyon is that it is ideal for the mentioned issue of significant size variations, very tedious annotation for large tumors, and that doctors can miss a portion of the smaller tumors. Other datasets, where the tumors are of similar sizes, do not necessarily have this issue. As experiments involving substantial extension to the current paper are discouraged according to the guideline, we will clarify and highlight the limitations, and add experiments in future work.
  • Method / data details (R1/2/3) We will compose a supplementary to cover the details about the experiments (dropout, training, sample queue, etc.), and information on the distribution of tumor/tissue and tissue/image ratios, also the distribution of the number of small/medium and large tumors. We will make our code available.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The strengths of the paper are: 1) original idea, 2) potentially useful in clinical practice; 2) good results on Camelyon 16; and 3) paper is well written and motivated. The rebuttal successfully addresses the following weaknesses: 1) the assumption behind the designed annotation strategy, 2) polygon annotation and tumor ratio estimation, and 3) clarifications on experiments on robustness, code, and dataset. Rebuttal did not address well the limitations of Camelyon 16, and the discussion on why there is not any comparison with the latest semi-supervised learning methods (e.g., FlexMatch and SimMatch). One of the reviewers switched the score from 4 to 5. Given that I see more positive than negative points, I recommend acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After rebuttal, there is consensus among reviewers that this paper should be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a new semi-supervised segmentation method for histopathological slides, where the annotation is in the form of a single polygon around the tumour. Following my reading of the paper, reviews, and rebuttal, it seems the authors have addressed most of the concerns. Recommend to accept and the authors should integrate the clarifications given in the rebuttal in the paper if finally accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



back to top