Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xierui Wang, Hanning Ying, Xiaoyin Xu, Xiujun Cai, Min Zhang

Abstract

Early diagnosis of focal liver lesions (FLLs) can decrease the fatality rate of liver cancer, which remains a big challenge. We designed a deep learning approach based on CT to assess and differentiate FLLs. To achieve high accuracy, CTs in different phases are integrated to provide more information than single-phase images. While most of the related studies use convolutional neural networks, we exploit the Transformer for multi-phase liver lesion classification. We propose a hybrid model called TransLiver, which has a transformer backbone and complementary convolutional modules. Specifically, we connect modified transformer blocks with convolutional encoder and down-samplers. For multi-phase fusion, we utilize cross phase tokens to reinforce the phases communication. In addition, we introduce a pre-processing unit to resolve realistic annotation issues. Extensive experiments are conducted, in which we achieve an overall accuracy of 90.9% on an in-house dataset of four CT phases and seven liver lesion classes. The results also show distinct advantages in comparison to state-of-art approaches in classification.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_31

SharedIt: https://rdcu.be/dnwyK

Link to the code repository

https://github.com/sherrydoge/TransLiver

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose TransLiver, an hybrid model combining Transformers and convolutional layers for multi-phase focal liver lesion (FLL) classification. Cross-phase tokens are exploited to enhance the fusion between features from multiple phases. A pre-processing unit is employed to obtain lesion areas on multi-phase CTs from annotations marked on a single phase. The proposed 2D classification pipeline is evaluated on a private dataset comprising 761 FLL from 444 patients and reaches an accuracy of 90.9%.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Innovative hybrid architecture combining Transformers and convolutional layers
    • Encouraging quantitative results with large margins with respect to results from state-of-the-art networks
    • The use of cross-phase tokens is relevant and boots the classification performance
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Authors should provide more insights to explain both pre-processing unit and fusion strategy
    • Works addressing FLL classification using ViT should be used for comparison purposes
    • 3D classification results are expected, even when using 2D networks
    • The case of lesions not detectable in one (or several) phase(s) is not handled
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The proposed method could be easily re-implemented based on provided architecture and training information
    • Both code and data are not provided
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The method is of interest for the medical community. However, a number of concerns that need to be included to meet the MICCAI requirements is missing. Please find below main/minor aspects which could, to my opinion, improve your paper.

    Introduction: 1 - The related works part of the introduction does not discuss about liver detection / segmentation which is a pre-requisite for FLL classification. 2 - You mentioned that few studies exploit a ViT backbone network in liver lesion classification. These studies whose scope completely fits to the paper topic should be mentioned and discussed.

    Methods: 3 - In the pre-processing unit (Sect.2.1), the registration network based on VoxelMorph employs an auxiliary Dice loss function between fixed image lesion masks and moved image lesion masks to help the registration field. It therefore assumes that FLL are segmented (or at least detected) for all phases which is contradictory with the fact that annotations are performed on a single phase only. 4 - As stated in Sect.2.1, multi-phase CT scans are registered with respect to the arterial phase. Why not choosing the early venous phase as reference phase since it provides a larger contrast between HCC lesions (one of the class among the 7 considered classes) and parenchyma? 5 - The usefulness of the lesion matcher should be highlighted in Sect.2.1 since the VoxelMorph registration step, if accurate, should have already aligned lesions from the different phases. 6 - Only lesions completely found in all phases are used as inputs of the classification network. This assumption appears restrictive. What about the case of lesions not detectable in one (or several) phase(s) (e.g. non-contrast)? How much FLL do you lose with this assumption? 7 - Replacing standard absolute positional encoding with a learnable relative positional encoding should be justified in Sect.2.3. 8 - According to Sect.2.4, single-phase liver transformer blocks (SPLTB) are phase-specific, meaning that the model parameters for each phase are independent. You should justify this choice since you could have rather relied on weight sharing between phase-specific branches. Moreover, it could have been interesting to experimentally justify such late fusion approach compared to more standard early (or intermediate) fusion strategies.

    Experiments: 9 - You adopted a 2D classification pipeline instead of a 3D pre-processing. I call into question the followed slice-level classification strategy, especially regarding assessment. Since FLL usually cover a certain amount of axial slices, a fusion of 2D results is expected to provide 3D classification results. In this direction and compared to 2D pipelines, the 3D pipeline referred to as “Baseline 3” is not prone to redundancy between axial slices in terms of evaluation. One can wonder if the metric values provided for Baseline 3 (e.g. approximately 60% in accuracy) are comparable with the ones achieved by Baselines 0, 1, 2 and TransLiver? 10 - Apart from the ablation study (Sect.3.2), comparisons are made in Sect.3.1 using different state-of-the-art classification networks. To my opinion, the studied mentioned in the introduction (even if not referenced) should appear as comparison methods in the experiments. 11 - The experiments (Tab.4) related to the influence of potential missing phase(s) is interesting. However, you should explain more how to proceed if one (or more) phase(s) is (are) missing for both training (i.e. not the same amount of phases among the training set) and inference (e.g. the corresponding SPLTB branche(s) have to be disconnected) procedures. 12 - To handle dataset imbalance issues, you randomly selected 586 lesions as training/validation sets with no more than 700 axial slices in each lesion type. Isn’t it problematic not to have the same class distribution between training/validation and test sets?

    Minor comments: 13 - GeLU (Gaussian Error Linear Unit) activation should be defined since the use of ReLU activations is more common. 14 - Typos and formulations: - “Liver cancer is one of the […] and has the second highest […]” instead of “Liver cancer is one of the […], which has the second highest […]” in Sect.1 - “As a manner” or “As a way” instead of “As a means” in Sect.1 - “vision transformers […] has been shown to replace” instead of “vision transformers […] has been shown can replace” in Sect.1 - “Patch embedding consists of” instead of “Patch embedding is consisted of” in Sect.2.2 - “which does not enable to […]” instead of “which is difficult to […]” in Sect.2.2 - “they are also prone to […]” instead of “they are also easy to […]” in Sect.2.3 - “IRFFN (Inverted Residual FFN) [6]” instead of “[6] IRFFN (Inverted Residual FFN)” in Sect.2.3 - add a coma between “Then” and “they are separated and […]” in Sect.2.4 - add a coma between “Then” and “we fuse the features […]” in Sect.4

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • 3D classification results are expected
    • Existing works addressing FLL classification using Transformers should be referenced and employed for comparison purposes
    • The methodological part is not 100% clear (e.g. description of the pre-processing unit)
  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    R1Q3,5, R1Q6,11 R1Q8,13, R1Q1 and R1Q12 have been well answered. Ambiguities pointed ou in R1Q2,10 has also been addressed in the rebuttal. However, the method described in [22] could have been re-implemented based on provided architecture and training information. Although I understand that vision Transformers are mostly pre-trained in 2D, 3D classification results are expected even when 2D networks are employed. For these reason, my overall opinion remains the same as during the review phase.



Review #2

  • Please describe the contribution of the paper

    This paper constructs a hybrid framework with ViT backbone for liver lesion classification TransLiver.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents a hybrid framework with ViT backbone. The paper is well organized.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Our approach involves implementing a multi-stage pyramid structure and incorporating convolutional layers into the original transformer encoder architecture.
    2. Our method has been validated using an in-house dataset. However, we have not yet decided whether we will release this dataset to the public or not.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    no code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The method is not novel. Using multi-stage pyramid structure and add convolutional layers to original transformer encoder is not novel at all.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is not novel. Using multi-stage pyramid structure and add convolutional layers to original transformer encoder is not novel at all.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    After reading the authors’ rebuttal, I would like to increase my score to weak accept.



Review #3

  • Please describe the contribution of the paper

    The authors propose a ViT based liver lesion classification framework for classifying seven different types of focal liver lesions by utilizing multi-phase CT images. The authors designed the framework in such a way that the transformer architecture can be applied to their comparatively smaller in-house dataset without giving up the benefits of a transformer architecture.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strengths of this paper are (1)proposing a ViT based framework for liver lesion classification (2) Designing a block called Single-phase Liver Transformer Block which uses a spatial reduction structure so that the computational overhead is reduced. Transformer based models need large amounts of data to be trained. The authors bypass the need for a large amount of data by introducing this Single-phase Liver Transformer Block which reduces the value of Key and value of the attention modules by using Depthwise convolution modules. The authors include a learnable relative positional encoding in this Single-phase Liver Transformer Block instead of the absolute positional encoding used in the original transformer architecture. (3) using cross phase tokens for multi-phase fusion. The cross phase tokens are utilized to reinforce the phases communication. (4) introduce a pre-processing unit pre-processing unit to acquire multi-phase annotated lesions from single-phase annotated lesions. The pre-proecssing unit is built on VoxelMorph and U-Net framework. (5) Extensive experiments are conducted, the method is shown to outperform other methods in classifying seven liver lesion classes. Generally the other frameworks developed on this problem have only utilized four different classes of liver lesions. (6)Utilizing multi phase CT images of the subject

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It would have been nice to have some examples of the seven different types of liver lesions, since the authors claim that the other methods worked on classifying four different types of liver lesions. Adding some visual examples of the cases where the model failed and providing some intuition behind them would have strengthened the paper.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors mention in the reproducibility checklist that the code will be released. The authors provided the necessary details such as hyperparameter settings, training epochs, hardware utilized, batch size etc. The authors also provided extensive ablation study in the supplementary material. The results should be reproducible with these details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    (1)Providing some additional results on hyperparameter tuning would be helpful. (2)Showing/explaining some results where the method does not work as expected would be helpful.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The adaption of the ViT architecture for applying to a limited dataset of liver lesion classification is interesting. The authors made careful design choices to learn from multi-phase CT data with the help of cross-phase design tokens. In general learning from multi-phase data would increase the memory footprint which the authors are able to avoid here. The authors also utilize a pre-processing unit to generate labels for all the phases from a single phase annotation.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper presents a transformer based model for classification of liver lesions from multi-phase CT images. The experiments have been carried out with a proprietary dataset. The results of the experiment show that the method is able to classify seven types of liver lesions with accuracy of 90.9% and is with that and other performance measures superior to four reference methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The used transformer based approach is state of the art.
    2. The proposed method in general has sufficient novelty.
    3. The pre-processing stage with a registration network seems to be both efficient and novel.
    4. The ablation study shows sufficiently that all the subparts of the model and images from all phases are necessary for obtaining the best classification accuracy and other performance measures.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The small image size of 224x224 pixels may be a limiting factor for the performance. There should have been an ablation study for verifying its sufficiency.
    2. More in depth analysis should have been given to the poor classification performance of the HM lesions. Also HA lesions have a low proportion in the data, but their classification accuracy is high.
    3. Experiments with publicly available datasets should have been presented in the paper, eg. by taking Table 3 of the supplementary material in it.
    4. A few qualitative samples of failed classification cases should have been provided to give insight in the proprietary dataset.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experiments have been carried out with a proprietary dataset and the authors do not indicate any aim to release it for public use. This is unfortunate as Table 3 of the supplementary material seems to indicate that the used dataset would have novelty and impact value.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please see my comments in the weaknesses section.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the method, the experiments and the overall contribution of the paper are valuable and worth publishing in MICCAI.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I don’t see any reason to change my rating. R2 seems to be very critical concerning the lack of novelty, but I would not pay too much attention to that because there is novelty in both the method and its application to medical imagery.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Reviewers all agree that the proposed model is novel with a hybrid architecture combining Transformers and convolutional layers. The results show encouraging quantitative results with large margins with respect to results from state-of-the-art networks. However reviews were mixed due mainly to the experimental section and to the limited analyzes of the results, with full 3D classification results expected. Authors should comment on the issues, particularly from R#1.




Author Feedback

We thank the constructive comments from all reviewers. Below, we clarify the main issues. R1Q2,10 & R4Q3 (FLL classification methods comparison is expected): We included works studying multi-phase CT FLL classification of more than 4 classes in Table 3 of supplementary material and some of them are discussed in the introduction (Para 2). Comparison with their methods is not as objective as we would wish because we cannot ensure fairness since their data and code are not publicly available, while we compare our result with some image classification SOTAs on the same dataset and experiment settings. The word “few” in the introduction (Para 4) might cause misinterpretation for we only find one similar paper [22] which is also listed in the table mentioned above using self-attention (a part of ViT) and we will revise it in the final version. R1Q9 & AC (3D results are expected): We have experimented on 3D pipeline in Figure 3 of the paper and Table 4 of the supplementary materials. The comparable settings and probable reasons for the poor performance of 3D pipeline are stated in Sec 3.2. A large portion of lesions having few slices in our dataset weakens the redundancy between slices in 2D pipeline, while the number of slices is still obviously larger than the number of lesions, alleviating the overfitting which usually occurs in transformers as we explained in the introduction (Para 4). Moreover, vision transformers are mostly pretrained in 2D images, causing poor performance when transferring to 3D pipeline. We acknowledge that a 3D pipeline is not prone to redundancy between axial slices and is more usually used in medical images, but its appropriateness may be limited on our dataset or similar real cases we stated in the introduction (Para 3). We will make the assessment more detailed in the final version. R1Q3,5 (Pre-processing unit is not clear): The single-phase annotated lesion has the position and class labels in all phases but they are not aligned, so we could have difficulty finding out which lesions in different phases are the same with 2 or more lesions in one patient. In the pre-processing unit, VoxelMorph registers the CTs for alignment in the original abdominal CT space and the lesion matcher makes lesions extracted from different phases match each other. This step is necessary because CTs are commonly grouped by patients and single-phase annotations can be easily derived from segmentation models as we clarified in the introduction (Para 3). We will make a clearer description in the final version. R1Q6,11 (Missing phases case is not handled): The experiment conducted in Figure 4 is an ablation study of phases, where we train specific models for different combinations of phases. We cut the missing phases’ branches when training. R1Q8,13 & R4Q1 (Design choices are not justified): (R1Q8)We use the late fusion strategy because the semantic concepts are learned in higher layers which benefits the cross phase connection. (R1Q13&R4Q1) We follow the design of original ViT. R1Q1 (Related work about segmentation is not discussed): Liver segmentation/detection is indeed a pre-requisite for FLL classification and we discussed the relation between them in the introduction (Para 3). R1Q12 (Problematic dataset split): We take all rest lesions into test split after handling dataset imbalance issues for more samples to make our result robust and ensure that the distribution of train/val and test sets is still close on the whole. R3 (Dataset examples are not found): We list 4 types of our FLL examples in Figure 1 with several types (ICC, HA) not included in most of the other works. R2 (Lack of novelty): The novelty of our paper is 3-fold. Besides the hybrid architecture with ViT backbone and conv layers with improvements from existing methods, we use cross phase tokens to enhance the phase communication in fusion. We also introduce a pre-processing unit to reduce the labor cost.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Contrary to what R2 states, I believe the paper does have some novelty with the hybrid architecture combining Transformers and convolutional layers, by reducing the value of Key and value of the attention modules using Depthwise convolution modules. The rebuttal has clarified several points raised by the reviewers. I agree with R1 that 3D classification results would have been preferable even though 2D networks were used and is a limitation to the paper, however the authors have explained in details the limitations in 3D aspect, which in my opinion reasonable. Given the fact the strengths slightly outweigh the weaknesses and most points can be addressed, I would tend towards acceptance.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes TransLiver, a hybrid model that combines Transformers and convolutional layers for multi-phase focal liver lesion classification. The model utilizes cross-phase tokens and a pre-processing unit to enhance feature fusion and obtain lesion areas on multi-phase CT images. This model is tailored to the authors’ smaller in-house dataset and gains the benefits of a transformer architecture. I advocate for the acceptance of this paper. In case of acceptance, I would strongly suggest the authors incorporate discussions brought up during the rebuttal process in the final version of the manuscript.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper develops a transformer based model for classification of liver lesions using multi-phase CT images. The topic and the proposed method are of interesting and novel. Though there still exist the concerns around the lack of the comparison with one existing method, given the contributions of the paper, I recommend acceptance.



back to top