Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Zixuan Wang, Yifan Shao, Jingyi Sun, Zhili Huang, Su Wang, Qiyong Li, Jinsong Li, Qian Yu

Abstract

Abstract. Cardiovascular disease is a high-fatality illness. Intravascular Optical Coherence Tomography (IVOCT) technology can significantly assist in diagnosing and treating cardiovascular diseases. However, locating and classifying lesions from hundreds of IVOCT images is time-consuming and challenging, especially for junior physicians. An automatic lesion detection and classification model is desirable. To achieve this goal, in this work, we first collect an IVOCT dataset, including 2,988 images from 69 IVOCT data and 4,734 annotations of lesions spanning over three categories. Based on the newly-collected dataset, we propose a multi-class detection model based on Vision Transformer, called G-Swin Transformer. The essential part of our model is grid attention which is used to model relations among consecutive IVOCT images. Through extensive experiments, we show that the proposed G-Swin Transformer can effectively localize different types of lesions in IVOCT images, significantly outperforming baseline methods in all evaluation metrics. Our code is available via this link. https://github.com/Shao1Fan/G-Swin- Transformer

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_32

SharedIt: https://rdcu.be/dnwKn

Link to the code repository

https://github.com/Shao1Fan/G-Swin-Transformer

Link to the dataset(s)

https://drive.google.com/drive/folders/1Y3K2SNz26Nl6EMC6PqGYpmqAYFsH2ExZ?usp=drive_link

Reviews

Review #1

Please describe the contribution of the paper

The main contributions of the paper can be summarised two fold: first, an IVOCT dataset with 2988 images and 4734 annotated lesions is collected and second, a new detection model named G-Swin is proposed. Ablation experiments of multiple components of the proposed model and augmentation scheme are provided to highlight the importance about individual components.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors collected a large dataset of IVOCT images with many annotated bounding box annotations and differentiating three classes.
- The presented ablation studies underline the contributions of the proposed G-Swin Transformer model and show the performance improvements of individual components.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The naming of different model components throughout the paper can be somewhat confusing since it does not follow the generally used conventions. Usually, Swin-Transformers (and also G-Swin Transformer) are regarded as backbones or encoders which extract features form the images. They are not by themselves detection methods but can be used in existing detection methods such as Faster R-CNN. This differentiation can also be found in the original Swin-Transformer [1] paper where experiments with multiple detection methods with different backbones are provided. From Figure 2 and the text description it is possible to conclude the the chosen detection method is Faster R-CNN based. (the rest of the review follows the aforementioned nomenclature of detection method and backbone)
2. The selected baselines are too weak given the properties of the dataset. Since the frames of the dataset are highly correlated it is closer to video based detection models rather than single frame detection models. As such, works such as Video Swin-Transformer [2], Slow-Fast Networks [3] etc. which can also utilise temporal information for their predictions are the appropriate baselines. This would allow for a fair comparison and highlight the novelty of the proposed Grid Attention scheme to fuse information. Furthermore, stronger baseline models such as YoloV5 could be utilised to demonstrate the superiority of the proposed method when compared to more modern detection methods.
3. To strengthen the confidence in the newly proposed backbone, additional ablation experiments with different detection methods could be provided to demonstrate the improved feature extraction in a variety of algorithms.
4. In some points the manuscript is not sufficiently precise: 1) No information of the used data splits is given 2) No information on used hyperparameters such as training length, number epochs, optimiser etc. is given which drastically restricts the reproducibility of the presented work 3) Several points from the reproducibility checklist (where everything was checked) can not be found in the current version of the manuscript such as: anonymised link to dataset and code (publishing these is also not mentioned in the manuscript), the need for ethics approval etc.
[1] Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF international conference on computer vision. 2021. [2] Liu, Ze, et al. “Video swin transformer.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. [3] Feichtenhofer, Christoph, et al. “Slowfast networks for video recognition.” Proceedings of the IEEE/CVF international conference on computer vision. 2019.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

As pointed out in the weaknesses section, the reproducibility checklist is fully checked while information on publishing data, code or ethics approval are not mentioned in the manuscript. Furthermore, no information on hyperparameters is given in the paper which makes reproducibility of the results very difficult.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The authors tackle a relevant medical problem with a unique private dataset uncovering potentially interesting shortcomings of current algorithms. Additional baseline methods are needed in order to ensure that the identified shortcomings were not already tackled by existing models from the video processing domain. Furthermore, an ablation experiment with different detection methods to show the effectiveness of the newly proposed G-Swin Transformer would enrich the manuscript significantly. Finally, additional steps should be taken to add information about the training and evaluation process to make sure that the presented experiments are reproducible and comparable.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors collected an interesting and novel dataset for their experiments which builds a great basis for future research. In order to ensure that the proposed methodological novelty outperforms existing works additional baselines and potentially extended ablation experiments are needed.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The authors have clarified the dataset and code release. While the dataset won’t be publicly available, at least the code with some exemplary images will be made available if the work is accepted. Updated naming conventions are mentioned and one additional baseline (YOLOv5) were added to the paper which improves the baseline section. While video models are certainly not applicable out of the box they could have been used as an inspiration (i.e. slight modifications) for further baselines. Furthermore, extended ablation experiments would have strengthened the confidence in the newly proposed backbone (i.e. running RetinaNet with the G-Swin Transformer backbone). In summary, the publicly available code will allow a certain degree of reproducibility of the results and the addition of the YOLOv5 model improved the baseline selection. While there remain multiple open points, this puts the paper marginally above the acceptance threshold and thus the score got update from 4 to 5.

Review #2

Please describe the contribution of the paper

This paper presents a vision transformer based deep learning model for multi-class lesion detection on IVOCT images. Moreover, this paper develops a IVOCT dataset with multi-class lesion annotations including macrophages, cavities/dissections and thrombi.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

• This work combines swin-transformer and grid attention to incorporate information from previous and past OCT frames to make prediction, which leads to increase of performance. • This work generates a public data of IVOCT images for object detection, which can be used by other researchers.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

• Details of the dataset should be provided. For example, the number of subjects, mode of the IVOCT system, resolution, size of imaging areas, pixel size, etc. • The details of experiment settings are missing. What are the training/validation/testing sets?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This work is reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

• For figure 1 and figure 4, please add scale bars. • Cross-validation is needed to achieve more accurate evaluation of different models. • The size of the OCT images is set to 575x575 pixels. Is there any downsampling operating involved in conversion from DICOM to PNG? If so, please provide reasons/discussions for the downsampling operation. • How is the training/inferencing time of the proposed model compared with other methods in Table 1?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This work presents a vision transformer based deep learning model which takes advantages of multiple OCT frames to perform object detection on multiple types of lesions. Also, this work presents a public IVOCT dataset with multiple types of lesions. However, details of experimental setup should be provided. Also, details of the dataset should be given to the audience.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

7
[Post rebuttal] Please justify your decision

All my questions are addressed.

Review #5

Please describe the contribution of the paper

In the paper “Vision Transformer based Multi-Class Lesion Detection in IVOCT” a collection of image data from 70 acquisitions labeled regarding manifestitations of atherosclerotic plaques is leveraged to propose a Transformer-based automated approach to detect the location of said manifestions from single axial slices. For the task at hand, a Swin Transformer is extended with a grid attention mechanism in order to leverage neighboring slices. The proposed extensions enhance the performance shown in an ablation study. Additionally, the method was compared to some other object detection approaches where again a superior performance is reported.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is overall well written and easily understood.
- The motivation behind tackling this task and all constituents of the method are well argued and sound.
- Data annotation was performed with extensive efforts.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- My biggest concern arises from the fact that the data splits used for creating the results are not reported. With that it is not clear whether the data was split patient-wise nor how large the test set really was. Linked to that the authors ticked several marks in the reproducibility checklist, which they don’t adhere to in the paper. There, they claim to release data, source code and the data splits, while none of these points are actually in the paper.
- Extending the architecture to also nearby slices is the main methodological novelty which is a pretty straight forward step. However, the way the data from adjacent frames is merged is not compared to any other possibly simpler approach, like just concatenating the adjacent slices to the input.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is really hard to comment on the reproducibility, as the statements between the reproducibility checklist and the actual paper are in completely opposite directions. If one purely looks at the paper it is as poor as it gets: No public data, no public code, no details on the data split, no details on the value ranges for the data augmentation, etc. Looking at the reproducibility checklist all of that will be provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- From scanning the literature and some very limited personal experience using polar transformation/directly using the A-scans as input to a deep learning approach for IVOCT is often the better data representation to learn from [1,2,3]. I guess taking the rgb images as an input stems from the way the data was annotated. However, this still may hold back the method in it’s current form.
- Motivate why YOLOv3 was used instead of a more recent variant.
- Motivate why rotations, which would be the most straightforward data augmentation strategy to me, were not used.
[1] https://www.spiedigitallibrary.org/conference-proceedings-of-spie/12034/120340S/Automatic-microchannel-detection-using-deep-learning-in-intravascular-optical-coherence/10.1117/12.2612697.full [2] https://www.mdpi.com/2076-3417/11/16/7412 [3] https://www.spiedigitallibrary.org/journals/journal-of-medical-imaging/volume-6/issue-4/045002/Coronary-calcification-segmentation-in-intravascular-OCT-images-using-deep-learning/10.1117/1.JMI.6.4.045002.full?SSO=1
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A well written paper solving a novel task with a novel extension of an existing architecture to solve the problem at hand more robustly. However, the paper in the form submitted is held back by a significant lack of detail, especially regarding the used data splits, which limits confident assessment of the results presented in the paper. Also the authors need to clarify the mismatch between the reproducibility checklist and the actual paper. I hope they get the chance to do so in a rebuttal.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

All of my concerns were dispelled.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Strengths: The paper curates a dataset of IVOCT images with a large number of annotated bounding box annotations. The proposed G-Swin Transformer model and its individual components are evaluated through ablation experiments, highlighting their contributions and performance improvements. The proposed method incorporates information from neighboring OCT frames, resulting in improved performance. Weaknesses: The selected baselines for comparison are considered weak, and the reviewer suggests using stronger baseline models such as YoloV5 or models from the video processing domain for a fair comparison. Importantly, reviewers find that the manuscript lacks precision in certain aspects, such as not providing information about the data splits, hyperparameters used, and other details that affect reproducibility. Finally, two reviewers find that although authors claim to release source code and the data splits, none of these are in the paper. Authors are recommended to address the major concerns from the reviewers.

Author Feedback

We thank the reviewers for their positive feedback and constructive comments. We will first address the common concern on the missing details of our dataset, and then address each reviewer’s concerns point by point.

To R1&R2&R5: We apologize for not providing sufficient dataset details in Section 2. Here are the additional details:

1) Dataset splits: The dataset consists of data from 69 subjects, split into training/validation/testing sets with a ratio of 55:7:7. Each split contains 2359/290/339 IVOCT frames. The data is split at the patient level, ensuring frames from the same IVOCT data appear in only one split. Our model is trained on the training set, selected based on validation set performance, and the reported results are obtained on the testing set. 2) Data pre-processing：The dataset’s image format is PNG, and the image size is 575x575. There is no downsampling operation involved in conversion from DICOM to PNG to retain the information as much as possible. 3) Dataset disclosure: We will not release the full dataset given the privacy issue. However, we will provide several examples along with the code if this work is accepted.

To R1:

Q6-1 Naming confusion: Thank you for pointing it out. Our proposed G-Swin Transformer is used as the basic module of the encoder in the full model, which is developed based on Faster R-CNN. We will address the naming confusion in our revised version.

Q6-2 Why not compare to video detection? IVOCT data exhibits anisotropic properties, with lower pixel sampling rate in the z-axis compared to the x- and y-axes. Lesions in IVOCT data rarely appear in more than three consecutive frames, making video detection/object tracking models unsuitable. The characteristics of IVOCT data differ significantly from videos, where objects can persist across hundreds of frames.

Stronger baseline: We have included YOLOv5 as a stronger baseline for comparison. YOLOv5m achieved an mAP50 of 38.3, surpassing Faster R-CNN and RetinaNet but still lower than Swin Transformer or our proposed model.

Q6-4 More training details: We apologize for the lack of training details due to space limitations. We trained the model for 60 epochs with an AdamW optimizer following the Swin Transformer. The learning rate and weight decay is set to be 1e-4 and 1e-2, respectively. We will include these details in the revised version.

To R2:

Q9-1: Thank you for your suggestion! We will add scale bars in Fig. 1 and Fig. 4.

Q9-2 Cross-validation: Cross-validation is not feasible for our new dataset due to variations in lesion label distribution among patients. Simple K-fold splits could result in imbalanced training, validation, and testing sets, leading to unreliable cross-validation results.

To R5:

Q6-2 Comparison with 2.5D convolution: Thanks for the suggestion. We experimented with different ways of incorporating adjacent frames, including a “2.5D” variant that concatenates adjacent slices. However, our proposed method consistently outperformed the 2.5D convolution approach, as shown in Table 4.

Q9-1 Why not use Polar transformation or A-scans? The decision to use RGB images as input was based on the annotation format used during data collection, where bounding boxes were manually drawn on IVOCT frames. Employing polar transformation would invalidate the bounding box annotations. Exploring the utilization of data in the Polar coordinate system is a potential avenue for future work.

Q9-2 Stronger baseline: Please refer to our response to R1 Q6-2.

Q9-3 Why not rotation? While rotation on IVOCT frames is acceptable during inference, it is not applicable during training. Rotating an IVOCT frame would also rotate the bounding boxes within it. Common detection methods struggle to handle rotated bounding boxes accurately, making it challenging to obtain precise bounding boxes after rotation. Consequently, we did not employ rotation augmentation in our training process.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors’ rebuttal addressed the concerns raised by two of the reviewers. However, the rebuttal did not address some important issues, such as the lack of comparison with video models (the author’s reasons were not sufficiently justified) and the absence of cross-validation (which is also not convincing, for example, according to the authors’ statement, the proposed method in this paper can only work on specific data splits; and importantly, only very limited subjects (n=7) are in the current test set). Nevertheless, overall, the work presented in this paper has a certain level of quality, and all reviewers now increase their scores to accept it. Therefore, I also recommend accepting the paper, but I suggest that the authors ensure they can provide several example datasets along with the code as claimed in the rebuttal (this should be one of the major reasons that all reviewer increase their original scores).

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Although the reviewers have not responded to the rebuttal or change their scores, I read the rebuttal and I think the authors’ addressed the majority the concerns. The authors need to make sure to include the revision as promised (e.g., the use of better notions) to the final submission.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors’ response was enough to be accepted by all reviewers. It also allows (partially) for reproducibility of the methods despite the unavailability of the data used. Provided that the authors include the clarifications to the main points (provided in their rebuttal) to the final version, the paper would be acceptable.

back to top

Vision Transformer based Multi-Class Lesion Detection in IVOCT