Authors

Yunan Wu, Francisco M. Castro-Macías, Pablo Morales-Álvarez, Rafael Molina, Aggelos K. Katsaggelos

Abstract

Multiple Instance Learning (MIL) has been widely applied to medical imaging diagnosis, where bag labels are known and instance labels inside bags are unknown. Traditional MIL assumes that instances in each bag are independent samples from a given distribution. However, instances are often spatially or sequentially ordered, and one would expect similar diagnostic importance for neighboring instances. To address this, in this study, we propose a smooth attention deep MIL (SA-DMIL) model. Smoothness is achieved by the introduction of first and second order constraints on the latent function encoding the attention paid to each instance in a bag. The method is applied to the detection of intracranial hemorrhage (ICH) on head CT scans. The results show that this novel SA-DMIL: (a) achieves better performance than the non-smooth attention MIL at both scan (bag) and slice (instance) levels; (b) learns spatial dependencies between slices; and (c) outperforms current state-of-the-art MIL methods on the same ICH test set.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_32

SharedIt: https://rdcu.be/dnwHb

Link to the code repository

https://github.com/YunanWu2168/SA-MIL

Link to the dataset(s)

https://www.kaggle.com/competitions/rsna-intracranial-hemorrhage-detection/data

Reviews

Review #2

Please describe the contribution of the paper

The paper proposes a model called SA-DMIL for detecting intracranial hemorrhage (ICH) in head CT scans. This model is implemented by adding smoothing attention regularization terms, and it has three main contributions：

(1) At scan (bag) and slice (instance) levels, the proposed model achieves better performance than non-smooth attention MIL.

(2) The model can learn the spatial dependencies between slices.

(3) It outperforms the current state-of-the-art MIL methods on the same ICH test set. Besides, it provides a potential reference method to improve performance without increasing complexity for existing automatic diagnostic architectures.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The innovation of the paper lies in the following two aspects.

Firstly, considering that many of the current so-called innovative intracranial hemorrhage (ICH) detection models achieve higher accuracy by increasing the complexity of the model, the authors simply introduce a smoothing attention rule item to improve the multi-instance learning results upon the current mainstream baseline model of medical images. This does not add further complexity, and it is effective.

Secondly, the smoothing attention mechanism can explain why the relationship between adjacent slices can be effectively learned, which indicates that the authors’ work has a certain interpretability. To some extent, the proposed model has reference significance to help radiologists quickly and reliably detect and diagnose cerebral hemorrhage.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Since only RSNA data sets are used to test the model’s performance, this may make the model less convincing to be equally valid for different data. The authors may refer to this paper: “Volumetric memory network for interactive medical image segmentation”.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Given that the data is a publicly available challenge dataset, and the code is made available via github, the paper is fully reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

I have the following minor comments to the paper:

(1) In subsection 2.2, the symbols “L” and “D” should be briefly supplemented with a description of the meaning specifically expressed.

(2) In subsection 2.2, the summation symbol of the denominator of formula (3) is proposed to be added superscript, since the superscript range has been specified.

(3) In subsection 2.3, the description of symbol “f” in the first line of the text paragraph below formula (6) lacks the superscript “b”.

(4) In subsection 2.4, the range description of symbol “k” in the first line of the text paragraph below formula (10) should be {1,2}.

(5) In Fig. 1, the part that passes through classifiers before scan prediction is not well represented.

(6) The metric selection and the inconsistency between the metric selections for scan and slice are not explained.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the advantages of the two aspects of the paper (which have been stated above) can support the acceptance of the paper, and some small problems in the paper can be improved by modification. However, considering that only one dataset is used, the effectiveness and applicability of the model presented in the paper are not verified completely.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The paper proposed to use a multiple instance learning (MIL) approach with smooth attentional aggregation for the detection of intracranial hemorrhage (ICH). The ICH detection task has been formulated as an MIL problem where a CT scan (bag) is considered positive if it contains at least one slice with evidence of hemorrhage. The authors introduced dependencies between instances based on a probabilistic formulation which imposes smoothness on the latent function that encodes the attention score for each instance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed idea is novel and it is grounded in statistical theory. It can also be applied to other problems where there is dependency among instances. The results show that this approach can improve the smoothness of the attention weights.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The idea is developed based on the equations 4 and 5. However, these equations are not well motivated and discussed. There are still some unanswered questions. If there is any related work or other references that may help the readers, the authors should cite them. For example, how did these equations derived? or why do they use f() in the formulations rather than s() (attention weights)?
- In equation 9, it is unclear why the authors used a convex combination of the CE and Sk objectives as we usually only use a regularization coefficient for the regularization term (L_{Sk} here). Does this explain the reason why the authors observed such behavior: “Note that, as α increases, the performance of the model first improves and then drops, which is consistent with the role played by the SA loss as a regularization term.”? Because as α increases, the role of the CE objective is being reduced as well.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Experimental settings are provided aprovided in the paper. The authors mentioned that the codes will be available as well.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- More citations should be provided for the developed algorithm, if possible.
- The reason why smooth attention should work better compared to non-smooth attention is unclear.
- There are some questions that can be answered to make the paper clearer e.g., Why is the smoothness applied on f() rather than the attention weight s()? Why does equation (5) constrain the second derivative of f()?
- Redundant or unused information can be removed from the paper to make it easier to understand. For example, if the authors did not use this information: “Page 4: Note that (4) and (5) impose smoothness but they can be modified to model, for example, competition between the attention weights by simply replacing the minus sign with a plus sign.”, it can be removed. Additionally, equation (10) is redundant and CE objective is well-defined.
- There are several pooling modules based on the Transformers (e.g., PMA block in Set Transformer [1]) and they can model the dependencies between instances in a bag. The authors can compare the SA-DMIL with them in the future, if they like to extend the paper. [1] Lee, Juho, et al. “Set transformer: A framework for attention-based permutation-invariant neural networks.” International conference on machine learning. PMLR, 2019.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea is novel and extendable. The results generally show improvement over the baselines. However, there are some details that can be better explained to make the paper clearer.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper
- This paper investigates multiple instance learning (MIL) in CT intracranial hemorrhage detection. Different from prior MIL methods which assume instances (i.e., slices) in each bag (i.e., CT scan) are independent, the authors take the spatial continuity in CT scans into account and introduce a smoothing regularizing term to the loss function.
- The proposed smooth attention deep MIL (SA-DMIL) method leverages a graph to learn the dependency between slices and show considerable performance improvement compared to prior non-smooth attention MIL methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The motivation of the proposed SA-DMIL is clear and reasonable.
- The visualizations in Fig. 2 are convincing and clearly show how the proposed SA-DMIL improves model performance.
- This overall paper is well organized and written.
- Promising results are achieved compared to prior MIL methods on an intracranial hemorrhage detection dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The proposed method utilizes a graph to learn the correlation between slices. I’m wondering how many time/computational costs would it bring? It would be better to analyze the trade-off between model performance and the costs.
- In the results and discussion, it would be better to discuss the rationale why the SA-DMIL-S1 outperforms the SA-DMIL-S2.
- In page 7, ‘Fig. S2 in the appendix’ should be ‘Fig. S1 in the appendix’. Additionally, I’m wondering why the authors do not plot SA-DMIL-S1 and SA-DMIL-S2 in the same figure.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This paper shows moderate reproducibility. Many important model and training details have been provided. The code is not available currently, and the authors claim the code will be presented. The dataset used in this paper is publicly available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please refer weakness section.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In general, the overall organization and writing of this paper is good. The core idea is simple yet effective. The motivation of the proposed SA-DMIL is clear and convincing. The overall novelty of this paper is adequate, though similar idea has been explored in other areas. Nontrivial performance improvement is achieved by the proposed method, and visualizations validate the effectiveness of the proposed method. Thus, my current rating is acceptance.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper investigates multiple instance learning (MIL) in CT intracranial hemorrhage detection. The reviewers constantly rated this paper positive, thus accept!

While this paper has received positive ratings from the reviewers, they also pointed the weaknesses. Please the authors address questions/concerns listed by the reviewers in the final paper, mainly for: (1) explain the computational cost and experimental results raised by R#1; (2) discuss the paper mentioned in Item 6 by R#2; (3) motivations of the equations mentioned by R#3.

Author Feedback

We appreciate the thoughtful and detailed comments made by the reviewers. We reply to their comments below. The final version of the paper will be updated accordingly. (R1) Discuss computational costs. For the proposed SA-DMIL, additional computational costs are incurred ONLY during the TRAINING stage (the architecture is the same as in Att-MIL). Moreover, these costs are limited: the gradient of the regularizer requires either the L or L^TL matrix (computed only once) and the gradient of f with respect to previous network’s parameters (already computed in the backward pass of Att-MIL). The only overhead introduced is in the gradient updates, where a convolution of the gradient of f with a matrix proportional to either L or L^TL is required (efficiently computed in O(NlogN), where N is the bag size). In our experiments, the training time for SA-DMIL-1, SA-DMIL-S2 and Att-MIL is 12.73, 12.88 and 12.5 mins per epoch, respectively. (R2) Application to other tasks and reference discussion. Due to page length limitations, we decided to focus on understanding and gaining insights into the proposed methodology and apply it to the relevant problem of ICH detection. The reference [1], provided by R2, proposes a memory-augmented network for segmenting 3D medical images, tested on three public datasets of bag-like data. These offer the opportunity to conduct further experiments to assess the performance of our model, which will be explored in future research. (R3) Eq. 4 and Eq. 5 are not well motivated. Why smooth f instead of s? More citations on related work. As we devise relationships between the slices of a scan as edges of a graph, the attention mechanism is conceived as a function defined on this graph. We regularize it following previous ideas from Spatial Statistics [2, Ch. 5 Sec. 2] and Manifold Regularization [3, Eq. 4], where graph-based modelling is employed. These references will be added to the manuscript. Regarding the choice to smooth f instead of s: the latter requires a normalization across instances in a bag, while f is a non-constrained parameter, which ensures consistent smoothing. (R1) Why SA-DMIL-S1 outperforms SA-DMIL-S2. S1 enforces f to be first-order smooth, while S2 imposes second-order smoothness. Both are used in image processing literature. However, the difference reported in Table 1 is minimal, with S2 slightly underperforming. Therefore, we do not have a definitive conclusion on which approach is superior, as both offer valuable benefits. (R3) Why smooth works better than non-smooth. Non-smooth MIL assumes each instance to be independently distributed. However, in tasks like ICH detection, where neighbouring instances are expected to have similar diagnostic importance, imposing smoothness on the latent function captures this spatial correlation and enhance the performance of the model. (R2) On the choice of metrics. Why the metrics at instance and bag levels are not the same. The decision to choose specific metrics for evaluation is driven by the imbalanced nature of the RSNA dataset. It is essential to assess the model’s performance separately for positive and negative instances/bags. Additionally, at instance level, only attention weights are computable, which do not have a clear probabilistic interpretation and thus AUC is not statistically grounded. (R3) Why use a convex combination in the loss? Why not use a regularization coefficient (lambda) for the regularization term. Both approaches are mathematically equivalent (alpha=lambda/(1+lambda)). However, while lambda ranges from zero to infinity, alpha is in the interval of [0,1] which makes the analysis simpler. [1] Zhou, T., et al. “Volumetric memory network for interactive medical image segmentation.” MIA, 2023. [2] Ripley, B. D. “Spatial statistics.” John Wiley & Sons, 2005 [3] Belkin, M., et al. “Manifold regularization: A geometric framework for learning from labelled and unlabelled examples.” JMLR, 2006.

back to top

Smooth Attention for Deep Multiple Instance Learning: Application to CT Intracranial Hemorrhage Detection