Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Efklidis Katsaros, Piotr K. Ostrowski, Krzysztof Włódarczak, Emilia Lewandowska, Jacek Ruminski, Damian Siupka-Mróz, Łukasz Lassmann, Anna Jezierska, Daniel Węsierski

Abstract

A microcamera firmly attached to a dental handpiece allows dentists to continuously monitor the progress of conservative dental procedures. Video enhancement in video-assisted dental interventions alleviates low-light, noise, blur, and camera handshakes that collectively degrade visual comfort. To this end, we introduce a novel deep network for multi-task video enhancement that enables macro-visualization of dental scenes. In particular, the proposed network jointly leverages video restoration and temporal alignment in a multi-scale manner for effective video enhancement. Our experiments on videos of natural teeth in phantom scenes demonstrate that the proposed network achieves state-of-the-art results in multiple tasks with near real-time processing. We release Vident-lab at https://doi.org/10.34808/1jby-ay90, the first dataset of dental videos with multi-task labels to facilitate further research in relevant video processing applications.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_18

SharedIt: https://rdcu.be/cVRUY

Link to the code repository

N/A

Link to the dataset(s)

https://doi.org/10.34808/1jby-ay90


Reviews

Review #1

  • Please describe the contribution of the paper

    This manuscript presents a multi-task network architecture that handles three tasks related to dental interventions, which are video enhancement, binary teeth segmentation, and homography estimation. Authors show that these tasks are correlated with each other and the overall performance of the proposed architecture on multiple tasks is comparable to that of state-of-the-art methods designed for a single task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well organized and written. The figures are very well made, and the method design and experiment setup are well described with sufficient details for readers to understand the work.
    2. The multi-task architecture design seems interesting and reasonably thoughted, and the authors demonstrate the comparable performance of the proposed method against some state-of-the-art works on several tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The evaluation results are not impressive enough in a couple of aspects. Firstly, the ablation studies show that the performance degradation after disabling one module is often not significant enough, which makes me wonder if the architecture has created enough synergy among these tasks. Secondly, based on the results, it looks to me that simply having several state-of-the-art models run in parallel with the original image input for processing could result in higher FPS and similar or better performance compared with the proposed method. The authors also mention that MHN on the original video frame works better than the proposed method and will become significantly better if the high-definition image is used instead. All the above may suggest that there is still room for further improvement in the proposed method.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This work is not reproducible because neither source code or dataset will be made available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. For Fig. 1, can the authors explain a bit why Fig.1 is equivalent to the architecture in Fig. 2? The traveling path of Fig.1 is a bit confusing, and I would like to know the path of information flow from input to the final output.
    2. There are some inconsistencies regarding the uses of subscripts and superscripts for several symbols. For example, the same symbol O has different layouts of subscript and superscript at the 6th line of Sec. 6, Eq. (1), the 2nd and 8th lines in “Problem Statement”. Similarly, for symbol B, the appearance of B in the 4th line in “Problem Statement”, Eq. (2), and the 2nd and 3rd line in “Training” are different. For the last two lines in “Encoders”, in Fig. 2, H has subscript t-1 -> t, while the symbols in those last two lines do not have them.
    3. Please carefully go through the text and make the style of all symbols consistent. If there are different ways of using the same symbol, please consider adding some more notes to explain to the readers.
    4. Fig. 2 is a bit confusing because the symbols are in the same row as the modules and at some places, the arrows between two modules are not shown. Please consider making the data dependencies and information flow clearer.
    5. Fig. 2, in terms of the method design, why do authors choose to use f_t^{1, 2, 3} as one input to produce h_t^{1,2,3} instead of the feature maps after the channel attention? It looks like for the other two tasks, the feature maps after the channel attention are used instead.
    6. Fig. 2, why do authors only upsample the binary mask output to the finer-scale level? Why not also upsample the deblurred image as input features?
    7. For Sec. 3, “Noise, blur, colorization”, is only camera C_2 used for this frame-to-frame training. Is camera C_1 also involved?
    8. Fig. 3, are images from C_2 only used for color mapping training? Are those not directly used during image deblurring training other than this color mapping training?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Evaluation results are not strong enough, suggesting the design of the multi-task architecture has room for further improvement. No dataset or source code will be available to the community.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this paper, a new dental video enhancement dataset is proposed, accompanied with a strong benchmark model. The dataset are tailored for 3 tasks: video restoration, segmentation and homography estimation. The proposed multi-task model is evaluated on all 3 tasks and compared with other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) First dataset of dental videos with multi-task labels. In this work, a new application of macro-camera for dental is proposed. (2) A solid multi-task benchmark model for 3 important dental video tasks. The video restoration is important for deblurring and noise suppression. Segmentation could be used to assist doctors. And homography estimation could be used for video stabilization.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It seems the restoration here is with an emphasis on video deblurring, while the author is also encourage to have some brief discussion on video stabilization, as they are closely correlated and discussed together. Actually some blurs are introduced by the camera jittering. The stabilization is related to homo-estimation and could be viewed as a smoothing task using estimated transformation. reference: Yu, Jiyang, and Ravi Ramamoorthi. “Learning video stabilization using optical flow.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No explicit signs about whether the dataset will be public available and the code will be released or not.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Overall the paper is in good shape and more discussion on how these 3 tasks are related is appreciated.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Based on the contribution of new dataset and strong benchmark method, I make the rating.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel deep network for multi-task video enhancement that works in near real- time for macro-visualization of dental scenes. The authors also release a new dataset of dental videos with multi-task labels for facilitating further research in video processing applications.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written, easy to follow and addresses an important issue.
    2. adequate amount of literature review, experiments have been done.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Since the authors indicate that the method runs in near real-time, any indication about the computational time and space with respect to the current state-of-the-art would be useful
    2. Model is very task-specific; any indication if this can be used for other similar application would be useful.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No code/dataset link has been provided

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please see Weakness section

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see Weakness section

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All three reviewers concur on the novelty, approach, and validation of the paper. The paper is well-written and organized.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

We thank the Reviewers for the time and effort in providing insightful feedback. We have incorporated their comments and suggestions to improve the paper. We refer to the main concerns of the Reviewers and then address the detailed comments.

Our dataset will be publicly available for research on video-based multi-task learning (as we wrote in the paper). The proposed MOST-Net (Fig. 1) can relate to other multi-task applications (e.g. laparoscopy, colonoscopy) where tasks are amenable to scaling (Eq. 1), such as concurrent semantic segmentation and depth estimation. However, the effectiveness of MOST-Net in other applications is subject to future research while this study validated the architecture on the task set for dental intervention applications. Computational time (FPS) and space (#P(M)) are shown in Tab. 2. The ablations quantitatively explore the relationship between tasks in Tab. 2. The camera-ready version will discuss the relation between video restoration, motion estimation, and teeth segmentation. R1 correctly argues the proposed solution leaves room for improvement in all tasks. We regard the current limitations of the MOST-Net’s instantiation (Fig. 2) as the motivation for using our dataset in multi-task learning as a testbed. We intentionally compared the performance of MOST-Net and MHN, that both processed noisy videos, to the performance of MHN that processed ground truth videos having low noise, low blur, and bright colors. As expected, in comparison to MACE=1.3-1.5 of both networks on noisy videos, MHN achieved much lower MACE=0.6 on the ground truth videos. This performance gap indicates the video restoration is a critical task to focus on because it affects the homography estimation task for video stabilization. We are also convinced as R1 that the synergy between the tasks can be further improved, e.g. by another learning algorithm or loss function for MOST-Net. This topic is the subject of our ongoing research. On the other hand, the restoration task is already markedly affected by our two ablations (the drop of 0.7-0.8 dB) that exclude information from the homography task (NW) and the segmentation task (NS). The observation of R1 that several state-of-the-art models running in parallel with the original image input will result in higher FPS than MOST-Net is true but not obvious to implement on a single GPU maintaining the individual FPS results. While ESTRNN+MHN+DL indeed runs faster than MOST-Net, our solution approaches real-time speed, that is sufficient for our application, and focuses primarily on the video restoration task, achieving higher PSNR, SSIM, and E(W) results than the pipeline of state-of-the-art methods.

Fig.1 illustrates the general MOST-Net framework where information flows in a swirl-like manner. MOST-Net is task agnostic and adapts to the synergy of the tasks at hand. Fig. 2, a special case of Fig. 1, demonstrates its application-specific instantiation with tailored information flow. The $h_t^s$ indeed utilizes $f_t^s$ instead of $F_t^s$. The attended features $F_t^s$ result from aligning $f_{t-1}^s$ with $f_t^s$ via $H_{t-1 \rightarrow t}^{s}$. Alignment errors, however, would be further propagated onto the homography estimation module, especially at earlier training stages; thus, we opted for a design that presumably minimizes negative transfer. Each task employs the features at different stages, i.e. the raw $f_t^s$ for homography and the attended $F_t^s$ for restoration and segmentation. Indeed, regarding label generation, C2 is used exclusively for learning the color mapping function. To employ C2 for image-to-image learning, the spatial alignment between frames from C1 and C2 needs to be flawless. However, cameras with different intrinsic parameters make exact spatio-temporal registration an open and challenging task. Thus, we employ only C1 for F2F denoising, making no assumptions about the noise distribution and using the exact noise maps inferred by F2F.



back to top