DDViT: Double-Level Fusion Domain Adapter Vision Transformer


With the help of Vision transformers (ViTs), medical image segmentation was able to achieve outstanding performance. In particular, they overcome the limitation of convolutional neural networks (CNNs) which rely on local receptive fields. ViTs use self-attention mechanisms to consider relationships between all image pixels or patches simultaneously. However, they require large datasets for training and did not perform well on capturing low-level features. To that end, we propose DDViT, a novel ViT model that unites a CNN to alleviate data-hunger for medical image segmentation with two multi-scale feature representations. Significantly, our approach incorporates a ViT with a plug-in domain adapter (DA) with Double-Level Fusion (DLF) technique, complemented by a mutual knowledge distillation paradigm, facilitating the seamless exchange of knowledge between a universal network and specialized domain-specific network branches. The DLF framework plays a pivotal role in our encoder-decoder architecture, combining the innovation of the TransFuse module with a robust CNN-based encoder. Extensive experimentation across diverse medical image segmentation datasets underscores the remarkable efficacy of DDViT when compared to alternative approaches based on CNNs and Transformer-based models.


Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. None



Sun, L., & Sheng, V.S.. 2024. DDViT: Double-Level Fusion Domain Adapter Vision Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 38(21). https://doi.org/10.1609/aaai.v38i21.30516