Interactively Fusing Global and Local Features for Benign and Malignant Classification of Breast Ultrasound Images

Wenhan Wang; Jiale Zhou; Jin Zhao; Xun Lin; Yan Zhang; Shan Lu; Wanchen Zhao; Shuai Wang; Wenzhong Tang; Xiaolei Qu

doi:10.1016/j.ultrasmedbio.2024.11.014

Interactively Fusing Global and Local Features for Benign and Malignant Classification of Breast Ultrasound Images

Ultrasound Med Biol. 2024 Dec 20:S0301-5629(24)00438-1. doi: 10.1016/j.ultrasmedbio.2024.11.014. Online ahead of print.

Authors

Wenhan Wang¹, Jiale Zhou¹, Jin Zhao², Xun Lin³, Yan Zhang⁴, Shan Lu⁴, Wanchen Zhao¹, Shuai Wang³, Wenzhong Tang³, Xiaolei Qu⁵

Affiliations

¹ School of Instrumentation and Optoelectronics Engineering, Beihang University, Beijing, China.
² Breast and Thyroid Surgery, China-Japan Friendship Hospital, Beijing, China.
³ School of Computer Science and Engineering, Beihang University, Beijing, China.
⁴ Department of Gynecology and Obstetrics, Peking University Third Hospital, Beijing, China.
⁵ School of Instrumentation and Optoelectronics Engineering, Beihang University, Beijing, China. Electronic address: quxiaolei@buaa.edu.cn.

PMID: 39709289
DOI: 10.1016/j.ultrasmedbio.2024.11.014

Abstract

Objective: Breast ultrasound (BUS) is used to classify benign and malignant breast tumors, and its automatic classification can reduce subjectivity. However, current convolutional neural networks (CNNs) face challenges in capturing global features, while vision transformer (ViT) networks have limitations in effectively extracting local features. Therefore, this study aimed to develop a deep learning method that enables the interaction and updating of intermediate features between CNN and ViT to achieve high-accuracy BUS image classification.

Methods: This study introduced the CNN and transformer multi-stage fusion network (CTMF-Net) consisting of two branches: a CNN branch and a transformer branch. The CNN branch employs visual geometry group as its backbone, while the transformer branch utilizes ViT as its base network. Both branches were divided into four stages. At the end of each stage, a proposed feature interaction module facilitated feature interaction and fusion between the two branches. Additionally, the convolutional block attention module was employed to enhance relevant features after each stage of the CNN branch. Extensive experiments were conducted using various state-of-the-art deep-learning classification methods on three public breast ultrasound datasets (SYSU, UDIAT and BUSI).

Results: For the internal validation on SYSU and UDIAT, our proposed method CTMF-Net achieved the highest accuracy of 90.14 ± 0.58% on SYSU and 92.04 ± 4.90% on UDIAT, which showed superior classification performance over other state-of-art networks (p < 0.05). Additionally, for external validation on BUSI, CTMF-Net showed outstanding performance, achieving the highest area under the curve score of 0.8704 when trained on SYSU, marking a 0.0126 improvement over the second-best visual geometry group attention ViT method. Similarly, when applied to UDIAT, CTMF-Net achieved an area under the curve score of 0.8505, surpassing the second-best global context ViT method by 0.0130.

Conclusion: Our proposed method, CTMF-Net, outperforms all existing methods and can effectively assist doctors in achieving more accurate classification performance of breast tumors.

Keywords: Breast cancer; Breast ultrasound image; Classification; Deep learning; Feature interaction.