Comparative analysis of self-supervised pre-trained vision transformers and convolutional neural networks with CheXNet in classifying lung conditions
Abstract
Classifying lung diseases from images has been a challenging task for Deep Learning (DL) methods. Self-Supervised Learning (SSL) in particular has been widely recognized to be effective for pre-training, especially new methods such as DINOV2 ViT/S-14 and ConvNeXt-V2. In this research, Transfer Learning (TL) was conducted on the two models by using the NIH CXR-14 dataset to perform 15-class classification. Additionally, SwAV ResNet-50, DINO ViT-S/16, and CheXNet were adopted as the baselines. Evaluation results showed that DINOV2 ViT-S/14 is superior to the other three models pre-trained on ImageNet with 0.743 macro-averaged AUC, but is inferior to CheXNet which was pre-trained using the same NIH CXR-14 dataset albeit without the ”No Finding” class. However, the CheXNet only obtained 0.773 AUC with 0.328 recall. Further analysis on feature separability showed that both CheXNet and DINOV2 ViT-S/14 were unable to extract meaningful features that differentiate the ”No Finding” class with the other 14 lung conditions, confirming the finding from a previous study that this dataset’s labels is noisy, rendering it unsuitable for downstream TL tasks. However, DINOV2 ViT-S/14 showed similar attention visualizations with CheXNet on some classes despite being pre-trained on natural images from ImageNet. Therefore, despite the unsatisfactory performance in this dataset, DINOV2 holds great potential in similar future studies, but it may require pre-training the models on medical image datasets.
Commun. Math. Biol. Neurosci.
ISSN 2052-2541
Editorial Office: [email protected]
Copyright ©2025 CMBN