AI-based large-scale machine learning and DL in the late 2010s, which facilitated the accurate diagnosis of medical radiographic images, garnered attention in biomedical engineering and provided novel insights into precision medicine26,27,28. More recently, deep convolutional neural network algorithms have gained popularity in dentistry, and have also achieved considerable success in analyzing dental radiographic images29. The potential clinical applications of DL technology are closely related to (1) deeper and more sophisticated neural network structures and (2) large annotated and high-quality datasets. Particularly, a gold-standard dataset annotated and verified by medical and dental professionals is essential to create a reliable radiographic image-based DL model in the medical and dental fields26,27.
To evaluate the performance of DL-based identification and classification of various types of DIS in actual clinical practice, a large, highly accurate, and reliable dataset is necessary. Recently, a large-scale and comprehensive multicenter dataset that could be used in the clinical field for DL-based identification and classification of DIS was collected and released openly by the national initiative. To our knowledge, the dataset used in the present study contained the larger number of radiographic images and types of DIS than any previously reported implant-related dataset. Because we used this dataset in the current study, it is expected to show higher feasibility than that of any previous implant-related DL research.
Most previous studies evaluated the accuracy performance of the conventional or minimally modified DL architectures (e.g., YOLO, SqueezeNet, ResNet, GoogLeNet, and VGG-16/19) using less than a few thousand dental radiographic images, and usually fewer than 10 different types of DIS in their datasets, identifying a classification accuracy ranging from 70 to 100%17,18,19,20,21,22,23,24,25. One study that utilized a ResNet architecture based on 12 types of 9767 panoramic images reported a high accuracy of 98% or more23. Our previous pilot study that utilized automated DL based on six different types of 11,980 DIS images also showed reliable outcomes and achieved a very high accuracy of 95.4% (sensitivity:95.5% and specificity:85.3%)18. Conversely, another study based on Yolov3 using 1282 panoramic images showed a relatively low accuracy in the 70% range on average22.
The automated DL algorithm used in this study, based on the combination of periapical and panoramic radiographs, achieved an AUC of 0.885. When only panoramic radiographs were used, the AUC was 0.878, and when only periapical radiographs were used, the AUC was 0.868. Specifically, periapical and panoramic images had the highest classification accuracy, and periapical images alone had the lowest accuracy, but there was no statistically significant difference between the three groups. These outcomes are consistent with the previously reported absence of a significant difference in classification accuracy between panoramic and periapical images and are also likely due to the fact that almost three times more panoramic images (n = 105,080) than periapical images (n = 36,188) were used for training and validation17,18.
Specifically, the Nobel Biocare Branemark, Megagen Exfeel external, Osstem US III, and Dentsply Xive showed a high classification accuracy of 100.0%, whereas Warantec IT showed a low accuracy performance (accuracy: 19.0–35.3%) due to the relatively small number of radiographic images, including only 238 panoramic and 208 periapical images, despite having a conventional fixture morphology with an internally tapered shape. From this perspective, DL has great advantages in identifying and classifying similar types of DIS; however, the accuracy performance varies significantly depending on the amount of datasets required for training, which is considered a fundamental limitation of the existing DL algorithms. Further research should be conducted to confirm whether the number of datasets required for training can be reduced by adopting an algorithm that is more specialized than the algorithm in this study for DIS classification.
In the radiographs used in this study, the main ROI was the implant fixture, but a number of other confounding conditions (such as surrounding alveolar bone, cover screw, healing abutment, provisional or definitive prosthesis) were included. To be used in actual clinical practice, implant fixtures with different confounding conditions and angles should be used as datasets, rather than implant fixtures with perfect/intact shapes and standard angles. Several previous studies, including this one, have confirmed that implant datasets with different angles and confounding conditions have a high accuracy performance of over 80%17,18,25. Furthermore, using the Gradient-Weighted Class Activation Mapping technique, it was found that the types of DIS were classified by focusing on the implant fixture itself rather than the various confounding components of the DIS. Therefore, various confounding factors and angles do not appear to have a significant impact on the accuracy performance of DL-based implant system classification.
In a recent study wherein healthcare professionals with no coding experience evaluated the feasibility of automated DL models using five publicly available and open-source medical image datasets, most classification models showed accuracy performance and diagnostic properties comparable to those of state-of-the-art DL algorithms30. Developing customized DL models according to the types and characteristics of datasets requires highly specialized skills and expertise. This study confirmed that the DL algorithm itself, not computer scientists and engineers, built an automated DL model without coding and showed excellent classification accuracy of over 86% in 27 similar design but different types of multiple classifications.
Identifying and classifying DIS with varying features and characteristics and limited clinical and radiographic information is a challenge not only for inexperienced dental professionals, but also for dentists with sufficient experience in implant surgery and prosthetics. In the past, several studies have identified DIS from a forensic perspective based on radiographs, and until recently, efforts have been made to classify DIS, but most of these are based on empirical evidence, making it difficult to achieve high reliability9,10,31. More recently, computer-based implant recognition software and web-based DIS classification platforms have been developed and used; however, most require manual classification of DIS features (such as coronal interface, flange, thread type, taper and apex shape) or contain only a small number of DIS datasets, limiting their active use in clinical practice32.
The first end goal based on this research was to obtain a database of almost all types of DIS used worldwide and train it with sophisticated and refined DL algorithms optimized for DIS classification to achieve a high level of reliability that can be used in actual clinical practice. The second goal was to create a web or cloud-based environment where datasets can be freely stored, trained, and validated in real time. Achieving these goals requires the proactive development of standard protocols to facilitate data sharing and integration, secure transmission and storage of large datasets, and enable federated learning33,34.
This study had several limitations. Collecting a dataset using supervised learning requires considerable tangible and intangible resources including finances, time, trained personnel, hardware, and software. Therefore, unsupervised learning, a technique for overcoming small-scale and imbalanced datasets, has been introduced and tested with caution in dentistry; however, it remains a challenging approach35. Large-scale and multicenter datasets may be useful for future DL-based research and actual clinical trials to identify and classify various types of DIS. Nevertheless, the dataset used in this study had inherent limitations regarding the interpretability of the results. Although the raw NIA dataset consisted of 165,700 radiographs and 42 different types of DIS, the number of panoramic and periapical images for each type of DIS was highly heterogeneous. In addition, DIS manufactured by foreign companies or using non-titanium materials (such as non-metallic ceramic zirconia), which are rarely used in South Korea, were few or not included in the raw dataset. To overcome the potential problem of overfitting and selective bias, we selected only DIS that contained more than 100 images of panoramic and periapical radiographs.