The present study evaluated the efficiency of AI for accurately detecting dental root fracture lines in digital 2D dental periapical radiograph. This study introduces an AI-based voting system that deployed five different algorithms to overcome individual potential discrepancies between different models. Such an approach empowers low-experienced and undergraduate dentists which tackles the poor performance and reproducibility encountered in detecting root fractures [1].
Current literature supported utilizing AI in the dental healthcare system, however with particular attention to its limitations [17]. The recent literature valued the importance of ongoing developments, which can significantly impact diagnosis and decision making in dentistry [18]. Health information systems are a subsidiary of the domain of Information Systems that can significantly support in the process of diagnosis in the dental section. When using customized tools or implementing analysis, it is referred to as “Health Informatics” [19]. Using Artificial intelligence as tools to perform the analysis is commonly used in almost every intelligent system.
In the present study, five state-of-the-art AI algorithms: VGG16, VGG19, ResNet50, DenseNet121, and DenseNet169 were selected. These models were chosen for their robustness, high performance, and proven efficacy in image classification tasks. VGG16 and VGG19 are chosen for their simplicity and depth, while ResNet50 has demonstrated high accuracy in detecting vertical root fractures in CBCT scans [20]. DenseNet169 and DenseNet121 leverage dense connectivity patterns to enhance feature reuse and parameter efficiency, making them ideal choices for image analysis objectives [21, 22].
In contrast to training and validating the deep learning models on radiographic images retrieved from real world context (i.e. retrieved from patients) [23] or in ex vitro models that highly mimic it as fixing the teeth in cadaver jaws [24], unmounted extracted teeth were utilized. This choice was based on simplifying the testing model, reducing the noise produced from surrounding structures and emphasizing on accuracy of detection in consistent context. However, this ex-vivo research design does not closely correlate to the clinical settings, which encounter limitations in generalizing the results. This is due in part to the restricted availability of patient data in real-world context. Nonetheless, recent evidence revealed improved diagnostic performance when the data set was mixed with extracted teeth from in vitro model, underscoring the viability of such approach [20].
Data sets augmentations are commonly introduced to improve the generalization of the involved model. Proposed methods to introduce such variations include and are not limited to rotating, flipping, contrast adjustment and noise introduction [25]. However, such a strategy was not recruited despite the relatively small sample size for the training data sets to avoid errors like overfitting [26]. Data augmentation can retain existing biases and imbalances from the original dataset and may not consistently improve results. Moreover, inappropriate or excessive augmentation can introduce noise and artifacts, potentially degrading the model’s performance, similar to the limitations seen with synthetically produced datasets [27]. Therefore, the study would reflect more reliable results mimicking the real time context. At the heart of the methodology lies the utilization of the datasets, which were split into training and testing/validation sets, at ratio of 80:20. Adopting 80% for training provides the model with ample training data to learn patterns and generalize effectively, while reserving 20% for validation/testing allows for an objective assessment of performance, addressing overfitting and underfitting issues [28], in accordance with the best diagnostic performance and in alignment with standard machine learning practices [3, 14].
Matplotlib software, a plotting library for Python, was used in the present study to generate curves for visualization purposes of the model’s performance metrics output. Its flexibility and granular control supported in production of clear, publication-quality graphs that provided an intuitive understanding of the model’s efficacy across diverse evaluation parameters [29]. The current study employed the AI-based system for dental fracture detection based on 2D periapical radiographs of extracted teeth. Despite the limitation of two-dimensional projections, they are cost-effective and readily accessible diagnostic tools.
The present findings are in line with the previous work by Kositbowornchai et al. which was built on ex vitro model radiographic images and outlined the efficiency of neural networks for fracture detection. Although overall accuracy was varying over different parameters of training and test sample size, it could reach up to 95.7% [3].
The proven reliability of the AI-based system for root fracture detection in the current study was in accordance with the previous work of Guo et al. who incorporated the deep learning algorithm for crack detection and gained accuracy exceeding 90% in a step forward in diagnosis and decision-making automation [30]. In the former scholar, the neural network was tested on a hundred photographic optical images rather than radiographic images with a resolution of 1920 × 1080-pixel.
Noteworthy, the present results revealed the inconsistency between the five models, demonstrating varying values of PPV and ROC. Among the models, the DenseNet121 and DenseNet169 models achieved the highest sensitivity and specificity rates, making them the most effective models for both detecting fractures and identifying unfractured roots. Whereas the VGG19 and ResNet50 models showed comparable performance and were generally inferior to the other models. Moreover, ResNet50 and VGG16 demonstrated a bias towards identifying the fractured roots. The VGG16, DenseNet169, and DenseNet121 models exhibited high values approaching 100% in contrast to the rest with values ranging from as low as 42% for sensitivity of detecting fractured roots in ResNet50 model.
These findings were contradicted by Day et al. as the best performance was manifested by ResNet50 and the low performance by VGG16 [31]. This contradiction could be attributed to different methodologies which aimed for detection of dental carious in dental panoramic radiographs in contrary to the recruited-in-hand research strategy. The former study utilized the Dental Carious Detection Net (DCDNet) model with a complex architecture and Multi-Predicted Output (MPO) structure, where the AI models were deployed as blocks in “bottlenecks” structure. This resulted in varying precision and recall for different types of carious lesions where the cervical carious demonstrated the lowest values. Such discrepancies highlighted differences in system architecture and data set complexities between the two studies.
On the other hand, these results were contradicted by the results of Johari et al, where the accuracy, selectivity, and specificity were 70%, 97.7%, and 67.7% respectively in periapical radiographs [15]. This could be attributed to the differences in methodologies that aimed to validate the AI algorithms for the detection of vertical rather than horizontal root fracture, which encounter challenges in detection on periapical radiographs. Noteworthy, their employed AI model was the probabilistic neural network, trained on a significantly smaller dataset of endodontically treated teeth (120 roots), in contrast to the CNN employed in the current study, which was trained, tested and validated using a larger dataset comprising 400 root images. The discrepancies between the results of Johari et al and ours are mainly reflected in both accuracy and specificity. Moreover, the intended algorithms presented, though low, a degree of false positive prediction, which was attributed to the anatomical configurations of roots like grooves and invaginations [13].
Additionally, in contradiction to the present findings, a former study reported superiority of ResNet50 over both VGG19 and DenseNet169, where accuracy, sensitivity, and specificity were 97.8%, 97.0%, and 98.5% while 74%, 42.6%, and 85% in the current study [15]. A possible explanation is the discrepancies in methodologies, where the authors conducted a retrospective evaluation on CBCT images. The surpassing diagnostic performance of 3D image modality over digital periapical radiographs in AI-based root fracture detection model was well-proven by Johari et al. [14].
Herein, the significance of the voting system is emphasized where the voting mechanism deployed the five models for obtaining a final decision. This approach drew inspiration from the conscious voting mechanism employed by radiologists in real-world medical practice. By aggregating the insights from diverse models, model assembling leverages their collective wisdom to achieve more robust and accurate predictions, effectively addressing the challenges posed by overfitting in machine learning applications [26]. Similarly, Shrestha et al. have comprehensively examined algorithmic fairness across various domains and advocated for voting mechanisms to facilitate democratic decision-making [32].
Therefore, the combined decision is obtained by a majority vote of the individual AI classifiers. This explains the reason for having an odd number of classifiers in order to get a decisive outcome [33]. The number of the classifiers (n) could be equal to 3, 5, 7. However, the more the n, the more computing power, and the fewer the n the less reliability of the results.
The current findings agreed with the demonstrated superiority of the voting system by Shimpi et al for the detection of oral cancers [27]. Running the voting algorithm involved reflected an innovative prospective for the present study as a step forward in automation for root fracture detection.
The present study was conducted in an in vitro setting rather than a clinical environment, which limits the applicability of the findings to real-world scenarios. The use of artificially created fractures may not accurately represent the complexity and variability of fracture lines encountered in clinical practice. Future studies should focus on validating these AI models in clinical settings to ensure their effectiveness and reliability in real-world conditions. Additionally, incorporating and implementing larger data sets as well as training on images of multirooted teeth could enhance the robustness and generalizability of the AI systems. Further research should also explore the integration of these voting-based AI systems into routine dental diagnostic workflows.