This study enabled the acquisition of a highly accurate cephalometric analysis value from the lateral facial photographs of malocclusion patients.
The proposed algorithm and the accuracy of landmarks estimation
The algorithm implemented in this study is based on our previous work25, where we predicted landmarks from lateral facial photographs of patients with normal occlusion. This algorithm comprises two stages: HRNetV2 and MLP. Initially, we used HRNetV2 for heatmap regression to estimate the positions of all landmarks in the input image. HRNetV2 applies multiple convolutional layers, enabling a precise understanding of the spatial relationships between each landmark and facial structures. In this step, the model learns the relationship between the local features of the face and landmarks. Subsequently, we introduced coordinate regression using MLP, which significantly improved the accuracy of landmark estimation. Learning complex spatial relationships with only MLP might not be accurate; however, combination of MLP and HR NetV2 can improve the accuracy of landmark estimation. Moreover, MLP can integrate input position information effectively because it is fully connected from the input to the output layer. This enables the model to learn underlying spatial relationships. Therefore, MLP can estimate the structural features between landmarks, i.e., it can estimate the relative positions of landmarks. In this study, the two-stage approach, comprising the coarse estimation through HRNetV2 heatmap regression and the fine estimation through MLP, enabled the accurate detection of all landmarks25.
Even though our study inferred landmarks from lateral facial photographs, the results outperformed those of previous studies that inferred landmarks from cephalograms32,33,34,35. This outcome may be unexpected from the perspective of humans; however, for the AI, it is possible that the HR NetV2 heatmap regression-based methods could capture facial soft tissue features such as eyes, nose, and mouth, in color facial photographs than in radiograph images36.
In our previous study25, we used lateral facial photographs of patients with normal occlusion after orthodontic treatment as the training data and the test data, and the mean MRE for each landmark was 0.61 ± 0.50 mm. In contrast, this study focused on pre-treatment patients with skeletal Class II and III malocclusion. Malocclusion data are considered more variable than that of normal occlusion, which may lead to greater variability in the prediction of landmarks. However, as a result, the mean MRE for each landmark in patients with the CL II group was 0.42 ± 0.15 mm, and that in patients with the CL III group was 0.46 ± 0.16 mm, which is even more accurate than those of normal occlusion. Moreover, there were almost no differences between each landmark.
In this study, several methods were attempted as a preliminary step in the research, using the algorithms from our previous study to learn and estimate data on malocclusion (Table 10). The methods include learning only skeletal Class II malocclusion data to predict skeletal Class II malocclusion, learning only skeletal Class III malocclusion data to predict skeletal Class III malocclusion, learning a combination of normal occlusion data and skeletal Class II malocclusion and skeletal Class III malocclusion data to predict the mixed data, and adding fine-tuning methods to these methods. Fine-tuning is the process of selecting pre-trained data, adding new layers tailored to the target task, and retraining the entire model with these new layers for refinement37,38. After conducting these preliminary steps, finally, we adopted method 6, because it achieved the most accurate estimation of the landmarks. This study was able to predict landmarks more accurately than the previous study for several reasons. One reason is that we employed fine-tuning methods to improve accuracy, and the other is that the constructed AI algorithms were additionally trained with malocclusion data. We obtained more precise results because we attempted several fine-tuning processes and selected the one with the highest accuracy. This can be considered an advancement in machine learning.
Cephalometric measurements
- 1)
Accuracy: Most skeletal cephalometric measurements based on estimated landmarks revealed a discrepancy of approximately 0.5° between the actual and estimated values. Clinically, an error of less than 1° in skeletal measurements is considered acceptable39.
- 2)
Agreement: Using a Bland–Altman analysis, we evaluated the agreement and error trends between the actual and estimated values. Measurements contain errors, which can be either random or systematic errors. A random error occurs in the true value and is caused by uncontrollable causes.
In this analysis, the influence of individual participant differences was removed by considering the difference between the two measurement methods (the actual data and the estimated data) for each participant. Systematic errors, which have a certain biased tendency toward the true value and are caused by controllable factors (e.g., the habit of the measurer or the inadequacy of the measuring instrument), were further classified into fixed and proportional errors. Fixed errors are those that have a constant bias in a specific direction, regardless of the true value. A proportional error is an error that occurs in a specific direction with respect to the true value. A scatter plot was created to investigate these systematic errors. The scatter plot considered the difference between the estimated and actual values on the vertical axis and the average of the estimated and actual values on the horizontal axis. Then, the vertical (Y-direction) and anterior–posterior (X-direction) errors were classified in this study, and the limit of agreement (LOA) of errors was calculated to ensure the reliability of our findings.
- 3)
Correlation: We selected the ANB angle, Wits appraisal, and FMA as the skeletal type measurement items in this study to examine the relationship between skeletal deformity and the error of each landmark. The ANB angle and Wits appraisal are generally used to indicate the anteroposterior position of the skeleton, with the ANB angle being more relevant than the Wits appraisal, particularly in skeletal Class II malocclusion and the Wits appraisal being more relevant than the ANB angle in skeletal Class III malocclusion40,41. FMA was selected for this study because it is associated with vertical position of the skeleton.
In skeletal Class II malocclusion, the greater the anteroposterior skeletal deformity, namely, the larger the ANB angle, the larger is the error in the position estimation of the upper incisor edge. In contrast, in skeletal Class III malocclusion, the more significant the anteroposterior skeletal deformity, namely, the smaller the Wits appraisal, the larger is the error in the position estimation of the lower incisor edge. Notably, errors in locating the incisor edge of the forward-positioned in overjet could cause errors in inclination. This might be because the position of incisor teeth located forward of overjet is prone to errors in estimated position due to the thickness of the lips42,43. Therefore, the results suggest that the greater the anterior–posterior skeletal deviations, the greater is the error in estimating the incisal position of the anterior teeth.
Regarding the skeletal structure, the results indicated that the larger the skeletal Class III malocclusion, the more the position of A shifted. As noted in the results of this study, the direction of the error was mainly horizontal, and the error was in the backward direction. These findings indicated the similar tendency in both ANB angle and Wits appraisal. Moreover, in skeletal vertical positioning, the greater the tendency for a long face in skeletal Class II malocclusion, the smaller is the error in estimating the position of the condylar head.
However, all landmarks had errors of less than 0.5 mm, which is considered a suitable accuracy level for clinical applications44.
Study limitation
A previous study reported that the greater the sample size of the training data, the greater is the accuracy45. Although no studies have yet been reported on detecting landmarks from photographs, Moon et al.46 estimated the landmarks from the cephalogram and reported that 2300 is the optimal sample size for training data. Arik et al.2, Gilmour et al.32, Li et al.33, Kwon et al.34, and Oh et al.35 used a total of 400 data, and Kim et al.9 used total data counts of 2075 (Table 11).
In this study, we additionally trained using the Takahashi et al. algorithm25, thus using 160 skeletal Class II malocclusion and 160 skeletal Class III malocclusion patients in addition to the 2000 patients from the previous study. Thus, 2320 patients were used as the number of training data. Therefore, based on these related reports, we consider this number as the optimum sample size for the number of training data.
In this study, as several researchers worked on the task, assessing both repeatability and reproducibility errors was crucial. A previous study reported that errors between landmarks in the cephalometric analysis are acceptable in the range of 2 mm46. The errors calculated in the present study were, on average less, than 0.5 mm. Furthermore, this value is considerably smaller than the acceptable errors among orthodontists reported previously, as the inter-measurement error between the two orthodontists in this study was within 1 mm. This result implies that the training data already has an error of at least 1 mm. Namely, the accuracy estimated by the algorithm in this study is considered suitable and high.
Furthermore, this research includes manual steps in the preparation of the test data. These are the plotting of landmarks and the transfer of landmarks from cephalograms to lateral photographs. The methodology in this manual are likely affects the accuracy of the prediction, so it was necessary for us to be highly sensitive in this point.
The training data used in this study consisted mainly of participants of Japanese descent, representing a specific racial background. Hence, errors may arise when facial photographs of participants from different races are used as test data for inference, as there will be differences in the facial morphology, soft tissue, and skeletal structures of the lateral facial photographs. Therefore, to implement the approach for various races, the training data of various races must be considered, and the accuracy must be improved. Thus, we plan to add more training data from various races and conduct further research.