Automatic segmentation and classification of frontal sinuses for sex determination from CBCT scans using a two-stage anatomy-guided attention network

Data acquisition and preparation

We collected a total of 310 CBCT scans acquired from 310 patients (mean age: 26.81 ± 11.36, 155 males and 155 females) who underwent Seoul National University Dental Hospital from 2020 to 2022. This study was performed with approval from the institutional review board of Seoul National University Dental Hospital (ERI123041). The ethics committee waived informed consent because this was a retrospective study. The study was performed following the Declarations of Helsinki. CBCT scans were acquired using a CS9300 (CS 9300, Carestream Health, Rochester, USA) with voxel sizes of 0.3 × 0.3 × 0.3 mm³, dimensions of 640 × 670 × 670 pixels, and 16-bit depth under conditions of 80 or 90 kVp and 8 or 10 mA. All CBCT scans were anonymized and exported in DICOM format. The inclusion criterion was patients aged from 4 to 86 years (Supplementary Fig. S1), while exclusion criteria were patients with visible trauma, previous surgery, or pathological conditions in the frontal region of the skull.

Among the 310 CBCT scans, we split into 50 and 260 CBCT scans for the frontal sinus segmentation task and the sex determination task, respectively (Table 1). The 50 CBCT scans only used for frontal sinus segmentation were split into 30, 10, and 10 scans for training, validation, and test sets, respectively, and each set had the same sex distribution. Training, validation, and test sets comprised 19,200, 6400, and 6400 CBCT images, respectively. We observed a difference in volume (Supplementary Fig. S2a), length of the major axis (Supplementary Fig. S2b), and length of the minor axis (Supplementary Fig. S2c) between the frontal sinuses of males and females in our dataset. A region of interest (ROI) on a CBCT scan was cropped to 122 × 128 × 128 pixels with the center at the frontal sinus region segmented by FSNet. The 260 CBCT scans only used for sex determination were split into 120, 40, and 100 CBCT scans for training, validation, and test sets, respectively, and each set had the same sex distribution. To generate the ground truth of segmentation masks (Fig. 1a,b), frontal sinus regions on CBCT images were labeled by a radiologist with over five years of experience using 3D Slicer software (www.slicer.org).

Table 1 Data configuration for frontal sinus segmentation and sex determination tasks.

We estimated the minimum required sample size to detect significant differences in the accuracy of SDetNet and that of other networks when both assessed the same subjects (CBCT scans). We designed the study to capture a mean accuracy difference of 0.05 and a standard deviation of 0.10 between SDetNet and the other networks. Based on an effect size of 0.5, a significance level of 0.05, and a statistical power of 0.80, we calculated a required sample size of N = 128 (G* Power for Windows 10, Version 3.1.9.7; Universität Düsseldorf, Germany). Finally, we split the dataset of CBCT scans into 120, 40, and 100 scans for training, validation, and test sets, respectively.

The architecture of a two-stage anatomy-guided attention network

We proposed a two-stage anatomy-guided attention network (SDetNet) for automatic and accurate sex determination from a CBCT scan (Fig. 2). SDetNet consisted of a 2D frontal sinus segmentation network (FSNet) and a 3D anatomy-guided attention network (SDNet). The first stage was 2D frontal sinus segmentation using FSNet, which automatically segmented the frontal sinus regions on CBCT images. Next, 3D sex classification was performed using SDNet, which used the anatomy-guided information from the frontal sinus segmentation to automatically determine the sex of a patient on a CBCT scan. For frontal sinus segmentation on CBCT images, we used FSNet which had a U-shape encoder-decoder architecture with transfer learning. Five popular backbones, namely VGG16²⁷, ResNet101²⁸, DenseNet201²⁹, Inception V3³⁰, and EfficientNet-B5³¹ were used as encoders in FSNet. The decoder part had five levels of layers with 2D convolution blocks and a 2D transposed convolution layer for 2D up-sampling. The 2D convolution block consisted of a 3 × 3 convolution layer, batch normalization (BN), and rectified linear unit (ReLU) activation. The final output layer in FSNet was a 1 × 1 convolution layer with a Sigmoid activation function.

After automatic segmentation of the frontal sinus on CBCT images by FSNet, the CBCT scan with corresponding prediction masks cropped at the centroid of the segmentation results of the frontal sinus were used as multi-channel inputs of the SDNet designed for automatic sex determination (Fig. 2a). SDNet had 3D convolutional blocks (ConvBlocks), an anatomy-guided attention module (AGAM), 3D max-pooling (MP), and 3D global average pooling (GAP). The ConvBlock consisted of a 3 × 3 × 3 convolution layer, BN, and ReLU. The MP was used for the down-sampling of 3D feature maps. We employed 3D GAP to average each 3D feature map. Final feature vectors by a 3D GAP were fed into the output layer with the Sigmoid activation function for sex prediction. The feature maps at each level of layers were gradually increased from 16 to 32, 64, and 128 in SDNet.

For accurate sex determination from a CBCT scan, deep learning models need to capture anatomical context information related to variations in the shape and size of the frontal sinuses between males and females (Supplementary Fig. S2a–c). Attention mechanisms in deep learning are inspired by the human visual cognition system, and these are used to encourage deep learning models to focus more on the most relevant regions and ignore the background³². Based on this observation, we proposed AGAM in SDNet to encourage the deep learning model to focus more on the frontal sinus regions on a CBCT scan than on the background regions (Fig. 2b). AGAM was embedded in each level of the layer of SDNet to learn the anatomical context of the frontal sinus hierarchically. Frontal sinus segmentation maps ({S}_{m}in {R}^{Htimes Wtimes Dtimes 1}) are obtained by the ROI extraction process after the inference of FDNet, where H, W, and D indicate the height, width, and depth of the mask maps, respectively. The 3D feature maps ({F}_{m}in {R}^{Htimes Wtimes Dtimes C}) are acquired by a ConvBlock, where C indicates the number of channels in the feature maps. Then, we employ a 1 × 1 × 1 convolution layer (Con{v}_{1}) for each ({F}_{m}) and to obtain anatomy-guided feature maps ({A}_{m}in {R}^{Htimes Wtimes Dtimes C}) and bottleneck feature maps ({B}_{m}in {R}^{Htimes Wtimes Dtimes C}) as follows:

$${A}_{m}=Con{v}_{1}({S}_{m}), { B}_{m}=Con{v}_{1}({F}_{m})$$

(1)

To extract discriminative features ({F}_{dis}) between ({A}_{m}) and ({B}_{m}), 3D attention maps (({F}_{att}in {R}^{Htimes Wtimes Dtimes C})) are acquired as follows:

$${F}_{dis}=psi left({sigma }_{1}left({A}_{m}+{B}_{m}right)right)$$

(2)

$${F}_{att}= GRleft({sigma }_{2}left({F}_{dis}right)right)$$

(3)

where (psi) is a 1 × 1 × 1 convolution layer to extract the discriminative feature map ({F}_{dis}in {R}^{Htimes Wtimes Dtimes 1}) and ({sigma }_{1}) and ({sigma }_{2}) are ReLU and Sigmoid activation functions, respectively. GR denotes a grid resampling operation to restore the dimensions of the discriminative feature map to the same as that of ({F}_{m}) using trilinear interpolation. Finally, 3D attentive feature maps ({F}_{n}) are acquired by elemental-wise multiplying ({F}_{m}) and ({F}_{att}) as follows:

$${F}_{n}={F}_{att}otimes {F}_{m}$$

(4)

where (otimes) indicates elemental-wise multiplying. ({F}_{att}in [ {0,1}]), which are saliency maps, identified important regions in the feature maps and pruned the feature response to retain the activations relevant to the foreground, suppressing the background.

We used the Dice similarity coefficient (DL) and binary cross-entropy (BL) losses to train FSNet and SDNet, respectively. DL measured the overlap between the ground truth and segmentation results for the frontal sinus. DL is defined as:

$$DLleft(y,widehat{y}right)=1-2frac{{sum }_{i}^{n}left({y}_{i}times widehat{{y}_{i}}right)+epsilon }{left({sum }_{i}^{n}{y}_{i}+{sum }_{i}^{n}widehat{{y}_{i}}right)+epsilon }$$

(5)

where (y) and (widehat{y}) are ground truth and segmentation results for the frontal sinus, respectively, and (n) is the number of pixels on CBCT images. (epsilon) provided numerical stability to prevent division by zero, with (epsilon) set to 10^–3. BL measured the average probability error between the ground truth (actual sex) and the sex predictions. BL is defined as:

$$BLleft(p,widehat{p}right)=1- {sum }_{i}^{N}left({p}_{i} {log}widehat{{p}_{i}}right)$$

(6)

where (p) and (widehat{p}) are the ground truth and probability of sex prediction, respectively. N is the number of CBCT scans.

FSNet was trained for 200 epochs with a mini-batch size of 16. Data augmentation was performed with rotation (− 10° to 10°), Gaussian blur (− 10% to 10%), and brightness (− 10° to 10°). Adam optimizer with a learning rate of 10^–3 was used as the initial setting, and the learning rate was reduced by half up to 10^–6 when the validation loss saturated for 20 epochs. SDetNet was trained for 100 epochs with a mini-batch size of 1. Adam optimizer was used with ({beta }_{0}=0.9) and ({beta }_{1}=0.999), and the learning rate was initially set to 10^–4, which was reduced by half up to 10^–7 when the validation loss saturated for 25 epochs. Deep learning models were implemented with Python3 and Keras with a TensorFlow backend based on an Intel i9-7900X CPU 3.3 GHz, 256 RAM, and an NVIDIA RTX A6000 GPU 48 GB.

Performance evaluation

We used precision (PR), recall (RC), Jaccard index (JI), and F1-score (F1) to evaluate the segmentation performance of deep learning models for the frontal sinus, and area under the receiver operating characteristic curve (AUC), Brier score (BR), accuracy (ACC), specificity (SPE), sensitivity (SEN), and the polygon area metric (PAM) to evaluate its performance for sex determination. PR is calculated as the number of true positives (TP) divided by the sum of the TP and false positives (FP): ({PR}=frac{ {TP}}{ {TP}+ {FP}}). RC is calculated as the number of TPs divided by the sum of the TPs and false negatives (FNs) as follows: ({RC}=frac{ {TP}}{ {TP}+ {FN}}). JI is calculated as the intersection of the predicted segmentation and ground truth divided by the union of the two: ({JI}=frac{ {TP}}{ {TF}+ {FP}+ {FN}}). F1 is calculated as the harmonic mean of the PR and RC: (F1=frac{2times {PR}times {RC}}{ {PR}+ {RC}}). ACC is defined as the ratio of the number of correct sex predictions to the total number of input samples as follows: ({ACC}=frac{ {TP}+ {TN}}{ {TF}+ {TN}+ {FP}+ {FN}}), where TN indicates true negatives. SPE is a metric that measures a model’s ability to predict negative cases correctly and is defined as ({SPE}=frac{ {TN}}{ {TN}+ {FP}}). SEN, similar to RC, is a metric that measures a model’s ability to correctly predict positive cases. BR is calculated as the mean squared difference between the predicted probabilities and the actual outcomes: ({BR}=frac{1}{N}{sum }_{i}^{N}{left({y}_{i}-{p}_{i}right)}^{2}), where N is the number of CBCT scans and ({y}_{i}) and ({p}_{i}) are the ground truth and prediction probability, respectively. AUC is calculated as the area under the receiver operating characteristic (ROC) curve, which is a plot of the true positive rate versus the false positive rate. PAM is calculated using the area of the polygon including ACC, SEN, SPE, AUC, Jaccard index (JI), and F-measure (FM) points generated in a regular hexagon³³. The PAM is defined as:

$${PAM}=frac{ {PA}}{2.59807}$$

(7)

where, the PA denotes the area of the polygon. To normalize the PAM into the [0, 1], the PA is divided by 2.59807. In terms of sex determination results, SDetNet outputs a probability within the range of 0.0 to 1.0, where females and males are classified based on the 0.5 threshold as 0 and 1, respectively. Therefore, SPE reflected the ability of the deep learning algorithm to correctly predict females and SEN the ability of the algorithm to correctly predict males. An analysis of variance (one-way ANOVA) with Scheffé post hoc tests was performed using IBM SPSS Statistics (IBM SPSS Statistics for Windows 10, Version 26.0; IBM, Armonk, New York, USA), and statistical significance (p-value) was set to 0.05.

Ethics declarations

This study was performed with approval from the Institutional Review Board (IRB) of Seoul National University Dental Hospital (ERI123041). The IRB of Seoul National University Dental Hospital approved the waiver for informed consent because this was a retrospective study. The study was performed in accordance with the Declaration of Helsinki.

Source link

Automatic segmentation and classification of frontal sinuses for sex determination from CBCT scans using a two-stage anatomy-guided attention network

Data acquisition and preparation

The architecture of a two-stage anatomy-guided attention network

Performance evaluation

Ethics declarations

Wellfit Technologies Showcases Innovative Financing Solutions at Premier Dental Conferences

Dangerous foods for demolishing your teeth

Related Articles